Question

I'm writing a crawler in c++, the function crawler downloads a website and extracts all the links from it. I wanted It to run the function crawler in multiple threads, each thread with different parameters and also I want to specify the amount of threads. So at the start of the program I want to specify a number of threads, then when the program starts I need each thread to run with different parameters so I can download a specified number of websites simultaneously. I know how to do basic multi-threading but I need to specify a number of threads at the start of the program. So are there any libraries that would allow me to do it ? Or is it possible with std::thread ?

#include <sstream>
#include <iostream>
#include <string>
#include "gumbo.h"
#include <curl/curl.h>
using namespace std;

//extract links
static void search_for_links(GumboNode* node) {
  if (node->type != GUMBO_NODE_ELEMENT) {
    return;
  }
  GumboAttribute* href;
  if (node->v.element.tag == GUMBO_TAG_A &&
      (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) {
    std::cout << href->value << std::endl;
  }

  GumboVector* children = &node->v.element.children;
  for (int i = 0; i < children->length; ++i) {
    search_for_links(static_cast<GumboNode*>(children->data[i]));
  }

}



//turn the output from libcurl in to a string
size_t write_to_string(void *ptr, size_t size, size_t count, void *stream) {
((string*)stream)->append((char*)ptr, 0, size*count);
return size*count;
}

int crawler(char url[60])
{


curl_global_init( CURL_GLOBAL_ALL );
CURL * myHandle = curl_easy_init ( );



//set the 'libcurl' parameters

curl_easy_setopt(myHandle, CURLOPT_USERAGENT, "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16");
curl_easy_setopt(myHandle, CURLOPT_AUTOREFERER, 1 );
curl_easy_setopt(myHandle, CURLOPT_FOLLOWLOCATION, 1 );
curl_easy_setopt(myHandle, CURLOPT_COOKIEFILE, "");

//set the url
curl_easy_setopt(myHandle, CURLOPT_URL, url);

//turn the output in to a string using a function

string response;
curl_easy_setopt(myHandle, CURLOPT_WRITEFUNCTION, write_to_string);
curl_easy_setopt(myHandle, CURLOPT_WRITEDATA, &response);
curl_easy_perform( myHandle );
//HTML parsing

GumboOutput* output = gumbo_parse(response.c_str());
search_for_links(output->root);



return 0;
}

int main()
{
    crawler("http://wikipedia.org");
return 0;   
}
Was it helpful?

Solution

You can create a number of std::threads and store them in a vector. Let's say you have your function

void f(int x, std::string const& y);

Then you can create a vector with threads running the function with

std::vector<std::thread> threadgroup;
threadgroup.emplace_back(1, "abc");
threadgroup.emplace_back(2, "def");

This will start two threads in a vector. Make sure to join the thread before exiting.

I think you actually will need a certain number of threads processing a container with links. Each thread downloads a page and adds new links to the container. When one page is processed, it fetches a new link from the container.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top