سؤال

I have a huge list of web pages which display a status, which i need to check. Some urls are within the same site, another set is located on another site.

Right now i'm trying to do this in a parallel way by using code like below, but i have the feeling that i'm causing too much overhead.

while(ListOfUrls.Count > 0){
  Parallel.ForEach(ListOfUrls, url =>
  {
    WebClient webClient = new WebClient();
    webClient.DownloadString(url);
    ... run my checks here.. 
  });

  ListOfUrls = GetNewUrls.....
}

Can this be done with less overhead, and some more control over how many webclients and connections i use/reuse? So, that in the end the job can be done faster?

هل كانت مفيدة؟

المحلول

Parallel.ForEach is good for CPU-bound computational tasks, but it will unnecessary block pool threads for synchronous IO-bound calls like DownloadString in your case. You can improve the scalability of your code and reduce the number of threads it may use, by using DownloadStringTaskAsync and tasks instead:

// non-blocking async method
async Task<string> ProcessUrlAsync(string url)
{
    using (var webClient = new WebClient())
    {
        string data = await webClient.DownloadStringTaskAsync(new Uri(url));
        // run checks here.. 
        return data;
    }
}

// ...

if (ListOfUrls.Count > 0) {
    var tasks = new List<Task>();
    foreach (var url in ListOfUrls)
    {
      tasks.Add(ProcessUrlAsync(url));
    }

    Task.WaitAll(tasks.ToArray()); // blocking wait

    // could use await here and make this method async:
    // await Task.WhenAll(tasks.ToArray());
}

نصائح أخرى

you can try using HttpClient a new addition in .Net 4.5 it consider to be be faster and it might improve your performance a little

using (HttpClient client = new HttpClient())
using (HttpResponseMessage response = await client.GetAsync(url))
using (HttpContent content = response.Content)
{

    string result = await content.ReadAsStringAsync();


}

An oft-overlooked element in the web.config or app.config files of your application is the connectionManagement tag. In particular, .NET will limit the simultaneous number of connections to a domain to 2 by default. You can see the documentation for the tag here.

If I understood your question correctly, it stands to reason that parallel-creating web clients to 2 domains will be limited to 4 threads by default (2 threads per domain), causing less speedup than you would otherwise expect.

If you are connecting to multiple domains, however, then the other answers are likely to yield more speedup since waiting on the response is probably a large part of the cost of each loop iteration. If you are on .NET 4.5, GetStringAsync method is probably your friend.

Did you think about asynchronous execution of your code? I think there is no faster way to get data from Internet but you can do in simultaneously.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top