I have 5000+ pages I want to download using WebClient
. Since I want that done as fast as possible I am trying to use multithreading (using BlockingCollection
in my case), but the program always seems to be crashing after a while with error - "System.Net.WebException". If I add some Thread.Sleep(3000)
delay it slows down my download process and it returns the error after a little more time.
It usually takes about 2-3 seconds to download one page.
Normally, I would guess that there is a problem with my BlockingCollection
, but it works fine with other tasks, so I am pretty sure that something has to be wrong with my WebClient
requests. I think there might be some kind of overlapping between the separate WebClients
, but that's just guessing.
Multithreading multiThread = new Multithreading(5);
for(int pageNumber = 1; pageNumber <= 5181; pageNumber++)
{
multiThread.EnqueueTask(new Action(() => //add task ("scrape the trader") to the multithread queue
{
using (WebClient client = new WebClient())
{
client.DownloadFile("http://example.com/page=" + pageNumber.ToString(), @"C:\mypages\page " + pageNumber.ToString() + ".html");
}
}));
//I put the Thread.Sleep(123) delay here
}
If I add a smaller delay (Thread.Sleep(100)
for example) it works fine, but then I end up scraping Page # *whatever pageNumber's value is at the moment*
, not in order as it usually does.
Here is my BlockingCollection
(I think I got this code from stackoverflow):
class Multithreading : IDisposable
{
BlockingCollection<Action> _taskQ = new BlockingCollection<Action>();
public Multithreading(int workerCount)
{
// Create and start a separate Task for each consumer:
for (int i = 0; i < workerCount; i++)
Task.Factory.StartNew (Consume);
}
public void Dispose() { _taskQ.CompleteAdding(); }
public void EnqueueTask (Action action) { _taskQ.Add (action); }
void Consume()
{
// This sequence that we’re enumerating will block when no elements
// are available and will end when CompleteAdding is called.
foreach (Action action in _taskQ.GetConsumingEnumerable())
action(); // Perform task.
}
}
I also tried putting everything into endless while
loop and handling the error using try...catch
statements, but apparently it does not return the error immediately, but after a while (not sure when).
Here is the whole exception:
An exception of type 'System.Net.WebException' occurred in System.dll but was not handled in user code
Additional information: An exception occurred during a WebClient request.