Question

My project is an html parser, that parallel loads html pages with a number of HttpClients(one client for each proxy in my list) and then parses loaded html pages with HtmlAgilityPack(third-side library for html parsing).

This method loads page using HttpClient. So it uses a network, low CPU usage:

LoadObjectPageAsync(client, i) 

And this method parse loaded page. Doesn't use network, uses much CPU:

ParseObjectPageAsync(i)

In my project I simultaneously execute a number of this async method for each proxy in my WebProxy list:

    Private Async Function LoadAndParseAsync(ByVal _proxy As WebProxy) As Task

    Dim client As HttpClient = CreateProxyHttpClient(_proxy, 10000)

    For i = 0 To URLS.Length - 1
        Await LoadObjectPageAsync(client, i)
        ParseObjectPageAsync(i)
    Next

End Function

Each HttpClient loads page by page, and after each loading complete I start parsing task for this page and forget about it.

My internet channel bandwidth is 30 Mbps. Here is a download speed diagramm during this method execution (I can't post images cause of low rating):

http://oi60.tinypic.com/2ebae4n.jpg

CPU using ~ 50-60%. But in this case internet channel doesn't always fully loaded during execution.

And If I execute above method without this string:

ParseObjectPageAsync(i)

(so i just don't parse loaded pages), then I've got this:

http://oi60.tinypic.com/rmu5gn.jpg

CPU using is about 5-10%. But bandwidth fully loaded. That's what I want to see with parsing.

So, when I call ParseObjectPageAsync(i) method, I expect, that It will have no affect on network using. But it somehow affects, in spite of the CPU is not fully loaded during execution, only 50-60 percent. So parse tasks interrupt load tasks. That's what I want to fix, cause the main priority is a maximum using of internet channel.

Maybe there is a way to set the priority of Parse tasks to low. Or other way to solve the problem.

I can read both VB and C# code. Sorry my bad English.

UPLOAD: The ParseObjectPageAsync method is:

Private Async Sub ParseObjectPageAsync(ByVal _num As Integer)

   // Await is a first keyword, so the whole method 
   // must run asynchronously, as I expect.
    Await Task.Run(Sub()

                           // and here some proccessing with loaded page.

                   End Sub)


End Sub
Was it helpful?

Solution

For mixing asynchronous and CPU-bound work, I recommend TPL Dataflow. You can set up a basic pipeline where the first TransformBlock takes URLs and (asynchronously) downloads them, and the second TransformBlock does the parsing. Then you can adjust the MaxDegreeOfParallelism option for both blocks as needed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top