Pregunta

I am building a component that downloads information from given urls and parses it to my business classes.

This has to happen in two stages. The pages that are being downloaded contain URLs to a set of further pages which are downloaded in a second stage.

I want all of this to be as parallel as possible and am trying to reduce the overall complexity by using the TPL Dataflow framework.

This is my (simplified) setup:

enter image description here

  • I post URLs to the buffer block which moves them to the download block.
  • In the download block the HTML is downloaded.
  • The download block has a conditional link to both parse blocks, so the html of Page Type A is moved to "Parse Page A", which is a TransformManyBlock.
  • Parse Page A generates a set of URLs to pages of type B.
  • Those are posted to the Download Block again.
  • Finally the conditional link posts the HTML of Page type B to the last block.

I am reusing the Download Block because I want to limit the number of max connections to the server this way by setting MaxDegreeOfParallelization.

The setup would be a lot easier if I simply could use two separate download blocks, but then I would be unable to limit the number of connections this way and still have as many parallel connections as possible.

Now my problem with this setup:

How can I propagate the Completion correctly? I call Complete() on the Buffer Block when I am done posting all URLs. But I cannot propagate this to the download block directly, because it might still be needed for the URLs produced from "Parse Page A" block, even after the buffer block has posted all URLs to it.

But I also couple the Download Block Completion to both Buffer Block and Parse Page A Block Completion, because then Parse Page A will never become complete.

I also thought about calling Complete() of "Parse Page A" when the Buffer Block is done, but then there might still be data in the download block which will get rejected by "Parse Block A".

Is there a way out of this circular dilemma?

Or am I on the wrong track completely and should do it in some other fashion?

¿Fue útil?

Solución

You logically have a linear pipeline, so I think that's how you should model it in code too. This means having a separate download block for each type of pages. This way, completion will work fine, but you'll have to deal with connection limiting separately.

There are two ways I can see how to solve that:

  1. If you're always connecting to the same server, you can limit the number of connections to it by using ServicePoints. You can either set that limit globally at the start of the program:

    ServicePointManager.DefaultConnectionLimit = limit;
    

    or just for the one server:

    ServicePointManager.FindServicePoint(new Uri("http://myserver.com"))
                       .ConnectionLimit = limit;
    
  2. If using ServicePoints won't work for you (because you don't have just one server, because it affects the whole application, …), you can limit the requests manually using something like SemaphoreSlim. The semaphore would be set to your desired limit and it would be shared between the two download blocks.

    MaxDegreeOfParallelism for each block would be set to the same limit (higher value won't add anything, lower value could be inefficient) and their code could look like this:

    try
    {
        await semaphore.WaitAsync();
    
        // perform the download
    }
    finally
    {
        semaphore.Release();
    }
    

    If you do need this kind of limiting often, you could create a helper class that encapsulates this logic. Its usage could look like this:

    var factory = new SharedLimitBlockFactory<Input, Output>(
        limit, input => Download(input));
    var downloadBlock1 = factory.CreateBlock();
    var downloadBlock2 = factory.CreateBlock();
    
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top