Question

I have a website which offers pages in the format of https://www.example.com/X where X is a sequential, unique number increasing by one every time a page is created by the users and never reused even if the user deletes their page. Since the site doesn't offer a quick and painless way to know which one of those pages is still up I resorted to checking them one by one, contacting them through an HttpClient and analyzing the HttpResponeMessage.StatusCode for 200 or 404 Http codes. My main method is as follows:

private async Task CheckIfPageExistsAsync(int PageId)
    {
        string address = $"{ PageId }";
        try
        {
            var result = await httpClient.GetAsync(address);

            Console.WriteLine($"{ PageId } - { result.StatusCode }");

            if (result.StatusCode == HttpStatusCode.OK)
            {
                ValidPagesChecked.Add(PageId);
            }
        }
        //Code for HttpClient timeout handling
        catch (Exception)
        {
            Console.WriteLine($"Failed ID: { PageId }");
        }
    }

This code is called like this in order to have a certain degree of parallelism:

public void Test()
    {
        var tasks = new ConcurrentBag<Task>();
        var lastId = GetLastPageIdChecked();
        //Here opens up 30 requests at a time because I found it's the upper limit before getting hit with a rate limiter and receiving 429 errors
        Parallel.For(lastId + 1, lastId + 31, i =>
        {
            tasks.Add(CheckIfCharacterExistsAsync(i));
        });
        Task.WaitAll(tasks.ToArray());

        lastId += 30;
        Console.WriteLine("STEP");

        WriteLastPageIdChecked(lastId);
        WriteValidPageIdsList();
    }

Now, from what I understand starting tasks through Parallel should let the program handle itself when it comes to how the concurrent threads should be active at the same time and adding them all to a ConcurrentBag enables me to wait for all of them to end before moving on to the next batch of pages to check. Since this whole operation is incredibly expensive time-wise I'd like to know if I've opted for a good approach when it comes to parallelism and asynchronous methods.

Was it helpful?

Solution

The first rule when talking about performance is to measure. The built in tools in visual studio is fairly competent, and in a pinch, adding a stopwatch or two can help reveal what part of the code is the bottleneck.

just looking at your code I would expect that each iteration of the parallel loop would complete very quickly since httpClient.GetAsync should do fairly little work before it returns the task. So the majority of the time would be spent in the Task.WaitAll call. Therefore I would try replacing the parallel.for with a regular loop and see if the parallelization is worth the effort. The actual web-calls will still be done in parallel due to the async call, as long as the calls are not awaited untill the end.

A rule of thumb is that Parallel.For/ForEach is most useful when you are CPU limited, while async is most useful when you are IO limited.

I would also recommend taking a look at semaphoreSlim for rate limiting. See How to limit the amount of concurrent async I/O operations for details. I would try something like this and see if it makes a difference.

        var semaphore = new SemaphoreSlim(MaxConcurrentCalls);
        var tasks = new List<Task>();
        for (int i = 0; i < NumberOfPagesToCheck; i++)
        {
            await semaphore.WaitAsync();
            try
            {
                tasks.Add(CheckIfPageExistsAsync(i));
            }
            finally
            {
                semaphore.Release();
            }
        }
        await Task.WhenAll(tasks);
Licensed under: CC-BY-SA with attribution
scroll top