retrieving partial content using multiple http requsets to fetch data via parllel tasks

https://stackoverflow.com/questions/13606523

03-12-2021
|

سؤال

i am trying to be as thorough as i can in this post, as it is very important for me,

though the issue is very simple, and only by reading the title of this question, you can get the idea...

question is:

with healthy bandwidth (30mb Vdsl) available...

how is it possible to get multiple httpWebRequest for a single data / file ?,

so each reaquest,will download only a portion of the data then when all instances have completed, all parts are joined back to one piece.

Code:

...what i have got working so far is same idea only each task =HttpWebRequest = different file,

so speedup is pure tasks parallelism rather acceleration of one download using multiple tasks/threads

as in my question.

see code below

the next part is only more detailed explantion and background on the subject...if you don't mind reading.

while i am still on a similar project that differ from this (in question)one,

in the way that it(see code below..) was trying to fetch as many different data sources for each of separated tasks(different downloads/files). ... so the speedup was gaind while each(task) does not have to wait for the former one to complete first before it get a chance to be executed .

what i am trying to do in this current-subjected question (having allmost everything ready in the code below) is actually targetting same url for same data, so this time the speedup to gain is for the single-task - current download .

implementing same idea as in code below only this time let SmartWebClient target same url by using multiple instances.

then (only theory for now) it will request partial content of data, with multiple requests with each one of instances .

last issue is i need to "put puzle back to one peace"... another problem i need to find out about...

as you can see in this code , what i did not get to work on yet is only the data parsing/processing which i find to be very easy using htmlAgilityPack so no problem.

current code

main entry:

        var htmlDictionary = urlsForExtraction.urlsConcrDict();
        Parallel.ForEach(
                        urlList.Values,
                        new ParallelOptions { MaxDegreeOfParallelism = 20 },
                        url => Download(url, htmlDictionary)
                        );
        foreach (var pair in htmlDictionary)
        {
            ///Process(pair);
            MessageBox.Show(pair.Value);
        }

public class urlsForExtraction
{
        const string URL_Dollar= "";
        const string URL_UpdateUsersTimeOut="";


        public ConcurrentDictionary<string, string> urlsConcrDict()
        {
            //need to find the syntax to extract fileds names so it would be possible to iterate on each instead of specying
            ConcurrentDictionary<string, string> retDict = new Dictionary<string,string>();
            retDict.TryAdd("URL_Dollar", "Any.Url.com");
            retDict.TryAdd("URL_UpdateUserstbl", "http://bing.com");
            return retDict;
        }


}


/// <summary>
/// second Stage Class consumes the Dictionary of urls for extraction
/// then downloads Each via parallel for each using The Smart WeBClient! (download(); )
/// </summary>
public class InitConcurentHtmDictExtrct
{

    private void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
    {

        using (var webClient = new SmartWebClient())
        {
            webClient.Encoding = Encoding.GetEncoding("UTF-8");
            webClient.Proxy = null;
            htmlDictionary.TryAdd(url, webClient.DownloadString(url));
        }
    }

    private ConcurrentDictionary<string, string> htmlDictionary;
    public ConcurrentDictionary<string, string> LoopOnUrlsVia_SmartWC(Dictionary<string, string> urlList)
    {

        htmlDictionary = new ConcurrentDictionary<string, string>();
        Parallel.ForEach(
                        urlList.Values,
                        new ParallelOptions { MaxDegreeOfParallelism = 20 },
                        url => Download(url, htmlDictionary)
                        );
        return htmlDictionary;

    }
}
/// <summary>
/// the Extraction Process, done via "HtmlAgility pack" 
/// easy usage to collect information within a given html Documnet via referencing elements attributes
/// </summary>
public class Results
{
    public struct ExtracionParameters
    {
        public string FileNameToSave;
        public string directoryPath;
        public string htmlElementType;

    }
    public enum Extraction
    {
        ById, ByClassName, ByElementName
    }
    public void ExtractHtmlDict( ConcurrentDictionary<string, string> htmlResults, Extract By)
    {
        // helps with easy  elements extraction from the page.
        HtmlAttribute htAgPcAttrbs;
        HtmlDocument HtmlAgPCDoc = new HtmlDocument();
        /// will hold a name+content of each documnet-part that was aventually extracted 
        /// then from this container the build of the result page will be possible
        Dictionary<string, HtmlDocument> dictResults = new Dictionary<string, HtmlDocument>();

        foreach (KeyValuePair<string, string> htmlPair in htmlResults)
        {
            Process(htmlPair);
        }
    }
    private static void Process(KeyValuePair<string, string> pair)
    {
        // do the html processing
    }

}
public class SmartWebClient : WebClient
{


    private readonly int maxConcurentConnectionCount;

    public SmartWebClient(int maxConcurentConnectionCount = 20)
    {
        this.Proxy = null;
        this.Encoding = Encoding.GetEncoding("UTF-8");
        this.maxConcurentConnectionCount = maxConcurentConnectionCount;
    }

    protected override WebRequest GetWebRequest(Uri address)
    {
        var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
        if (httpWebRequest == null)
        {
            return null;
        }

        if (maxConcurentConnectionCount != 0)
        {
            httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
        }

        return httpWebRequest;
    }

}
}

this allows me to take advantage of good bandwith, only i am far from the subjected solution, i will realy appriciate any clue on where to start .

المحلول

If the server support what's wikipedia calls byte serving, you can multiplex a file download spawning multiple requests with a specific Range header value (using the AddRange method. See also How to download the data from the server discontinuously？). Most serious HTTP servers do support byte-range.

Here is some sample code that implements a parallel download of a file using byte range:

    public static void ParallelDownloadFile(string uri, string filePath, int chunkSize)
    {
        if (uri == null)
            throw new ArgumentNullException("uri");

        // determine file size first
        long size = GetFileSize(uri);

        using (FileStream file = new FileStream(filePath, FileMode.Create, FileAccess.Write, FileShare.Write))
        {
            file.SetLength(size); // set the length first

            object syncObject = new object(); // synchronize file writes
            Parallel.ForEach(LongRange(0, 1 + size / chunkSize), (start) =>
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
                request.AddRange(start * chunkSize, start * chunkSize + chunkSize - 1);
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();

                lock (syncObject)
                {
                    using (Stream stream = response.GetResponseStream())
                    {
                        file.Seek(start * chunkSize, SeekOrigin.Begin);
                        stream.CopyTo(file);
                    }
                }
            });
        }
    }

    public static long GetFileSize(string uri)
    {
        if (uri == null)
            throw new ArgumentNullException("uri");

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
        request.Method = "HEAD";
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        return response.ContentLength;
    }

    private static IEnumerable<long> LongRange(long start, long count)
    {
        long i = 0;
        while (true)
        {
            if (i >= count)
            {
                yield break;
            }
            yield return start + i;
            i++;
        }
    }

And sample usage:

    private static void TestParallelDownload()
    {
        string uri = "http://localhost/welcome.png";
        string fileName = Path.GetFileName(uri);

        ParallelDownloadFile(uri, fileName, 10000);
    }

PS: I'd be curious to know if it's really more interesting to do this parallel thing rather than to just use WebClient.DownloadFile... Maybe in slow network scenarios?

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow