How to download faster?

https://stackoverflow.com/questions/8029025

22-02-2021
|

Question

What is the fastest way to download webpage source into a memo component? I use Indy and HttpCli components.

The problem is that I have a listbox filled with more than 100 sites, my program downloads source to a memo and parses that source for mp3 files. It is something like a Google music search program; it uses Google queries to make Google search easier.

I started reading about threads which lead to my question: Can I create a IdHttp instance in a thread with parsing function and tell it to parse half of the sites in the listbox?

So basically when a user clicks parse, the main thread should do:

for i := 0 to listbox1.items.count div 2 do
    get and parse

, and the other thread should do:

for i := form1.listbox1.items.count div 2 to form1.listbox1.items.count - 1 do
    get and parse.

, so they would add parsed content to form1.listbox2 in the same time. Or is it maybe easier to start two IdHttp instances in the main thread; one for first half of sites and other for second?

For this: should I use Indy or Synapse?

Solution

I would create a thread that can read a single url and process its content. You can then decide how many of those threads you want to fire at the same time. Your computer will allow quite a number of connections, so if those 100 sites have different hostnames, it is not a problem to run 10 or 20 at the same time. Too much is overkill, but too little is a waste of processor time.

You can tweak this process even further by having separate threads for downloading and processing, so that you can have a number of threads constantly downloading content. Downloading is not very processor intensive. It is basically waiting for a response, so you can easily have a relatively large number of download threads, while a couple of other worker threads can grab items from the pool of results and process them.
But splitting downloading and processing will make it a little bit more complex, and I don't think you're up to that challenge yet.

Because currently, you got some other problems. At first, it is not done to use VCL components in a thread. If you need information from a listbox in a thread, you will either need to use Synchronize in the thread to make a 'safe' call to the main thread, or you will have to pass the information needed before you start the thread. The latter is more efficient, because code executed using Synchronize actually runs in the main thread, making your multi-threading less efficient.

But my attention actually was drawn to the first line, "download webpage source into memo component". Don't do that! Don't load those results in a memo for processing. Automatic processing can best be done in memory, outside of visual controls. Using strings, streams, or even stringlists for processing a text is way faster than using a memo.
A stringlist has some overhead as well, but it uses the same construction of indexing the lines (TMemoStrings, which is the Lines property of a Memo, and TStringList both have the same ancestor), so if you got code that makes use of this, it will be quite easy to convert it to TStringList.

OTHER TIPS

I would suggest doing ALL of the parsing in threads, don't have the main thread do any parsing at all. The main thread should only manage the UI. Don't parse the HTML from a TMemo, have each thread download to a TStream or String and then parse from that directly. Use TIdSync or TIdNotify to send parsing results to the UI for display (if speed is important, use TIdNotify). Involving the UI components in your parsing logic will slow it down.

Indy or Synapse are both multi-thread ready. I'd recommend using Synpase, which is much lighter than Indy, and will be sufficient enough for your purpose. Do not forget about the HTTP APIs provided by Microsoft.

Simple implementation:

One thread per URI;
Each thread gets the data using one HTTP communication;
Then each thread parse the data;
Then use Synchronize to refresh the UI.

Perhaps my favorite:

Define a number of maximum threads to be used (e.g. 8);
Each of these threads will maintain a remote connection (this is the purpose of HTTP/1.1 and can really make a difference about speed);
All requests are retrieved by those threads one by one - do not pre-assign URLs to threads, but retrieve a new URL from a global list when a thread has finished one (each URL does not take always the same time);
The threads may wait until any other URI is added to the global list (using a Sleep(100) or a semaphore e.g.);
Then parse and update the UI in the main GUI thread, using a dedicated GDI message (WM_USER+...) - parsing will be fast IMHO (and remember that UI refresh can be slow - take a look at BeginUpdate-EndUpdate methods for instance) - I found out that a GDI message (with the associated HTML data) is more efficient than using Synchronize which blocks the background thread;
Another option is to do the parsing in the background thread, just after having retrieved the data from its URI - perhaps not worth it (only if your parser is slow), and you may come into multi-threading issues if your parser/data processor is not 100% thread-safe.

The 2nd is how popular so-called "download managers" are implemented.

When you deal with multithreading, you'll have to "protect" your shared resources (lists, e.g.). Use a TCriticalSection to access any global list (e.g. the URI list), and release the lock as soon as possible.

And try to test your implementation with several computers and networks, concurrent access, diverse Operating Systems. Debugging multi-threaded applications can be difficult, so the simpler implementation the better: that is the reason why I recommend making the download part multi-threaded, but let the main thread process the data (which won't be huge, so it shall be fast).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow