Processing huge text files

https://stackoverflow.com/questions/1625299

06-07-2019
|

Question

Problem: I've a huge raw text file (assume of 3gig), I need to go through each word in the file and find out that a word appears how many times in the file.

My Proposed Solution: Split the huge file into multiple files and each splitted file will have words in a sorted manner. For example, all the words starting with "a" will be stored in a "_a.dic" file. So, at any time we will not execeed more than 26 files.

The problem in this approach is,

I can use streams to read the file, but wanted to use threads to read certain parts of the file. For example, read 0-1024 bytes with a separate thread (atleast have 4-8 threads based on the no. of processors exist in the box). Is this is possible or am I dreaming?

Any better approach?

Note: It should be a pure c++ or c based solution. No databases etc., are allowed.

Solution

You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.

In C++, use a map based on the strings and a count (std::map<string,size_t>, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.

In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)

Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).

OTHER TIPS

I don't think using multiple threads that read parts of the file in parallel is going to help much. I would expect that this application is bound to the bandwidth and latency of your harddisk, not the actual word counting. Such a multi-threaded version might actually perform worse because "quasi-random" file access is typically slower than "linear file" access.

In case the CPU is really busy in a single-threaded version there might be a potential speed up. One thread could read the data in big chunks and put them into a queue of limited capacity. A bunch of other worker threads could operate each on their own chunk and count the words. After the counting worker threads finished you have to merge the word counters.

First - decide on the datastructure for saving the words.

The obvious choice is the map. But perhaps a Trie would serve you better. In each node, you save the count for the word. 0 means, that it's only part of a word. You can insert into the trie using a stream and reading your file characterbased.

Second - multithreading yes or no? This one is not easy to answer. Depending on the size the datastructure grows and how you parallelize the answer may differ.

Singlethreaded - straitforward and easy to implement.
Multithreaded with multiple reader threads and one datastructur. Then you have to synchronize the access to the datastructure. In a Trie, you only need to lock the node you are actually in, so multiple readers can access the datastructure without much interference. A self-balancing tree might be different, especially when rebalancing.
Multithreaded with multiple reader threads, each with their own datastructure. Each thread builds it's own datastructure while reading a part of the file. After each one is finished, the results have to be combined (which should be easy).

One thing you have to think about - you have to find a word boundary for each thread to start, but that should not pose a great problem (e.g. each thread walks it's start until the first word boundary and starts there, at the end each thread finishes the word it's working on).

While you can use a second thread to analyze the data after reading it, you're probably not going to gain a huge amount by doing so. Trying to use more than one thread to read the data will almost certainly hurt speed rather than improving it. Using multiple threads to process the data is pointless -- processing will be many times faster than reading, so even with only one extra thread, the limit is going to be the disk speed.

One (possible) way to gain significant speed is to bypass the usual iostreams -- while some are nearly as fast as using C FILE*'s, I don't know of anything that's really faster, and some are substantially slower. If you're running this on a system (e.g. Windows) that has an I/O model that's noticeably different from C's, you can gain considerably more with a little care.

The problem is fairly simple: the file you're reading is (potentially) larger than the cache space you have available -- but you won't gain anything from caching, because you're not going to reread chunks of the file again (at least if you do things sensibly). As such, you want to tell the system to bypass any caching, and just transfer data as directly as possible from the disk drive to your memory where you can process it. In a Unix-like system, that's probably open() and read() (and won't gain you a whole lot). On Windows, that's CreateFile and ReadFile, passing the FILE_FLAG_NO_BUFFERING flag to CreateFile -- and it'll probably roughly double your speed if you do it right.

You've also gotten some answers advocating doing the processing using various parallel constructs. I think these are fundamentally mistaken. Unless you do something horribly stupid, the time to count the words in the file will be only a few milliseconds longer than it takes to simply read the file.

The structure I'd use would be to have two buffers of, say, a megabyte apiece. Read data into one buffer. Turn that buffer over to your counting thread to count the words in that buffer. While that's happening, read data into the second buffer. When those are done, basically swap buffers and continue. There is a little bit of extra processing you'll need to do in swapping buffers to deal with a word that may cross the boundary from one buffer to the next, but it's pretty trivial (basically, if the buffer doesn't end with white space, you're still in a word when you start operating on the next buffer of data).

As long as you're sure it'll only be used on a multi-processor (multi-core) machine, using real threads is fine. If there's a chance this might ever be done on a single-core machine, you'd be somewhat better off using a single thread with overlapped I/O instead.

As others have indicated, the bottleneck will be the disk I/O. I therefore suggest that you use overlapped I/O. This basically inverts the program logic. Instead of your code tyring to determine when to do I/O, you simply tell the Operating System to call your code whenever it has finished a bit of I/O. If you use I/O completion ports, you can even tell the OS to use multiple threads for processing the file chunks.

c based solution?

I think perl was born for this exact purpose.

stream has only one cursor. If you access to the stream with more than one thread at a time, you will not be sure to read where you want. Read is done from cursor position.

What I would do is to have only one thread (maybe the main one) that reads the stream and dispatch reading bytes to other threads.

By example:

Thread #i is ready and ask main thread to give it next part,
Main thread read next 1Mb and provide them to thread 1,
Thread #i read the 1Mb and count words as you want,
Thread #i finishes its work and ask again for the next 1Mb.

By this way you can separate stream reading to stream analysis.

What you are looking for is RegEx. This Stackoverflow thread on c++ regex engines should help:

C++: what regex library should I use?

First, I'm pretty sure that C/C++ isn't the best way to handle this. Ideally, you'd use some map/reduce for parallelism, too.

But, assuming your constraints, here's what I'd do.

1) Split the text file into smaller chunks. You don't have to do this by the first-letter of the word. Just break them up into, say, 5000-word chunks. In pseudocode, you'd do something like this:

index = 0

numwords = 0

mysplitfile = openfile(index-split.txt)

while (bigfile >> word)

mysplitfile << word

numwords ++

if (numwords > 5000)

    mysplitfile.close()

    index++

    mysplitfile = openfile(index-split.txt)

2) Use a shared map data structure and pthreads to spawn new threads to read each of the subfiles. Again, pseudocode:

maplock = create_pthread_lock()

sharedmap = std::map()

for every index-split.txt file:

spawn-new-thread(myfunction, filename, sharedmap, lock)

dump_map(sharedmap)

void myfunction(filename, sharedmap) {

localmap = std::map<string, size_t>();

file = openfile(filename)

while (file >> word)

    if !localmap.contains(word)
         localmap[word] = 0

    localmap[word]++

acquire(lock)
for key,value in localmap
    if !sharedmap.contains(key)
         sharedmap[key] = 0

    sharedmap[key] += value
release(lock)

}

Sorry for the syntax. I've been writing a lot of python lately.

Not C, and a bit UGLY, but it took only 2 minutes to bang out:

perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

Loop over each line with -n
Split each line into @F words with -a
Each $_ word increments hash %h
Once the END of file has been reached,
sort the hash by the frequency $h{$b}<=>$h{$a}
If two frequencies are identical, sort alphabetically $a cmp $b
Print the frequency $h{$w} and the word $w
Redirect the results to file 'freq'

I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 173 seconds.

My input file already had punctuation stripped out, and uppercase converted to lowercase, using this bit of code:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file
(runtime of 144 seconds)

The word-counting script could alternately be written in awk:
awk '{for (i=1; i<=NF; i++){h[$i]++}} END{for (w in h){printf("%s\t%s\n", h[w], w)}}' file | sort -rn > freq

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow