Searching a particular string pattern out of 10000 files in parallel

https://stackoverflow.com/questions/11978754

26-06-2021
|

Pergunta

Problem Statement:-

I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.

Below is the command I am using to search a particular string pattern after unzipping the dat.gz file

gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'

If I simply count how many files are there after unzipping the above dat.gz file

gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l

I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.

What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.

Note:

I am running SunOS

bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc

Solução

Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.

Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.

In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.

(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)

Outras dicas

For starters, you will need to uncompress the file to disk.

This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:

for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done

So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):

while [ `top -b -n1 | grep -c grep` -gt 10  ]; do echo true; done

I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?

for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done

Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.

If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.

Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.

This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:

find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep

That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow