Question

(Don't suggest Hadoop or map reduce solution even they sounds logically the same)

I have a big file - 70GB of raw html files and I need to do the parsing to get the information I need.

I have delt with 10GB file successfully before using standardI/O:

cat input_file | python parse.py > output_file

And in my python script, it reads every html (each html per line) from standard input and writes the result back to standard output.

from bs4 import BeautifulSoup
import sys
for line in sys.stdin:
    print ....

The code is very simple but right now, I am dealing with a big file which is horribly slow on one node. I have a cluster of about 20 nodes. And I am wondering how could I easily distribute this work.

What I have done so far:

split -l 5000 input_file.all input_file_   # I have 60K lines in total in that 70G file

Now the big file have been splitted into several small files:

input_file_aa
input_file_ab
input_file_ac
...

Then I have no problem working with each one of them:

cat input_file_aa | python parser.py > output_file_aa 

What I gonna do is probably scp the input_file to each node and do the parsing and then scp the result back, but there are 10+ nodes! I it is so tedious to do that manually.

I am wondering how could I easily distribute these files to other nodes and do the parsing and move the result back?

I am open to basic SHELL, JAVA, Python solution. Thanks a lot in advance and let me know if you need more explanation.

Note, I do have a folder called /bigShare/ that could be assessible on every node and the contents are synchronized and stay the same. I don't know how the architect implemented that (NFS..? I don't know how to check) but I could put my input_file and python script there so the rest is how to easily log into those nodes and execute the command. Btw, I am on Red Hat.

Was it helpful?

Solution

Execute the command remotely with remote piping to stdout. Then make the local command pipe to a local file.

Example:

ssh yourUserName@node1 "cat input_file_node1 | python parser.py" >output_file_node1

If the files have not been copied to the different nodes, then:

ssh yourUserName@node1 "python parser.py" <input_file_node1 >output_file_node1

This assumes that yourUserName has been configured with key-based authentication. Otherwise, you will need to enter your password manually (20 times! :-( ). To avoid this, you can use expect, but I will strongly suggest to setup key-based authentication. You can do the later using expect too.

OTHER TIPS

Assuming you want to process a piece of each file on a host of its own: first copy the python script to the remote hosts. Then loop over the remote hosts:

for x in aa ab ac ...; do
   ssh user@remote-$x python yourscript.py <input_file_$x >output_file_$x &
done;

If the processing nodes don't have names that are easy to generate you can create aliases for them in your .ssh/config, for example:

Host remote-aa
    Hostname alpha.int.youcompany

Host remote-ab
    Hostname beta.int.yourcompany

Host remote-ac
    Hostname gamma.int.yourcompany

This particular use case could be more easily solved by editing /etc/hosts though.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top