(Don't suggest Hadoop or map reduce solution even they sounds logically the same)
I have a big file - 70GB of raw html files and I need to do the parsing to get the information I need.
I have delt with 10GB file successfully before using standardI/O:
cat input_file | python parse.py > output_file
And in my python script, it reads every html (each html per line) from standard input and writes the result back to standard output.
from bs4 import BeautifulSoup
import sys
for line in sys.stdin:
print ....
The code is very simple but right now, I am dealing with a big file which is horribly slow on one node. I have a cluster of about 20 nodes. And I am wondering how could I easily distribute this work.
What I have done so far:
split -l 5000 input_file.all input_file_ # I have 60K lines in total in that 70G file
Now the big file have been splitted into several small files:
input_file_aa
input_file_ab
input_file_ac
...
Then I have no problem working with each one of them:
cat input_file_aa | python parser.py > output_file_aa
What I gonna do is probably scp the input_file to each node and do the parsing and then scp the result back, but there are 10+ nodes! I it is so tedious to do that manually.
I am wondering how could I easily distribute these files to other nodes and do the parsing and move the result back?
I am open to basic SHELL, JAVA, Python solution. Thanks a lot in advance and let me know if you need more explanation.
Note, I do have a folder called /bigShare/ that could be assessible on every node and the contents are synchronized and stay the same. I don't know how the architect implemented that (NFS..? I don't know how to check) but I could put my input_file and python script there so the rest is how to easily log into those nodes and execute the command.
Btw, I am on Red Hat.