Question

I am working on a project in python that is starting to overwhelm my low-end windows lap-top and I wanted to ask for advice about how to find the additional computing power I think I need.

Here are some details about my project: I am processing and analyzing a fairly large database of text from the web. Approximately 10,000 files each equivalent to on average approximately 500 words or so (though with a lot of variance around this mean). The first step is pulling certain key phrases and using GenSim to do a fairly simple similarity analysis. This takes my computer a while but it can handle it if I'm gentle. Second, once I have identified a short list of candidates I fingerprint each candidate document to more closely assess similarity. Each file requires fingerprinting and comparison over 2-10 other files - so its not really an n-to-n comparison of the sort that would require months of computer time I don't think.

It is this second step where my computer starts to struggle. I was considering looking into running the script in an EC2 environment but when I started reading about that on here, I saw a a comment to the effect that effectively doing so requires a linux sys admin level of sophistication - I am about as far from that level of sophistication as any member of this site can be.

So is there another option? Or is getting a fairly simply python script running on ES2 not so hard.

The part of the script that seems the most resource-intensive is below. For each text file, it creates a list of fingerprints by selecting certain text files from amdt_word_bags trim according to criteria in PossDupes_1 (both of which are lists). It uses the fingerprintgenerator module which I found here: https://github.com/kailashbuki/fingerprint.

fingerprints_hold=[]
counter=0
error_count=0
for amdt, sims in zip(amdt_word_bags_trim, PossDupes_1):
    counter+=1
    if counter%100==0:
        print counter    
    if len(sims)>1:
        poss_sim=[sim for sim in sims if sim !=(counter-1)]
        fpg_orig = FingerprintGenerator(input_string=amdt)        
        try:
            fpg_orig.generate_fingerprints()
            orig_prints=fpg_orig.fingerprints
        except IndexError as s:
            orig_prints=["small"]
            print s
            error_count+=1
            print error_count
        cand_text=[[cand for cand in amdt_word_bags_trim[num]] for num in poss_sim]
        cand_text=[''.join(c) for c in cand_text]
        fing_cands_hold=[]
        for text in cand_text:
            fpg_cands = FingerprintGenerator(input_string=text)
            try:
                fpg_cands.generate_fingerprints()
                fing_cands_pre=[int(a[0]) for a in fpg_cands.fingerprints]
                fing_cands_hold.append(fing_cands_pre)                
            except IndexError as s2:
                fing_cands_hold.append('small cand')
            except TypeError as s3:
                fing_cands_hold.append("none")
        fingerprints_hold.append([orig_prints, fing_cands_hold])
    else: fingerprints_hold.append("no potential matches")
Was it helpful?

Solution

How about using Amazon's Elastic Map Reduce (EMR). This is Amazon's hadoop service which basically runs on top of EC2. You can copy you your data files to AmazonS3 and have your EMR cluster pick up the data from there. You can also send your results to files in Amazon S3.

When you launch your cluster you can customize how many EC2 instances you want to use and what size for each instance. That way you can tailor how much CPU power you need. After you are done with your job you can tear down your cluster when you are not using it. (Avoiding paying for it)

You can also do all of the above programmatically too. For example python I use the boto Amazon API which is quite popular.

For getting started on how to write python map reduce jobs you can find several posts on the web explaining how to do it. Here's an example: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top