Question

I am working with server who's configurations are as:

RAM - 56GB Processor - 2.6 GHz x 16 cores How to do parallel processing using shell? How to utilize all the cores of processor?

I have to load data from text file which contains millions of entries for example one file contains half million lines data. I am using django python script to load data in postgresql database. But it takes lot of time to add data in database even though i have such a good config. server but i don't know how to utilize server resources in parallel so that it takes less time to process data. Yesterday i had loaded only 15000 lines of data from text file to postgresql and it took nearly 12 hours to do it. My django python script is as below:

import re
import collections
    def SystemType():
        filename = raw_input("Enter file Name:")
        in_file = file(filename,"r")
        out_file = file("SystemType.txt","w+")
        for line in in_file:
            line = line.decode("unicode_escape")
            line = line.encode("ascii","ignore")
            values = line.split("\t")
            if values[1]:
                for list in values[1].strip("wordnetyagowikicategory"):
                        out_file.write(re.sub("[^\ a-zA-Z()<>\n""]"," ",list))

    # Eliminate Duplicate Entries from extracted data using regular expression

def FSystemType():
    lines_seen = set()
    outfile = open("Output.txt","w+")
    infile = open("SystemType.txt","r+")
    for line in infile:
        if line not in lines_seen:
                l = line.lstrip()
# Below reg exp is used to handle Camel Case.
                outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower())
                lines_seen.add(line)
    infile.close()
    outfile.close()




 sylist=[]
        def create_system_type(stname):
            syslist=Systemtype.objects.all()
            for i in syslist:
                sylist.append(str(i.title))
            if not stname in sylist:
                slu=slugify(stname)
                st=Systemtype()
                st.title=stname
                st.slug=slu
        #   st.sites=Site.objects.all()[0]
                st.save()
            print "one ST added."

No correct solution

OTHER TIPS

if you could express your requirements without the code (not every shell programmer can really read phython), possibly we could help here.

e.g. your report of 12 hours for 15000 lines suggests you have a too-busy "for" loop somewhere, and i'd suggest the nested for

for list in values[1]....

what are you trying to strip? individual characters, whole words? ...

then i'd suggest "awk".

If you are able to work out the precise data structure required by Django, you can load the database tables directly, using the psql "copy" command. You could do this by preparing a csv file to load into the db.

There are any number of reasons why loading is slow using your approach. First of all Django has a lot of transactional overhead. Secondly it is not clear in what way you are running the Django code -- is this via the internal testing server? If so you may have to deal with the slowness of that. Finally what makes a fast database is not normally to do with CPU, but rather fast IO and lots of memory.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top