質問

background: I have 500 formatted *.txt files that I need to insert into a mysql database. Currently I have a python script to read the files line by line and insert into mySQL database.

Problem: the files are quite big (~100M per txt file), I tested the script and it takes too long to insert just one file to database.

How can I speed up the process by modifying the scripts?

code:

for file in os.listdir(INPUTFILEPATH):
    ## index += 1
    ## print "processing %s out of %s files " % (index, totalfiles)
    inputfilename = INPUTFILEPATH + "/" + file 
    open_file = open(inputfilename, 'r')
    contents = open_file.readlines()
    totalLines = len(contents)
    ## index2 = 0 
    for i in range(totalLines):
        ## index2 +=1
        ## print "processing %s out of %s lines " % (index2, totalLines)
        lineString = contents[i]
        lineString = lineString.rstrip('\n')
        values = lineString.split('\t')
        if ( len(re.findall(r'[0123456789_\'\.]',values[0])) > 0 ):  
            continue 
        message = """INSERT INTO %s(word,year,count,volume)VALUES('%s','%s','%s','%s')"""% ('1gram', values[0],values[1],values[2],values[3]) 
        cursor.execute(message)
        db.commit()

cursor.close()
db.close() 
役に立ちましたか?

解決

Two options to consider:

1) the easiest is to include multiple rows of values on one insert. This is way, way faster than doing multiple indserts.

Insetad of doing INSERT INTO tbl ( cols ) VALUES ( vals ), do something like INSERT INTO tbl ( cols ) VALUES ( vals ), ( vals ), ( vals )

The amount of rows you can insert at once depends on the maximum packet size of the mysql server, but you can probably do 100, 1000, maybe 10000 rows safely and it should give you a performance increase of an order of magnitude or more.

See http://dev.mysql.com/doc/refman/5.5/en/insert-speed.html

2) LOAD DATA INFILE is a bit different, requires more work and has its own requirements, but is very very fast.

他のヒント

You are storing too much text in ram. You should do it using buffers like this:

with open(inputfilename, 'r') as f:
    for lineString in f:
        ... do your thing

Try this :

Process the txt files and generate INSERT commands for each line. After the files are done processing and you have .sql file for all the txt's, import them into the db.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top