Question

I have a file wich has about 12 millon lines, each line looks like this:

0701648016480002020000002030300000200907242058CRLF

What I'm trying to accomplish is adding a row numbers before the data, the numbers should have a fixed length.

The idea behind this is able to do a bulk insert of this file into a SQLServer table, and then perform certain operations with it that require each line to have a unique identifier. I've tried doing this in the database side but I haven´t been able to accomplish a good performance (under 4' at least, and under 1' would be ideal).

Right now I'm trying a solution in python that looks something like this.

file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()

I don't know if this will work, but it'll help me having an idea of how will it perform and side effects before I keep on trying new things, I also thought doing it in C so I have a better memory control.

Will it help doing it in a low level language? Does anyone know a better way to do this, I'm pretty sure it has being done but I haven't being able to find anything.

thanks

Was it helpful?

Solution

oh god no, don't read all 12 million lines in at once! If you're going to use Python, at least do it this way:

file = open('file.cas', 'r')
try:
    output = open('output.cas', 'w')
    try:
        output.writelines('%d %s' % tpl for tpl in enumerate(file))
    finally:
        output.close()
finally:
    file.close()

That uses a generator expression which runs through the file processing one line at a time.

OTHER TIPS

Why don't you try cat -n ?

Stefano is right:

$ time cat -n file.cas > output.cas

Use time just so you can see how fast it is. It'll be faster than python since cat is pure C code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top