Question

I'm using C# and I write my data into csv files (for further use). However my files have grown into a large scale and i have to transpose them. what's the easiest way to do that. in any program?

Gil

Was it helpful?

Solution

In increasing order of complexity (and also increasing order of ability to handle large files):

  • Read the whole thing into a 2-D array (or jagged array aka array-of-arrays).
    • Memory required: equal to size of file

  • Track the file offset within each row. Start by finding each (non-quoted) newline, storing the current position into a List<Int64>. Then iterate across all rows, for each row: seek to the saved position, copy one cell to the output, save the new position. Repeat until you run out of columns (all rows reach a newline).
    • Memory required: eight bytes per row
    • Frequent file seeks scattered across a file much larger than the disk cache results in disk thrashing and miserable performance, but it won't crash.

  • Like above, but working on blocks of e.g. 8k rows. This will create a set of files each with 8k columns. The input block and output all fit within disk cache, so no thrashing occurs. After building the stripe files, iterate across the stripes, reading one row from each and appending to the output. Repeat for all rows. This results in sequential scan on each file, which also has very reasonable cache behavior.
    • Memory required: 64k for first pass, (column count/8k) file descriptors for second pass.
    • Good performance for tables of up to several million in each dimension. For even larger data sets, combine just a few (e.g. 1k) of the stripe files together, making a smaller set of larger stripes, repeat until you have only a single stripe with all data in one file.

Final comment: You might squeeze out more performance by using C++ (or any language with proper pointer support), memory-mapped files, and pointers instead of file offsets.

OTHER TIPS

It really depends. Are you getting these out of a database? The you could use a MySql import statement. http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Or you could use could loop through the data add it to a file stream using streamwriter object.

StreamWriter sw = new StreamWriter('pathtofile');
foreach(String[] value in lstValueList){
String something = value[1] + "," + value[2];
sw.WriteLine(something);
}

I wrote a little proof-of-concept script here in python. I admit it's buggy and there are likely some performance improvements to be made, but it will do it. I ran it against a 40x40 file and got the desired result. I started to run it against something more like your example data set and it took too long for me to wait.

path = mkdtemp()
try :
    with open('/home/user/big-csv', 'rb') as instream:
        reader = csv.reader(instream)        
        for i, row in enumerate(reader):
            for j, field in enumerate(row):                
                with open(join(path, 'new row {0:0>2}'.format(j)), 'ab') as new_row_stream:
                    contents = [ '{0},'.format(field) ]
                    new_row_stream.writelines(contents)
            print 'read row {0:0>2}'.format(i)
    with open('/home/user/transpose-csv', 'wb') as outstream:
        files = glob(join(path, '*'))
        files.sort()
        for filename in files:
            with open(filename, 'rb') as row_file:
                contents = row_file.readlines()          
                outstream.writelines(contents + [ '\n' ]) 
finally:
    print "done"
    rmtree(path)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top