Question

I have a 10GB CSV file which is essentially a huge square matrix. I am trying to write a function that can access a single cell of the matrix as efficiently as possible, ie matrix[12345,20000].

Given its size, it is obviously not possible to load the entire matrix into a 2D array, I need to somehow read the values direct from the file.

I have Googled around looking at file random access using FileStream.Seek, however unfortunately due to variable rounding each cell isn't a fixed width. It would not be possible for me to seek to a specific byte and know what cell I'm looking at by some sort of arithmetic.

I considered scanning the file and creating a lookup table for the index of the first byte of each row. That way, if I wanted to access matrix[12345,20000] I would seek to the start of row 12345 and then scan across the line, counting the commas until I reach the correct cell.

I am about to try this, but has anyone else got any better ideas? I'm sure I wouldn't be the first person to try and deal with a file like this.

Cheers

Edit: I should note that the file contains a very sparse matrix. If parsing the CSV file ends up being too slow, I would consider converting the file to a more appropriate, and easier to process, file format. What is the best way to store a sparse matrix?

Was it helpful?

Solution

I have used Lumenworks CSV reader for quite large CSV files, it may be worth a quick look to see how quickly it can parse your file.

Lumenworks CSV

OTHER TIPS

First of all, how would you want to refer to a particular row? Is it the index of the row so that you have another table or something that will help you know which row you are interested? or is it by an id or something?

These ideas come to mind

  • Your approach
  • Binary search. Assuming you have average length (size/rows), you can use a binary search to find a row assuming there is an identifier in the row which is ordered and can tell you if you are hit or miss.
  • Loading it to a database! By the way, what prevents you to do that? You can even use SQL express - which is free - and to get around the size limit, you can shard your data to multiple databases.

Index-file would be the best you could do. I bet. Having unknown size of row, there is no way to skip directly to the line other than either scan the file or have an index.

The only question is how large your index is. If it is too large, you could make it smaller by indexing only every 5th (for example) line and scan in range of 5 lines.

Pre-process the file so that the fields are fixed width. Then you can do your random read easily.

From doing similar sorts of in the past you should be able to write some simple code that reads the 10G variable width file from a local disk and writes a 10G fixed width file to a local disk in a few (~20) minutes. If that up front time investment pays off depends on how many random reads you need to do and how often the file to be read changes.

What if you created 12345 separate file that are read with Lazy instantiation. Each file would only be read if the data was needed. If the data is completely sparse you could create a data structure with an IsEmpty bool property.

Do you need to access the same element over and over or do you need to just read each element once?

I disagree that you shouldn't load the file into RAM, especially if you use a 64bit OS.

It shouldn't be a problem to allocate a matrix of size 12345x20000 : that's only about 1.9 GB in double precision. And in fact even if the size was bigger, I would still recommend this approach under a 64bit platform (see "virtual memory").

Secondly you stated that your matrix was sparse, hence you could load into RAM but use a sparse representation to spare some memory.

In conclusion if your application requires many access to your matrix and performance is somewhat important, putting it in RAM would definitely be my favourite approach.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top