Question

Are there any good programs for dealing with reading large CSV files? Some of the datafiles I deal with are in the 1 GB range. They have too many lines for Excel to even deal with. Using Access can be a little slow, as you have to actually import them into a database to work with them directly. Is there a program that can open large CSV files and give you a simple spreadsheet layout to help you easily and quickly scan through the data?

Was it helpful?

Solution

MySQL can import CSV files very quickly onto tables using the LOAD DATA INFILE command. It can also read from CSV files directly, bypassing any import procedures, by using the CSV storage engine.

Importing it onto native tables with LOAD DATA INFILE has a start up cost, but after that you can INSERT/UPDATE much faster, as well as index fields. Using the CSV storage engine is almost instantaneous at first, but only sequential scan will be fast.

Update: This article (scroll down to the section titled Instant Data Loads) talks about using both approaches to loading CSV data onto MySQL, and gives examples.

OTHER TIPS

I've found reCSVeditor is a great program for editing large CSV files. It's ideal for stripping out unnecessary columns. I've used it for files 1,000,000 record files quite easily.

vEdit is great for this. I routinely open up 100+ meg (i know you said up to one gig, I think they advertise on their site it can handle twice that) files with it. It has regex support and loads of other features. 70 dollars is cheap for the amount you can do with it.

GVim can handle files that large for free if you are not attached to a true spreadsheet static field size view.

vEdit is great but don't forget you can always go back to "basics" check out Cygwin and start greping.

Helpfull commands

  • grep
  • head
  • tail
  • of course perl!

It depends on what you actually want to do with the data. Given a large text file like that you typically only want a smaller subset of the data at any one time, so don't overlook tools like 'grep' for pulling out the pieces you want to look for and work with.

If you can fit the data into memory and you like python then I recommend checking out the UniTable portion of Augustus. (Disclaimer: Augustus is open source (GPLv2) but I work for the company that writes it.)

It's not very well documented but this should help you get going.

from augustus.kernel.unitable import *
a = UniTable().from_csv_file('filename')
b = a.subtbl(a['key'] == some_value) #creates a subtable

It won't directly give you an excel like interface but with a little bit of work you can get many statistics out quickly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top