Question

I have a 20 gb file which looks like the following:

Read name, Start position, Direction, Sequence

Note that read names are not neccessarily unique.

E.g. a snippet of my file would look like

Read1, 40009348, +, AGTTTTCGTA
Read2, 40009349, -, AGCCCTTCGG
Read1, 50994530, -, AGTTTTCGTA

I want to be able to store these lines in a way that allows me to

  1. keep the file sorted based on the second value
  2. iterate over the sorted file

It seems that databases can be used for this.

The documentation seems to imply that dbm cannot be used to sort the file and iterate over it.

Therefore I'm wondering whether SQLite3 will be able to do 1) and 2). I know that I will be able to sort my file with a SQL-query and iterate over the resultset with sqlite3. However, will I be able to do this without running out of memory on a 4gb of RAM computer?

Was it helpful?

Solution

SQLite is able to do both 1) and 2).

I recommend you try it and report any problems you encounter.

With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes.

OTHER TIPS

See this question about large SQLlite databases.

The important bit:

I tried to insert multiple rows into a sqlite file with just one table. When the file was about 7GB (sorry I can't be specific about row counts) insertions were taking far too long. I had estimated that my test to insert all my data would take 24 hours or so, but it did not complete even after 48 hours.

The sample used was ~50GB of data, though system specs are not mentioned.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top