Question

I'm looking for the amount of storage in bytes (MB, GB, TB, etc.) required to store a single human genome. I read a few articles on Wikipedia about DNA, chromosomes, base pairs, genes, and have some rough guess, but before disclosing anything I'd like to see how others would approach this issue.

An alternative question would be how many atoms are there in human DNA, but that would be off topic for this site.

I understand that this will be an approximation, so I'm looking for the minimal value that would be able to store DNA of any human.

Was it helpful?

Solution

If you trust such things, here is what Wikipedia claims (from http://en.wikipedia.org/wiki/Human_genome#Information_content):

The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.

OTHER TIPS

You do not store all the DNA in one stream, rather most the time it is store by chromosomes.

A large chromosome take about 300 MB and a small one about 50 MB.


Edit:

I think the first reason why it is not saved in 2 bits per base pair is that it would cause an hurdle to work with the data. Most of the people would not know how to convert it. And even when a program for conversion would be given, a lot of people in large companies or research institutes are not allowed to/need to ask or do not know how to install programs...

1GB storage costs nothing, even the download of 3 GB takes only 4 minutes with 100 Mbitsps and most companies have faster speeds.

Another point is that the data isn't as simple as you get told.

e.g. The method for sequencing invented by Craig_Venter was a great breakthrough but has its down sides. It could not separate long chains of the same base pair, so it is not always 100% clear if there are 8 A's or 9 A's. Things you have to take care of later on...

Another example is the DNA methylation because you can't store this Information in a 2-bit representation.

Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs in the human genome, (2 * 2.9 billion) bits ~= 691 megabytes.

I'm no expert, however, the Human Genome page on Wikipedia states the following:

Raw MB:

  • Male (XY): 770MB
  • Female (XX): 756MB

I'm not sure where their variance comes from, but I'm sure you can figure it out.

Yes, the minimum RAM needed for whole human DNA is about 770 MB. However, the 2-bit representation is impractical. It is hard to search through or do some computations on it. Therefore some mathematicians designed more effective way to store those sequencies of bases ... and use them in searching and comparation algorithms such as for example GARLI (www.bio.utexas.edu/faculty/antisense/garli/garli.html ). This application runs on my PC right now, so I can say to You... that it practically has the DNA stored in about: 1 563 MB.

Most answers except users slayton, rauchen, Paul Amstrong are dead wrong if its about pure storage one-on-one without compression techniques.

The human genome with 3Gb of nucleotides correspond with 3Gb of bytes and not ~750MB. The constructed "haploid" genome according to NCBI is currently 3436687kb or 3.436687 Gb in size. Check here for yourself.

Haploid = single copy of a chromosome. Diploid = two versions of haploid. Humans have 22 unique chromosomes x 2 = 44. Male 23rd chromosome is X, Y and makes 46 in total. Females 23rd chrom. is X, X and thus makes 46 in total.

For males it would be 23 + 1 chromosome in data storage on a HDD and for females 23 chromosomes, explaining the little differences mentioned now and then in answers. The X chrom. from males is equal to X chrom. from the females.

Thus loading the genome (23 + 1) into memory is done in parts via BLAST using constructed databases from fasta-files. Regardless of zipped versions or not nucleotides are hardly to be compressed. Back in the early days one of the tricks used was to replace tandem repeats (GACGACGAC with shorter coding e.g. "3GAC"; 9byte to 4byte). The reason was to save harddrive space (area of the 500bm-2GB HDDD platters with 7.200 rpm and SCSI connectors). For sequence searching this was also done with the query.

If "coded nucleotide" storage would be 2-bit per letter then you get for a byte:

A = 00
C = 01
G = 10
T = 11

Only this way you fully profit from positions 1,2,3,4,5,6,7 and 8 for 1 byte of coding. For example the combination 00.01.10.11 (as byte 00011011) would then correspond for "ACTG" (and show in a textfile as an unrecognizable character). This alone is responsible for a four times reduction in file-size as we see in other answers. Thus 3.4Gb will be downsized to 0.85917175 Gb... ~860MB including a then required conversion program (23kb-4mb).

But... in biology you want to be able to read something thus compression gzipped is more than enough. Unzipped you can still read it. If this byte filling was used it becomes harder to read the data. That's why fasta-files are plain-text files in reality.

The human genome contains 2.9 billion base pairs. So if you represented each base pair as a byte then it would take 2.9 billion bytes or 2.9 GB. You could probably come up with a more creative way of storing base pairs as each base pair only requires 2 bits. So you could probably store 4 base pairs per byte bringing down the total of less than a GB.

There are 4 nucleotide bases that make up our DNA these are A,C,G,T therefore for each base in the DNA takes up 2bits. There are around 2.9billion bases so thats around 700 megabytes. The weird thing is that would fill a normal data cd! coincidence?!?

just did it too. the raw sequence is ~700 MB. if one uses a fixed storage sequence or a fixed sequence storage algoritm - and the fact that the changes are 1% i calcuated ~120 MB with a perchromosome-sequenceoffset-statedelta storage. that's it for the storage.

All answers are leaving off the fact that nuDNA is not the only DNA that defines a human genome. mtDNA is also inherited and it contributes an additional 16,500 base pairs to a human genome, bringing it more in line with the Wikipedia guess of 770MB for males, and 756MB for females.

This does not mean that a human genome can easily be stored on an 4GB USB stick. Bits do not represent information by themselves, it is the combination of bits that represent information. So in the case of nuDNA and mtDNA, the bits are encoded (not to be confused with compressed) to represent proteins and enzymes that in themselves would requires many MBs of raw data to represent, especially in terms of functionality.

Food for thought: 80% of the human genome is called "non-coding" DNA, so did you actually really believe that the entire human body and brain can be represented in a mere 151 to 154MBs of raw data?

There is only 2 types of base pairs, Cytosine can only bind to Guanine, and Adenine can only bind to thymine, So each base pair can be considered a single bit. This means that an entire strand of Human DNA ~3 billion "Bits" would be right around ~350 megabytes.

One base -- T, C, A, G (in the base-4 number system: 0, 1, 2, 3) -- is encoded as two bits (not one), so one base pair is encoded by four bits.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top