Advances in next generation sequencing technologies has
resulted in the generation of unprecedented levels of sequence
data. Therefore, modern biology now presents new challenges in
terms of data management, query and analysis. Human DNA is
comprised of approximately 3 billion base pairs with a personal
genome representing approximately 100 gigabytes (GB) of data,
the equivalent of 102,400 photos. By the end of 2011, the global
annual sequencing capacity was estimated to be 13 quadrillion
bases and counting, enough data to fill a stack of DVDs two miles
high.
Moore’s Law describes a trend coined by Intel co-founder Gordon Moore which states that ‘‘the number of transistors that can be placed on an integrated circuit board is increasing exponentially, with a doubling time of roughly 18 months’’ . Put more simply: computers double in speed and half in size every 18 months. Similar phenomena have been noted for the capacity of hard disks (Kryder’s Law) and network bandwidth (Nielsen’s Law and Butter’s Law) . This trend has remained true for approximately 40 years, until the completion of the Human Genome project in 2003. Since then, a deluge of biological sequence data has been generated; a phenomenon largely spurred by the falling cost of sequencing. Sequencing a human genome has decreased in cost from $1 million in 2007 to $1 thousand in 2012 . As further evidence of this, the 1,000 Genomes project [8], which involves sequencing and cataloguing human genetic variation, has deposited two times more raw data into NCBI’s GenBank during its first 6 months than all the previous sequences deposited in the last 30 years and with mobile sequencing thumb drives on the horizon, it shows no sign of slowing. Over the coming years, the National Cancer Institute will sequence a million genomes to understand biological pathways and the genomic variation. Given that the whole genome of a tumour and a matching normal tissue sample consumes 1 TB of uncompressed data (this could be reduced by a factor of 10 if compressed); one million genomes will require 1 million TB, equivalent to 1000 petabyte (PB) or 1 Exabyte (EB).