Big Data in Computational Biology
The amount of data generated by next-gen sequencing (NGS) machines is now doubling every five months and the trend is expected to continue for the next few years [1]. In contrast, the number of transistors on a chip only doubles every two years (Moore’s law), with chip performance doubling at a slightly faster rate of 18 months (attributed by Intel executive David House). Equivalently, the doubling time for drive capacity is also about 18 months. Hence, the growth rate of sequence data generation is outpacing that of hardware capabilities by a factor of approximately four every year. Without human ingenuity, it is apparent that not only will we be restricted to analyzing an ever-smaller fraction of the data generated, we may not even have the capacity to store all the data generated.
While the explosion in data generated by sequencing machines has generated the most attention, parallel developments are occurring in all fields of biomedicine. Epigenomics, transcriptomics, proteomics, metabolomics, functional genomics, structural biology, single cell analysis, and biomedical imaging have similar explosive growth in data generation. With the transition to electronic health records, clinical data analysis will also be joining the rich data party. Interestingly, increasing large data sets are also being generated by computer simulation, which often have to be stored and analyzed in the same way as biological assay data. For example, agent-based simulations, which are increasingly popular in the study of complex adaptive systems such as the brain or immune system, simulate individual entities (cell, people) as autonomous entities and track the properties of these agents over time. Said simulations can generate massive amounts of data. In order to meet the challenges of big data in biology and medicine, fundamental innovations in data structures and algorithms will be critical. The same can be said about breakthroughs in database technologies, bioinformatics, machine learning, and system biology. This is a great time for students of computer science with and interest in biology and medicine to become involved with opportunities that are as vast as the challenges presented.
USES OF BIG DATA
How can we use big data in biomedicine? Grossly oversimplifying, big data is currently used for understanding disease risk in individuals, and to a lesser extent, for providing insight into disease mechanisms. An example of how big data in used for linking risk of disease to personal biomedical data are genome-wide association studies (GWAS) that make use of single-nucleotide polymorphisms (SNP) arrays to probe for hundreds of thousands to millions of genetic variants. In typical case-control studies, variations in the frequencies of SNPs are then used to find SNPs associated with the disease being studied. Similar association studies are widely used for data from other genomic assays such as full sequence reads, expression arrays, proteomics, and metabolomics . The ultimate goal of this research is to create a database of disease signatures that can be used to predict the risk of disease in an individual, and then customize appropriate prevention or therapeutic efforts for personalized medicine. One caveat with the possibility of such massive data mining is a high risk of false positive results. Fortunately, well established statistical methods that limit such false positives are available (e.g. permutation resampling methods to control the family-wise Type 1 error rate), but the lessons learned by biostatisticians may not have fully filtered down to all research communities. A notorious poster reports on the use of standard functional brain imaging analysis methods to demonstrate “a dead salmon perceiving humans can tell their emotional state”[2].
The use of big data for providing insight into disease mechanisms is less mature; this is a challenging problem for which the appropriate mathematical and statistical framework for analysis is less defined. Understanding disease mechanisms from big data requires tight feedback loops between experimental research and computational analysis. Cohesive inter-disciplinary teams that can perform such work are rare. Finally, the nature of the current data being generated is often highly homogeneous (e.g. DNA strings) and not ideal for mechanism discovery that may require linking multiple types of data over several time points. Although mechanistic models based on rich data may be used in the future, the analysis of big data has already revealed several surprising challenges to our biological knowledge. One surprise was the discovery that non-coding DNA (accounting for more than 90 percent of our DNA and sometimes derogatively labeled “junk” DNA) is highly evolutionarily conserved, suggesting essential, albeit unknown functionality [3]. Borrowing terminology from cosmology, such DNA is often known as “dark matter,” after the missing matter hypothesized to be necessary for the observed large-scale dynamics and structure of the universe. Another surprise was that the more than 1,200 genetic variants discovered in GWAS studies account for only a small fraction of the total heritability. Presently, it remains unknown if the “missing heritability” is due to rare variants not detected by GWAS studies or an artifact of our current statistical models for estimating heritability[4].
BOTTLENECKS IN BIG DATA ANALYSIS
The first bottlenecks in big data analysis is data storage and retrieval. Given that the rate of growth for storage capacity is not likely to suddenly increase, attention has focused on more efficient ways of data compression. An interesting direction is in the use of probabilistic data structures and algorithms that can store data with dramatic increases in compression efficiency in exchange for only a small loss in certainty. For example, Bloom filters that guarantee a specified level of false positive and zero false negative can be constructed to store sequence data. A variety of ingenious proposals for compressing NGS data were submitted to the Sequence Squeeze competition sponsored by the Pistoia Alliance (http://www.sequencesqueeze.org).
Simply storing big data is suboptimal—given the cost of generating it, ideally, big data should be freely shared and reused by different investigators. Funding agencies and top journals require big data be deposited in online repositories before publication, making the data publicly accessible in principle. However, the data in public repositories may be poorly annotated and linking information from distinct databases might be impossible because of different data schemas and lack of unifying metadata. To address this issue, data standards in the form of minimal information requirements have been published for several data types (e.g. MIAME; minimal information about microarray experiments) and there is a drive to create standard vocabularies in the form of biomedical ontologies to allow data sharing across databases and machine processing.
Even if the data can be stored and retrieved efficiently, big data is often too large to fit into available RAM, and languages that support generators are likely to be increasingly popular for the analysis of big data. This explains why there is a critical need for online algorithms that can efficiently process an input piece-by-piece. In the fields of statistical and machine learning, Bayesian models with conjugate priors are great example of an online algorithm. Since the prior family is closed under Bayesian updating, and as data streams in, we can recursively apply Bayes’ theorem to update the posterior distribution.
Machine learning is a field central to the analysis of big data. The information-processing rate of the human brain is severely constrained, necessitating the use of algorithms that can summarize the data and reduce the number of interesting features to a manageable level. Probabilistic graphical models with a foundation in Bayesian statistics play an increasing role in big data machine learning algorithms due to their ability to learn structure as well as parameters, ease of constructing hierarchical (“mixed effects”) models, and natural fit to online processing requirements. Another advantage of Bayesian probabilistic models is their declarative nature, allowing algorithms developed for applications such as text mining by Yahoo or social network modeling by Facebook to be easily adaptable to biomedical data (or vice versa).
Finally, the ability to visualize or summarize big data is crucial to scientific discovery and insight, since the optic cortex takes up a larger share of our brain than any other sensory modality. Most investigators still rely on variations of pie and bar charts or scatter and line plots to visualize their
Big Data in Computational Biology
The amount of data generated by next-gen sequencing (NGS) machines is now doubling every five months and the trend is expected to continue for the next few years [1]. In contrast, the number of transistors on a chip only doubles every two years (Moore’s law), with chip performance doubling at a slightly faster rate of 18 months (attributed by Intel executive David House). Equivalently, the doubling time for drive capacity is also about 18 months. Hence, the growth rate of sequence data generation is outpacing that of hardware capabilities by a factor of approximately four every year. Without human ingenuity, it is apparent that not only will we be restricted to analyzing an ever-smaller fraction of the data generated, we may not even have the capacity to store all the data generated.
While the explosion in data generated by sequencing machines has generated the most attention, parallel developments are occurring in all fields of biomedicine. Epigenomics, transcriptomics, proteomics, metabolomics, functional genomics, structural biology, single cell analysis, and biomedical imaging have similar explosive growth in data generation. With the transition to electronic health records, clinical data analysis will also be joining the rich data party. Interestingly, increasing large data sets are also being generated by computer simulation, which often have to be stored and analyzed in the same way as biological assay data. For example, agent-based simulations, which are increasingly popular in the study of complex adaptive systems such as the brain or immune system, simulate individual entities (cell, people) as autonomous entities and track the properties of these agents over time. Said simulations can generate massive amounts of data. In order to meet the challenges of big data in biology and medicine, fundamental innovations in data structures and algorithms will be critical. The same can be said about breakthroughs in database technologies, bioinformatics, machine learning, and system biology. This is a great time for students of computer science with and interest in biology and medicine to become involved with opportunities that are as vast as the challenges presented.
USES OF BIG DATA
How can we use big data in biomedicine? Grossly oversimplifying, big data is currently used for understanding disease risk in individuals, and to a lesser extent, for providing insight into disease mechanisms. An example of how big data in used for linking risk of disease to personal biomedical data are genome-wide association studies (GWAS) that make use of single-nucleotide polymorphisms (SNP) arrays to probe for hundreds of thousands to millions of genetic variants. In typical case-control studies, variations in the frequencies of SNPs are then used to find SNPs associated with the disease being studied. Similar association studies are widely used for data from other genomic assays such as full sequence reads, expression arrays, proteomics, and metabolomics . The ultimate goal of this research is to create a database of disease signatures that can be used to predict the risk of disease in an individual, and then customize appropriate prevention or therapeutic efforts for personalized medicine. One caveat with the possibility of such massive data mining is a high risk of false positive results. Fortunately, well established statistical methods that limit such false positives are available (e.g. permutation resampling methods to control the family-wise Type 1 error rate), but the lessons learned by biostatisticians may not have fully filtered down to all research communities. A notorious poster reports on the use of standard functional brain imaging analysis methods to demonstrate “a dead salmon perceiving humans can tell their emotional state”[2].
The use of big data for providing insight into disease mechanisms is less mature; this is a challenging problem for which the appropriate mathematical and statistical framework for analysis is less defined. Understanding disease mechanisms from big data requires tight feedback loops between experimental research and computational analysis. Cohesive inter-disciplinary teams that can perform such work are rare. Finally, the nature of the current data being generated is often highly homogeneous (e.g. DNA strings) and not ideal for mechanism discovery that may require linking multiple types of data over several time points. Although mechanistic models based on rich data may be used in the future, the analysis of big data has already revealed several surprising challenges to our biological knowledge. One surprise was the discovery that non-coding DNA (accounting for more than 90 percent of our DNA and sometimes derogatively labeled “junk” DNA) is highly evolutionarily conserved, suggesting essential, albeit unknown functionality [3]. Borrowing terminology from cosmology, such DNA is often known as “dark matter,” after the missing matter hypothesized to be necessary for the observed large-scale dynamics and structure of the universe. Another surprise was that the more than 1,200 genetic variants discovered in GWAS studies account for only a small fraction of the total heritability. Presently, it remains unknown if the “missing heritability” is due to rare variants not detected by GWAS studies or an artifact of our current statistical models for estimating heritability[4].
BOTTLENECKS IN BIG DATA ANALYSIS
The first bottlenecks in big data analysis is data storage and retrieval. Given that the rate of growth for storage capacity is not likely to suddenly increase, attention has focused on more efficient ways of data compression. An interesting direction is in the use of probabilistic data structures and algorithms that can store data with dramatic increases in compression efficiency in exchange for only a small loss in certainty. For example, Bloom filters that guarantee a specified level of false positive and zero false negative can be constructed to store sequence data. A variety of ingenious proposals for compressing NGS data were submitted to the Sequence Squeeze competition sponsored by the Pistoia Alliance (http://www.sequencesqueeze.org).
Simply storing big data is suboptimal—given the cost of generating it, ideally, big data should be freely shared and reused by different investigators. Funding agencies and top journals require big data be deposited in online repositories before publication, making the data publicly accessible in principle. However, the data in public repositories may be poorly annotated and linking information from distinct databases might be impossible because of different data schemas and lack of unifying metadata. To address this issue, data standards in the form of minimal information requirements have been published for several data types (e.g. MIAME; minimal information about microarray experiments) and there is a drive to create standard vocabularies in the form of biomedical ontologies to allow data sharing across databases and machine processing.
Even if the data can be stored and retrieved efficiently, big data is often too large to fit into available RAM, and languages that support generators are likely to be increasingly popular for the analysis of big data. This explains why there is a critical need for online algorithms that can efficiently process an input piece-by-piece. In the fields of statistical and machine learning, Bayesian models with conjugate priors are great example of an online algorithm. Since the prior family is closed under Bayesian updating, and as data streams in, we can recursively apply Bayes’ theorem to update the posterior distribution.
Machine learning is a field central to the analysis of big data. The information-processing rate of the human brain is severely constrained, necessitating the use of algorithms that can summarize the data and reduce the number of interesting features to a manageable level. Probabilistic graphical models with a foundation in Bayesian statistics play an increasing role in big data machine learning algorithms due to their ability to learn structure as well as parameters, ease of constructing hierarchical (“mixed effects”) models, and natural fit to online processing requirements. Another advantage of Bayesian probabilistic models is their declarative nature, allowing algorithms developed for applications such as text mining by Yahoo or social network modeling by Facebook to be easily adaptable to biomedical data (or vice versa).
Finally, the ability to visualize or summarize big data is crucial to scientific discovery and insight, since the optic cortex takes up a larger share of our brain than any other sensory modality. Most investigators still rely on variations of pie and bar charts or scatter and line plots to visualize their
การแปล กรุณารอสักครู่..