Big Data in Computational BiologyTh

Big Data in Computational Biology
The amount of data generated by next-gen sequencing (NGS) machines is now doubling every five months and the trend is expected to continue for the next few years [1]. In contrast, the number of transistors on a chip only doubles every two years (Moore’s law), with chip performance doubling at a slightly faster rate of 18 months (attributed by Intel executive David House). Equivalently, the doubling time for drive capacity is also about 18 months. Hence, the growth rate of sequence data generation is outpacing that of hardware capabilities by a factor of approximately four every year. Without human ingenuity, it is apparent that not only will we be restricted to analyzing an ever-smaller fraction of the data generated, we may not even have the capacity to store all the data generated.
While the explosion in data generated by sequencing machines has generated the most attention, parallel developments are occurring in all fields of biomedicine. Epigenomics, transcriptomics, proteomics, metabolomics, functional genomics, structural biology, single cell analysis, and biomedical imaging have similar explosive growth in data generation. With the transition to electronic health records, clinical data analysis will also be joining the rich data party. Interestingly, increasing large data sets are also being generated by computer simulation, which often have to be stored and analyzed in the same way as biological assay data. For example, agent-based simulations, which are increasingly popular in the study of complex adaptive systems such as the brain or immune system, simulate individual entities (cell, people) as autonomous entities and track the properties of these agents over time. Said simulations can generate massive amounts of data. In order to meet the challenges of big data in biology and medicine, fundamental innovations in data structures and algorithms will be critical. The same can be said about breakthroughs in database technologies, bioinformatics, machine learning, and system biology. This is a great time for students of computer science with and interest in biology and medicine to become involved with opportunities that are as vast as the challenges presented.
USES OF BIG DATA
How can we use big data in biomedicine? Grossly oversimplifying, big data is currently used for understanding disease risk in individuals, and to a lesser extent, for providing insight into disease mechanisms. An example of how big data in used for linking risk of disease to personal biomedical data are genome-wide association studies (GWAS) that make use of single-nucleotide polymorphisms (SNP) arrays to probe for hundreds of thousands to millions of genetic variants. In typical case-control studies, variations in the frequencies of SNPs are then used to find SNPs associated with the disease being studied. Similar association studies are widely used for data from other genomic assays such as full sequence reads, expression arrays, proteomics, and metabolomics . The ultimate goal of this research is to create a database of disease signatures that can be used to predict the risk of disease in an individual, and then customize appropriate prevention or therapeutic efforts for personalized medicine. One caveat with the possibility of such massive data mining is a high risk of false positive results. Fortunately, well established statistical methods that limit such false positives are available (e.g. permutation resampling methods to control the family-wise Type 1 error rate), but the lessons learned by biostatisticians may not have fully filtered down to all research communities. A notorious poster reports on the use of standard functional brain imaging analysis methods to demonstrate “a dead salmon perceiving humans can tell their emotional state”[2].
The use of big data for providing insight into disease mechanisms is less mature; this is a challenging problem for which the appropriate mathematical and statistical framework for analysis is less defined. Understanding disease mechanisms from big data requires tight feedback loops between experimental research and computational analysis. Cohesive inter-disciplinary teams that can perform such work are rare. Finally, the nature of the current data being generated is often highly homogeneous (e.g. DNA strings) and not ideal for mechanism discovery that may require linking multiple types of data over several time points. Although mechanistic models based on rich data may be used in the future, the analysis of big data has already revealed several surprising challenges to our biological knowledge. One surprise was the discovery that non-coding DNA (accounting for more than 90 percent of our DNA and sometimes derogatively labeled “junk” DNA) is highly evolutionarily conserved, suggesting essential, albeit unknown functionality [3]. Borrowing terminology from cosmology, such DNA is often known as “dark matter,” after the missing matter hypothesized to be necessary for the observed large-scale dynamics and structure of the universe. Another surprise was that the more than 1,200 genetic variants discovered in GWAS studies account for only a small fraction of the total heritability. Presently, it remains unknown if the “missing heritability” is due to rare variants not detected by GWAS studies or an artifact of our current statistical models for estimating heritability[4].
BOTTLENECKS IN BIG DATA ANALYSIS
The first bottlenecks in big data analysis is data storage and retrieval. Given that the rate of growth for storage capacity is not likely to suddenly increase, attention has focused on more efficient ways of data compression. An interesting direction is in the use of probabilistic data structures and algorithms that can store data with dramatic increases in compression efficiency in exchange for only a small loss in certainty. For example, Bloom filters that guarantee a specified level of false positive and zero false negative can be constructed to store sequence data. A variety of ingenious proposals for compressing NGS data were submitted to the Sequence Squeeze competition sponsored by the Pistoia Alliance (http://www.sequencesqueeze.org).
Simply storing big data is suboptimal—given the cost of generating it, ideally, big data should be freely shared and reused by different investigators. Funding agencies and top journals require big data be deposited in online repositories before publication, making the data publicly accessible in principle. However, the data in public repositories may be poorly annotated and linking information from distinct databases might be impossible because of different data schemas and lack of unifying metadata. To address this issue, data standards in the form of minimal information requirements have been published for several data types (e.g. MIAME; minimal information about microarray experiments) and there is a drive to create standard vocabularies in the form of biomedical ontologies to allow data sharing across databases and machine processing.
Even if the data can be stored and retrieved efficiently, big data is often too large to fit into available RAM, and languages that support generators are likely to be increasingly popular for the analysis of big data. This explains why there is a critical need for online algorithms that can efficiently process an input piece-by-piece. In the fields of statistical and machine learning, Bayesian models with conjugate priors are great example of an online algorithm. Since the prior family is closed under Bayesian updating, and as data streams in, we can recursively apply Bayes’ theorem to update the posterior distribution.
Machine learning is a field central to the analysis of big data. The information-processing rate of the human brain is severely constrained, necessitating the use of algorithms that can summarize the data and reduce the number of interesting features to a manageable level. Probabilistic graphical models with a foundation in Bayesian statistics play an increasing role in big data machine learning algorithms due to their ability to learn structure as well as parameters, ease of constructing hierarchical (“mixed effects”) models, and natural fit to online processing requirements. Another advantage of Bayesian probabilistic models is their declarative nature, allowing algorithms developed for applications such as text mining by Yahoo or social network modeling by Facebook to be easily adaptable to biomedical data (or vice versa).
Finally, the ability to visualize or summarize big data is crucial to scientific discovery and insight, since the optic cortex takes up a larger share of our brain than any other sensory modality. Most investigators still rely on variations of pie and bar charts or scatter and line plots to visualize their

Big Data in Computational Biology
The amount of data generated by next-gen sequencing (NGS) machines is now doubling every five months and the trend is expected to continue for the next few years [1]. In contrast, the number of transistors on a chip only doubles every two years (Moore’s law), with chip performance doubling at a slightly faster rate of 18 months (attributed by Intel executive David House). Equivalently, the doubling time for drive capacity is also about 18 months. Hence, the growth rate of sequence data generation is outpacing that of hardware capabilities by a factor of approximately four every year. Without human ingenuity, it is apparent that not only will we be restricted to analyzing an ever-smaller fraction of the data generated, we may not even have the capacity to store all the data generated.
While the explosion in data generated by sequencing machines has generated the most attention, parallel developments are occurring in all fields of biomedicine. Epigenomics, transcriptomics, proteomics, metabolomics, functional genomics, structural biology, single cell analysis, and biomedical imaging have similar explosive growth in data generation. With the transition to electronic health records, clinical data analysis will also be joining the rich data party. Interestingly, increasing large data sets are also being generated by computer simulation, which often have to be stored and analyzed in the same way as biological assay data. For example, agent-based simulations, which are increasingly popular in the study of complex adaptive systems such as the brain or immune system, simulate individual entities (cell, people) as autonomous entities and track the properties of these agents over time. Said simulations can generate massive amounts of data. In order to meet the challenges of big data in biology and medicine, fundamental innovations in data structures and algorithms will be critical. The same can be said about breakthroughs in database technologies, bioinformatics, machine learning, and system biology. This is a great time for students of computer science with and interest in biology and medicine to become involved with opportunities that are as vast as the challenges presented.
USES OF BIG DATA
How can we use big data in biomedicine? Grossly oversimplifying, big data is currently used for understanding disease risk in individuals, and to a lesser extent, for providing insight into disease mechanisms. An example of how big data in used for linking risk of disease to personal biomedical data are genome-wide association studies (GWAS) that make use of single-nucleotide polymorphisms (SNP) arrays to probe for hundreds of thousands to millions of genetic variants. In typical case-control studies, variations in the frequencies of SNPs are then used to find SNPs associated with the disease being studied. Similar association studies are widely used for data from other genomic assays such as full sequence reads, expression arrays, proteomics, and metabolomics . The ultimate goal of this research is to create a database of disease signatures that can be used to predict the risk of disease in an individual, and then customize appropriate prevention or therapeutic efforts for personalized medicine. One caveat with the possibility of such massive data mining is a high risk of false positive results. Fortunately, well established statistical methods that limit such false positives are available (e.g. permutation resampling methods to control the family-wise Type 1 error rate), but the lessons learned by biostatisticians may not have fully filtered down to all research communities. A notorious poster reports on the use of standard functional brain imaging analysis methods to demonstrate “a dead salmon perceiving humans can tell their emotional state”[2].
The use of big data for providing insight into disease mechanisms is less mature; this is a challenging problem for which the appropriate mathematical and statistical framework for analysis is less defined. Understanding disease mechanisms from big data requires tight feedback loops between experimental research and computational analysis. Cohesive inter-disciplinary teams that can perform such work are rare. Finally, the nature of the current data being generated is often highly homogeneous (e.g. DNA strings) and not ideal for mechanism discovery that may require linking multiple types of data over several time points. Although mechanistic models based on rich data may be used in the future, the analysis of big data has already revealed several surprising challenges to our biological knowledge. One surprise was the discovery that non-coding DNA (accounting for more than 90 percent of our DNA and sometimes derogatively labeled “junk” DNA) is highly evolutionarily conserved, suggesting essential, albeit unknown functionality [3]. Borrowing terminology from cosmology, such DNA is often known as “dark matter,” after the missing matter hypothesized to be necessary for the observed large-scale dynamics and structure of the universe. Another surprise was that the more than 1,200 genetic variants discovered in GWAS studies account for only a small fraction of the total heritability. Presently, it remains unknown if the “missing heritability” is due to rare variants not detected by GWAS studies or an artifact of our current statistical models for estimating heritability[4].
BOTTLENECKS IN BIG DATA ANALYSIS
The first bottlenecks in big data analysis is data storage and retrieval. Given that the rate of growth for storage capacity is not likely to suddenly increase, attention has focused on more efficient ways of data compression. An interesting direction is in the use of probabilistic data structures and algorithms that can store data with dramatic increases in compression efficiency in exchange for only a small loss in certainty. For example, Bloom filters that guarantee a specified level of false positive and zero false negative can be constructed to store sequence data. A variety of ingenious proposals for compressing NGS data were submitted to the Sequence Squeeze competition sponsored by the Pistoia Alliance (http://www.sequencesqueeze.org).
 Simply storing big data is suboptimal—given the cost of generating it, ideally, big data should be freely shared and reused by different investigators. Funding agencies and top journals require big data be deposited in online repositories before publication, making the data publicly accessible in principle. However, the data in public repositories may be poorly annotated and linking information from distinct databases might be impossible because of different data schemas and lack of unifying metadata. To address this issue, data standards in the form of minimal information requirements have been published for several data types (e.g. MIAME; minimal information about microarray experiments) and there is a drive to create standard vocabularies in the form of biomedical ontologies to allow data sharing across databases and machine processing.
 Even if the data can be stored and retrieved efficiently, big data is often too large to fit into available RAM, and languages that support generators are likely to be increasingly popular for the analysis of big data. This explains why there is a critical need for online algorithms that can efficiently process an input piece-by-piece. In the fields of statistical and machine learning, Bayesian models with conjugate priors are great example of an online algorithm. Since the prior family is closed under Bayesian updating, and as data streams in, we can recursively apply Bayes’ theorem to update the posterior distribution.
 Machine learning is a field central to the analysis of big data. The information-processing rate of the human brain is severely constrained, necessitating the use of algorithms that can summarize the data and reduce the number of interesting features to a manageable level. Probabilistic graphical models with a foundation in Bayesian statistics play an increasing role in big data machine learning algorithms due to their ability to learn structure as well as parameters, ease of constructing hierarchical (“mixed effects”) models, and natural fit to online processing requirements. Another advantage of Bayesian probabilistic models is their declarative nature, allowing algorithms developed for applications such as text mining by Yahoo or social network modeling by Facebook to be easily adaptable to biomedical data (or vice versa).
 Finally, the ability to visualize or summarize big data is crucial to scientific discovery and insight, since the optic cortex takes up a larger share of our brain than any other sensory modality. Most investigators still rely on variations of pie and bar charts or scatter and line plots to visualize their

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

ข้อมูลขนาดใหญ่ในชีววิทยาเชิงคำนวณ
จำนวนของข้อมูลที่สร้างขึ้น โดยเครื่องลำดับถัดไป gen (NGS) ตอนนี้มีเพิ่มทุกเดือนห้า แล้วคาดว่าแนวโน้มต่อไปในปีต่อมา [1] ในทางตรงกันข้าม จำนวน transistors บนชิปที่เฉพาะคู่ทุก ๆ สองปี (กฎของมัวร์), ด้วยประสิทธิภาพของชิพจะในอัตราเร็วขึ้นเล็กน้อย 18 เดือน (บันทึก โดย Intel เอ็กดา House) Equivalently เวลา doubling สำหรับไดรฟ์ความจุนั้นยังประมาณ 18 เดือน ดังนั้น อัตราการเติบโตของการสร้างข้อมูลลำดับเป็นส.ค.ที่ความสามารถของฮาร์ดแวร์ โดยตัวของประมาณสี่ทุกปี ไม่ มีมนุษย์ประดิษฐ์คิดค้น เห็นได้ชัดเจนว่า ไม่เพียงแต่เราจะจำกัดการวิเคราะห์เศษเคยเล็กของข้อมูลที่สร้างขึ้น เราอาจไม่ได้มีความสามารถในการเก็บข้อมูลทั้งหมดที่สร้างขึ้นด้วย
ในขณะที่การกระจายข้อมูลที่สร้างขึ้น โดยเครื่องลำดับได้สร้างความสนใจมากที่สุด พัฒนาควบคู่กันจะเกิดขึ้นในเขตข้อมูลทั้งหมดของ biomedicine Epigenomics, transcriptomics โปรตีโอมิกส์ metabolomics หน้าที่ genomics ชีววิทยาโครงสร้าง วิเคราะห์เซลล์เดียว และภาพทางชีวการแพทย์ได้เจริญเติบโตคล้ายคลึงกันในการสร้างข้อมูล มีการเปลี่ยนแปลงกับระเบียนสุขภาพอิเล็กทรอนิกส์ วิเคราะห์ข้อมูลทางคลินิกจะยังสามารถเข้าร่วมฝ่ายข้อมูลรวย เป็นเรื่องน่าสนใจ เพิ่มชุดข้อมูลขนาดใหญ่ยังถูกสร้างขึ้น โดยคอมพิวเตอร์การจำลอง ซึ่งมักจะมีการจัดเก็บ และวิเคราะห์ข้อมูลวิเคราะห์ทางชีวภาพแบบเดียวกัน ตัวอย่าง ใช้แทนจำลอง ซึ่งได้รับความนิยมมากขึ้นในการศึกษาระบบซับซ้อนเหมาะสมเช่นสมองหรือระบบภูมิคุ้มกัน จำลองแต่ละเอนทิตี (เซลล์ คน) เป็นเอนทิตีเขตปกครองตนเอง และติดตามคุณสมบัติของตัวแทนเหล่านี้เวลา จำลองดังกล่าวสามารถสร้างจำนวนข้อมูลขนาดใหญ่ เพื่อตอบสนองความท้าทายของข้อมูลขนาดใหญ่ในวิชาชีววิทยาและการแพทย์ นวัตกรรมพื้นฐานในโครงสร้างข้อมูลและอัลกอริทึมจะสำคัญ เหมือนจะกล่าวเกี่ยวกับนวัตกรรมใหม่ในเทคโนโลยีฐานข้อมูล bioinformatics เรียนรู้ของเครื่อง และชีววิทยาระบบ นี่เป็นเวลาดีสำหรับนักศึกษาวิทยาศาสตร์คอมพิวเตอร์ด้วยและสนใจในวิชาชีววิทยาและการแพทย์เป็นเกี่ยวข้องกับโอกาสที่จะเป็นความท้าทายนำเสนอได้
ของใช้ข้อมูลขนาดใหญ่ใช้
ใช้ข้อมูลขนาดใหญ่ใน biomedicine ปัจจุบันมีใช้ข้อมูล grossly oversimplifying ขนาดใหญ่สำหรับทำความเข้าใจเกี่ยวกับความเสี่ยงของโรค ในแต่ละบุคคล และขอบ เขตที่น้อยกว่า สำหรับการให้ความเข้าใจถึงกลไกของโรค ตัวอย่างของวิธีใหญ่ข้อมูลใช้สำหรับเชื่อมโยงความเสี่ยงของโรคไปยังข้อมูลส่วนตัวทางชีวการแพทย์จะศึกษาจีโนมทั้งสมาคม (GWAS) ที่ทำให้การใช้อาร์เรย์เดียวนิวคลีโอไทด์ polymorphisms (SNP) หยั่งหลายร้อยหลายพันถึงล้านตัวแปรทางพันธุกรรม ในการทั่วไปการควบคุมกรณีศึกษา จากนั้นใช้ในความถี่ของ SNPs หา SNPs ที่เกี่ยวข้องกับโรคที่กำลังศึกษา คล้ายสมาคมศึกษาอย่างกว้างขวางใช้สำหรับข้อมูลจาก assays genomic อื่น ๆ เช่นอ่านลำดับเต็ม นิพจน์ โปรตีโอมิกส์ และกังหัน metabolomics เป้าหมายสูงสุดของการวิจัยนี้คือการ สร้างฐานข้อมูลของลายเซ็นโรคที่สามารถใช้ทำนายความเสี่ยงของการเกิดโรคในบุคคล และกำหนดเองป้องกันที่เหมาะสมหรือความพยายามรักษาสำหรับแพทย์ส่วนตัวแล้ว Caveat เดียวกับความเป็นไปได้ของการทำเหมืองข้อมูลขนาดใหญ่ดังกล่าวจะมีความเสี่ยงสูงผลบวกเท็จ โชคดี สร้างวิธีการสถิติที่จำกัดดังกล่าวทำงานผิดพลาดไม่ดีจะพร้อมใช้งาน (เช่นการเรียงสับเปลี่ยนเปลี่ยนความละเอียดของวิธีการควบคุมอัตราข้อผิดพลาดประเภท 1 family-wise), แต่บทเรียนที่เรียน โดย biostatisticians อาจไม่ได้เต็มกรองลงไปชุมชนวิจัยทั้งหมดได้ โปสเตอร์อื้อฉาวรายงานการใช้สมองทำงานมาตรฐานวิธีวิเคราะห์ภาพแสดงให้เห็นถึง "การตายปลาแซลมอน perceiving มนุษย์สามารถแจ้งสถานะทางอารมณ์" [2] .
การใช้ข้อมูลขนาดใหญ่ให้ความเข้าใจถึงกลไกของโรคเป็นน้อยผู้ใหญ่ นี่คือปัญหาท้าทายที่กรอบที่ทางคณิตศาสตร์ และสถิติที่เหมาะสมสำหรับการวิเคราะห์น้อยกว่ากำหนด เข้าใจกลไกของโรคจากข้อมูลขนาดใหญ่ต้องการวนรอบคำติชมแน่นระหว่างวิจัยทดลองและวิเคราะห์คำนวณ ทีมระหว่างวินัยควบที่สามารถดำเนินงานดังกล่าวไม่ค่อยพบ สุดท้าย ธรรมชาติของข้อมูลปัจจุบันที่ถูกสร้างขึ้นเป็นเหมือนสูง (เช่น สายดีเอ็นเอ) และไม่เหมาะสำหรับการค้นพบกลไกที่อาจเชื่อมโยงข้อมูลหลายชนิดผ่านเวลาหลายจุด แม้ว่าอาจจะใช้รูปแบบกลไกการทำตามข้อมูลรวยในอนาคต การวิเคราะห์ข้อมูลขนาดใหญ่ได้เปิดเผยความท้าทายน่าแปลกใจที่หลายความรู้ชีวภาพของเราแล้ว แปลกใจหนึ่งค้นพบว่า ไม่ใช่รหัสดีเอ็นเอ (บัญชีมากกว่า 90 เปอร์เซ็นต์ของดีเอ็นเอและดีเอ็นเอ "เมลขยะ" ป้ายบางครั้ง derogatively) เป็น evolutionarily สูงอาศัย ฟังก์ชันสำคัญ แม้ว่าไม่รู้จักเสนอ [3] ยืมคำศัพท์จากจักรวาล ดีเอ็นเอดังกล่าวมักจะเรียกกันว่า "สสารมืด"หลังจากเรื่องหายไปตั้งสมมติฐานว่าจะจำเป็นสำหรับ dynamics สังเกตขนาดใหญ่และโครงสร้างของจักรวาล อื่นแปลกใจได้ว่า ตัวแปรทางพันธุกรรมมากกว่า 1200 พบในบัญชีเพียงส่วนเล็ก ๆ ของ heritability รวม GWAS การศึกษา ปัจจุบัน ยังคงไม่รู้จักว่า "heritability ขาด" เนื่องจากย่อยยากที่ตรวจไม่พบการศึกษา GWAS หรือเป็นสิ่งประดิษฐ์ของเรารุ่นสถิติปัจจุบันสำหรับการประเมิน heritability [4] .
วิเคราะห์ข้อมูลขนาดใหญ่ในเครื่อง
เครื่องแรกในการวิเคราะห์ข้อมูลขนาดใหญ่จะจัดเก็บข้อมูลและเรียก ระบุว่าอัตราการเจริญเติบโตสำหรับเก็บของไม่น่าจะเพิ่มทันที ความสนใจได้เน้นวิธีเพิ่มประสิทธิภาพของการบีบอัดข้อมูล ทิศทางที่น่าสนใจในการใช้โครงสร้างข้อมูล probabilistic และอัลกอริทึมที่สามารถเก็บข้อมูลที่ มีเพิ่มขึ้นอย่างมากในประสิทธิภาพการบีบอัดเพื่อแลกกับการสูญเสียขนาดเล็กเท่าในความแน่นอน ได้ ตัวอย่าง สามารถสร้างตัวกรองบลูมที่ระบุระดับบวกปลอมและลบศูนย์เท็จ ข้อมูลลำดับ หลากหลายข้อเสนอแยบยลสำหรับการบีบอัดข้อมูล NGS ส่งมาที่สนับสนุน โดยพันธมิตรซานอันเดรอา (http://www.sequencesqueeze.org) การแข่งขันลำดับบีบ
เพียงแค่เก็บข้อมูลขนาดใหญ่มีสภาพ — ให้ทุนสร้างมัน ดาว ข้อมูลขนาดใหญ่ควรจะอิสระร่วมกัน และนำ โดยนักสืบแตกต่างกัน แหล่งทุนและสมุดรายวันชั้นต้องใหญ่ข้อมูลจะฝากใน repositories ออนไลน์ก่อนพิมพ์ การทำให้ข้อมูลสามารถเข้าถึงได้ทั่วไปในหลักการ อย่างไรก็ตาม ข้อมูลใน repositories สาธารณะอาจจะไม่ดีใส่คำอธิบายประกอบ และเชื่อมโยงข้อมูลจากฐานข้อมูลที่แตกต่างกันอาจเป็นไปไม่ได้เนื่องจากข้อมูลต่าง ๆ แบบแผนและขาดข้อมูลเมตารวมกัน ปัญหานี้ ข้อมูลมาตรฐานในรูปแบบของข้อมูลน้อยที่สุดต้องมีการเผยแพร่สำหรับชนิดข้อมูลต่าง ๆ (เช่น MIAME ข้อมูลน้อยเกี่ยวกับทดลอง microarray) และไดรฟ์ไปสร้าง vocabularies มาตรฐานในรูปแบบของ ontologies ทางชีวการแพทย์จะอนุญาตให้ใช้ข้อมูลร่วมกันระหว่างฐานข้อมูล และประมวลผลด้วยเครื่องจักร
ถ้าสามารถจัดเก็บข้อมูล และดึงข้อมูลได้อย่างมีประสิทธิภาพ ข้อมูลขนาดใหญ่มักจะเป็นใหญ่เกินไปเป็น RAM ที่มี และภาษาที่สนับสนุนเครื่องกำเนิดไฟฟ้ามักจะเป็นที่นิยมมากสำหรับการวิเคราะห์ข้อมูลขนาดใหญ่ นี้อธิบายว่า ทำไมจึงจำเป็นที่สำคัญสำหรับอัลกอริทึมออนไลน์ที่มีประสิทธิภาพสามารถประมวลผลการนำเข้าชิ้นส่วนโดยชิ้น งาน ในเขตข้อมูลของสถิติและเรียนรู้ของเครื่อง แบบจำลองทฤษฎีกับ conjugate priors เป็นตัวอย่างที่ดีของอัลกอริทึมการออนไลน์ เนื่องจากครอบครัวก่อนปิดภายใต้ทฤษฎีปรับปรุง และเป็นกระแสข้อมูลข้อมูลใน เราสามารถ recursively ใช้ทฤษฎีบทของ Bayes การปรับปรุงหลังกระจาย
เครื่องเรียนเป็นเขตศูนย์กลางการวิเคราะห์ข้อมูลขนาดใหญ่ อัตราการประมวลผลข้อมูลของสมองมนุษย์มีอย่างจำกัด necessitating การใช้อัลกอริทึมที่สามารถสรุปข้อมูล และลดจำนวนคุณลักษณะที่น่าสนใจระดับที่เหมาะสม รูปแบบกราฟิก probabilistic มีพื้นฐานในทฤษฎีสถิติมีบทบาทเพิ่มขึ้นในเครื่องข้อมูลใหญ่อัลกอริทึมจากความสามารถในการเรียนรู้โครงสร้างตลอดจนพารามิเตอร์ ความง่ายในการสร้างแบบจำลองแบบลำดับชั้น ("ผสมผล") การเรียนรู้ และธรรมชาติเหมาะสมกับความต้องการการประมวลผลออนไลน์ ประโยชน์อีกประการหนึ่งของทฤษฎีแบบจำลอง probabilistic เป็นธรรมชาติการ declarative อัลกอริทึมที่พัฒนาขึ้นสำหรับการใช้งานเช่นการทำเหมืองข้อความ Yahoo หรือโมเดล โดย Facebook เครือข่ายสังคมให้ปรับข้อมูลทางชีวการแพทย์ (หรือในทางกลับกัน) ทำให้
ในที่สุด สามารถมองเห็นภาพ หรือสรุปข้อมูลขนาดใหญ่เป็นสิ่งสำคัญการค้นพบทางวิทยาศาสตร์และความเข้าใจ เนื่องจาก cortex ใยแก้วนำแสงใช้สมองของเราส่วนใหญ่กว่า modality ทางประสาทสัมผัสอื่น ๆ ส่วนใหญ่นักยังคงใช้รูปแบบของแผนภูมิวงกลม และแผนภูมิแท่ง หรือกระจาย และบรรทัดลงจุดเห็นภาพของ

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

ข้อมูลในคอมพิวเตอร์ชีววิทยา
ปริมาณของข้อมูลที่สร้างขึ้นโดยลำดับถัดไป gen ( ภาพหลุด ) เครื่องตอนนี้เป็นสองเท่าทุกๆ 5 เดือน และแนวโน้มที่คาดว่าจะดำเนินการในอีกไม่กี่ปี [ 1 ] ในทางตรงกันข้ามจำนวนทรานซิสเตอร์ในชิปเพียงคู่ทุกสองปี ( กฎของมัวร์ )กับชิปประสิทธิภาพมากในอัตราเล็กน้อยเร็ว 18 เดือน ( บันทึกโดย Intel บริหารเดวิดบ้าน ) ก้อง , doubling time ความจุไดรฟ์ก็ประมาณ 18 เดือน ดังนั้น อัตราการเติบโตของลำดับข้อมูลรุ่น 100 ความสามารถของฮาร์ดแวร์โดยปัจจัยประมาณสี่ทุกปี ไม่มีความฉลาดของมนุษย์มันเป็นที่ชัดเจนว่าไม่เพียง แต่จะถูก จำกัด เพื่อวิเคราะห์เป็นส่วนเล็ก ๆของข้อมูลที่สร้างขึ้น เราอาจจะไม่ได้มีความสามารถเก็บข้อมูลทั้งหมดที่สร้างขึ้น .
ในขณะที่การกระจายข้อมูลที่สร้างขึ้นโดยเครื่องดังกล่าวได้สร้างความสนใจมากที่สุด , การพัฒนาแบบคู่ขนานที่เกิดขึ้นในทุกสาขา Biomedicine . EPIGENOMICS ทราน ริปโตมิกเมตะโบโลมิก , ขีดความสามารถ ,างไร ชีววิทยาโครงสร้าง การวิเคราะห์เซลล์เดี่ยวและการถ่ายภาพทางการแพทย์มีการเจริญเติบโตที่คล้ายกันในยุคข้อมูล กับการบันทึกสุขภาพอิเล็กทรอนิกส์ การวิเคราะห์ข้อมูลทางคลินิกจะเข้าร่วมงานปาร์ตี้ข้อมูลที่อุดมไปด้วย ทั้งนี้ การเพิ่มชุดข้อมูลขนาดใหญ่ก็ถูกสร้างขึ้น โดยจำลองคอมพิวเตอร์ซึ่งมักจะมีการจัดเก็บและวิเคราะห์ข้อมูลในแบบเดียวกับข้อมูลวิเคราะห์ทางชีวภาพ ตัวอย่างเช่นตัวแทนจำลองซึ่งเป็นที่นิยมมากขึ้นในการปรับระบบที่ซับซ้อน เช่น สมอง หรือระบบภูมิคุ้มกัน จำลองแต่ละเอนทิตี ( เซลล์ ) เป็นหน่วยงานกำกับและติดตามคุณสมบัติของตัวแทนเหล่านี้ตลอดเวลาบอกว่าจำลองสามารถสร้างจำนวนมหาศาลของข้อมูล เพื่อตอบสนองความท้าทายของข้อมูลใหญ่ในชีววิทยาและการแพทย์ นวัตกรรมพื้นฐานโครงสร้างข้อมูลและขั้นตอนวิธีจะวิกฤต เดียวกันสามารถพูดเกี่ยวกับนวัตกรรมในเทคโนโลยี เครื่องจักรการเรียนรู้ชีววิทยารสน , ฐานข้อมูลและระบบนี้เป็นเวลาที่ดีสำหรับนักศึกษาวิทยาศาสตร์คอมพิวเตอร์และความสนใจในชีววิทยาและการแพทย์ที่จะกลายเป็นที่เกี่ยวข้องกับโอกาสที่ใหญ่เท่ากับความท้าทายนำเสนอ การใช้ข้อมูล

ใหญ่เราสามารถใช้ข้อมูลใน Biomedicine ใหญ่ ? ไม่ oversimplifying ข้อมูลใหญ่ ใช้อยู่ในปัจจุบันเพื่อความเข้าใจความเสี่ยงโรคในบุคคลและในระดับที่น้อยกว่าเพื่อให้ความเข้าใจในกลไกการเกิดโรค ตัวอย่างของวิธีการใหญ่ที่ใช้ในการเชื่อมโยงข้อมูลความเสี่ยงของโรคข้อมูลทางการแพทย์ส่วนบุคคลสมาคมการศึกษา genome-wide ( gwas ) ที่ใช้ประโยชน์จากความหลากหลายซิงเกิลนิวคลีโอไทด์ ( SNP ) อาร์เรย์เพื่อสืบหานับล้านสายพันธุ์ทางพันธุกรรม ในการศึกษากลุ่มทั่วไปการเปลี่ยนแปลงความถี่ของ snps จะใช้ในการค้นหา snps เกี่ยวข้องกับโรคที่ถูกศึกษา สมาคมการศึกษาที่คล้ายกันมีการใช้กันอย่างแพร่หลาย สำหรับข้อมูลจากวิธีอื่นๆ เช่น ลำดับจีโนมเต็มอ่านการแสดงออกของอาร์เรย์ , โปรติโอมิกส์และเมตะโบโลมิก .เป้าหมายสูงสุดของการวิจัยนี้คือการสร้างฐานข้อมูลของลายเซ็นโรคที่สามารถทำนายความเสี่ยงของการเกิดโรคในแต่ละคน และปรับการป้องกันที่เหมาะสมหรือความพยายามเพื่อการบำบัดรักษายาส่วนบุคคล หนึ่ง caveat กับความเป็นไปได้ของข้อมูลเหมืองแร่ขนาดใหญ่มีความเสี่ยงสูงของผลบวกปลอม โชคดีจัดตั้งขึ้นรวมทั้งสถิติที่จำกัดบวกเท็จดังกล่าวมีอยู่ ( เช่นการเปลี่ยนแปลงวิธีการสุ่มซ้ำเพื่อควบคุมอัตราความคลาดเคลื่อนประเภทที่ 1 ครอบครัวฉลาด ) แต่บทเรียนที่ได้เรียนรู้จาก biostatisticians อาจไม่ได้เต็มกรองลงสู่ชุมชนวิจัยทั้งหมดโปสเตอร์ฉาวรายงานการใช้มาตรฐานการทำงานสมองภาพวิธีการวิเคราะห์เพื่อแสดงให้เห็นถึง " แซลม่อนตาย มนุษย์สามารถบอกการรับรู้สภาพอารมณ์ของพวกเขา " [ 2 ] .
ใช้ข้อมูลครั้งใหญ่เพื่อให้ความเข้าใจในกลไกของโรคเป็นผู้ใหญ่น้อย นี่คือปัญหาที่ท้าทายที่เหมาะสมทางคณิตศาสตร์และสถิติในการวิเคราะห์ คือ กรอบ น้อยกว่าที่กำหนดเข้าใจโรคกลไกจากใหญ่ข้อมูลต้องแน่น บางอย่างระหว่างการทดลองและการวิเคราะห์การคำนวณ การทำงานร่วมกันระหว่างทีมใดที่สามารถปฏิบัติงานดังกล่าวมีน้อยมาก ในที่สุด ลักษณะของข้อมูลที่เป็นปัจจุบันถูกสร้างขึ้นมักจะเป็นเนื้อเดียวกันสูง ( เช่นสายดีเอ็นเอ ) และไม่เหมาะอย่างยิ่งสำหรับกลไกการค้นพบที่อาจต้องใช้ในการเชื่อมโยงหลายประเภทของข้อมูลมากกว่าจุดเวลาหลาย แม้ว่ากลไกรูปแบบบนพื้นฐานของข้อมูลที่อุดมไปด้วยอาจจะใช้ในอนาคต การวิเคราะห์ข้อมูลใหญ่ได้พบความท้าทายที่น่าแปลกใจหลายความรู้ที่แท้จริงของเราที่แปลกใจคือการค้นพบว่ารหัสดีเอ็นเอไม่ใช่บัญชีสำหรับมากกว่าร้อยละ 90 ของดีเอ็นเอของเรา และบางครั้ง derogatively ติดป้ายว่า " ขยะ " ดีเอ็นเอ ) เป็นอย่างสูง evolutionarily สามารถแนะนำที่จำเป็น แม้ว่าฟังก์ชันที่ไม่รู้จัก [ 3 ] ยืมคำศัพท์จากจักรวาลวิทยา เช่น DNA เป็นมักจะเรียกว่า " สสารมืด" หลังจากพลาดเรื่องความจำเป็นสำหรับตรวจสอบพลวัตและโครงสร้างขนาดใหญ่ของจักรวาล แปลกใจอีกอย่างคือว่า กว่า 1200 พันธุกรรมสายพันธุ์ที่พบในการศึกษา gwas บัญชีเพียงเศษเล็ก ๆของการรวม ปัจจุบันมันยังคงไม่รู้ว่า " หายไปแบบ " เนื่องจากหายากสายพันธุ์ตรวจไม่พบโดย gwas การศึกษาหรือสิ่งประดิษฐ์ของแบบจำลองทางสถิติสำหรับการประมาณค่าอัตราพันธุกรรมของเราในปัจจุบัน [ 4 ] .
คอขวดในการวิเคราะห์ข้อมูลใหญ่
คอขวดแรกในการวิเคราะห์ข้อมูล คือ การจัดเก็บข้อมูลและการดึงใหญ่ . ระบุว่าอัตราการเติบโตสำหรับความจุดังกล่าวก็เพิ่มขึ้นความสนใจได้มุ่งเน้นวิธีที่มีประสิทธิภาพมากขึ้นของการบีบอัดข้อมูล ทิศทางที่น่าสนใจ คือ ในการใช้ความน่าจะเป็นโครงสร้างข้อมูลและขั้นตอนวิธีที่สามารถเก็บข้อมูลที่มีเพิ่มขึ้นอย่างมากในประสิทธิภาพในการบีบอัดตราเพียงการสูญเสียเล็ก ๆในความแน่นอน ตัวอย่างเช่นบลูมกรองที่รับประกันระบุระดับของเท็จบวกและลบปลอมสามารถสร้างศูนย์เก็บข้อมูลลำดับ ความหลากหลายของข้อเสนอที่ชาญฉลาดสำหรับการบีบอัดข้อมูลภาพหลุดที่ถูกส่งไปยังลำดับการแข่งขันได้รับการสนับสนุนจากพันธมิตรบีบา ( http : / / www.sequencesqueeze . org ) .
เพียงแค่การจัดเก็บข้อมูลที่ใหญ่ suboptimal ได้รับค่าใช้จ่ายของการสร้างมัน , ใจกลางข้อมูลใหญ่ควรจะได้อย่างอิสระใช้ร่วมกันและดำเนินการโดยผู้ตรวจสอบที่แตกต่างกัน ทุนหน่วยงานและวารสารด้านบนต้องใหญ่ข้อมูลฝากออนไลน์ที่เก็บก่อนที่จะตีพิมพ์ ทำให้ข้อมูลที่สาธารณชนสามารถเข้าถึงได้ ในหลักการ อย่างไรก็ตามข้อมูลสาธารณะที่เก็บอาจจะงานบันทึกย่อและเชื่อมโยงข้อมูลจากฐานข้อมูลที่แตกต่างกันอาจเป็นไปไม่ได้ เพราะกำหนดว่าข้อมูลที่แตกต่าง และขาดการเมตาดาต้า เพื่อแก้ไขปัญหานี้ , ข้อมูลมาตรฐานในรูปแบบของความต้องการข้อมูลน้อยที่สุดได้รับการเผยแพร่สำหรับชนิดข้อมูลต่างๆ ( เช่น miame ;ข้อมูลเกี่ยวกับการทดลอง microarray น้อยที่สุด ) และมีไดรฟ์ที่จะสร้างคำศัพท์มาตรฐานในรูปแบบของชีวนโทโลจีให้ข้อมูลผ่านฐานข้อมูลและเครื่องประมวลผล .
แม้ว่าข้อมูลที่สามารถจัดเก็บและดึงข้อมูลได้อย่างมีประสิทธิภาพ ข้อมูลใหญ่มักจะมีขนาดใหญ่เกินไปให้พอดีกับ RAMและภาษาที่รองรับไฟฟ้ามีแนวโน้มที่จะได้รับความนิยมมากขึ้นสำหรับข้อมูลการวิเคราะห์ของขนาดใหญ่ นี้อธิบายว่าทำไมมีความต้องการที่สำคัญสำหรับขั้นตอนวิธีออนไลน์ที่สามารถมีประสิทธิภาพกระบวนการใส่ชิ้นโดยชิ้น ในด้านสถิติและการเรียนรู้เครื่อง แบบเบส์กับคู่ดูเป็นตัวอย่างที่ดีของวิธีการออนไลน์ตั้งแต่ครอบครัวก่อนที่จะปิดปรับปรุงตามแบบ และ เป็นข้อมูล กระแส เราสามารถ recursively ใช้ทฤษฎีบทของการปรับปรุงการกระจายด้านหลัง
เครื่องจักรการเรียนรู้เป็นสนามกลางในการวิเคราะห์ข้อมูลใหญ่ ข้อมูลอัตราการประมวลผลของสมองมนุษย์เป็นข้อ จำกัด อย่างรุนแรง ,ถูกใช้ขั้นตอนวิธีที่สามารถสรุปข้อมูล และลดจำนวนคุณลักษณะที่น่าสนใจในระดับจัดการ ความน่าจะเป็นแบบรุ่นที่มีพื้นฐานทางสถิติแบบเบส์ เล่น บทบาทในการ ใหญ่ เครื่องข้อมูลอัลกอริทึมการเรียนรู้เนื่องจากความสามารถของพวกเขาที่จะเรียนรู้โครงสร้างเช่นเดียวกับพารามิเตอร์ , ความสะดวกในการสร้างลำดับชั้น ( ผล " ผสม " ) รุ่นและพอดีกับธรรมชาติออนไลน์ความต้องการการประมวลผล อีกประโยชน์ของความน่าจะเป็นแบบเบย์คือธรรมชาติการประกาศของพวกเขาทำให้ขั้นตอนวิธีที่พัฒนาสำหรับการใช้งานเช่นเหมืองแร่ข้อความโดย Yahoo หรือแบบจำลองเครือข่ายทางสังคมโดย Facebook จะสามารถปรับตัวได้กับข้อมูลทางการแพทย์ ( หรือในทางกลับกัน ) .
ในที่สุดความสามารถในการเห็นภาพหรือข้อมูลสรุปใหญ่เป็นสิ่งสำคัญเพื่อการค้นพบทางวิทยาศาสตร์และข้อมูลเชิงลึก ตั้งแต่เปลือกจักษุจะขึ้นเป็นขนาดใหญ่แบ่งสมองของเรามากกว่า กิริยาทางประสาทสัมผัสใด ๆอื่น ๆ นักวิจัยส่วนใหญ่ยังคงพึ่งพารูปแบบวงกลมและแผนภูมิแท่งหรือกระจายและสายแปลงที่จะเห็นภาพของพวกเขา

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.