ABSTRACTDremel is a scalable, inter

ABSTRACT
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
1. INTRODUCTION
Large-scale analytical data processing has become widespread in web companies and across industries, not least due to low-cost storage that enabled collecting vast amounts of business-critical data. Putting this data at the fingertips of analysts and engineers has grown increasingly important; interactive response times often make a qualitative difference in data exploration, monitoring, online customer support, rapid prototyping, debugging of data pipelines, and other tasks. Performing interactive data analysis at scale demands a high degree of parallelism. For example, reading one terabyte of compressed data in one second using today’s commodity disks would require tens of thousands of disks. Similarly, CPU-intensive queries may need to run on thousands of cores to complete within seconds. At Google, massively parallel computing is done using shared clusters of commodity machines [5].A cluster typically hosts a multitude of distributed applications that share resources, have widely varying workloads, and run on machines with different hardware parameters. An individual worker in a distributed application may take much longer to execute a given task than others, or may never complete due to failures or preemption by the cluster management system. Hence, dealing with stragglers and failures is essential for achieving fast execution and fault tolerance [10]. The data used in web and scientific computing is often non-relational. Hence, a flexible data model is essential in these domains. Data structures used in programming languages, messages Exchanged by distributed systems, structured documents, etc. lend themselves naturally to a nested representation. Normalizing and recombining such data at web scale is usually prohibitive. A nested data model underlies most of structured data processing at Google [21] and reportedly at other major web companies. This paper describes a system called Dremel1 that supports interactive analysis of very large datasets over shared clusters of commodity machines. Unlike traditional databases, it is capable of operating on in situ nested data. In situ refers to the ability to access data ‘in place’, e.g., in a distributed file system (like GFS [14]) or another storage layer (e.g., Bigtable [8]). Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce (MR [12]) jobs, but at a fraction of the execution time. Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations. Dremel has been in production since 2006 and has thousands of users within Google. Multiple instances of Dremel are deployed in the company, ranging from tens to thousands of nodes.
Examples of using the system include:
• Analysis of crawled web documents
• Tracking install data for applications on Android Market.
• Crash reporting for Google products.
• OCR results from Google Books.
• Spam analysis.
• Debugging of map tiles on Google Maps.
• Tablet migrations in managed Bigtable instances.
• Results of tests run on Google’s distributed build system.
• Disk I/O statistics for hundreds of thousands of disks.
• Resource monitoring for jobs run in Google’s data centers.
• Symbols and dependencies in Google’s codebase.
Dremel builds on ideas from web search and parallel DBMSs. First, its architecture borrows the concept of a serving tree used in distributed search engines [11]. Just like a web search request, a query gets pushed down the tree and is rewritten at each step. The result of the query is assembled by aggregating the replies received from lower levels of the tree. Second, Dremel provides a high-level, SQL-like language to express ad hoc queries. In contrast to layers such as Pig [18] and Hive [16], it executes queries natively without translating them into MR jobs.
Lastly, and importantly, Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression. Column stores have been adopted for analyzing relational data [1] but to the best of our knowledge have not been extended to nested data models. The columnar storage format that we present is supported by many data processing tools at Google, including MR, Sawzall [20], and FlumeJava [7].
In this paper we make the following contributions:
• We describe a novel columnar storage format for nested data. We present algorithms for dissecting nested records into columns and reassembling them (Section 4).
• We outline Dremel’s query language and execution. Both are designed to operate efficiently on column-striped nested data and do not require restructuring of nested records (Section 5).
• We show how execution trees used in web search systems can be applied to database processing, and explain their benefits for answering aggregation queries efficiently (Section 6).
• We present experiments on trillion-record, multi-terabyte datasets, conducted on system instances running on 1000-4000 nodes (Section 7).
This paper is structured as follows. In Section 2, we explain how Dremel is used for data analysis in combination with other data management tools. Its data model is presented in Section 3. The main contributions listed above are covered in Sections 4-8. Related work is discussed in Section 9. Section 10 is the conclusion.
2. BACKGROUND
We start by walking through a scenario that illustrates how interactive query processing fits into a broader data management ecosystem. Suppose that Alice, an engineer at Google, comes up with a novel idea for extracting new kinds of signals from web pages. She runs an MR job that cranks through the input data and produces a dataset containing the new signals, stored in billions of records in the distributed file system. To analyze the results of her experiment, she launches Dremel and executes several interactive commands:
DEFINE TABLE t AS /path/to/data/*
SELECT TOP(signal1, 100), COUNT(*) FROM t
Her commands execute in seconds. She runs a few other queries to convince herself that her algorithm works. She finds an irregularity in signal1 and digs deeper by writing a FlumeJava [7] program that performs a more complex analytical computation over her out-put dataset. Once the issue is fixed, she sets up a pipeline which processes the incoming input data continuously. She formulates a few canned SQL queries that aggregate the results of her pipeline across various dimensions, and adds them to an interactive dashboard. Finally, she registers her new dataset in a catalog so other engineers can locate and query it quickly. The above scenario requires interoperation between the query processor and other data management tools. The first ingredient for that is a common storage layer. The Google File System (GFS [14]) is one such distributed storage layer widely used in the company. GFS uses replication to preserve the data despite faulty hardware and achieve fast response times in presence of stragglers. A high-performance storage layer is critical for in situ data management. It allows accessing the data without a time-consuming loading phase, which is a major impedance to database usage in analytical data processing [13], where it is often possible to run dozens of MR analyses before a DBMS is able to load the data and execute a single query. As an added benefit, data in a file system can be conveniently manipulated using standard tools, e.g., to transfer to another cluster, change access privileges, or identify a subset of data for analysis based on file names.

Figure 1: Record-wise vs. columnar representation of nested data. The second ingredient for building interoperable data management components is a shared storage format. Columnar storage proved successful for flat relational data but making it work for Google required adapting it to a nested data model. Figure 1 illustrates the main idea: all values of a nested field such as A.B.C are stored contiguously. Hence, A.B.C can be retrieved without reading A.E, A.B.D, etc. The challenge that we address is how to preserve all structural information and be able to reconstruct records from an arbitrary subset of fields. Next we discuss our data model, and then turn to algorithms and query processing.
3. DATA MODEL
In this section we present Dremel’s data model and introduce some terminology used later. The data model originated in the context of distributed systems (which explains its name, ‘Protocol Buffers’ [21]), is used widely at Google, and is available as an open source implementation. The data model is based on strongly-typed nested records. Its abstract syntax is given by:

where is an atomic type or a record type. Atomic types in comprise integers, floating-point numbers, strings, etc. Records consist of one or multiple fields. Field in a record has a name and an optional multiplicity label. Repeated fields () may occur multiple times in a record. They are interpreted as lists of values, i.e., the order of field occurences in a record is significant. Optional fields () m

ABSTRACT
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
1. INTRODUCTION 
Large-scale analytical data processing has become widespread in web companies and across industries, not least due to low-cost storage that enabled collecting vast amounts of business-critical data. Putting this data at the fingertips of analysts and engineers has grown increasingly important; interactive response times often make a qualitative difference in data exploration, monitoring, online customer support, rapid prototyping, debugging of data pipelines, and other tasks. Performing interactive data analysis at scale demands a high degree of parallelism. For example, reading one terabyte of compressed data in one second using today’s commodity disks would require tens of thousands of disks. Similarly, CPU-intensive queries may need to run on thousands of cores to complete within seconds. At Google, massively parallel computing is done using shared clusters of commodity machines [5].A cluster typically hosts a multitude of distributed applications that share resources, have widely varying workloads, and run on machines with different hardware parameters. An individual worker in a distributed application may take much longer to execute a given task than others, or may never complete due to failures or preemption by the cluster management system. Hence, dealing with stragglers and failures is essential for achieving fast execution and fault tolerance [10]. The data used in web and scientific computing is often non-relational. Hence, a flexible data model is essential in these domains. Data structures used in programming languages, messages Exchanged by distributed systems, structured documents, etc. lend themselves naturally to a nested representation. Normalizing and recombining such data at web scale is usually prohibitive. A nested data model underlies most of structured data processing at Google [21] and reportedly at other major web companies. This paper describes a system called Dremel1 that supports interactive analysis of very large datasets over shared clusters of commodity machines. Unlike traditional databases, it is capable of operating on in situ nested data. In situ refers to the ability to access data ‘in place’, e.g., in a distributed file system (like GFS [14]) or another storage layer (e.g., Bigtable [8]). Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce (MR [12]) jobs, but at a fraction of the execution time. Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations. Dremel has been in production since 2006 and has thousands of users within Google. Multiple instances of Dremel are deployed in the company, ranging from tens to thousands of nodes. 
 Examples of using the system include:
• Analysis of crawled web documents
• Tracking install data for applications on Android Market.
• Crash reporting for Google products.
• OCR results from Google Books.
• Spam analysis.
• Debugging of map tiles on Google Maps.
• Tablet migrations in managed Bigtable instances.
• Results of tests run on Google’s distributed build system.
• Disk I/O statistics for hundreds of thousands of disks.
• Resource monitoring for jobs run in Google’s data centers.
• Symbols and dependencies in Google’s codebase.
 Dremel builds on ideas from web search and parallel DBMSs. First, its architecture borrows the concept of a serving tree used in distributed search engines [11]. Just like a web search request, a query gets pushed down the tree and is rewritten at each step. The result of the query is assembled by aggregating the replies received from lower levels of the tree. Second, Dremel provides a high-level, SQL-like language to express ad hoc queries. In contrast to layers such as Pig [18] and Hive [16], it executes queries natively without translating them into MR jobs. 
Lastly, and importantly, Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression. Column stores have been adopted for analyzing relational data [1] but to the best of our knowledge have not been extended to nested data models. The columnar storage format that we present is supported by many data processing tools at Google, including MR, Sawzall [20], and FlumeJava [7].
 In this paper we make the following contributions:
• We describe a novel columnar storage format for nested data. We present algorithms for dissecting nested records into columns and reassembling them (Section 4). 
• We outline Dremel’s query language and execution. Both are designed to operate efficiently on column-striped nested data and do not require restructuring of nested records (Section 5). 
• We show how execution trees used in web search systems can be applied to database processing, and explain their benefits for answering aggregation queries efficiently (Section 6). 
• We present experiments on trillion-record, multi-terabyte datasets, conducted on system instances running on 1000-4000 nodes (Section 7). 
This paper is structured as follows. In Section 2, we explain how Dremel is used for data analysis in combination with other data management tools. Its data model is presented in Section 3. The main contributions listed above are covered in Sections 4-8. Related work is discussed in Section 9. Section 10 is the conclusion.
2. BACKGROUND 
We start by walking through a scenario that illustrates how interactive query processing fits into a broader data management ecosystem. Suppose that Alice, an engineer at Google, comes up with a novel idea for extracting new kinds of signals from web pages. She runs an MR job that cranks through the input data and produces a dataset containing the new signals, stored in billions of records in the distributed file system. To analyze the results of her experiment, she launches Dremel and executes several interactive commands:
DEFINE TABLE t AS /path/to/data/* 
SELECT TOP(signal1, 100), COUNT(*) FROM t
Her commands execute in seconds. She runs a few other queries to convince herself that her algorithm works. She finds an irregularity in signal1 and digs deeper by writing a FlumeJava [7] program that performs a more complex analytical computation over her out-put dataset. Once the issue is fixed, she sets up a pipeline which processes the incoming input data continuously. She formulates a few canned SQL queries that aggregate the results of her pipeline across various dimensions, and adds them to an interactive dashboard. Finally, she registers her new dataset in a catalog so other engineers can locate and query it quickly. The above scenario requires interoperation between the query processor and other data management tools. The first ingredient for that is a common storage layer. The Google File System (GFS [14]) is one such distributed storage layer widely used in the company. GFS uses replication to preserve the data despite faulty hardware and achieve fast response times in presence of stragglers. A high-performance storage layer is critical for in situ data management. It allows accessing the data without a time-consuming loading phase, which is a major impedance to database usage in analytical data processing [13], where it is often possible to run dozens of MR analyses before a DBMS is able to load the data and execute a single query. As an added benefit, data in a file system can be conveniently manipulated using standard tools, e.g., to transfer to another cluster, change access privileges, or identify a subset of data for analysis based on file names.
 
Figure 1: Record-wise vs. columnar representation of nested data. The second ingredient for building interoperable data management components is a shared storage format. Columnar storage proved successful for flat relational data but making it work for Google required adapting it to a nested data model. Figure 1 illustrates the main idea: all values of a nested field such as A.B.C are stored contiguously. Hence, A.B.C can be retrieved without reading A.E, A.B.D, etc. The challenge that we address is how to preserve all structural information and be able to reconstruct records from an arbitrary subset of fields. Next we discuss our data model, and then turn to algorithms and query processing. 
3. DATA MODEL 
In this section we present Dremel’s data model and introduce some terminology used later. The data model originated in the context of distributed systems (which explains its name, ‘Protocol Buffers’ [21]), is used widely at Google, and is available as an open source implementation. The data model is based on strongly-typed nested records. Its abstract syntax is given by:

where is an atomic type or a record type. Atomic types in comprise integers, floating-point numbers, strings, etc. Records consist of one or multiple fields. Field in a record has a name and an optional multiplicity label. Repeated fields () may occur multiple times in a record. They are interpreted as lists of values, i.e., the order of field occurences in a record is significant. Optional fields () m

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

บทคัดย่อDremel ปรับสเกล โต้ตอบกิจสอบถามระบบวิเคราะห์ข้อมูลซ้อนกันแบบอ่านอย่างเดียวได้ โดยรวมการดำเนินการหลายระดับต้นไม้และโครงร่างคอลัมน์ข้อมูล มีความสามารถในการเรียกใช้แบบสอบถามรวมกว่าตารางแถวล้านล้านวินาที ระบบปรับขนาดพัน Cpu และ petabytes ข้อมูล และมีผู้ใช้นับพันที่ Google ในเอกสารนี้ เราอธิบายสถาปัตยกรรมของ Dremel และอธิบายว่า มันเสริม MapReduce ใช้คอมพิวเตอร์ เรานำเสนอการแสดงคอลัมน์เก็บนวนิยายสำหรับระเบียนที่ซ้อนกัน และอภิปรายทดลองบนโหนไม่กี่พันอินสแตนซ์ของระบบ1. บทนำ ประมวลผลข้อมูลการวิเคราะห์ขนาดใหญ่ได้กลายเป็นแพร่หลาย ในเว็บบริษัท และ อุตสาหกรรม ไม่น้อยเนื่องจากเก็บข้อมูลต้นทุนต่ำที่เปิดใช้งานการเก็บรวบรวมข้อมูลธุรกิจที่สำคัญไพศาล ใส่ข้อมูลนี้ทั้งนักวิเคราะห์และวิศวกรได้เติบโตขึ้นเรื่อย ๆ สำคัญ เวลาการตอบสนองโต้ตอบมักจะสร้างความแตกต่างเชิงคุณภาพในการสำรวจข้อมูล ตรวจ สอบ ฝ่ายบริการลูกค้าออนไลน์ ต้นแบบอย่างรวดเร็ว ตรวจแก้จุดบกพร่องของท่อส่งข้อมูล และงานอื่น ๆ ทำการวิเคราะห์ข้อมูลแบบโต้ตอบที่อัตราความต้องการระดับสูงของ parallelism ตัวอย่าง การอ่านหนึ่งเทราไบต์ของการบีบอัดข้อมูลในหนึ่งวินาทีใช้วันนี้ชุดดิสก์จะต้องนับหมื่นของดิสก์ ในทำนองเดียวกัน สอบถาม cpu สูงอาจจำเป็นต้องรันบนพันแกนให้เสร็จสมบูรณ์ภายในวินาที ที่ Google อย่างหนาแน่นพร้อมใช้งานจะใช้ทำใหม่ของชุดเครื่อง [5] คลัสเตอร์ได้หลากหลายกระจายโปรแกรมประยุกต์ที่ใช้ร่วมกันทรัพยากร ปริมาณงานแตกต่างกัน และรันบนเครื่องที่มีฮาร์ดแวร์ที่แตกต่างกันพารามิเตอร์ โดยทั่วไป ผู้ปฏิบัติงานแต่ละตัวในโปรแกรมประยุกต์ที่กระจายอาจใช้เวลาในการดำเนินงานให้มากกว่าคนอื่น หรืออาจไม่เสร็จสมบูรณ์เนื่องจากความล้มเหลวหรือ preemption โดยระบบการจัดการของคลัสเตอร์ ดังนั้น จัดการกับ stragglers และความล้มเหลวเป็นสิ่งจำเป็นสำหรับการบรรลุเป้าหมายการดำเนินการอย่างรวดเร็วและยอมรับข้อบกพร่อง [10] ข้อมูลที่ใช้ในเว็บและการคำนวณทางวิทยาศาสตร์มักจะเป็นไม่สัมพันธ์กัน ดังนั้น แบบยืดหยุ่นเป็นสิ่งจำเป็นในโดเมนเหล่านี้ โครงสร้างข้อมูลที่ใช้ในภาษาโปรแกรม ข้อความ Exchanged โดยระบบกระจาย โครงสร้างเอกสาร ฯลฯ ยืมตัวเองตามธรรมชาติจะแสดงซ้อนกัน Normalizing และ recombining ข้อมูลดังกล่าวในเว็บขนาดนั้นมักจะห้ามปราม แบบซ้อน underlies ที่สุดของการประมวลผลข้อมูลมีโครงสร้าง ที่ Google [21] และรายงาน ที่บริษัทอื่น ๆ เว็บหลัก เอกสารนี้อธิบายระบบที่เรียกว่า Dremel1 ที่สนับสนุนวิเคราะห์แบบโต้ตอบของ datasets ขนาดใหญ่ผ่านคลัสเตอร์ที่ใช้ร่วมกันของชุดเครื่องจักร ซึ่งแตกต่างจากฐานข้อมูลดั้งเดิม มันมีความสามารถในการทำงานบนข้อมูลใน situ ซ้อน ใน situ หมายถึงความสามารถในการเข้าถึงข้อมูล 'ใน' เช่น ในระบบแฟ้มแบบกระจาย (เช่น [14] GFS) หรือชั้นจัดเก็บอื่น (เช่น Bigtable [8]) Dremel สามารถดำเนินการแบบสอบถามในช่วงข้อมูลดังกล่าวโดยปกติจะต้องมีลำดับงาน MapReduce (นาย [12]) แต่ที่เวลาดำเนินการ Dremel ไม่ได้มีไว้แทนนาย และมักต้องใช้ร่วมกับการวิเคราะห์เอาท์พุตของนายท่อหรือรวดเร็วประมวลผลใหญ่ต้นแบบ Dremel มีผลิตตั้งแต่ปี 2006 และมีหลักพันของผู้ใช้ใน Google มีการติดตั้งอินสแตนซ์หลายของ Dremel ในบริษัท ตั้งแต่หลักสิบถึงพันโหน ตัวอย่างของการใช้ระบบรวมถึง:•วิเคราะห์เอกสารเว็บถูกตระเวน•ติดตามข้อมูลสำหรับโปรแกรมประยุกต์ที่ติดตั้งในตลาด Android•ปัญหาที่รายงานเกี่ยวกับผลิตภัณฑ์ของ Google• OCR ผลจากหนังสือ Google•การวิเคราะห์ขยะ•ตรวจแก้จุดบกพร่องของแผนที่กระเบื้องบน Google Mapsตั้งแท็บเล็ต•ในกรณี Bigtable จัดการ•ผลของการทดสอบที่รันบนระบบของ Google สร้างกระจาย•ดิสก์ I/O สถิติหลายร้อยหลายพันของดิสก์•ทรัพยากรการตรวจสอบงานที่ทำงานในศูนย์ข้อมูลของ Google•สัญลักษณ์และอ้างอิงใน Google ของ codebase Dremel สร้างบนแนวคิดจากการค้นหาเว็บและ DBMSs ขนาน ครั้งแรก แก่มิตรของต้นไม้ให้บริการใช้เครื่องมือค้นหากระจาย [11] เช่นเดียวกับคำค้นหาเว็บ แบบสอบถามได้รับการผลักดันลงต้นไม้ และจิตในแต่ละขั้นตอน ผลลัพธ์ของแบบสอบถามจะประกอบ ด้วยการรวบรวมการตอบรับจากระดับล่างของแผนภูมิ สอง Dremel ให้ภาษาระดับสูง เช่น SQL จะแสดงการสอบถามเฉพาะกิจ ตรงข้ามชั้นหมู [18] และกลุ่ม [16], จะดำเนินการสอบถามวิธีโดยไม่ต้องแปลให้เป็นนายงาน สุดท้าย และที่ สำคัญ Dremel ใช้แสดงคอลัมน์สไทรพ์เก็บ ซึ่งช่วยให้การอ่านน้อยกว่าข้อมูลจากการเก็บข้อมูลสำรอง และลดต้นทุน CPU เนื่องจากถูกบีบอัด คอลัมน์เก็บนำมาใช้ในการวิเคราะห์ข้อมูลเชิงสัมพันธ์ [1] แต่ให้ดีสุดของความรู้ของเรามีไม่การขยายข้อมูลซ้อนรุ่น สนับสนุนรูปแบบคอลัมน์เก็บข้อมูลที่เรานำเสนอเครื่องมือประมวลผลข้อมูลมากมายที่ Google รวมทั้งนาย Sawzall [20], และ FlumeJava [7] ในเอกสารนี้ เราทำการจัดสรรดังต่อไปนี้:•เราอธิบายรูปแบบนวนิยายคอลัมน์เก็บข้อมูลที่ซ้อนกัน เรานำเสนอสำหรับ dissecting ระเบียนซ้อนกันเป็นแถว และ reassembling พวกเขา (4 ส่วน) •เราเค้าของ Dremel ภาษาสอบถามและการดำเนินการ ทั้งสองถูกออกแบบให้ทำงานได้อย่างมีประสิทธิภาพบนลายคอลัมน์ข้อมูลซ้อน และไม่จำเป็นต้องปรับโครงสร้างของระเบียนซ้อน (5 ส่วน) •เราแสดงว่าต้นไม้การดำเนินการที่ใช้ในระบบค้นหาเว็บใช้ฐานข้อมูลการประมวลผล และอธิบายประโยชน์ของพวกเขาสำหรับการตอบแบบสอบถามรวมได้อย่างมีประสิทธิภาพ (มาตรา ๖) •ที่เรานำเสนอการทดลองในล้านล้านระเบียน datasets หลายเทราไบต์ ดำเนินการบนอินสแตนซ์ของระบบที่เรียกใช้บนโหน 1000-4000 (7 ส่วน) กระดาษนี้มีโครงสร้างดังนี้ ในส่วน 2 เราอธิบายวิธีใช้ Dremel สำหรับวิเคราะห์ข้อมูลร่วมกับเครื่องมือจัดการข้อมูลอื่น ๆ มีการนำเสนอรูปแบบของข้อมูลใน 3 ส่วน ผลงานหลักที่แสดงรายการข้างต้นจะครอบคลุมในส่วน 4-8 งานที่เกี่ยวข้องจะกล่าวถึงในส่วน 9 ส่วน 10 เป็นข้อสรุป2. พื้นหลัง เราเริ่มต้น ด้วยการเดินผ่านสถานการณ์ที่แสดงพอดีกับการประมวลผลแบบสอบถามวิธีโต้ตอบเข้าไปในระบบนิเวศการจัดการข้อมูลที่กว้างขึ้น สมมติว่า อลิซ วิศวกรที่ Google มากับความคิดที่นวนิยายสำหรับแยกสัญญาณชนิดใหม่จากหน้าเว็บ เธอทำงานเป็นนายที่ cranks ผ่านข้อมูลป้อนเข้า และสร้างชุดข้อมูลที่ประกอบด้วยสัญญาณใหม่ เก็บในพันล้านของเรกคอร์ดในระบบแฟ้มแบบกระจาย การวิเคราะห์ผลการทดลองของเธอ เธอเปิด Dremel และปฏิบัติตามคำสั่งโต้ตอบหลาย:กำหนดตาราง t เป็น /path/ข้อมูล / * เลือกด้านบน (signal1, 100), COUNT(*) จาก tคำสั่งของเธอดำเนินวินาที เธอทำงานสอบถามกี่การโน้มน้าวใจตัวเองที่ทำงานของอัลกอริทึม เธอพบความผิดปกติของ signal1 และ digs ลึก โดยการเขียนโปรแกรม FlumeJava [7] ที่ทำการคำนวณวิเคราะห์ซับซ้อนผ่านชุดข้อมูลของเธอเข้าวาง เมื่อได้รับการแก้ไขปัญหา เธอตั้งไลน์ซึ่งประมวลผลข้อมูลที่ป้อนเข้ามาอย่างต่อเนื่อง เธอ formulates กี่กระป๋องแบบสอบถาม SQL ที่รวมผลของไปป์ไลน์เธอข้ามมิติต่าง ๆ และเพิ่มแดชบอร์ดแบบโต้ตอบ ในที่สุด เธอทะเบียนชุดข้อมูลใหม่ของเธอในแค็ตตาล็อกเพื่อให้วิศวกรอื่น ๆ สามารถค้นหา และสอบถามอย่างรวดเร็ว สถานการณ์ข้างต้นต้องการ interoperation ระหว่างตัวประมวลผลแบบสอบถามและอื่น ๆ เครื่องมือการจัดการข้อมูล ส่วนประกอบแรกที่เป็นชั้นเก็บทั่วไป ระบบไฟล์ Google (GFS [14]) เป็นชั้นจัดเก็บกระจายหนึ่งเช่นใช้ในบริษัท GFS ใช้จำลองเพื่อรักษาข้อมูลแม้ มีการผิดพลาดของฮาร์ดแวร์ และให้เวลาการตอบสนองอย่างรวดเร็วในของ stragglers ชั้นเก็บข้อมูลมีประสิทธิภาพสูงมีความสำคัญสำหรับการจัดการข้อมูลใน situ จะช่วยให้เข้าถึงข้อมูล โดยไม่มีระยะเวลาในการโหลด ซึ่งเป็นความต้านทานหลักที่ฐานข้อมูลใช้ในการประมวลผลข้อมูลวิเคราะห์ [13], ที่เป็นมักจะทำหลายสิบนายวิเคราะห์ก่อนการ DBMS สามารถโหลดข้อมูล และดำเนินการแบบสอบถามเดียว เป็นรับประโยชน์ ข้อมูลในระบบแฟ้มสามารถจะบริการจัดการโดยใช้เครื่องมือมาตรฐาน เช่น การโอนย้ายไปยังคลัสเตอร์อื่น เปลี่ยนสิทธิ์การเข้าถึง ระบุชุดย่อยของข้อมูลสำหรับการวิเคราะห์ตามชื่อแฟ้ม รูปที่ 1: Record-wise เทียบกับแสดงคอลัมน์ของข้อมูลที่ซ้อนกันอยู่ ส่วนผสมที่สองสำหรับการสร้างคอมโพเนนต์การจัดการข้อมูล interoperable เป็นรูปแบบจัดเก็บข้อมูลที่ใช้ร่วมกัน คอลัมน์เก็บข้อมูลพิสูจน์เรียบร้อยแล้วสำหรับข้อมูลเชิงสัมพันธ์แบน แต่การทำงานใน Google ต้องดัดแปลงเป็นแบบจำลองข้อมูลที่ซ้อนกัน รูปที่ 1 แสดงให้เห็นถึงแนวคิดหลัก: เก็บค่าทั้งหมดของเขตข้อมูลซ้อนกันเช่น A.B.C ไปทาง ดังนั้น สามารถดึง A.B.C ไม่อ่าน A.E, A.B.D ฯลฯ ความท้าทายที่เราเป็นวิธีการเก็บข้อมูลโครงสร้างทั้งหมด และสามารถสร้างระเบียนจากชุดย่อยการกำหนดของเขตข้อมูล ต่อไปเราหารือรูปแบบข้อมูลของเรา และเปิดอัลกอริทึมและการประมวลผลแบบสอบถาม 3. ข้อมูลแบบจำลอง ในส่วนนี้ เราสามารถนำเสนอรูปแบบข้อมูลของ Dremel และแนะนำบางคำศัพท์ที่ใช้ในภายหลัง แบบจำลองข้อมูลที่มาในบริบทของระบบแบบกระจาย (ซึ่งอธิบายชื่อ 'โพรโทคอลบัฟเฟอร์' [21]), ใช้กันอย่างแพร่หลายใน Google และมีเป็นการนำไปใช้เปิดแหล่งที่มา แบบจำลองข้อมูลขึ้นอยู่กับพิมพ์ขอระเบียนซ้อน ไวยากรณ์ของนามธรรมถูกกำหนดโดย: มีชนิดของอะตอมหรือชนิดของเรกคอร์ด ชนิดอะตอมประกอบด้วยจำนวนเต็ม ตัวเลขทศนิยม สาย ฯลฯ ระเบียนประกอบด้วยเขตข้อมูลหนึ่ง หรือหลาย เขตข้อมูลในระเบียนมีชื่อและป้ายชื่อมีตัวเลือกมากมายหลายหลาก (เขตข้อมูลซ้ำ)อาจเกิดขึ้นหลายครั้งในเรกคอร์ด พวกเขาจะถูกแสดงเป็นรายการของค่า เช่น ใบสั่งของครั้งที่เขตข้อมูลปรากฏในระเบียนเป็นสำคัญ เขตข้อมูล() m

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

บทคัดย่อ
Dremel เป็นขยายขีดความสามารถของระบบแบบสอบถามเฉพาะกิจโต้ตอบสำหรับการวิเคราะห์เพียงการอ่านข้อมูลที่ซ้อนกัน โดยการรวมการดำเนินการหลายระดับต้นและรูปแบบข้อมูลที่เสามันมีความสามารถในการรวมคำสั่งการทำงานมากกว่าล้านล้านตารางแถวในไม่กี่วินาที ระบบเครื่องชั่งน้ำหนักหลายพันซีพียูและเพตาไบต์ของข้อมูลและมีหลายพันคนที่ Google ในบทความนี้เราจะอธิบายสถาปัตยกรรมและการดำเนินงานของ Dremel และอธิบายวิธีการที่จะเสริมคอมพิวเตอร์ MapReduce ตาม เรานำเสนอการแสดงที่จัดเก็บเสานวนิยายสำหรับบันทึกที่ซ้อนกันและหารือเกี่ยวกับการทดลองในไม่กี่พันกรณีโหนดของระบบ.
1 บทนำขนาดใหญ่การประมวลผลข้อมูลการวิเคราะห์ที่ได้กลายเป็นที่แพร่หลายในเว็บและ บริษัท ในอุตสาหกรรมไม่น้อยเนื่องจากการจัดเก็บต้นทุนต่ำที่เปิดใช้งานการเก็บรวบรวมข้อมูลจำนวนมหาศาลทางธุรกิจที่สำคัญ
ใส่ข้อมูลที่ปลายนิ้วของนักวิเคราะห์และวิศวกรที่มีการเติบโตที่สำคัญมากขึ้น; เวลาการตอบสนองแบบโต้ตอบมักจะสร้างความแตกต่างเชิงคุณภาพในการสำรวจข้อมูลการตรวจสอบการสนับสนุนลูกค้าออนไลน์ต้นแบบอย่างรวดเร็ว, การแก้จุดบกพร่องของท่อข้อมูลและงานอื่น ๆ การแสดงการวิเคราะห์ข้อมูลแบบโต้ตอบในระดับความต้องการระดับสูงของความเท่าเทียม ยกตัวอย่างเช่นการอ่านหนึ่งเทราไบต์ของการบีบอัดข้อมูลในหนึ่งวินาทีของวันนี้โดยใช้ดิสก์สินค้าจะต้องมีนับหมื่นของดิสก์ ในทำนองเดียวกันคำสั่ง CPU สูงอาจจำเป็นต้องใช้ในพันของแกนที่จะเสร็จสมบูรณ์ภายในไม่กี่วินาที ที่ Google คำนวณแบบขนานอย่างหนาแน่นจะกระทำโดยใช้กลุ่มที่ใช้ร่วมกันของเครื่องสินค้า [5] กลุ่มลวด Cored Metallurgical มักจะเป็นเจ้าภาพจัดงานหลากหลายของการใช้งานที่มีการกระจายทรัพยากรร่วมกันได้อย่างกว้างขวางที่แตกต่างกันปริมาณงานและทำงานบนเครื่องที่มีพารามิเตอร์ฮาร์ดแวร์ที่แตกต่าง คนงานของแต่ละบุคคลในการประยุกต์ใช้การกระจายอาจใช้เวลานานมากในการดำเนินงานที่กำหนดกว่าคนอื่น ๆ หรืออาจจะไม่สมบูรณ์เนื่องจากความล้มเหลวหรือใบจองโดยระบบการจัดการกลุ่ม ดังนั้นการจัดการกับพลัดหลงและความล้มเหลวเป็นสิ่งจำเป็นสำหรับการดำเนินการประสบความสำเร็จอย่างรวดเร็วและมีความอดทนความผิด [10] ข้อมูลที่ใช้ในเว็บและการคำนวณทางวิทยาศาสตร์มักจะไม่สัมพันธ์ ดังนั้นรูปแบบข้อมูลที่มีความยืดหยุ่นเป็นสิ่งสำคัญในโดเมนเหล่านี้ โครงสร้างข้อมูลที่ใช้ในการเขียนโปรแกรมภาษาข้อความแลกเปลี่ยนโดยระบบการกระจายเอกสารที่มีโครงสร้างและอื่น ๆ ยืมตัวตามธรรมชาติที่เป็นตัวแทนที่ซ้อนกัน normalizing และ recombining ข้อมูลดังกล่าวในระดับเว็บมักจะห้ามปราม รูปแบบข้อมูลที่ซ้อนกันมากที่สุดของการรองรับการประมวลผลข้อมูลที่มีโครงสร้างที่ Google [21] และมีรายงานว่า บริษัท อื่น ๆ ที่เว็บหลัก กระดาษนี้จะอธิบายระบบที่เรียกว่า Dremel1 ที่สนับสนุนการวิเคราะห์การโต้ตอบของชุดข้อมูลขนาดใหญ่มากในช่วงที่ใช้ร่วมกันของกลุ่มสินค้าเครื่อง ซึ่งแตกต่างจากฐานข้อมูลแบบดั้งเดิมก็คือความสามารถในการดำเนินงานเกี่ยวกับข้อมูลในแหล่งกำเนิดที่ซ้อนกัน ในแหล่งกำเนิดหมายถึงความสามารถในการเข้าถึงข้อมูลในสถานที่ 'เช่นในระบบแฟ้มกระจาย (เช่นระบบสศค [14]) หรือชั้นจัดเก็บข้อมูลอื่น (เช่น Bigtable [8]) Dremel สามารถดำเนินการสอบถามหลายข้อมูลดังกล่าวว่าปกติจะต้องมีลำดับของ MapReduce (ที่นาย [12]) งาน แต่ในส่วนของเวลาการดำเนินการ Dremel ไม่ได้มีไว้แทนนายและมักจะถูกนำมาใช้ร่วมกับมันในการวิเคราะห์ผลของท่อ MR หรืออย่างรวดเร็วต้นแบบการคำนวณที่มีขนาดใหญ่ Dremel ได้รับในการผลิตตั้งแต่ปี 2006 และมีหลายพันของผู้ใช้ใน Google . กรณีหลาย Dremel จะนำไปใช้ใน บริษัท
ตั้งแต่หลายสิบหลายพันโหนดตัวอย่างของการใช้ระบบรวมถึง:
•การวิเคราะห์ของเอกสารเว็บคลาน•ติดตามการติดตั้งข้อมูลสำหรับการใช้งานใน Android Market. •ความผิดพลาดการรายงานสำหรับผลิตภัณฑ์ของ Google. • OCR ผลลัพธ์ที่ได้จาก Google หนังสือ. •วิเคราะห์สแปม. •การแก้จุดบกพร่องของกระเบื้องแผนที่บน Google Maps. •การโยกย้ายแท็บเล็ตในการจัดการกรณี Bigtable. •ผลการทดสอบทำงานในการสร้างระบบการกระจายของ Google. • Disk I / O สถิติหลายร้อยหลายพันของดิสก์•ตรวจสอบทรัพยากรสำหรับงานทำงานใน Google ศูนย์ข้อมูล. •สัญลักษณ์และการอ้างอิงใน codebase ของ Google. Dremel สร้างขึ้นบนความคิดจากการค้นหาเว็บและขนาน DBMSs ครั้งแรกสถาปัตยกรรมยืมแนวคิดของต้นไม้ที่ให้บริการที่ใช้ในการค้นหาเครื่องมือกระจาย [11] เช่นเดียวกับการร้องขอการค้นหาเว็บแบบสอบถามได้รับการผลักลงต้นไม้และจะเขียนใหม่ในแต่ละขั้นตอน ผลของแบบสอบถามจะประกอบโดยรวมตอบที่ได้รับจากการลดระดับของต้นไม้ ประการที่สอง Dremel มีระดับสูงภาษา SQL เหมือนจะแสดงคำสั่งเฉพาะกิจ ในทางตรงกันข้ามกับชั้นเช่นหมู [18] และไฮฟ์ [16] จะรันคำสั่งโดยกำเนิดโดยไม่ต้องแปลพวกเขาเข้าไปในงาน MR. สุดท้ายและสำคัญ Dremel ใช้เป็นตัวแทนการจัดเก็บคอลัมน์ลายซึ่งทำให้มันสามารถอ่านข้อมูลน้อยลงจากรอง การจัดเก็บและลดค่าใช้จ่ายของ CPU เนื่องจากการบีบอัดที่ถูกกว่า ร้านค้าคอลัมน์ได้รับการรับรองสำหรับการวิเคราะห์ข้อมูลเชิงสัมพันธ์ [1] แต่ที่ดีที่สุดของความรู้ของเรายังไม่ได้รับการขยายรูปแบบข้อมูลที่ซ้อนกัน รูปแบบการจัดเก็บเสาที่เรานำเสนอการสนับสนุนโดยเครื่องมือประมวลผลข้อมูลจำนวนมากใน Google รวมถึงนาย Sawzall [20] และ FlumeJava [7]. ในบทความนี้เราให้มีส่วนร่วมต่อไปนี้: •เราอธิบายรูปแบบการจัดเก็บคอลัมน์ใหม่สำหรับที่ซ้อนกัน ข้อมูล เรานำเสนอขั้นตอนวิธีการสำหรับการตัดบันทึกลงในคอลัมน์ที่ซ้อนกันและประกอบพวกเขา (มาตรา 4). •เราร่างภาษาแบบสอบถาม Dremel และการดำเนินการ ทั้งสองได้รับการออกแบบให้ทำงานได้อย่างมีประสิทธิภาพกับข้อมูลที่ซ้อนกันคอลัมน์ลายและไม่จำเป็นต้องปรับโครงสร้างของบันทึกที่ซ้อนกัน (มาตรา 5). •เราแสดงให้เห็นว่าต้นไม้การดำเนินการที่ใช้ในระบบการค้นหาเว็บสามารถนำไปใช้ในการประมวลผลฐานข้อมูลและอธิบายถึงประโยชน์ของพวกเขาสำหรับการตอบรับรวม คำสั่งได้อย่างมีประสิทธิภาพ (มาตรา 6). •เรานำเสนอการทดลองในล้านล้านบันทึกชุดข้อมูลหลายเทราไบต์ดำเนินการในกรณีที่ทำงานบนระบบ 1,000-4,000 โหนด (มาตรา 7). กระดาษนี้จะมีโครงสร้างดังนี้ ในส่วนที่ 2 เราจะอธิบายวิธี Dremel ถูกนำมาใช้ในการวิเคราะห์ข้อมูลร่วมกับเครื่องมือในการจัดการข้อมูลอื่น ๆ รูปแบบข้อมูลที่จะนำเสนอในมาตรา 3 ผลงานหลักดังกล่าวข้างต้นจะครอบคลุมในส่วนที่ 4-8 งานที่เกี่ยวข้องจะกล่าวถึงในมาตรา 9 มาตรา 10 คือข้อสรุป. 2 ภูมิหลังเราเริ่มต้นด้วยการเดินผ่านสถานการณ์ที่แสดงให้เห็นถึงวิธีการประมวลผลแบบสอบถามโต้ตอบควรเป็นระบบนิเวศที่กว้างขึ้นในการจัดการข้อมูล สมมติว่าอลิซวิศวกรที่ Google ที่มากับความคิดใหม่สำหรับการแยกชนิดใหม่ของการส่งสัญญาณจากหน้าเว็บ เธอวิ่งงานที่นาย cranks ผ่านข้อมูลเข้าและผลิตชุดข้อมูลที่มีสัญญาณใหม่ที่เก็บไว้ในพันล้านของระเบียนในระบบแฟ้มกระจาย การวิเคราะห์ผลการทดลองของเธอเธอเปิดตัว Dremel และรันคำสั่งแบบโต้ตอบหลายDEFINE ตารางที AS / เส้นทาง / เพื่อ / ข้อมูล / * SELECT TOP (signal1, 100), COUNT (*) จากทีคำสั่งของเธอดำเนินการในไม่กี่วินาที เธอทำงานคำสั่งอื่น ๆ ไม่กี่ที่จะโน้มน้าวตัวเองว่างานอัลกอริทึมของเธอ เธอพบความผิดปกติใน signal1 และขุดลึกโดยการเขียน FlumeJava [7] โปรแกรมที่มีประสิทธิภาพการคำนวณที่ซับซ้อนมากขึ้นกว่าการวิเคราะห์ชุดข้อมูลที่ออกมาทำให้เธอ เมื่อปัญหาได้รับการแก้ไขที่เธอตั้งท่อซึ่งกระบวนการการป้อนข้อมูลเข้ามาอย่างต่อเนื่อง เธอหลักเกณฑ์กระป๋องไม่กี่คำสั่ง SQL ที่รวมผลของท่อเธอข้ามมิติต่างๆและเพิ่มลงในแดชบอร์ดแบบโต้ตอบ ในที่สุดเธอก็ลงทะเบียนชุดใหม่ของเธอในแคตตาล็อกเพื่อให้วิศวกรอื่น ๆ สามารถค้นหาและสอบถามได้อย่างรวดเร็ว สถานการณ์ดังกล่าวข้างต้นต้องทำงานร่วมกันระหว่างหน่วยประมวลผลแบบสอบถามและเครื่องมือการจัดการข้อมูลอื่น ๆ ส่วนผสมแรกที่เป็นชั้นจัดเก็บข้อมูลร่วมกัน ระบบไฟล์ของ Google (สศค [14]) เป็นหนึ่งในชั้นการจัดเก็บข้อมูลการกระจายดังกล่าวใช้กันอย่างแพร่หลายใน บริษัท สศคใช้การจำลองแบบที่จะรักษาข้อมูลแม้จะมีผิดพลาดของฮาร์ดแวร์และประสบความสำเร็จครั้งตอบสนองอย่างรวดเร็วในที่พลัดหลง ที่มีประสิทธิภาพสูงชั้นจัดเก็บข้อมูลเป็นสิ่งสำคัญสำหรับในแหล่งกำเนิดการจัดการข้อมูล จะช่วยให้การเข้าถึงข้อมูลโดยไม่ต้องมีขั้นตอนการโหลดใช้เวลานานซึ่งเป็นความต้านทานที่สำคัญในการใช้งานฐานข้อมูลในการประมวลผลข้อมูลการวิเคราะห์ [13] ซึ่งก็มักจะเป็นไปได้ที่จะทำงานหลายสิบนายวิเคราะห์ก่อน DBMS สามารถในการโหลดข้อมูลและ ดำเนินการแบบสอบถามเดียว ในฐานะที่เป็นประโยชน์เพิ่มข้อมูลในระบบไฟล์สามารถจัดการได้สะดวกโดยใช้เครื่องมือมาตรฐานเช่นการโอนไปยังกลุ่มอื่นเปลี่ยนสิทธิ์การเข้าถึงหรือการระบุชุดย่อยของข้อมูลสำหรับการวิเคราะห์บนพื้นฐานของชื่อไฟล์. รูปที่ 1: บันทึกฉลาดครับ . การแสดงคอลัมน์ของข้อมูลที่ซ้อนกัน ส่วนผสมที่สองสำหรับการสร้างองค์ประกอบการจัดการข้อมูลการทำงานร่วมกันเป็นรูปแบบการจัดเก็บข้อมูลที่ใช้ร่วมกัน การจัดเก็บเรียงเป็นแนวพิสูจน์แล้วว่าประสบความสำเร็จสำหรับข้อมูลเชิงสัมพันธ์แบน แต่ทำให้การทำงานสำหรับ Google ต้องปรับตัวให้เป็นรูปแบบข้อมูลที่ซ้อนกัน รูปที่ 1 แสดงให้เห็นถึงความคิดหลัก: ค่าทั้งหมดของสนามที่ซ้อนกันเช่นเอบีซีจะถูกเก็บไว้ติดกัน ดังนั้น ABC สามารถเรียกดูได้โดยไม่ต้องอ่าน AE, ABD ฯลฯ ความท้าทายที่เราอยู่เป็นวิธีการที่จะรักษาโครงสร้างข้อมูลทั้งหมดและสามารถที่จะสร้างบันทึกจากย่อยโดยพลการของเขตข้อมูล ต่อไปเราจะหารือเกี่ยวกับรูปแบบข้อมูลของเราแล้วหันไปขั้นตอนวิธีการและการประมวลผลแบบสอบถาม. 3 ข้อมูลรูปแบบในส่วนนี้เรานำเสนอรูปแบบข้อมูล Dremel และแนะนำคำศัพท์บางอย่างที่ใช้ในภายหลัง รูปแบบข้อมูลที่เกิดขึ้นในบริบทของระบบการกระจาย (ซึ่งอธิบายถึงชื่อของมันพิธีสารบัฟเฟอร์ '[21]) ใช้กันอย่างแพร่หลายที่ Google และสามารถใช้ได้ในฐานะที่เป็นแหล่งที่มาเปิดดำเนินการ รูปแบบข้อมูลที่อยู่บนพื้นฐานขอพิมพ์บันทึกที่ซ้อนกัน ไวยากรณ์นามธรรมมันจะได้รับโดย: ที่เป็นประเภทหรือชนิดของอะตอมบันทึก ชนิดของอะตอมในจำนวนเต็มประกอบด้วยตัวเลขทศนิยมสตริง ฯลฯ ประวัติประกอบด้วยหนึ่งหรือหลายเขตข้อมูล เขตข้อมูลในระเบียนมีชื่อและป้ายหลายหลากตัวเลือก สาขาซ้ำ () อาจเกิดขึ้นหลายครั้งในการบันทึก พวกเขาจะถูกตีความเป็นรายการค่าคือคำสั่งของปรากฏเขตข้อมูลในการบันทึกเป็นสำคัญ ฟิลด์ตัวเลือก () ม.

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

dremel นามธรรม
เป็นอุปกรณ์ระบบแบบสอบถามตามแบบอ่านอย่างเดียวที่ซ้อนกันเพื่อการวิเคราะห์ข้อมูล โดยการรวมหลายระดับตามต้นไม้และมีรูปแบบของข้อมูล มีความสามารถในการวิ่งของแถวตารางในแบบสอบถามกว่าล้านล้านวินาที ระบบเครื่องชั่งนับพันของซีพียูและ petabytes ของข้อมูล และมีหลายพันของผู้ใช้ที่ Google ในกระดาษนี้เราอธิบายสถาปัตยกรรมและการ dremel และอธิบายวิธีการเสริม mapreduce โดยใช้คอมพิวเตอร์ เรานำเสนอนวนิยายมีกระเป๋าแทนด้วยประวัติและหารือเกี่ยวกับการทดลองหลายพันโหนดอินสแตนซ์ของระบบ .
1 บทนำ
ขนาดใหญ่วิเคราะห์การประมวลผลข้อมูลได้กลายเป็นที่แพร่หลายใน บริษัท เว็บ และ ข้ามอุตสาหกรรมไม่น้อยเนื่องจากราคากระเป๋าที่สามารถเก็บจำนวนมากมายของข้อมูลทางธุรกิจที่สำคัญ . การใส่ข้อมูลที่ปลายนิ้วของนักวิเคราะห์และวิศวกรมีการเติบโตที่สำคัญมากขึ้นการตอบสนอง ; โต้ตอบ บ่อยครั้งที่สร้างความแตกต่างเชิงคุณภาพในการสำรวจข้อมูล การสร้างต้นแบบอย่างรวดเร็ว , ออนไลน์ , การสนับสนุนลูกค้า , การแก้จุดบกพร่องของระบบข้อมูล และงานอื่น ๆการวิเคราะห์ข้อมูลเชิงโต้ตอบที่แสดงระดับความต้องการในระดับสูงของขนาน . ตัวอย่างเช่น การอ่านหนึ่งเทราไบต์ของข้อมูลที่ถูกบีบอัดในหนึ่งวินาทีที่ใช้ปัจจุบันดิสก์สินค้าจะต้องนับหมื่นของดิสก์ เหมือนกับ , CPU เข้มแบบสอบถามอาจจะต้องใช้พันแกนให้เสร็จสมบูรณ์ภายในไม่กี่วินาที ที่ Google ,คอมพิวเตอร์ขนาน massively ใช้ร่วมกันกลุ่มเครื่องจักรสินค้า [ 5 ] . กลุ่มโดยทั่วไปโฮสต์ความหลากหลายของการกระจายการแบ่งปันทรัพยากร มีปริมาณงานที่แตกต่างกันอย่างกว้างขวาง และใช้เครื่องจักรที่มีค่าฮาร์ดแวร์ที่แตกต่างกัน คนงานในแต่ละแจกใบสมัครอาจใช้เวลานานเพื่อดำเนินการให้งานมากกว่าคนอื่นหรืออาจจะไม่สมบูรณ์เนื่องจากความล้มเหลวหรือใบจอง โดยระบบจัดการคลัสเตอร์ . ดังนั้น การจัดการกับพลัดหลงและความล้มเหลวเป็นสิ่งจำเป็นเพื่อให้บรรลุการยอมรับความผิดและรวดเร็ว [ 10 ] ข้อมูลที่ใช้ในเว็บและวิทยาศาสตร์การคำนวณมักจะไม่สัมพันธ์ ดังนั้น รูปแบบข้อมูลที่มีความยืดหยุ่น มีความสําคัญในการ โดเมนเหล่านี้ โครงสร้างข้อมูลที่ใช้ในการเขียนโปรแกรมภาษาข้อความแลกเปลี่ยนโดยระบบแบบกระจาย โครงสร้างเอกสาร ฯลฯ ยืมตัวเองตามธรรมชาติ เพื่อกันการเป็นตัวแทน และข้อมูลต่างๆที่เว็บ normalizing recombining ขนาดมักจะห้ามปราม เป็นข้อมูลแบบโครงสร้างซ้อนแผ่นอยู่มากที่สุดของการประมวลผลข้อมูลที่ Google [ 21 ] และมีรายงานว่า ที่ บริษัท เว็บหลักอื่น ๆกระดาษนี้จะอธิบายถึงระบบที่เรียกว่า dremel1 ที่สนับสนุนแบบโต้ตอบการวิเคราะห์ข้อมูลมีขนาดใหญ่มาก มากกว่าที่กลุ่มเครื่องจักรสินค้า ซึ่งแตกต่างจากฐานข้อมูลแบบดั้งเดิม มันสามารถผ่าตัดใน situ ซ้อนข้อมูล ในแหล่งกำเนิดหมายถึงความสามารถในการเข้าถึงข้อมูลในสถานที่ ' , เช่น , ในการกระจายแฟ้มระบบ ( เช่น GFS [ 14 ] ) หรืออีกกระเป๋าชั้น ( เช่น bigtable [ 8 ] )dremel สามารถรันหลายแบบสอบถาม เช่นข้อมูลที่ตามปกติจะต้องมีลำดับของ mapreduce ( นาย [ 12 ] ) งาน แต่ในส่วนของการเวลา dremel ไม่ได้มีวัตถุประสงค์เพื่อแทนที่นาย และมักใช้ร่วมกับมัน เพื่อวิเคราะห์ผลของนายท่อหรืออย่างรวดเร็วต้นแบบการคำนวณขนาดใหญ่dremel ได้รับในการผลิตตั้งแต่ปี 2006 และมีหลายพันของผู้ใช้ใน Google หลายอินสแตนซ์ของ dremel จะใช้ในบริษัท ตั้งแต่หลักสิบถึงหลักพันของโหนด
ตัวอย่างของการใช้ระบบรวม :
-
การวิเคราะห์ของเอกสารเว็บ คลาน - ติดตามติดตั้งข้อมูลสำหรับการใช้งานบน Android Market .
- @ รายงานผลิตภัณฑ์ Google
-
OCR ผลลัพธ์จากหนังสือของ Googleการวิเคราะห์ขยะ - .
- การตรวจแก้จุดบกพร่องของกระเบื้องแผนที่บน Google Maps .
- แท็บเล็ตการย้ายถิ่นในการจัดการกรณี bigtable .
- ผลการทดสอบวิ่งของ Google สร้างระบบกระจาย .
- Disk I / O สถิติสำหรับหลายร้อยหลายพันของดิสก์ .
- การเรียกใช้งานทรัพยากรใน Google
ศูนย์ข้อมูล - สัญลักษณ์และการอ้างอิงใน Google codebase .

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.