Distributed GraphLab: A Framework f

Distributed GraphLab: A Framework for Machine Learning
and Data Mining in the Cloud
Yucheng Low
Carnegie Mellon University
ylow@cs.cmu.edu
Joseph Gonzalez
Carnegie Mellon University
jegonzal@cs.cmu.edu
Aapo Kyrola
Carnegie Mellon University
akyrola@cs.cmu.edu
Danny Bickson
Carnegie Mellon University
bickson@cs.cmu.edu
Carlos Guestrin
Carnegie Mellon University
guestrin@cs.cmu.edu
Joseph M. Hellerstein
UC Berkeley
hellerstein@cs.berkeley.edu
ABSTRACT
While high-level data parallel frameworks, like MapReduce, simplify
the design and implementation of large-scale data processing
systems, they do not naturally or efficiently support many important
data mining and machine learning algorithms and can lead to inefficient
learning systems. To help fill this critical void, we introduced
the GraphLab abstraction which naturally expresses asynchronous,
dynamic, graph-parallel computation while ensuring data consistency
and achieving a high degree of parallel performance in the
shared-memory setting. In this paper, we extend the GraphLab
framework to the substantially more challenging distributed setting
while preserving strong data consistency guarantees.
We develop graph based extensions to pipelined locking and data
versioning to reduce network congestion and mitigate the effect of
network latency. We also introduce fault tolerance to the GraphLab
abstraction using the classic Chandy-Lamport snapshot algorithm
and demonstrate how it can be easily implemented by exploiting
the GraphLab abstraction itself. Finally, we evaluate our distributed
implementation of the GraphLab abstraction on a large Amazon
EC2 deployment and show 1-2 orders of magnitude performance
gains over Hadoop-based implementations.
1. INTRODUCTION
With the exponential growth in the scale of Machine Learning and
Data Mining (MLDM) problems and increasing sophistication of
MLDM techniques, there is an increasing need for systems that can
execute MLDM algorithms efficiently in parallel on large clusters.
Simultaneously, the availability of Cloud computing services like
Amazon EC2 provide the promise of on-demand access to affordable
large-scale computing and storage resources without substantial
upfront investments. Unfortunately, designing, implementing, and
debugging the distributed MLDM algorithms needed to fully utilize
the Cloud can be prohibitively challenging requiring MLDM experts
to address race conditions, deadlocks, distributed state, and communication
protocols while simultaneously developing mathematically
complex models and algorithms.
Nonetheless, the demand for large-scale computational and storage
resources, has driven many [2, 14, 15, 27, 30, 35] to develop new
parallel and distributed MLDM systems targeted at individual models
and applications. This time consuming and often redundant effort
slows the progress of the field as different research groups repeatedly
solve the same parallel/distributed computing problems. Therefore,
the MLDM community needs a high-level distributed abstraction
that specifically targets the asynchronous, dynamic, graph-parallel
computation found in many MLDM applications while hiding the
complexities of parallel/distributed system design. Unfortunately,
existing high-level parallel abstractions (e.g. MapReduce [8, 9],
Dryad [19] and Pregel [25]) fail to support these critical properties.
To help fill this void we introduced [24] GraphLab abstraction which
directly targets asynchronous, dynamic, graph-parallel computation
in the shared-memory setting.
In this paper we extend the multi-core GraphLab abstraction to the
distributed setting and provide a formal description of the distributed
execution model. We then explore several methods to implement
an efficient distributed execution model while preserving strict consistency
requirements. To achieve this goal we incorporate data
versioning to reduce network congestion and pipelined distributed
locking to mitigate the effects of network latency. To address the
challenges of data locality and ingress we introduce the atom graph
for rapidly placing graph structured data in the distributed setting.
We also add fault tolerance to the GraphLab framework by adapting
the classic Chandy-Lamport [6] snapshot algorithm and demonstrate
how it can be easily implemented within the GraphLab abstraction.
We conduct a comprehensive performance analysis of our
optimized C++ implementation on the Amazon Elastic Cloud
(EC2) computing service. We show that applications created
using GraphLab outperform equivalent Hadoop/MapReduce[9]
implementations by 20-60x and match the performance of carefully
constructed MPI implementations. Our main contributions are the
following:
• A summary of common properties of MLDM algorithms and the
limitations of existing large-scale frameworks. (Sec. 2)
• A modified version of the GraphLab abstraction and execution
model tailored to the distributed setting. (Sec. 3)
• Two substantially different approaches to implementing the new
distributed execution model(Sec. 4):
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 38th International Conference on Very Large Data Bases,
August 27th 31st
2012, Istanbul, Turkey.
Proceedings of the VLDB Endowment, Vol. 5, No. 8
Copyright 2012 VLDB Endowment 21508097/
12/04... $ 10.00.
716
Chromatic Engine: uses graph coloring to achieve efficient
sequentially consistent execution for static schedules.
Locking Engine: uses pipelined distributed locking and latency
hiding to support dynamically prioritized execution.
• Fault tolerance through two snapshotting schemes. (Sec. 4.3)
• Implementations of three state-of-the-art machine learning algorithms
on-top of distributed GraphLab. (Sec. 5)
• An extensive evaluation of Distributed GraphLab using a 512 processor
(64 node) EC2 cluster, including comparisons to Hadoop,
Pregel, and MPI implementations. (Sec. 5)
2. MLDM ALGORITHM PROPERTIES
In this section we describe several key properties of efficient
large-scale parallel MLDM systems addressed by the GraphLab
abstraction [24] and how other parallel frameworks fail to address
these properties. A summary of these properties and parallel frameworks
can be found in Table 1.
Graph Structured Computation: Many of the recent advances
inMLDM have focused on modeling the dependencies between data.
By modeling data dependencies, we are able to extract more signal
from noisy data. For example, modeling the dependencies between
similar shoppers allows us to make better product recommendations
than treating shoppers in isolation. Unfortunately, data parallel
abstractions like MapReduce [9] are not generally well suited for
the dependent computation typically required by more advanced
MLDM algorithms. Although it is often possible to map algorithms
with computational dependencies into the MapReduce abstraction,
the resulting transformations can be challenging and may introduce
substantial inefficiency.
As a consequence, there has been a recent trend toward graphparallel
abstractions like Pregel [25] and GraphLab [24] which
naturally express computational dependencies. These abstractions
adopt a vertex-centric model in which computation is defined as
kernels that run on each vertex. For instance, Pregel is a bulk synchronous
message passing abstraction where vertices communicate
through messages. On the other hand, GraphLab is a sequential
shared memory abstraction where each vertex can read and write
to data on adjacent vertices and edges. The GraphLab runtime is
then responsible for ensuring a consistent parallel execution. Consequently,
GraphLab simplifies the design and implementation of
graph-parallel algorithms by freeing the user to focus on sequential
computation rather than the parallel movement of data (i.e.,
messaging).
Asynchronous Iterative Computation: Many important
MLDM algorithms iteratively update a large set of parameters.
Because of the underlying graph structure, parameter updates (on
vertices or edges) depend (through the graph adjacency structure)
on the values of other parameters. In contrast to synchronous
systems, which update all parameters simultaneously (in parallel)
using parameter values from the previous time step as input,
asynchronous systems update parameters using the most recent
parameter values as input. As a consequence, asynchronous systems
provides many MLDM algorithms with significant algorithmic
benefits. For example, linear systems (common to many MLDM
algorithms) have been shown to converge faster when solved
asynchronously [4]. Additionally, there are numerous other
cases (e.g., belief propagation [13], expectation maximization
[28], and stochastic optimization [35, 34]) where asynchronous
procedures have been empirically shown to significantly outperform
synchronous procedures. In Fig. 1(a) we demonstrate how asynchronous
computation can substantially accelerate the convergence
of PageRank.
Synchronous computation incurs costly performance penalties
since the runtime of each phase is determined by the slowest machine.
The poor performance of the slowest machine may be caused
by a multitude of factors including: load and network imbalances,
hardware variability, and multi-tenancy (a principal concern in the
Cloud). Even in typical cluster settings, each compute node may also
provide other services (e.g., distributed file systems). Imbalances
in the utilization of these other services will result in substantial
performance penalties if synchronous computation is used.
In addition, variability in the complexity and convergence of
the individual vertex kernels c

Distributed GraphLab: A Framework for Machine Learning
and Data Mining in the Cloud
Yucheng Low
Carnegie Mellon University
ylow@cs.cmu.edu
Joseph Gonzalez
Carnegie Mellon University
jegonzal@cs.cmu.edu
Aapo Kyrola
Carnegie Mellon University
akyrola@cs.cmu.edu
Danny Bickson
Carnegie Mellon University
bickson@cs.cmu.edu
Carlos Guestrin
Carnegie Mellon University
guestrin@cs.cmu.edu
Joseph M. Hellerstein
UC Berkeley
hellerstein@cs.berkeley.edu
ABSTRACT
While high-level data parallel frameworks, like MapReduce, simplify
the design and implementation of large-scale data processing
systems, they do not naturally or efficiently support many important
data mining and machine learning algorithms and can lead to inefficient
learning systems. To help fill this critical void, we introduced
the GraphLab abstraction which naturally expresses asynchronous,
dynamic, graph-parallel computation while ensuring data consistency
and achieving a high degree of parallel performance in the
shared-memory setting. In this paper, we extend the GraphLab
framework to the substantially more challenging distributed setting
while preserving strong data consistency guarantees.
We develop graph based extensions to pipelined locking and data
versioning to reduce network congestion and mitigate the effect of
network latency. We also introduce fault tolerance to the GraphLab
abstraction using the classic Chandy-Lamport snapshot algorithm
and demonstrate how it can be easily implemented by exploiting
the GraphLab abstraction itself. Finally, we evaluate our distributed
implementation of the GraphLab abstraction on a large Amazon
EC2 deployment and show 1-2 orders of magnitude performance
gains over Hadoop-based implementations.
1. INTRODUCTION
With the exponential growth in the scale of Machine Learning and
Data Mining (MLDM) problems and increasing sophistication of
MLDM techniques, there is an increasing need for systems that can
execute MLDM algorithms efficiently in parallel on large clusters.
Simultaneously, the availability of Cloud computing services like
Amazon EC2 provide the promise of on-demand access to affordable
large-scale computing and storage resources without substantial
upfront investments. Unfortunately, designing, implementing, and
debugging the distributed MLDM algorithms needed to fully utilize
the Cloud can be prohibitively challenging requiring MLDM experts
to address race conditions, deadlocks, distributed state, and communication
protocols while simultaneously developing mathematically
complex models and algorithms.
Nonetheless, the demand for large-scale computational and storage
resources, has driven many [2, 14, 15, 27, 30, 35] to develop new
parallel and distributed MLDM systems targeted at individual models
and applications. This time consuming and often redundant effort
slows the progress of the field as different research groups repeatedly
solve the same parallel/distributed computing problems. Therefore,
the MLDM community needs a high-level distributed abstraction
that specifically targets the asynchronous, dynamic, graph-parallel
computation found in many MLDM applications while hiding the
complexities of parallel/distributed system design. Unfortunately,
existing high-level parallel abstractions (e.g. MapReduce [8, 9],
Dryad [19] and Pregel [25]) fail to support these critical properties.
To help fill this void we introduced [24] GraphLab abstraction which
directly targets asynchronous, dynamic, graph-parallel computation
in the shared-memory setting.
In this paper we extend the multi-core GraphLab abstraction to the
distributed setting and provide a formal description of the distributed
execution model. We then explore several methods to implement
an efficient distributed execution model while preserving strict consistency
requirements. To achieve this goal we incorporate data
versioning to reduce network congestion and pipelined distributed
locking to mitigate the effects of network latency. To address the
challenges of data locality and ingress we introduce the atom graph
for rapidly placing graph structured data in the distributed setting.
We also add fault tolerance to the GraphLab framework by adapting
the classic Chandy-Lamport [6] snapshot algorithm and demonstrate
how it can be easily implemented within the GraphLab abstraction.
We conduct a comprehensive performance analysis of our
optimized C++ implementation on the Amazon Elastic Cloud
(EC2) computing service. We show that applications created
using GraphLab outperform equivalent Hadoop/MapReduce[9]
implementations by 20-60x and match the performance of carefully
constructed MPI implementations. Our main contributions are the
following:
• A summary of common properties of MLDM algorithms and the
limitations of existing large-scale frameworks. (Sec. 2)
• A modified version of the GraphLab abstraction and execution
model tailored to the distributed setting. (Sec. 3)
• Two substantially different approaches to implementing the new
distributed execution model(Sec. 4):
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 38th International Conference on Very Large Data Bases,
August 27th 31st
2012, Istanbul, Turkey.
Proceedings of the VLDB Endowment, Vol. 5, No. 8
Copyright 2012 VLDB Endowment 21508097/
12/04... $ 10.00.
716
 Chromatic Engine: uses graph coloring to achieve efficient
sequentially consistent execution for static schedules.
 Locking Engine: uses pipelined distributed locking and latency
hiding to support dynamically prioritized execution.
• Fault tolerance through two snapshotting schemes. (Sec. 4.3)
• Implementations of three state-of-the-art machine learning algorithms
on-top of distributed GraphLab. (Sec. 5)
• An extensive evaluation of Distributed GraphLab using a 512 processor
(64 node) EC2 cluster, including comparisons to Hadoop,
Pregel, and MPI implementations. (Sec. 5)
2. MLDM ALGORITHM PROPERTIES
In this section we describe several key properties of efficient
large-scale parallel MLDM systems addressed by the GraphLab
abstraction [24] and how other parallel frameworks fail to address
these properties. A summary of these properties and parallel frameworks
can be found in Table 1.
Graph Structured Computation: Many of the recent advances
inMLDM have focused on modeling the dependencies between data.
By modeling data dependencies, we are able to extract more signal
from noisy data. For example, modeling the dependencies between
similar shoppers allows us to make better product recommendations
than treating shoppers in isolation. Unfortunately, data parallel
abstractions like MapReduce [9] are not generally well suited for
the dependent computation typically required by more advanced
MLDM algorithms. Although it is often possible to map algorithms
with computational dependencies into the MapReduce abstraction,
the resulting transformations can be challenging and may introduce
substantial inefficiency.
As a consequence, there has been a recent trend toward graphparallel
abstractions like Pregel [25] and GraphLab [24] which
naturally express computational dependencies. These abstractions
adopt a vertex-centric model in which computation is defined as
kernels that run on each vertex. For instance, Pregel is a bulk synchronous
message passing abstraction where vertices communicate
through messages. On the other hand, GraphLab is a sequential
shared memory abstraction where each vertex can read and write
to data on adjacent vertices and edges. The GraphLab runtime is
then responsible for ensuring a consistent parallel execution. Consequently,
GraphLab simplifies the design and implementation of
graph-parallel algorithms by freeing the user to focus on sequential
computation rather than the parallel movement of data (i.e.,
messaging).
Asynchronous Iterative Computation: Many important
MLDM algorithms iteratively update a large set of parameters.
Because of the underlying graph structure, parameter updates (on
vertices or edges) depend (through the graph adjacency structure)
on the values of other parameters. In contrast to synchronous
systems, which update all parameters simultaneously (in parallel)
using parameter values from the previous time step as input,
asynchronous systems update parameters using the most recent
parameter values as input. As a consequence, asynchronous systems
provides many MLDM algorithms with significant algorithmic
benefits. For example, linear systems (common to many MLDM
algorithms) have been shown to converge faster when solved
asynchronously [4]. Additionally, there are numerous other
cases (e.g., belief propagation [13], expectation maximization
[28], and stochastic optimization [35, 34]) where asynchronous
procedures have been empirically shown to significantly outperform
synchronous procedures. In Fig. 1(a) we demonstrate how asynchronous
computation can substantially accelerate the convergence
of PageRank.
Synchronous computation incurs costly performance penalties
since the runtime of each phase is determined by the slowest machine.
The poor performance of the slowest machine may be caused
by a multitude of factors including: load and network imbalances,
hardware variability, and multi-tenancy (a principal concern in the
Cloud). Even in typical cluster settings, each compute node may also
provide other services (e.g., distributed file systems). Imbalances
in the utilization of these other services will result in substantial
performance penalties if synchronous computation is used.
In addition, variability in the complexity and convergence of
the individual vertex kernels c

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

GraphLab กระจาย: กรอบการเรียนรู้ของเครื่องและการทำเหมืองข้อมูลในเมฆโรงแรมยู่เฉิงต่ำคาร์เนกีเมลลอนมหาวิทยาลัยylow@cs.cmu.eduโจเซฟ Gonzalezคาร์เนกีเมลลอนมหาวิทยาลัยjegonzal@cs.cmu.eduโรงแรมกัลมาร์ Kyrolaคาร์เนกีเมลลอนมหาวิทยาลัยakyrola@cs.cmu.eduDanny Bicksonคาร์เนกีเมลลอนมหาวิทยาลัยbickson@cs.cmu.eduCarlos Guestrinคาร์เนกีเมลลอนมหาวิทยาลัยguestrin@cs.cmu.eduโจเซฟ Hellerstein ม.เบิร์กลีย์ UChellerstein@cs.berkeley.eduบทคัดย่อในขณะที่ข้อมูลพื้นฐานแบบขนานกรอบ เช่น MapReduce ง่ายการออกแบบและดำเนินการประมวลผลข้อมูลขนาดใหญ่ระบบ พวกเขาตามธรรมชาติ หรือมีประสิทธิภาพได้มากที่สำคัญการทำเหมืองข้อมูลและเครื่องเรียนรู้อัลกอริทึมและสามารถนำไปต่ำเรียนรู้ระบบ ช่วยกรอกข้อมูลสำคัญนี้โมฆะ เราแนะนำabstraction GraphLab ที่ธรรมชาติแสดงแบบอะซิงโครนัสคำนวณแบบไดนามิก กราฟขนานในขณะที่ข้อมูลแน่และบรรลุเป้าหมายระดับสูงพร้อมประสิทธิภาพในการการตั้งค่าใช้ร่วมกันหน่วยความจำ ในเอกสารนี้ เราได้ขยายการ GraphLabตั้งค่ากระจายกรอบการท้าทายอย่างมากในขณะที่รักษาประกันความสอดคล้องของข้อมูลที่แข็งแรงเราพัฒนาขยายกราฟตาม pipelined ล็อกและข้อมูลรุ่นเพื่อลดการแออัดของเครือข่าย และบรรเทาผลของเวลาแฝงเครือข่าย เรายังแนะนำค่าเผื่อความบกพร่อง GraphLababstraction ที่ใช้อัลกอริทึมการ snapshot ของ Chandy Lamport คลาสสิกและแสดงให้เห็นว่ามันสามารถจะได้ดำเนินการ โดย exploitingabstraction GraphLab เอง สุดท้าย เราประเมินของเรากระจายดำเนินงานของ abstraction GraphLab ใน Amazon ขนาดใหญ่ประสิทธิภาพ EC2 ปรับใช้และดู 1-2 อันดับของขนาดกำไรมากกว่าอย่างไร Hadoop ตามการใช้งาน1. บทนำกับที่เรขาในระดับของการเรียนรู้ของเครื่อง และปัญหาการทำเหมืองแร่ (MLDM) ข้อมูลและความซับซ้อนที่เพิ่มขึ้นของเทคนิคการ MLDM การจำระบบที่สามารถเพิ่มขึ้นเป็นดำเนิน MLDM อัลกอริทึมมีประสิทธิภาพพร้อมกันในคลัสเตอร์ขนาดใหญ่พร้อมกัน ความพร้อมของคอมพิวเตอร์บริการคลาวด์Amazon EC2 ให้สัญญาการเข้าถึงตามความต้องการราคาไม่แพงขนาดใหญ่การใช้งานและการจัดเก็บทรัพยากรสำคัญการลงทุนที่ตะวัน อับ ออกแบบ การใช้ และดีบักกระบวน MLDM กระจายที่จำเป็นอย่างแท้จริงเมฆสามารถท้าทาย prohibitively ต้องการผู้เชี่ยวชาญ MLDMสภาพการแข่งขันที่อยู่ หยุดชะงัก รัฐกระจาย และการสื่อสารโปรโตคอลพร้อมพัฒนา mathematicallyรูปแบบที่ซับซ้อนและอัลกอริทึมกระนั้น ความต้องการขนาดใหญ่คำนวณและเก็บข้อมูลทรัพยากร มีการขับเคลื่อนหลาย [2, 14, 15, 27, 30, 35] การพัฒนาแบบขนาน และแบบกระจาย MLDM ระบบแต่ละรุ่นและโปรแกรมประยุกต์ พยายามใช้เวลานาน และมักจะซ้ำซ้อนช้าความคืบหน้าของเขตข้อมูลเป็นกลุ่มวิจัยต่าง ๆ ซ้ำ ๆแก้ปัญหาเดียวกันพร้อมกัน/กระจายคอมพิวเตอร์ ดังนั้นชุมชน MLDM ต้อง abstraction กระจายตัวสูงโดยที่เป้าหมายการแบบอะซิงโครนัส ไดนามิก กราฟขนานคำนวณพบในโปรแกรมประยุกต์ MLDM ในขณะที่ซ่อนตัวความซับซ้อนของการออกแบบระบบขนาน / กระจาย อับabstractions ขนานอยู่ระดับสูง (เช่น MapReduce [8, 9],โรงแรมไดรอาด [19] และ [25] Pregel) ไม่สนับสนุนคุณสมบัติเหล่านี้สำคัญช่วยกรอกข้อมูลนี้ โมฆะเราแนะนำ [24] GraphLab abstraction ที่ตรงเป้าหมายแบบอะซิงโครนัส ไดนามิก กราฟพร้อมคำนวณการจำที่ใช้ร่วมกันในเอกสารนี้ เราขยาย abstraction GraphLab หลายหลักการตั้งค่ากระจาย และให้คำอธิบายอย่างเป็นทางการของการกระจายแบบจำลองการดำเนินการ เราได้หลายวิธีจะใช้แล้วแบบจำลองการดำเนินการแจกจ่ายที่มีประสิทธิภาพในขณะที่รักษาความสอดคล้องอย่างเข้มงวดความต้องการ เพื่อให้บรรลุเป้าหมายนี้ เรารวบรวมข้อมูลรุ่นลดเครือข่ายแออัด และ pipelined กระจายล็อกการบรรเทาผลกระทบของเวลาแฝงเครือข่าย ที่อยู่ความท้าทายของข้อมูลท้องถิ่นและระดับที่เราแนะนำกราฟอะตอมสำหรับการทำกราฟอย่างรวดเร็วโครงสร้างข้อมูลการกระจายเรายังเพิ่มค่าเผื่อความบกพร่องในกรอบ GraphLab โดยดัดแปลงอัลกอริทึมสแนปช็อตคลาสสิ Chandy-Lamport [6] และแสดงให้เห็นถึงว่ามันสามารถจะได้ดำเนินการภายใน GraphLab abstractionเราทำการวิเคราะห์ประสิทธิภาพการทำงานที่ครอบคลุมของเราเพิ่มประสิทธิภาพงาน c ++บน Amazon Cloud ยืดหยุ่นบริการคอมพิวเตอร์ (EC2) แสดงว่า โปรแกรมประยุกต์ที่สร้างขึ้นใช้ GraphLab มีประสิทธิภาพสูงกว่าอย่างไร Hadoop เท่า MapReduce [9]ใช้งาน โดย x 20-60 และเปรียบเทียบประสิทธิภาพของอย่างระมัดระวังสร้างการใช้งาน MPI ผลงานหลักของเราคือการต่อไปนี้:•สรุปคุณสมบัติของอัลกอริทึม MLDM และข้อจำกัดของกรอบขนาดใหญ่ที่มีอยู่ (รอบ 2)•ปรับเปลี่ยน GraphLab abstraction และการดำเนินการรูปแบบเหมาะกับการตั้งค่าการกระจาย (รอบ 3)•สองแตกวิธีใช้ใหม่รูปแบบการดำเนินการกระจาย (รอบ 4):สิทธิ์ในการทำดิจิตอลหรือสิ่งพิมพ์ทั้งหมดหรือส่วนหนึ่งของงานนี้ใช้ส่วนตัวหรือห้องเรียนได้รับ โดยไม่มีค่าธรรมเนียมที่มีสำเนาไม่ทำ หรือกระจายกำไร หรือประโยชน์ทางการค้า และสำเนาที่หมีนี้ประกาศและอ้างอิงเต็มหน้าแรก การคัดลอกอื่น การประกาศใหม่ การลงรายการบัญชีบนเซิร์ฟเวอร์ หรือกระจายไปยังรายการ ต้องการเฉพาะก่อนสิทธิ์และ/หรือค่าธรรมเนียม บทความจากไดรฟ์ข้อมูลนี้ได้รับเชิญให้นำเสนอผลลัพธ์ในประชุมนานาชาติ 38 มากขนาดใหญ่ฐานข้อมูล27 31 สิงหาคม2012 อิสตันบูล ตุรกีวิชาการที่ VLDB มอบเงิน ปีที่ 5, 8 หมายเลขลิขสิทธิ์ 2012 VLDB องค์การกองทุน 21508097 /12/04 ... $ 10.00716เครื่องตั้งสายเครื่องยนต์: ใช้การระบายสีกราฟเพื่อให้มีประสิทธิภาพการดำเนินการที่สอดคล้องกันตามลำดับสำหรับกำหนดการคงล็อคเครื่องยนต์: ใช้ล็อกกระจาย pipelined และแฝงซ่อนการสนับสนุนแบบไดนามิกจัดลำดับความสำคัญการดำเนินการค่าเผื่อความบกพร่อง•ผ่านร่าง snapshotting สอง (4.3 วินาที)•การใช้งานของอัลกอริทึมเรียนรัฐ-of-the-art เครื่องที่สามอยู่บนสุดของ GraphLab กระจาย (5 วินาที)•การประเมินอย่างละเอียดโดยใช้ตัวประมวลผล 512 GraphLab กระจาย(โหน 64) คลัสเตอร์ EC2 รวมถึงการให้อย่างไร Hadoop การเปรียบเทียบPregel และการใช้งาน MPI (5 วินาที)2. MLDM อัลกอริทึมคุณสมบัติในส่วนนี้ เราอธิบายคุณสมบัติสำคัญหลาย ๆ อย่างของประสิทธิภาพขนาดใหญ่ขนาน MLDM ระบบการ GraphLab การabstraction [24] และวิธีอื่น ๆ กรอบคู่ขนานไม่อยู่คุณสมบัติเหล่านี้ สรุปของกรอบคู่ขนานและคุณสมบัติเหล่านี้สามารถพบในตารางที่ 1กราฟแบบมีโครงสร้างคำนวณ: หลายความก้าวหน้าล่าสุดinMLDM ได้เน้นการสร้างโมเดลความสัมพันธ์ระหว่างข้อมูลโดยอ้างอิงข้อมูลการสร้างโมเดล เราจะสามารถแยกสัญญาณเพิ่มเติมจากข้อมูลคะ ตัวอย่าง การสร้างโมเดลความสัมพันธ์ระหว่างนักช็อปเหมือนกันช่วยให้เราสามารถให้คำแนะนำผลิตภัณฑ์ที่ดีกว่ากว่าการรักษานักช็อปในแยก อับ ข้อมูลแบบขนานabstractions เช่น MapReduce [9] เป็นไม่เหมาะสมสำหรับการคำนวณขึ้นอยู่โดยทั่วไปต้องการเพิ่มเติมขั้นสูงอัลกอริทึม MLDM แม้ว่าบ่อยครั้งการแมปอัลกอริทึมมีการคำนวณขึ้นเป็น MapReduce abstractionแปลงผลลัพธ์ได้อย่างท้าทาย และอาจแนะนำinefficiency พบผล มีแนวโน้มล่าสุดไป graphparallelabstractions ชอบ Pregel [25] และ GraphLab [24] ซึ่งธรรมชาติแสดงอ้างอิงการคำนวณ Abstractions เหล่านี้นำแบบจำลองเกี่ยวกับจุดที่คำนวณไว้เป็นเมล็ดที่ใช้ในแต่ละจุด ตัวอย่าง Pregel เป็นจำนวนมากเป็นแบบซิงโครนัสข้อความที่ช่วย abstraction ที่การสื่อสารของจุดยอดผ่านข้อความ บนมืออื่น ๆ GraphLab เป็นแบบลำดับabstraction หน่วยความจำที่ใช้ร่วมกันซึ่งแต่ละจุดสามารถอ่าน และเขียนข้อมูลบนข้าง ๆ และขอบ รันไทม์ GraphLab เป็นแล้วรับผิดชอบเพื่อการดำเนินการคู่ขนานสอดคล้องกัน ดังนั้นGraphLab ช่วยให้ง่ายการออกแบบและปฏิบัติขนานกราฟอัลกอริทึม โดยพ้นผู้เน้นตามลำดับคำนวณแทนการเคลื่อนไหวคู่ขนานของข้อมูล (เช่นส่งข้อความ)คำนวณซ้ำแบบอะซิงโครนัส: หลายสิ่งสำคัญอัลกอริทึม MLDM ปรับปรุงพารามิเตอร์ชุดใหญ่ซ้ำ ๆเนื่องจากกราฟโครงสร้างพื้นฐาน การปรับปรุงพารามิเตอร์(จุดยอดหรือขอบ) ขึ้น (ผ่านโครงสร้างกราฟ adjacency)ค่าของพารามิเตอร์อื่น ๆ ในการซิงโครนัสระบบ การปรับปรุงพารามิเตอร์ทั้งหมดพร้อมกัน (ในขนาน)ใช้ค่าพารามิเตอร์จากการย้อนเวลาเป็นอินพุทพารามิเตอร์ที่ใช้ล่าสุดปรับปรุงระบบแบบอะซิงโครนัสค่าพารามิเตอร์เป็นอินพุท เป็นสัจจะ ระบบแบบอะซิงโครนัสมีหลาย MLDM อัลกอริทึม ด้วยสำคัญ algorithmicประโยชน์ ตัวอย่าง ระบบเชิงเส้น (ไป MLDM มากอัลกอริทึม) ได้รับการแสดงเพื่อมาบรรจบกันได้เร็วขึ้นเมื่อมีแก้ไขแบบอะซิงโครนัส [4] นอกจากนี้ ยังมีอีกมากมายกรณี (เช่น เผยแพร่ความเชื่อ [13] maximization ความคาดหวัง[28], และเพิ่มประสิทธิภาพสโทแคสติก [35, 34]) แบบอะซิงโครนัสกระบวนงานที่ได้รับการแสดง empirically มีประสิทธิภาพสูงกว่าอย่างมีนัยสำคัญกระบวนการซิงโครนัส ใน Fig. 1(a) เราสาธิตวิธีแบบอะซิงโครนัสคำนวณมากสามารถเร่งการบรรจบกันของรถเข้าคำนวณแบบซิงโครนัสก่อโทษประสิทธิภาพค่าใช้จ่ายเนื่องจากรันไทม์ของแต่ละขั้นตอนเป็นไปตามเครื่องช้าที่สุดอาจเกิดจากประสิทธิภาพที่ต่ำของเครื่องช้าที่สุดจากหลากหลายปัจจัยที่รวมทั้ง: สมดุลการผลิตและเครือข่ายความแปรผันของฮาร์ดแวร์ และเช่าหลาย (กังวลหลักในการเมฆ แม้ในการตั้งค่าคลัสเตอร์ทั่วไป แต่ละโหนดการคำนวณอาจจะให้บริการอื่น ๆ (เช่น ระบบแฟ้มแบบกระจาย) ความไม่สมดุลในการใช้ประโยชน์ของ บริการอื่น ๆ จะส่งผลสำคัญโทษปรับประสิทธิภาพถ้าใช้การคำนวณแบบซิงโครนัสนอกจากนี้ ความแปรผันในความซับซ้อนของการลู่เข้าของc จุดยอดแต่ละเมล็ด

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

กระจาย GraphLab:
กรอบสำหรับการเรียนรู้เครื่องและการทำเหมืองข้อมูลในเมฆ
Yucheng ต่ำ
Carnegie Mellon University
ylow@cs.cmu.edu~~V
โจเซฟกอนซาเล
Carnegie Mellon University
jegonzal@cs.cmu.edu
Aapo Kyrola
Carnegie Mellon University
akyrola@cs.cmu.edu
แดนนี่ Bickson
Carnegie Mellon University
bickson@cs.cmu.edu~~V
คาร์ลอ Guestrin
Carnegie Mellon University
guestrin@cs.cmu.edu~~V
โจเซฟเมตร Hellerstein
UC Berkeley
hellerstein@cs.berkeley.edu
บทคัดย่อขณะที่ข้อมูลระดับสูงกรอบคู่ขนานเช่น MapReduce ง่าย การออกแบบและการใช้งานของข้อมูลขนาดใหญ่การประมวลผลระบบพวกเขาไม่ได้เป็นธรรมชาติได้อย่างมีประสิทธิภาพหรือการสนับสนุนที่สำคัญมากการทำเหมืองข้อมูลและการเรียนรู้เครื่องขั้นตอนวิธีและสามารถนำไปสู่การไม่มีประสิทธิภาพระบบการเรียนรู้ ที่จะช่วยเติมเต็มช่องว่างที่สำคัญเราแนะนำนามธรรม GraphLab ซึ่งเป็นการแสดงออกถึงความเป็นธรรมชาติที่ไม่ตรงกันแบบไดนามิกคำนวณกราฟขนานขณะที่มั่นใจความสอดคล้องของข้อมูลและประสบความสำเร็จในระดับสูงของประสิทธิภาพการทำงานแบบคู่ขนานในการตั้งค่าหน่วยความจำที่ใช้ร่วมกัน ในบทความนี้เราขยาย GraphLab กรอบไปอย่างมีนัยสำคัญการตั้งค่าการกระจายที่ท้าทายมากขึ้นในขณะที่รักษาความสอดคล้องของข้อมูลการค้ำประกันที่แข็งแกร่ง. เราพัฒนาส่วนขยายตามกราฟจะล็อคไปป์ไลน์และข้อมูลเวอร์ชันเพื่อลดความแออัดของเครือข่ายและลดผลกระทบจากความล่าช้าของเครือข่าย นอกจากนี้เรายังแนะนำความอดทนความผิดไปที่ GraphLab นามธรรมโดยใช้ขั้นตอนวิธีคลาสสิกภาพรวม Chandy-Lamport และแสดงให้เห็นถึงวิธีการที่จะสามารถดำเนินการได้อย่างง่ายดายโดยการใช้ประโยชน์จากนามธรรม GraphLab ตัวเอง สุดท้ายเราประเมินการกระจายของเราดำเนินงานของนามธรรม GraphLab บนอเมซอนที่มีขนาดใหญ่ใช้งานEC2 และแสดง 1-2 คำสั่งของประสิทธิภาพการทำงานที่สำคัญกำไรมากกว่าการใช้งานHadoop-based. 1 บทนำกับการเจริญเติบโตชี้แจงในขนาดของเครื่องการเรียนรู้และการทำเหมืองข้อมูล(MLDM) ปัญหาและความซับซ้อนที่เพิ่มขึ้นของเทคนิคMLDM มีความต้องการที่เพิ่มขึ้นสำหรับระบบที่สามารถดำเนินการขั้นตอนวิธีการMLDM ได้อย่างมีประสิทธิภาพในแบบคู่ขนานในกลุ่มที่มีขนาดใหญ่. พร้อมกันที่ความพร้อมของระบบคลาวด์ บริการคอมพิวเตอร์เช่นAmazon EC2 ให้สัญญาของการเข้าถึงความต้องการที่จะราคาไม่แพงคอมพิวเตอร์ขนาดใหญ่และการจัดเก็บข้อมูลโดยไม่ต้องทรัพยากรที่สำคัญการลงทุนล่วงหน้า แต่น่าเสียดายที่การออกแบบดำเนินการและการแก้จุดบกพร่องขั้นตอนวิธีการ MLDM กระจายที่จำเป็นในการใช้ประโยชน์อย่างเต็มที่เมฆสามารถท้าทายสาหัสที่ต้องใช้ผู้เชี่ยวชาญMLDM เพื่อรับมือกับสภาพการแข่งขันงันรัฐกระจายและการสื่อสารโปรโตคอลในขณะเดียวกันการพัฒนาทางคณิตศาสตร์รูปแบบที่ซับซ้อนและขั้นตอนวิธี. อย่างไรก็ตาม ความต้องการสำหรับการคำนวณขนาดใหญ่และการเก็บรักษาทรัพยากรที่มีการขับเคลื่อนจำนวนมาก[2, 14, 15, 27, 30, 35] การพัฒนาใหม่ขนานและระบบกระจายMLDM เป้าหมายที่แต่ละรุ่นและการประยุกต์ใช้ ใช้เวลานานนี้และมักจะพยายามที่ซ้ำซ้อนช้าความคืบหน้าของสนามเป็นกลุ่มวิจัยที่แตกต่างกันซ้ำ ๆ แก้ขนานเดียวกัน / กระจายปัญหาคอมพิวเตอร์ ดังนั้นชุมชน MLDM ความต้องการระดับสูงที่เป็นนามธรรมกระจายที่เฉพาะเป้าหมายตรงกันแบบไดนามิกกราฟขนานการคำนวณพบว่าในการใช้งานMLDM มากในขณะที่ซ่อนความซับซ้อนของการขนาน/ กระจายการออกแบบระบบ แต่น่าเสียดายที่มีอยู่ abstractions ขนานระดับสูง (เช่น MapReduce [8, 9] นางไม้ [19] และ Pregel [25]) ล้มเหลวที่จะสนับสนุนคุณสมบัติที่สำคัญเหล่านี้. เพื่อช่วยให้ช่องว่างนี้เราได้แนะนำ [24] นามธรรม GraphLab ซึ่งเป้าหมายโดยตรงไม่ตรงกันแบบไดนามิกการคำนวณกราฟขนานในการตั้งค่าหน่วยความจำที่ใช้ร่วมกัน. ในบทความนี้เราขยายนามธรรม GraphLab แบบ multi-core กับการตั้งค่าการกระจายและให้คำอธิบายอย่างเป็นทางการของการกระจายรูปแบบการดำเนินการ จากนั้นเราจะสำรวจวิธีการหลายวิธีที่จะใช้รูปแบบการดำเนินการกระจายที่มีประสิทธิภาพในขณะที่รักษาความมั่นคงที่เข้มงวดความต้องการ เพื่อให้บรรลุเป้าหมายนี้เรารวมข้อมูลเวอร์ชันเพื่อลดความแออัดของเครือข่ายและไปป์ไลน์กระจายล็อคเพื่อบรรเทาผลกระทบจากความล่าช้าของเครือข่าย เพื่อรับมือกับความท้าทายของท้องถิ่นข้อมูลและสิทธิในการเข้าเราแนะนำกราฟอะตอมสำหรับอย่างรวดเร็ววางกราฟข้อมูลที่มีโครงสร้างในการตั้งค่าการกระจาย. นอกจากนี้เรายังเพิ่มความอดทนความผิดกรอบ GraphLab โดยการปรับคลาสสิกChandy-Lamport [6] อัลกอริทึมภาพรวมและแสดงให้เห็นถึงวิธีการที่จะสามารถดำเนินการได้อย่างง่ายดายภายในนามธรรม GraphLab. เราดำเนินการวิเคราะห์ผลการดำเนินงานที่ครอบคลุมของเราที่ดีที่สุด c ++ การดำเนินงานใน Amazon Elastic Cloud (EC2) บริการคอมพิวเตอร์ เราแสดงให้เห็นว่าการใช้งานที่สร้างขึ้นโดยใช้ GraphLab ดีกว่าเทียบเท่า Hadoop / MapReduce [9] การใช้งานโดย 20-60x และตรงกับประสิทธิภาพการทำงานของอย่างระมัดระวังสร้างการใช้งานMPI ผลงานหลักของเราเป็นดังต่อไปนี้•สรุปคุณสมบัติทั่วไปของขั้นตอนวิธีMLDM และข้อจำกัด ของกรอบที่มีอยู่ขนาดใหญ่ (Sec. 2) •รุ่นแก้ไขของนามธรรม GraphLab และการดำเนินการรูปแบบที่เหมาะกับการตั้งค่าการกระจาย (กลต. 3) •สองวิธีที่แตกต่างกันอย่างมีนัยสำคัญที่จะดำเนินการใหม่การดำเนินการกระจายรูปแบบ (Sec 4.) ได้รับอนุญาตให้ทำสำเนาดิจิตอลหรือหนักของทั้งหมดหรือบางส่วนของการทำงานในการนี้การใช้งานส่วนตัวหรือห้องเรียนจะได้รับโดยไม่เสียค่าธรรมเนียมให้สำเนาไม่ได้ทำหรือแจกจ่ายเพื่อหากำไรหรือประโยชน์ในเชิงพาณิชย์และที่สำเนาแบกนี้และแจ้งให้ทราบล่วงหน้าอ้างอิงเต็มรูปแบบบนหน้าแรก ในการคัดลอกอย่างอื่นในการเผยแพร่โพสต์บนเซิร์ฟเวอร์หรือเพื่อแจกจ่ายไปยังรายการที่ต้องการที่เฉพาะเจาะจงก่อนได้รับอนุญาตและ/ หรือค่าธรรมเนียม บทความจากหนังสือเล่มนี้ได้รับเชิญให้นำเสนอผลของพวกเขาในการประชุมวิชาการนานาชาติครั้งที่ 38 ในฐานข้อมูลขนาดใหญ่มาก, 27 สิงหาคม 31, 2012, อิสตันบูล, ตุรกี. การดำเนินการของ VLDB บริจาคฉบับ 5 ฉบับที่ 8 ลิขสิทธิ์ 2012 VLDB บริจาค 21,508,097 / 4/12 ... $ 10.00. 716? รงค์เครื่องยนต์: ใช้กราฟสีที่มีประสิทธิภาพเพื่อให้บรรลุตามลำดับการดำเนินการที่สอดคล้องกันสำหรับตารางเวลาคงที่.? เครื่องยนต์ล็อค: ใช้ล็อคกระจายไปป์ไลน์และแฝงซ่อนตัวอยู่เพื่อสนับสนุนการดำเนินการจัดลำดับความสำคัญแบบไดนามิก. •ความอดทนความผิดพลาดผ่านสองรูปแบบ snapshotting (Sec. 4.3) •การใช้งานของเครื่องสามรัฐของศิลปะขั้นตอนวิธีการเรียนรู้ในด้านบนของ GraphLab กระจาย (Sec. 5) •การประเมินผลที่กว้างขวางของ GraphLab กระจายโดยใช้หน่วยประมวลผล 512 (64 โหนด) กลุ่ม EC2 รวมทั้งเปรียบเทียบกับ Hadoop, Pregel, และการใช้งานในภาคอุตสาหกรรม (Sec. 5) 2 ขั้นตอนวิธี MLDM คุณสมบัติในส่วนนี้เราจะอธิบายคุณสมบัติที่สำคัญหลายแห่งที่มีประสิทธิภาพขนาดใหญ่ระบบMLDM ขนานแก้ไขโดย GraphLab นามธรรม [24] และวิธีการอื่น ๆ กรอบคู่ขนานไม่ได้อยู่คุณสมบัติเหล่านี้ ผลรวมจากคุณสมบัติเหล่านี้และกรอบคู่ขนานสามารถพบได้ในตารางที่ 1 กราฟการคำนวณโครงสร้าง: หลายก้าวหน้า. inMLDM ได้มุ่งเน้นการสร้างแบบจำลองการอ้างอิงระหว่างข้อมูลโดยการสร้างแบบจำลองการอ้างอิงข้อมูลที่เราสามารถที่จะดึงสัญญาณมากขึ้นจากข้อมูลที่มีเสียงดัง ยกตัวอย่างเช่นการสร้างแบบจำลองการพึ่งพาระหว่างผู้ซื้อที่คล้ายกันช่วยให้เราสามารถให้คำแนะนำผลิตภัณฑ์ที่ดีขึ้นกว่าการรักษาผู้ซื้อในการแยก แต่น่าเสียดายที่ข้อมูลแบบขนานนามธรรมเช่น MapReduce [9] โดยทั่วไปจะไม่เหมาะกันดีสำหรับการคำนวณขึ้นอยู่กับที่ต้องการโดยทั่วไปสูงขึ้นขั้นตอนวิธีการMLDM แม้ว่ามันมักจะเป็นไปได้ที่จะ map ขั้นตอนวิธีการที่มีการอ้างอิงการคำนวณออกเป็นนามธรรมMapReduce การเปลี่ยนแปลงที่เกิดขึ้นสามารถเป็นสิ่งที่ท้าทายและอาจแนะนำการขาดประสิทธิภาพมาก. เป็นผลให้มีการแนวโน้มล่าสุดต่อ graphparallel นามธรรมเช่น Pregel [25] และ GraphLab [24 ] ซึ่งเป็นธรรมชาติแสดงอ้างอิงการคำนวณ แนวคิดเหล่านี้นำมาใช้เป็นรูปแบบจุดสุดยอดเป็นศูนย์กลางในการคำนวณที่ถูกกำหนดให้เป็นเมล็ดที่ทำงานบนแต่ละจุดสุดยอด ยกตัวอย่างเช่น Pregel เป็นซิงโครกลุ่มข้อความผ่านจุดที่เป็นนามธรรมที่สื่อสารผ่านข้อความ ในทางตรงกันข้าม, GraphLab เป็นลำดับนามธรรมหน่วยความจำร่วมที่แต่ละจุดสุดยอดสามารถอ่านและเขียนข้อมูลในจุดที่อยู่ติดกันและขอบ รันไทม์ GraphLab เป็นแล้วรับผิดชอบการดำเนินการคู่ขนานที่สอดคล้องกัน ดังนั้นGraphLab ช่วยลดความยุ่งยากในการออกแบบและการดำเนินการตามขั้นตอนวิธีกราฟขนานโดยพ้นผู้ใช้ให้ความสำคัญกับลำดับการคำนวณมากกว่าการเคลื่อนไหวคู่ขนานของข้อมูล(เช่นการส่งข้อความ). Asynchronous ซ้ำคำนวณ: หลายคนที่สำคัญขั้นตอนวิธีการMLDM ซ้ำอัปเดตชุดใหญ่ของพารามิเตอร์ . เพราะโครงสร้างกราฟพื้นฐานการปรับปรุงพารามิเตอร์ (ในจุดหรือขอบ) ขึ้นอยู่กับ (ผ่านโครงสร้างถ้อยคำกราฟ) ค่าของพารามิเตอร์อื่น ๆ ในทางตรงกันข้ามกับซิงโครระบบที่ปรับปรุงพารามิเตอร์ทั้งหมดพร้อมกัน (ในแบบคู่ขนาน) ใช้ค่าพารามิเตอร์จากขั้นตอนที่ครั้งก่อนหน้านี้เป็น input ระบบไม่ตรงกันพารามิเตอร์ใช้ปรับปรุงล่าสุดค่าพารามิเตอร์เป็น input เป็นผลให้ระบบไม่ตรงกันมีขั้นตอนวิธีการ MLDM จำนวนมากที่มีอัลกอริทึมที่สำคัญผลประโยชน์ ยกตัวอย่างเช่นระบบเชิงเส้น (เรื่องธรรมดาที่จะ MLDM หลายขั้นตอนวิธีการ) ได้รับการแสดงที่จะมาบรรจบกันได้เร็วขึ้นเมื่อแก้ไขถ่ายทอดสด [4] นอกจากนี้ยังมีอื่น ๆ อีกมากมายกรณี(เช่นการขยายพันธุ์เชื่อ [13], ความคาดหวังสูงสุด[28] และการเพิ่มประสิทธิภาพสุ่ม [35, 34]) ที่ไม่ตรงกันขั้นตอนการได้รับการแสดงที่จะมีการสังเกตุดีกว่าวิธีการซิงโคร ในรูป 1 (ก) เราแสดงให้เห็นถึงวิธีการที่ไม่ตรงกันในการคำนวณอย่างมีนัยสำคัญสามารถเร่งการบรรจบกันของPageRank. คำนวณ Synchronous เกิดขึ้นจากการลงโทษประสิทธิภาพค่าใช้จ่ายตั้งแต่รันไทม์ของแต่ละขั้นตอนจะถูกกำหนดโดยเครื่องช้าที่สุด. ผลการดำเนินงานที่ดีของเครื่องช้าที่สุดอาจจะเกิดจากความหลากหลายของปัจจัย ได้แก่ : โหลดและความไม่สมดุลของเครือข่ายความแปรปรวนของฮาร์ดแวร์และหลายครอบครอง(ความกังวลที่สำคัญในระบบคลาวด์) แม้จะอยู่ในกลุ่มการตั้งค่าทั่วไปโหนดคำนวณแต่ละนอกจากนี้ยังอาจให้บริการอื่น ๆ (เช่นการกระจายระบบไฟล์) ความไม่สมดุลในการใช้ประโยชน์จากบริการอื่น ๆ เหล่านี้จะส่งผลอย่างมีนัยสำคัญบทลงโทษหากคำนวณผลการดำเนินงานจะใช้ซิงโคร. นอกจากนี้ยังมีความแปรปรวนในความซับซ้อนและการบรรจบกันของเมล็ดจุดสุดยอดของแต่ละบุคคลค

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

graphlab กระจาย : กรอบสำหรับการเรียนรู้เครื่อง
และการทำเหมืองข้อมูลในเมฆ
ยู่เฉิงน้อย

ylow Carnegie Mellon University @ CS . มหาวิทยาลัยเชียงใหม่ . edu
โจเซฟ กอนซาเลซ

jegonzal Carnegie Mellon University @ CS . มหาวิทยาลัยเชียงใหม่ . edu kyrola

aapo Carnegie Mellon University
akyrola @ CS . มหาวิทยาลัยเชียงใหม่ . edu bickson

แดนนี่ มหาวิทยาลัยคาร์เนกีเมลลอน
bickson @ CS . มหาวิทยาลัยเชียงใหม่ . edu
คาร์ลอส guestrin

guestrin Carnegie Mellon University @ CS Edu
โจเซฟเมตรมหาวิทยาลัยเชียงใหม่hellerstein
UC Berkeley
hellerstein @ CS . Berkeley . edu

ขณะที่พื้นฐานกรอบนามธรรมข้อมูลแบบขนาน เช่น mapreduce , ลดความซับซ้อนของการออกแบบและใช้งานระบบ

ข้อมูลการประมวลผลขนาดใหญ่ พวกเขาไม่ธรรมชาติ หรือ มีหลายคนสนับสนุนสำคัญ
การทำเหมืองข้อมูลและอัลกอริทึมการเรียนรู้ของเครื่องจักร และสามารถนำไปสู่ประสิทธิภาพ
การเรียนรู้ระบบ จะช่วยเติมเต็มช่องว่างที่สำคัญนี้ เราแนะนำ
การ graphlab นามธรรมที่แสดงแบบธรรมชาติ , แบบขนาน
กราฟการคำนวณในขณะที่มั่นใจ
ความสอดคล้องของข้อมูลและการบรรลุระดับสูงของการทำงานแบบขนานใน
หน่วยความจำที่ใช้ร่วมกันการตั้งค่า ในกระดาษนี้เราขยาย graphlab
กรอบกับความท้าทายอย่างมากมากกว่ากระจายการตั้งค่าในขณะที่รักษารับประกันความสอดคล้องของข้อมูล

แข็งแรง .เราพัฒนากราฟตามนามสกุล pipelined ล็อคและข้อมูล
รุ่นเพื่อลดความแออัดของเครือข่ายและลดผลกระทบของ
ศักยภาพเครือข่าย เรายังแนะนำความอดทนความผิดกับนามธรรม graphlab
ใช้คลาสสิก chandy แลมพ ์ตภาพรวมขั้นตอนวิธี
และแสดงให้เห็นถึงวิธีที่สามารถใช้งานได้อย่างง่ายดาย โดยการใช้ประโยชน์ graphlab
นามธรรมนั่นเอง ในที่สุด เราประเมินของเรากระจาย
การดำเนินงานของ graphlab นามธรรมบน Amazon EC2 การใช้งานขนาดใหญ่
และแสดง 1-2 จากผลงาน
สั่งขนาดกว่า Hadoop ตามการใช้งาน .
1 บทนำ
กับการเจริญเติบโตในขนาดของการเรียนรู้ของเครื่องและ
การทำเหมืองข้อมูล ( mldm ) ปัญหาและเพิ่มความซับซ้อนของเทคนิค mldm
มีความต้องการระบบที่สามารถ
เพิ่มดำเนินการขั้นตอนวิธี mldm ได้อย่างมีประสิทธิภาพในแบบขนานบนคลัสเตอร์ขนาดใหญ่
พร้อมกัน , ความพร้อมของบริการคอมพิวเตอร์เมฆเช่น Amazon EC2
ให้สัญญาของการเข้าถึงความต้องการที่จะมาก
ขนาดใหญ่คอมพิวเตอร์และจัดเก็บข้อมูลโดยไม่ต้องลงทุนล่วงหน้ามาก

แต่น่าเสียดายที่การออกแบบ การใช้งาน และการแก้จุดบกพร่อง mldm อัลกอริทึมต้องการกระจาย

เพื่อใช้ประโยชน์อย่างเต็มที่เมฆสามารถ prohibitively ท้าทายที่ต้องการ mldm ผู้เชี่ยวชาญ
ที่อยู่เงื่อนไขการแข่งขันรัฐกระจาย deadlocks , และโปรโตคอลสื่อสาร ในขณะเดียวกัน การพัฒนาทางคณิตศาสตร์ที่ซับซ้อน

นางแบบและขั้นตอนวิธี อย่างไรก็ตาม ความต้องการขนาดใหญ่คอมพิวเตอร์และกระเป๋า
ทรัพยากร มีแรงผลักดันมาก [ 2 , 14 , 15 , 27 , 30 , 35 . พัฒนาใหม่
ขนานและระบบ mldm กระจายเป้าหมายที่รุ่นบุคคล
และการประยุกต์ใช้ นี้ใช้เวลานานและมักจะมากเกินไปความพยายาม
ชะลอความคืบหน้าของนามกลุ่มงานวิจัยต่าง ๆซ้ำ ๆ
แก้การคำนวณแบบกระจายแบบขนาน / ปัญหา ดังนั้น ชุมชน mldm
ความต้องการระดับสูงกระจายนามธรรม
ที่เฉพาะเป้าหมายไม่ตรงกัน แบบไดนามิกกราฟการคำนวณแบบขนานที่พบในการใช้งาน mldm

หลายในขณะที่ซ่อนความซับซ้อนของเส้นขนาน การออกแบบระบบกระจาย แต่น่าเสียดายที่
ที่มีอยู่ระดับสูง ( เช่น mapreduce ขนานนามธรรม [ 8 , 9 ] ,
นางไม้ [ 19 ] และปรีเกล [ 25 ] ) ไม่สนับสนุนคุณสมบัติที่สำคัญเหล่านี้ .
ช่วยเติมช่องว่างนี้เราแนะนำ [ 24 ] graphlab นามธรรมซึ่ง
ตรงเป้าหมายไม่ตรงกัน แบบไดนามิกกราฟการคำนวณแบบขนานในหน่วยความจำที่ใช้ร่วมกันการตั้งค่า
.
ในกระดาษนี้เราขยายแบบ graphlab นามธรรมเพื่อกระจายการตั้งค่าและการตั้งค่าให้

รายละเอียดอย่างเป็นทางการของการกระจายแบบ จากนั้นเราสำรวจหลายวิธีที่จะใช้รูปแบบการกระจาย
ที่มีประสิทธิภาพในขณะที่รักษาความคงเส้นคงวา
อย่างเข้มงวด เพื่อให้บรรลุเป้าหมายนี้ เรารวมข้อมูล
รุ่น เพื่อลดความแออัดของเครือข่ายและ pipelined กระจาย
ล็อคเพื่อลดผลกระทบของศักยภาพเครือข่าย ที่อยู่
ท้าทายท้องถิ่นข้อมูลและสามารถแนะนำอะตอมกราฟ
อย่างรวดเร็ววางกราฟข้อมูลที่มีโครงสร้างในการกระจายการตั้งค่า .
เรายังเพิ่มความอดทนความผิดไป graphlab กรอบโดยการปรับ
การ chandy คลาสสิก แลมพ ์ต [ 6 ] ภาพรวมขั้นตอนวิธีและสาธิต
ว่ามันสามารถใช้งานได้อย่างง่ายดาย ภายใน graphlab นามธรรม
เราวิเคราะห์ประสิทธิภาพครอบคลุมของเรา
( C การ Amazon Elastic Cloud
( EC2 ) บริการคอมพิวเตอร์ เราแสดงให้เห็นว่าโปรแกรมที่สร้างขึ้นโดยใช้ graphlab outperform เทียบเท่า Hadoop /

mapreduce [ 9 ]การใช้งานโดย 20-60x และตรงกับการปฏิบัติอย่างระมัดระวัง
สร้างดัชนีที่ใช้งาน ผลงานหลักของเราคือต่อไปนี้ :
-
สรุปคุณสมบัติทั่วไปของขั้นตอนวิธีและ mldm
ข้อจำกัดของกรอบขนาดใหญ่ที่มีอยู่ ( วินาที 2 )
- เป็นรุ่นที่แก้ไขของ graphlab นามธรรมและรูปแบบการกระจาย
เหมาะกับการตั้งค่า ( วินาที 3 )

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.