We are concerned with two major
classes of problems with large distributed systems: bugs and operator errors that cause a departure from the
system’s logical intent; and surprising “sustained emergent performance
degradation” of complex systems that
inevitably contain feedback loops.
We know how to use formal specification to find problems in the first class.
However, problems in the second class
can cripple a system even though no
logic bug is involved. A common ex-
ample is when a momentary slowdown
in a server (due, perhaps, to Java garbage collection) causes timeouts to be
breached on clients, causing the clients to retry requests, thus adding load
to the server, and further slowdown. In
such scenarios the system eventually
makes progress; it is not stuck in a logical deadlock, livelock, or other cycle.
But from the customer’s perspective
it is effectively unavailable due to sustained unacceptable response times.
TLA+ can be used to specify an upper
bound on response time, as a real-time
safety property. However, AWS systems
are built on infrastructure—disks, operating systems, network—that does
not support hard real-time scheduling
or guarantees, so real-time safety properties would not be realistic. We build
soft real-time systems in which very
short periods of slow responses are not
considered errors. However, prolonged severe slowdowns are considered errors. We do not yet know of a feasible
way to model a real system that would
enable tools to predict such emergent
behavior. We use other techniques to
mitigate these risks.
We are concerned with two majorclasses of problems with large distributed systems: bugs and operator errors that cause a departure from thesystem’s logical intent; and surprising “sustained emergent performancedegradation” of complex systems thatinevitably contain feedback loops.We know how to use formal specification to find problems in the first class.However, problems in the second classcan cripple a system even though nologic bug is involved. A common ex-ample is when a momentary slowdownin a server (due, perhaps, to Java garbage collection) causes timeouts to bebreached on clients, causing the clients to retry requests, thus adding loadto the server, and further slowdown. Insuch scenarios the system eventuallymakes progress; it is not stuck in a logical deadlock, livelock, or other cycle.But from the customer’s perspectiveit is effectively unavailable due to sustained unacceptable response times.TLA+ can be used to specify an upperbound on response time, as a real-timesafety property. However, AWS systemsare built on infrastructure—disks, operating systems, network—that doesnot support hard real-time schedulingor guarantees, so real-time safety properties would not be realistic. We buildsoft real-time systems in which veryshort periods of slow responses are notconsidered errors. However, prolonged severe slowdowns are considered errors. We do not yet know of a feasibleวิธีการแบบระบบจริงที่จะเปิดใช้งานเครื่องมือในการทำนายดังกล่าวโผล่ออกมาลักษณะการทำงาน เราใช้เทคนิคการลดความเสี่ยงเหล่านี้
การแปล กรุณารอสักครู่..
