Success with DynamoDB gave us
enough evidence to present TLA+ to
the broader engineering community at
Amazon. This raised a challenge—how
to convey the purpose and benefits
of formal methods to an audience of
software engineers. Engineers think in
terms of debugging rather than “verification,” so we called the presentation
“Debugging Designs.” 18 Continuing
the metaphor, we have found that software engineers more readily grasp the
concept and practical value of TLA+ if
we dub it “exhaustively testable pseudo-code.” We initially avoid the words
“formal,” “verification,” and “proof”
due to the widespread view that formal methods are impractical. We also
initially avoid mentioning what TLA
stands for, as doing so would give an
incorrect impression of complexity.
Immediately after seeing the presentation, a team working on S3 asked
for help using TLA+ to verify a new
fault-tolerant network algorithm.
The documentation for the algorithm
consisted of many large, complicated
state-machine diagrams. To check
the state machine, the team had been
considering writing a Java program
to brute-force explore possible executions: essentially a hard-wired form
of model checking. They were able to
avoid the effort by using TLA+ instead.
Author F.Z. wrote two versions of the
spec over a couple of weeks. For this
particular problem, F.Z. found that
she was more productive in PlusCal
than TLA+, and we have observed that
engineers often find it easier to begin
with PlusCal.
Model checking revealed two subtle bugs in the algorithm and allowed
F.Z. to verify fixes for both. F.Z. then
used the spec to experiment with the
design, adding new features and optimizations. The model checker quickly
revealed that some of these changes
would have introduced bugs.
This success led AWS management
to advocate TLA+ to other teams working on S3. Engineers from those teams
wrote specs for two additional critical
algorithms and for one new feature.
F.Z. helped teach them how to write
their first specs. We find it encouraging
that TLA+ can be taught by engineers
who are still new to it themselves; this is
important for quickly scaling adoption
in an organization as large as Amazon.
Author B.M. was one such engineer.
His first spec was for an algorithm
known to contain a subtle bug. The bug
had passed unnoticed through multiple design reviews and code reviews
and had surfaced only after months of
testing. B.M. spent two weeks learning
TLA+ and writing the spec. Using it,
the TLC model checker found the bug
in seconds. The team had already designed and reviewed a fix for the bug,
so B.M. changed the spec to include
the proposed fix. The model checker
found the problem still occurred in a
different execution trace. A stronger fix
was proposed, and the model checker
verified the second fix. B.M. later wrote
another spec for a different algorithm.
That spec did not uncover any bugs but
did uncover several important ambiguities in the documentation for the
algorithm the spec helped resolve.
Somewhat independently, after seeing internal presentations about TLA+,
authors M.B and M.D. taught them-
selves PlusCal and TLA+ and started
using them on their respective projects
without further persuasion or assistance. M.B. used PlusCal to find three
bugs and wrote a public blog about his
personal experiments with TLA+ out-
side of Amazon. 7 M.D. used PlusCal to
check a lock-free concurrent algorithm
and then used TLA+ to find a critical
bug in one of AWS’s most important
new distributed algorithms. M.D. also
developed a fix for the bug and verified the fix. Independently, C.N. wrote
a spec for the same algorithm that was
quite different in style from the spec
written by M.D., but both found the
same bug in the algorithm. This suggests the benefits of using TLA+ are
quite robust to variations among engineers. Both specs were later used to
verify that a crucial optimization to the
algorithm did not introduce any bugs.
Engineers at Amazon continue to
use TLA+, adopting the practice of first
writing a conventional prose-design
document, then incrementally refining
parts of it into PlusCal or TLA+. This
method often yields important insight
about the design, even without going as
far as full specification or model checking. In one case, C.N. refined a prose
design of a fault-tolerant replication
system that had been designed by an other Amazon engineer. C.N. wrote
and model checked specifications
at two levels of concurrency; these
specifications helped him understand
the design well enough to propose
a major protocol optimization that
radically reduced write-latency in the
system. We have also discovered that
TLA+ is an excellent tool for data modeling, as when designing the schema
for a relational or “no SQL” database.
We used TLA+ to design a non-trivial
schema with semantic invariants over
the data that were much richer than
standard multiplicity constraints and
foreign key constraints. We then added
high-level specifications of some of
the main operations on the data that
helped us correct and refine the schema. This result suggests a data model
can be viewed as just another level of
abstraction of the entire system. It also
suggests TLA+ may help designers improve a system’s scalability. In order to
remove scalability bottlenecks, design-
ers often break atomic transactions
into finer-grain operations chained
together through asynchronous work-
flows; TLA+ can help explore the consequences of such changes with respect
to isolation and consistency.
Success with DynamoDB gave usenough evidence to present TLA+ tothe broader engineering community atAmazon. This raised a challenge—howto convey the purpose and benefitsof formal methods to an audience ofsoftware engineers. Engineers think interms of debugging rather than “verification,” so we called the presentation“Debugging Designs.” 18 Continuingthe metaphor, we have found that software engineers more readily grasp theconcept and practical value of TLA+ ifwe dub it “exhaustively testable pseudo-code.” We initially avoid the words“formal,” “verification,” and “proof”due to the widespread view that formal methods are impractical. We alsoinitially avoid mentioning what TLAstands for, as doing so would give anincorrect impression of complexity.Immediately after seeing the presentation, a team working on S3 askedfor help using TLA+ to verify a newfault-tolerant network algorithm.The documentation for the algorithmconsisted of many large, complicatedstate-machine diagrams. To checkthe state machine, the team had beenconsidering writing a Java programto brute-force explore possible executions: essentially a hard-wired formof model checking. They were able toavoid the effort by using TLA+ instead.Author F.Z. wrote two versions of thespec over a couple of weeks. For thisparticular problem, F.Z. found thatshe was more productive in PlusCalthan TLA+, and we have observed thatengineers often find it easier to beginwith PlusCal.Model checking revealed two subtle bugs in the algorithm and allowedF.Z. to verify fixes for both. F.Z. thenused the spec to experiment with thedesign, adding new features and optimizations. The model checker quicklyrevealed that some of these changeswould have introduced bugs.This success led AWS managementto advocate TLA+ to other teams working on S3. Engineers from those teamswrote specs for two additional criticalalgorithms and for one new feature.F.Z. helped teach them how to writetheir first specs. We find it encouragingthat TLA+ can be taught by engineerswho are still new to it themselves; this isimportant for quickly scaling adoptionin an organization as large as Amazon.Author B.M. was one such engineer.His first spec was for an algorithmknown to contain a subtle bug. The bughad passed unnoticed through multiple design reviews and code reviewsand had surfaced only after months oftesting. B.M. spent two weeks learningTLA+ and writing the spec. Using it,the TLC model checker found the bugin seconds. The team had already designed and reviewed a fix for the bug,so B.M. changed the spec to includethe proposed fix. The model checkerfound the problem still occurred in adifferent execution trace. A stronger fixwas proposed, and the model checkerverified the second fix. B.M. later wroteanother spec for a different algorithm.That spec did not uncover any bugs butdid uncover several important ambiguities in the documentation for thealgorithm the spec helped resolve.Somewhat independently, after seeing internal presentations about TLA+,authors M.B and M.D. taught them-selves PlusCal and TLA+ and startedusing them on their respective projectswithout further persuasion or assistance. M.B. used PlusCal to find threebugs and wrote a public blog about hispersonal experiments with TLA+ out-side of Amazon. 7 M.D. used PlusCal tocheck a lock-free concurrent algorithmand then used TLA+ to find a criticalbug in one of AWS’s most importantnew distributed algorithms. M.D. alsodeveloped a fix for the bug and verified the fix. Independently, C.N. wrotea spec for the same algorithm that wasquite different in style from the specwritten by M.D., but both found thesame bug in the algorithm. This suggests the benefits of using TLA+ arequite robust to variations among engineers. Both specs were later used toverify that a crucial optimization to thealgorithm did not introduce any bugs.Engineers at Amazon continue touse TLA+, adopting the practice of firstwriting a conventional prose-designdocument, then incrementally refiningparts of it into PlusCal or TLA+. Thismethod often yields important insightabout the design, even without going asfar as full specification or model checking. In one case, C.N. refined a prosedesign of a fault-tolerant replicationsystem that had been designed by an other Amazon engineer. C.N. wroteand model checked specificationsat two levels of concurrency; thesespecifications helped him understandthe design well enough to proposea major protocol optimization thatradically reduced write-latency in thesystem. We have also discovered thatTLA+ is an excellent tool for data modeling, as when designing the schemafor a relational or “no SQL” database.We used TLA+ to design a non-trivialschema with semantic invariants overthe data that were much richer thanstandard multiplicity constraints andforeign key constraints. We then addedhigh-level specifications of some ofthe main operations on the data thathelped us correct and refine the schema. This result suggests a data modelcan be viewed as just another level ofabstraction of the entire system. It alsosuggests TLA+ may help designers improve a system’s scalability. In order toremove scalability bottlenecks, design-ers often break atomic transactionsinto finer-grain operations chainedtogether through asynchronous work-flows; TLA+ can help explore the consequences of such changes with respectto isolation and consistency.
การแปล กรุณารอสักครู่..