5.2.1 Ease of Use
Once the system is on-line and the data has been loaded, the programmer then begins to write the query or the code needed to perform their task. Like other kinds of programming, this is often an
iterative process: the programmer writes a little bit of code, tests it,
and then writes some more. The programmer can easily determine
whether his/her code is syntactically correct in both types of systems: the MR framework can check whether the user’s code compiles and the SQL engines can determine whether the queries parse
correctly. Both systems also provide runtime support to assist users
in debugging their programs.
It is also worth considering the way in which the programmer
writes the query. MR programs in Hadoop are primarily written in
Java (though other language bindings exist). Most programmers are
more familiar with object-oriented, imperative programming than
with other language technologies, such as SQL. That said, SQL
is taught in many undergraduate programs and is fairly portable –
we were able to share the SQL commands between DBMS-X and
Vertica with only minor modifications.
In general, we found that getting an MR program up and running
with Hadoop took less effort than with the other systems. We did
not need to construct a schema or register user-defined functions in
order to begin processing the data. However, after obtaining our
initial results, we expanded the number of benchmark tasks, causing us to add new columns to our data set. In order to process
this new data, we had to modify our existing MR code and retest
each MR program to ensure that it worked with the new assumptions about the data’s schema. Furthermore, some API methods in
Hadoop were deprecated after we upgraded to newer versions of
the system, which again required us to rewrite portions of our programs. In contrast, once we had built our initial SQL-based applications, we did not have to modify the code despite several changes
to our benchmark schema.
We argue that although it may be easier to for developers to get
started with MR, maintenance of MR programs is likely to lead to
significant pain for applications developers over time. As we also
argued in Section 3.1, reusing MR code between two deployments
or on two different data sets is difficult, as there is no explicit representation of the schema for data used in the MR model.