Proceedings of the Joint Workshop on NLP&LOD and SWAIE, pages 3–7,
Hissar, Bulgaria, 12 September 2013.
Evaluation of SPARQL query generation from natural language questions
K. Bretonnel Cohen
Computational Bioscience Program
U. Colorado School of Medicine
Jin-Dong Kim
Database Center for Life Science
Abstract
SPARQL queries have become the standard for querying linked open data knowledge bases, but SPARQL query construction can be challenging and timeconsuming even for experts. SPARQL
query generation from natural language
questions is an attractive modality for interfacing with LOD. However, how to
evaluate SPARQL query generation from
natural language questions is a mostly
open research question. This paper
presents some issues that arise in SPARQL
query generation from natural language, a
test suite for evaluating performance with
respect to these issues, and a case study
in evaluating a system for SPARQL query
generation from natural language questions.
1 Introduction
The SPARQL query language is the standard for
retrieving linked open data from triple stores.
SPARQL is powerful, flexible, and allows the use
of RDF, with all of its advantages over traditional databases. However, SPARQL query construction has been described as “absurdly difficult” (McCarthy et al., 2012), and even experienced users may struggle with it. For this reason, various methods have been suggested for aiding in SPARQL query generation, including assisted query construction (McCarthy et al., 2012)
and, most germaine to this work, converting natural language questions into SPARQL queries.
Although a body of work on SPARQL query
generation from natural language questions has
been growing, no consensus has yet developed
about how to evaluate such systems. (Abacha
and Zweigenbaum, 2012) evaluated their system
by manual inspection of the SPARQL queries that
they generated. No gold standard was prepared—
the authors examined each query and determined
whether or not it accurately represented the original natural language question. (Yahya et al.,
2012) used two human judges to manually examine the output of their system at three points—
disambiguation, SPARQL query construction, and
the answers returned. If the judges disagreed, a
third judge examined the output. (McCarthy et
al., 2012) does not have a formal evaluation, but
rather gives two examples of the output of the
SPARQL Assist system. (This is not a system
for query generation from natural language questions per se, but rather an application for assisting
in query constructions through methods like autocompletion suggestions.) (Unger et al., 2012) is
evaluated on the basis of a gold standard of answers from a static data set. It is not clear how
(Lopez et al., 2007) is evaluated, although they
give a nice classification of error types. Reviewing this body of work, the trends that have characterized most past work are that either systems
are not formally evaluated, or they are evaluated
in a functional, black-box fashion, examining the
mapping between inputs and one of two types of
outputs—either the SPARQL queries themselves,
or the answers returned by the SPARQL queries.
The significance of the work reported here is that
it attempts to develop a unified methodology for
evaluating systems for SPARQL query generation
from natural language questions that meets a variety of desiderata for such a methodology and that
is generalizable to other systems besides our own.
In the development of our system for SPARQL
query generation from natural language questions,
it became clear that we needed a robust approach
to system evaluation. The approach needed to
meet a number of desiderata:
• Automatability: It should be possible to automate tests so that they can be run automat-3
ically many times during the day and so that
there is no opportunity for humans to miss errors when doing manual examination.
• Granularity: The approach should allow
for granular evaluation of behavior—that is,
rather than (or in addition to) just returning a
single metric that characterizes performance
over an entire data set, such as accuracy, it
should allow for evaluation of functionality
over specific types of inputs.
• Modularity: The approach should allow for
evaluating individual modules of the system
independently.
• Functionality: The approach should allow
functional, black-box evaluation of the endto-end performance of the system as a whole.
The hypothesis being explored in the work reported here is that it is possible to conduct a
principled fine-grained evaluation of software for
SPARQL query generation from natural language
questions that is effective in uncovering weaknesses in the software.
As in any software testing situation, various
methods of evaluating the software exist. A typical black-box approach would be to establish a
gold standard of the SPARQL queries themselves,
and/or of the answers that should be returned in response to a natural language question.. However,
we ruled out applying the black-box approach to
the SPARQL queries themselves because there are
multiple correct SPARQL queries that are equivalent in terms of the triples that they will return
from a linked open data source. We ruled out
a black-box approach based entirely on examining the triples returned from the query when the
SPARQL query was executed against the triple
store because the specific list of triples is subject to
change unpredictably as the contents of the triple
store are updated by the data maintainers.
We opted for a gray-box approach, in which we
examine the output at multiple stages of processing. The first was at the point of mapping to TUIs.
The Unified Medical Language System’s Semantic Network contains a hierarchically grouped set
of 133 semantic types, each with a Type Unique
Identifier (TUI). That is, for any given natural language question that should cause a mapping to
a TUI, we examined if a TUI was generated by
the system and, if so, if it was the correct TUI.
The second was the point of SPARQL query generation, where we focused on syntactic validity,
rather than the entire SPARQL query (for the reason given above). We also examined the output
of the SPARQL query, but not in terms of exact
match to a gold standard. In practice, the queries
would typically return a long list of triples, and
the specific list of triples is subject to change unpredictably as the contents of the triple store are
updated by the OMIM maintainers. For that reason, we have focused on ensuring that we know
one correct triple which should occur in the output, and validating the presence of that triple in
the output. We have also inspected the output for
triples that we knew from domain expertise should
not be returned, although we have done that manually so far and have not formalized it in the test
suite.
In this paper, we focus on one specific aspect
of the gray-box evaluation: the mapping to TUIs.
As will be seen, mapping to TUIs when appropriate, and of course to the correct TUI, is an important feature of answering domain-specific questions. As we developed our system beyond the
initial prototype, it quickly became apparent that
there was a necessity to differentiate between elements of the question that referred to specific entities in the triple store, and elements of the question that referred to general semantic categories.
For example, for queries like What genes are related to heart disease?, we noticed that heart disease was being mapped to the correct entity in the
triple store, but genes, rather than being treated as
a general category, was also being mapped (erroneously) to a particular instance in the triple store.
Given the predicates in the triple store, the best solution was to recognize general categories in questions and map them to TUIs. Therefore, we developed a method to recognize general categories
in questions and map them to TUIs. Testing this
functionality is the main topic of this paper.
2 Materials and methods
2.1 Online Mendelian Inheritance In Man
In this work we focused on a single linked open
data source, known as Online Mendelian Inheritance in Man (OMIM) (Amberger et al., 2011).
The most obvious application of OMIM, and the
one that biomedical researchers are most accustomed to using it for, is queries about genes and
diseases, but this is a much richer resource that is
probably not often exploited to the full extent that
it could be; in fact, the web-based interface offers
4
no options at all for exploiting it beyond querying
for genes and diseases.
The knowledge model goes far beyond this. It
includes linkages between at least 12 semantic
types, listed below in the Results section. OMIM
makes use of TUIs in typing the participants in
many of the triples that it encodes. In particular,
each of the linkages described above is actually a
pair of TUIs.
2.2 LODQA
To understand the evaluation methodology that we
developed, it is helpful to understand the system
under test. A prototype version of the system that
differed from the current system primarily in terms
of not performing TUI identification and of using
a default relation for all predicates is described in
some detail in (Kim and Cohen, 2013). We briefly
describe the current version of the system here.
2.2.1 Architecture
In order both to understand what features of our
system need to be tested and to understand how
well the testing approach will generalize to evaluating other systems for SPARQL query generation from natural language questions, it is helpful
to understand, in general terms, the architecture of
the system that we are testing. The primary modules of the system are as follows:
• A dependency parser for determining semantic relations in the question.
• A base noun chunker for finding terms that
need to be mapped to entities or TUIs in the
linked open data set.
• A system for matching base noun chunks to
entities or TUIs in the linked open data set.
• A module for presudo-SPARQL generation.
• A module for gene