A few years ago we started investigating the collection of
HTML tables on the Web [4], a vast resource that also inspired
several other research efforts, e.g., [10, 3, 13]. Our
goal was twofold. First, we wanted to characterize the size
and quality of this untapped source of structured data. Second,
we wanted to create services that would expose this
content to Google users.
In the past few years, we have been tackling the main
challenges concerning this collection: (1) extracting a highquality
corpus of HTML data and (2) recovering signals that
provide semantic clues about the content of these tables.
Based on our high-quality corpus, we demonstrated that
structured data from WebTables is relevant to a broad set
of services. First, we created a search engine for structured