Figure 1 shows a conceptual overview of our approach towards a content aggregation engine for the web. We aim to enable users to harvest information from any web page, regardless of the specifics of how that information happens to be presented to the user. For this reason, we propose an architecture that abstracts specific data formats and consolidates all incoming data before it is being processed by further components of the aggregation engine. To this end, all extracted individual data items are transformed to a hierarchical data structure, similar to a JSON document, and merged. This abstraction allows us to extend the system with custom extractors for new data formats and to add support for future semantic markup languages without having to adapt the data integration pipeline. Extractors are software modules that process a web page or web data source, if applicable, and may use any programming technique available to extract information. A number of built-in extractors are described in Sect. 5. In addition, it is important that the extraction engine should not only be able to process numerous, diverse file formats and semantic markup specifications, but also allow user-generated content to be fed directly into the system. Manual data entry allows end-users to create new data items in order to capture information that may not yet be available online from existing websites.