BDBComp: Building a Digital Library for the Brazilian Computer Science Community* Alberto H. F. Laender1 Marcos André Gonçalves2 Pablo A. Roberto1 1Department of Computer Science Federal University of Minas Gerais 31270-901 - Belo Horizonte - MG Brazil {laender,pabloa}@dcc.ufmg.br 2 Department of Computer Science Virginia Tech Blacksburg, VA 24061USA mgoncalv@vt.edu
ABSTRACT This paper reports initial efforts towards building BDBComp, a digital library for the Brazilian computer science community. BDBComp is based on a number of standards (e.g., OAI, Dublin Core, SQL) as well as on new technologies (e.g., Web data extraction tools), which allowed fast and easy prototyping. The paper focuses on architectural issues and specific challenges faced during the construction of this digital library as well as on proposed solutions. Categories and Subject Descriptors D.3.7 [Information Systems]: Information Storage and Retrieval – Digital Libraries General Terms Design, Economics. Keywords Computing Digital Libraries, OAI, DL Modeling, National DLs.
1. INTRODUCTION The last two decades has witnessed the consolidation of the Brazilian computer science (CS) community as the largest and most active one in Latin America. According to a recent census conducted by the Ministry of Education (www.inep.gov.br), the number of undergraduate programs in computer science and computer engineering in Brazil has grown from less than 20 in the early 80’s to more than 360 in 2002. The number of graduate programs also has grown at approximately the same rate and today, considering only those whose main core is computer science, there are 29 programs in the country. As a result, there has been a considerable increase in the number of theses and dissertations concluded in these programs as well in the number of papers published in international conference proceedings and journals. In addition, the Brazilian Computing Society - SBC (www.sbc.org.br) promotes and organizes about 30 events every year, most of which have official proceedings that collect a substantial part of the community’s scientific production.
Therefore, there exists a strong need for mechanisms for archiving, preserving, indexing, and disseminating the wealth of scientific knowledge produced by the Brazilian CS community. This paper reports initial efforts towards this goal, by describing the design and building of the Brazilian Digital Library of
Computing - BDBComp (www.lbd.dcc.ufmg.br/bdbcomp/). Our focus is on architectural issues and specific challenges faced during the construction of this DL as well as on proposed solutions. BDBComp has been designed to be OAI compliant and adopts Dublin Core (DC) as its metadata standard.
2. THE BDBCOMP ARCHITECTURE The BDBComp architecture comprises three major layers (Figure 1). The user interfaces serve as a “glue” that binds all provided services together. These are diverse interfaces specially tailored to the needs of different communities of users, among them: general users (e.g., educators, apprentices, researchers), contributors, and administrators. The services we expect to provide for general users are those usually available in any DL, such as searching and browsing, as well as more advanced ones such as filtering, recommendation, automatic linking, etc. Currently, BDBComp provides only searching, browsing, and limited linking facilities, similar to the services provided by DBLP [4].
Figure 1: The BDBComp Architecture In addition to general purpose services, BDBComp will provide a self-archiving service, for allowing contributors to submit metadata to the main repository, including facilities to import such data for complete conference proceedings and books in a batch mode. Reviewers will play an important role in this service since they will be responsible for approving the metadata submissions. There will also be special purpose administration services. Finally, at the bottom level we find the main repository that stores the metadata describing available resources. In addition to the self-archiving service, we envisage two other ways to collect metadata for the repository: (1) by extracting them from existing Web sites, for instance, by using tools such as the WebDL environment [1], and (2) by harvesting other OAI complaint repositories. The former deals with a large number of sources of legacy data (e.g., conference and institutional Web sites) already existing in the Web, while the latter supplements the BDBComp information, for example, by including data from works of Brazilian authors published in international conferences and journals (e.g., harvested from DLs such as CITIDEL (www.citidel.org)).
*This work is partially supported by the I3DL Project (MCT/CNPq/ProTeM-CC grant 680154/01-9).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’04, June 7–11, 2004, Tucson, Arizona, USA. Copyright 2004 ACM 1-58113-832-6/04/0006…$5.00.
Administration
R ev ie w er sUsers Contributors Administrators Interfaces
Services
R ep o sito ries
BDBComp (metadata)
Other Repositories (eg , C IT ID E L , D B L P )
OAI Protocol
Web Sites
W eb -D L
OAI Protocol
B r ow singSearching F ilter ing Linking S elf-A rc h iv in g
23 Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries (JCDL’04) 1-58113-832-6/04 $ 20.00 © 2004 ACM
3. THE BDBCOMP REPOSITORY The BDBComp main repository is a relational database and has been implemented in MySQL according to the ER schema depicted in Figure 2. This schema captures the idea that a work (e.g., a paper) might belong to a specific set (e.g., a conference series or a subject). The entity type work has a key work_id, a composite attribute header and a set of multivalued attributes (represented by double ellipses), which correspond to the works' metadata stored in the repository (e.g., title, author names, and publication date). These multivalued attributes are meant to comply with the DC metadata standard (dublincore.org), whose 15 fields have no occurrence constraints, i.e., any DC field might occur many times. The attribute header is used together with the entity type set to support the OAI Protocol for Metadata Harvesting (OAI-PMH) (www.openarchives.org). The relational database structure follows typical ER/relational mappings. Thus, in addition to the two tables required to represent the entity types work and set, there is a separate table for each multivalued attribute.
Figure 2: The Repository ER Schema 4. MAJOR CHALLENGES Legacy Web Data and Construction of the Seed Collection. To achieve its goal as the main source of information about the scientific production of the Brazilian CS community, BDBComp strongly relies on its self-archiving service. However, before making this service available it was necessary to collect some data to construct its “seed” collection. To start with, we decided to collect as much data as possible from previous major SBC events. A problem we found was that most of the legacy data about CS events was available only in static form through Web sites. To solve this problem, and based on our previous experience with the Web-DL environment [1], we generated wrappers for extracting data (paper titles, author names, events’ venues and dates, etc.) from more than 60 of these Web sites in order to produce the DC records required to create an OAI source file. DC records also have been generated from tables of contents provided (in textual format) by external contributors. In addition to that, we also collected DC records from the SIBGRAPI Digital Library Archive (iris.sid.inpe.br:1906), a DL that archives the collection of full text papers presented at the Brazilian Symposium on Computer Graphics and Image Processing - SIBGRAPI, since 1996. Table 1 summarizes the seed data collected for BDBComp. As we can see, of the 2638 records currently available in BDBComp, 748 include the work abstract and 675 a link to the work full text.
Information Integration. Since BDBComp is consolidating the Brazilian CS literature by including heterogeneous resources from a number of different archives, one important problem is how to identify similar objects or objects that can be integrated together
(e.g., papers derived from a specific dissertation). This is important to allow homogenous services and to solve problems such as deduping. Preliminary experiments with similarity functions to allow the identification of such objects based on complex structural information (e.g., as expressed by XML documents) combined with standard IR measures have shown good results [2]. A sub-problem that arises here is how to identify variants of names of authors [4]. This is particularly important because most of the data collected for BDBComp so far has come from Web sites where name normalization was not a issue, which, in many situations, leads to the generation of multiple answers for a same author when searching/browsing by author name.
Records Collected Sources Total With Abstract With FT Link Web Sites 2227 464 353 Other DLs 359 292 294 Contributors 52 28 28 Total 2638 748 675 Table 1: Seed Data Collected for BDBComp
Involvement of the CS community. Key to the success and sustainability of any DL is the involvement of the target community in its use and maintenance. The BDBComp team is working closely with SBC to galvanize its community around the project. Among the responsibilities of SBC are the adoption of policies to require the submission of full texts, at least for papers presented at its major events, the archiving and preservation of such collections, and the selection of specialists from i