Modern document collections such as e-mail, newsgroups and Web pages, can contain groups of documents with largely overlapping content. On the Web, for example, studies have shown that up to 45% of the pages are duplicates – pages with (nearly) identical content that are replicated in many different sites [6, 8, 22]. In e-mail collections, individual documents with significant amounts of overlapping content are naturally created as people reply to (or forward) messages while keeping the original content intact. E-mail exchanges often contain long chains or threads of replies to replies, causing early messages in the thread to be replicated over and over. Similar threading patterns are also common in newsgroup discussions. Information Retrieval (IR) systems typically use an inverted text index to evaluate free-text queries. During indexing, most IR systems process each document separately, causing overlapping content to be indexed multiple times. This, in turn, leads to larger indexes that take longer to build and longer to query. In this paper, we describe a scheme where overlapping content is indexed just once