The big data trend in the area of natural language processing (NLP) is well expressed in concluding remarks of
the Google research team (p. 12 in1), which can be summarized in six words: More words and less linguistic
annotation! However, publicly available large-scale n-gram systems are still the privilege of only 11 Indo-European
languages2,3, the Chinese4 and the Japanese language5. In all cases the WaC (Web as Corpus) approach to big data
collection was applied. The WaC trend was followed by South Slavic computational linguists too, which have
created recently the corpora for Croatian and Slovene language6. In this specific case one must allow for the
closeness of the South Slavic languages. The amount of texts written in neighboring languages (especially close to
each other are those derived from the former Serbo-Croatian language) within a preselected set of HTML documents
is not negligible, and there is no simple and effective way to filter them out, in order to create a “clean” web-corpus
for a desired South Slavic language (the standard language identification procedure based on word filters does not
help). As far as we know, the Croatian WaC is still in a cleaning process, three years after its creation.