12
however, similarly to word lists in RapidMiner, their informational value relies on eliminating irrelevant tokens from the input.
Figure 11. Identifying skills in demand for Developer (www.tagcrowd.com)
4.4.
Data Similarity
RapidMiner supplies processes for computing distances between records in a dataset. Through comparison of textual attributes and constructing similarity ranking, these tools identify rows potentially belonging to the same category that can be grouped together for further analysis. With regard to vacancies, this is another effective method of detecting mislabelled jobs or categorising posts with ambiguous JobTiles. Similarity evaluation can be time consuming, since it calculates and outputs measures for all pairs of records in a dataset. It is, however, particularly helpful in evaluating confusing data. In our case, this technique can be applied to understand the resemblance between new occupations and those already well established. For example, we can answer the questions:
Who is Data Modeller? Where can we position Data Modeller in the formal occupational framework?
With similarity ranking we can easily identify the closest and furthers records in our dataset, and assess similarity between any points of our interest. Figure 12 illustrates a sample outcome of similarity ranking for our crawled data. In this instance, we visualise the content of job descriptions for three vacancies: NET-Application-Suport-7441581 (middle), its closest neighbour Applicatio