There are three typical objectives for evaluation: a) to prove advantage in performance over existing
traditional or competitive approaches; b) understand performance sensitivity of a system by evaluating
different configurations of the system; or c) assess usability and user experience of a system. Standard
relevance metrics are used to fulfil the first objective when evaluating search systems. However,
specificity of semantic search systems requires tailored benchmark datasets, i.e., a set of annotated