Evaluation methods in information retrieval are typically classified as system-centric and user-centric.
Methods in the former category are based on or derived from precision and recall metrics (Baeza-Yates &
Ribeiro-Neto, 1999). However, these metrics are criticized for not being able to indicate the causes for
variation of different retrieval results that remain hidden under the average recall and precision figures
(Alemayehu, 2003). User-centric evaluations, on the other hand, try to assess the probability of an IR
system being adopted and used. When taking a closer look at evaluation of semantic search systems, we
notice a lack of end-users’ involvement (e.g., Castells et al., 2007; Wang et al., 2008; Zhang et al., 2005).