Search engines have test collections of queries and hand-ranked results
Recall is difficult to measure on the web
Search engines often use precision at top k, e.g., k = 10
. . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.
NDCG (Normalized Cumulative Discounted Gain)
Search engines also use non-relevance-based measures.
Clickthrough on first result
Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.
Studies of user behavior in the lab