Should average over large document collection/query ensembles
Need human relevance assessments
People aren’t reliable assessors
Assessments have to be binary
Nuanced assessments?
Heavily skewed by collection/authorship
Results may not translate from one domain to another