Abstract
Information retrieval evaluation has to consider the varying "difficulty" between topics. Topic difficulty is often defined in terms of the aggregated effectiveness of a set of retrieval systems to satisfy a respective information need. Current approaches to estimate topic difficulty come with drawbacks such as being incomparable across different experimental settings. We introduce a new approach to estimate topic difficulty, which is based on the ratio of systems that achieve an NDCG score that is better than a baseline formed as random ranking of the pool of judged documents. We modify the NDCG measure to explicitly reflect a system's divergence from this hypothetical random ranker. In this way we achieve relative comparability of topic difficulty scores across experimental settings as well as stability to outlier systems?features lacking in previous difficulty estimations. We reevaluate the TREC 2012 Web Track's ad hoc task to demonstrate the feasibility of our approach in practice.
Users
Please
log in to take part in the discussion (add own reviews or comments).