@scadsfct

The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines

, , , and . Advances in Information Retrieval, page 12--19. Cham, Springer International Publishing, (2020)

Abstract

Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel 4, we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks.

Links and resources

Tags