Inproceedings,

Sampling Bias Due to Near-Duplicates in Learning to Rank

, , , , and .
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, page 1997–2000. New York, NY, USA, Association for Computing Machinery, (2020)
DOI: 10.1145/3397271.3401212

Abstract

Learning to rank~(LTR) is the de facto standard for web search, improving upon classical retrieval models by exploiting (in)direct relevance feedback from user judgments, interaction logs, etc. We investigate for the first time the effect of a sampling bias on LTR~models due to the potential presence of near-duplicate web pages in the training data, and how (in)consistent relevance feedback of duplicates influences an LTR~model's decisions. To examine this bias, we construct a series of specialized LTR~datasets based on the ClueWeb09 corpus with varying amounts of near-duplicates. We devise worst-case and average-case train/test splits that are evaluated on popular pointwise, pairwise, and listwise LTR~models. Our experiments demonstrate that duplication causes overfitting and thus less effective models, making a strong case for the benefits of systematic deduplication before training and model evaluation.

Tags

Users

  • @scadsfct

Comments and Reviews