Artikel in einem Konferenzbericht,

Sampling Bias Due to Near-Duplicates in Learning to Rank

, , , , und .
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Seite 1997–2000. New York, NY, USA, Association for Computing Machinery, (2020)
DOI: 10.1145/3397271.3401212

Zusammenfassung

Learning to rank~(LTR) is the de facto standard for web search, improving upon classical retrieval models by exploiting (in)direct relevance feedback from user judgments, interaction logs, etc. We investigate for the first time the effect of a sampling bias on LTR~models due to the potential presence of near-duplicate web pages in the training data, and how (in)consistent relevance feedback of duplicates influences an LTR~model's decisions. To examine this bias, we construct a series of specialized LTR~datasets based on the ClueWeb09 corpus with varying amounts of near-duplicates. We devise worst-case and average-case train/test splits that are evaluated on popular pointwise, pairwise, and listwise LTR~models. Our experiments demonstrate that duplication causes overfitting and thus less effective models, making a strong case for the benefits of systematic deduplication before training and model evaluation.

Tags

Nutzer

  • @scadsfct

Kommentare und Rezensionen