Artikel in einem Konferenzbericht,

Sampling Bias Due to Near-Duplicates in Learning to Rank

M. Fröbe, J. Bevendorff, J. Reimer, M. Potthast, und M. Hagen.
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Seite 1997–2000. New York, NY, USA, Association for Computing Machinery, (2020)
DOI: 10.1145/3397271.3401212

Zusammenfassung

Learning to rank~(LTR) is the de facto standard for web search, improving upon classical retrieval models by exploiting (in)direct relevance feedback from user judgments, interaction logs, etc. We investigate for the first time the effect of a sampling bias on LTR~models due to the potential presence of near-duplicate web pages in the training data, and how (in)consistent relevance feedback of duplicates influences an LTR~model's decisions. To examine this bias, we construct a series of specialized LTR~datasets based on the ClueWeb09 corpus with varying amounts of near-duplicates. We devise worst-case and average-case train/test splits that are evaluated on popular pointwise, pairwise, and listwise LTR~models. Our experiments demonstrate that duplication causes overfitting and thus less effective models, making a strong case for the benefits of systematic deduplication before training and model evaluation.

BibTeX-Schlüssel: 10.1145/3397271.3401212
Eintragstyp: inproceedings
Adresse: New York, NY, USA
Buchtitel: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
Jahr: 2020
Seiten: 1997–2000
Verlag: Association for Computing Machinery
Reihe: SIGIR '20
isbn: 9781450380164
numpages: 4
location: Virtual Event, China
DOI: 10.1145/3397271.3401212
URL: https://doi.org/10.1145/3397271.3401212

PUMA

Sampling Bias Due to Near-Duplicates in Learning to Rank

Zusammenfassung

Tags

Nutzer

Kommentare und Rezensionenanzeigen / verbergen

Zitieren Sie diese Publikation

Mehr Zitationsstile

Suchen auf