Artikel in einem Konferenzbericht,

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

C. Schröder, und G. Heyer.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Seite 11987--12004. Miami, Florida, USA, Association for Computational Linguistics, (November 2024)

Zusammenfassung

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification.While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. In this work, we investigate how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification. Building on a comprehensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks. Our results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using as little as 25\% of the data. The code is publicly available at https://github.com/chschroeder/self-training-for-sample-efficient-active-learning.

BibTeX-Schlüssel: schroder-heyer-2024-self
Eintragstyp: inproceedings
Adresse: Miami, Florida, USA
Buchtitel: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jahr: 2024
Monat: nov
Seiten: 11987--12004
Verlag: Association for Computational Linguistics
URL: https://aclanthology.org/2024.emnlp-main.669

Nutzer

Kommentare und Rezensionenanzeigen / verbergen

Bitte melden Sie sich an um selbst Rezensionen oder Kommentare zu erstellen.

Zitieren Sie diese Publikation

@inproceedings{schroder-heyer-2024-self, abstract = {Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification.While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. In this work, we investigate how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification. Building on a comprehensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks. Our results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using as little as 25{\%} of the data. The code is publicly available at https://github.com/chschroeder/self-training-for-sample-efficient-active-learning.}, added-at = {2024-12-04T10:11:34.000+0100}, address = {Miami, Florida, USA}, author = {Schr{\"o}der, Christopher and Heyer, Gerhard}, biburl = {https://puma.scadsai.uni-leipzig.de/bibtex/2539d75f44e8927739c8836886f98eef4/scadsfct}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, interhash = {f30c803139b36d3201f4b8e7e947306f}, intrahash = {539d75f44e8927739c8836886f98eef4}, keywords = {imported xack}, month = nov, pages = {11987--12004}, publisher = {Association for Computational Linguistics}, timestamp = {2025-07-29T10:29:54.000+0200}, title = {Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models}, url = {https://aclanthology.org/2024.emnlp-main.669}, year = 2024 }

PUMA

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Zusammenfassung

Tags

Nutzer

Kommentare und Rezensionenanzeigen / verbergen

Zitieren Sie diese Publikation

Mehr Zitationsstile

Suchen auf