An End-to-End Submodular Framework for Data-Efficient In-Context Learning
Lilly Kumari, Shengjie Wang, Arnav Das, Tianyi Zhou, Jeff Bilmes
Proceedings of Findings of the Association for Computational Linguistics: NAACL, 2024
Abstract
Recent advancements in natural language tasks leverage the emergent In-Context Learning (ICL) ability of pretrained Large Language Models (LLMs). ICL enables LLMs to perform new tasks by utilizing a limited number of input-output examples as prompts. While ICL circumvents the costly step of finetuning LLMs, its effectiveness is heavily dependent on the quality and ordering of provided examples (called exemplars). In this work, we propose a two-stage data-efficient framework Div-S3 for exemplar selection for ICL. The first stage focuses on data annotation and employs a poolbased active learning approach to select a set of Diverse and informative exemplars from the target tasks’ unlabeled pool. Given a test input/query, the second stage uses Submodular Span Summarization (S3) to select the most relevant and non-redundant exemplars from the annotated pool of a limited budget. On 7 different NLP datasets and 5 LLMs of varying complexities, we show Div-S3 outperforms (1) existing active learning-based methods for data annotation for ICL and (2) similarity-based methods for test query-specific exemplars retrieval.