Concept Bottleneck Large Language Models

Chung-En Sun; Tuomas Oikarinen; Berk Ustun; Lily Weng

doi:10.48550/arXiv.2412.07992

Concept Bottleneck Large Language Models

Chung-En Sun • Tuomas Oikarinen • Berk Ustun • Lily Weng

ICLR International Conference on Learning Representations • 2025

Links

PDF arXiv

Code

GitHub

References

Google Scholar Semantic Scholar

Topics

interpretability alignment

Abstract

We introduce concept bottleneck large language models (CB-LLMs), a novel framework for building inherently interpretable large language models. In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs — allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts — significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models.

Citation

Loading...