Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
December 04, 2023 Β· Entered Twilight Β· π International Conference on Machine Learning
Repo contents: .gitignore, ID_estimation.py, LICENSE, README.md, data_utils.py, figures, generation_utils.py, huggingface_classifiers.py, imshow_stats.py, incremental_inference.py, inference.py, modeling_llama.py, statistic_analysis.py, text_features_figure.py, toxicity_id_eval
Authors
Randall Balestriero, Romain Cosentino, Sarath Shekkizhar
arXiv ID
2312.01648
Category
cs.AI: Artificial Intelligence
Cross-listed
cs.CL,
cs.LG
Citations
6
Venue
International Conference on Machine Learning
Repository
https://github.com/RandallBalestriero/SplineLLM
β 16
Last Checked
1 month ago
Abstract
Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Artificial Intelligence
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI
R.I.P.
π»
Ghosted
Addressing Function Approximation Error in Actor-Critic Methods
R.I.P.
π»
Ghosted
Explanation in Artificial Intelligence: Insights from the Social Sciences
R.I.P.
π»
Ghosted
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
R.I.P.
π»
Ghosted