Can LLMs Understand Multi-Armed Bandit Tasks?

Authors: Isaac Cohen, William Hayes, Hiten Malhotra

Field of Study: Science, Technology, Engineering, and/or Math

Program Affiliation: BUPNUR

Faculty Mentors: William Hayes

Easel: 1

Timeslot: Afternoon

Abstract: Large Language Models (LLMs) possess many emergent traits, including in-context learning capabilities that one can utilize for sequential decision-making tasks. This study examined how Llama 3 (8B parameters) performs in multi-armed bandit tasks using Bernoulli reward distributions. The LLM’s performance were compared to traditional reinforcement learning (RL) algorithms, including epsilon greedy and Upper Confidence Bound, to evaluate the model’s understanding of exploration and exploitation. Using a logistic regression classifier on PCA-reduced activation vectors extracted from the LLM’s decoder layers, over 90% accuracy was achieved in distinguishing prompts reflecting greedy versus anti-greedy decisions, highlighting the LLM’s internally consistent representations. However, efforts to steer the LLM’s behavior using steering vectors proved unsuccessful, highlighting the difficulties inherent in attempts to manipulate LLM behavior in complex decision-making tasks. These findings raise important questions about interpretability, control, and the emergent nature of in-context learning.