Category · 1.2k presentations
Category · 3.8k presentations
Category · 14.8k presentations
Category · 491 presentations
Category · 1.3k presentations
Category · 529 presentations
Category · 3.3k presentations
Category · 599 presentations
Dec 15, 2023
Speaker · 0 followers
Speaker · 0 followers
Speaker · 0 followers
LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-7b-chat on the forbidden fact task. Specifically, we instruct Llama 2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama 2 into 1057 different components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, 41 components are enough to reliably implement the full suppression behavior. However, we find that these components are fairly heterogeneous and that many operate using faulty heuristics. We find that one of these heuristics can be exploited via manually designed adversarial attacks, which we call California Attacks. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems.LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-7b-chat on the forbidden fact task. Specifically, we instruct Llama 2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama 2 into 1057 different components, and rank each one with respect to how useful it is for forbidding the correct…
Account · 54 followers
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker
Tianlin Shi, …
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Po-An Wang, …
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Yuzhe Lu, …
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%