Unlocking the Mystery: What Do Neural Networks Truly Learn? 🧠
Discover how neural networks process information and what they actually learn behind the scenes. Dive into the fascinating world of AI cognition and uncover the secrets of these powerful models.

Rational Animations
287.2K views • Jun 14, 2024

About this video
Neural networks have become increasingly impressive in recent years, but there's a big catch: we don't really know what they are doing. We give them data and ways to get feedback, and somehow, they learn all kinds of tasks. It would be really useful, especially for safety purposes, to understand what they have learned and how they work after they've been trained. The ultimate goal is not only to understand in broad strokes what they're doing but to precisely reverse engineer the algorithms encoded in their parameters. This is the ambitious goal of mechanistic interpretability. As an introduction to this field, we show how researchers have been able to partly reverse-engineer how InceptionV1, a convolutional neural network, recognizes images.
▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
This topic is truly a rabbit hole. If you want to learn more about this important research and even contribute to it, check out this list of sources about mechanistic interpretability and interpretability in general we've compiled for you:
On Interpreting InceptionV1:
Feature visualization: https://distill.pub/2017/feature-visualization/
Zoom in: An Introduction to Circuits: https://distill.pub/2020/circuits/zoom-in/
The Distill journal contains several articles that try to make sense of how exactly InceptionV1 does what it does: https://distill.pub/2020/circuits/
OpenAI's Microscope tool lets us visualize the neurons and channels of a number of vision models in great detail: https://microscope.openai.com/models
Here's OpenAI's Microscope tool pointed on layer Mixed3b in InceptionV1: https://microscope.openai.com/models/inceptionv1/mixed3b_0?models.op.feature_vis.type=channel&models.op.technique=feature_vis
Activation atlases: https://distill.pub/2019/activation-atlas/
More recent work applying SAEs to InceptionV1: https://arxiv.org/abs/2406.03662v1
Transformer Circuits Thread, the spiritual successor of the circuits thread on InceptionV1. This time on transformers: https://transformer-circuits.pub/
In the video, we cite "Toy Models of Superposition": https://transformer-circuits.pub/2022/toy_model/index.html
We also cite "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning": https://transformer-circuits.pub/2023/monosemantic-features/
More recent progress:
Mapping the Mind of a Large Language Model:
Press: https://www.anthropic.com/research/mapping-mind-language-model
Paper in the transformers circuits thread: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Extracting Concepts from GPT-4:
Press: https://openai.com/index/extracting-concepts-from-gpt-4/
Paper: https://arxiv.org/abs/2406.04093
Browse features: https://openaipublic.blob.core.windows.net/sparse-autoencoder/sae-viewer/index.html
Language models can explain neurons in language models (cited in the video):
Press: https://openai.com/index/language-models-can-explain-neurons-in-language-models/
Paper: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
View neurons: https://openaipublic.blob.core.windows.net/neuron-explainer/neuron-viewer/index.html
Neel Nanda on how to get started with Mechanistic Interpretability:
Concrete Steps to Get Started in Transformer Mechanistic Interpretability: https://www.neelnanda.io/mechanistic-interpretability/getting-started
Mechanistic Interpretability Quickstart Guide: https://www.neelnanda.io/mechanistic-interpretability/quickstart
200 Concrete Open Problems in Mechanistic Interpretability: https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability
More work mentioned in the video:
Progress measures for grokking via mechanistic interpretability: https://arxiv.org/abs/2301.05217
Discovering Latent Knowledge in Language Models Without Supervision: https://arxiv.org/abs/2212.03827
Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning: https://www.nature.com/articles/s41551-018-0195-0
▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🟠 Patreon: https://www.patreon.com/rationalanimations
🔵 Channel membership: https://www.youtube.com/channel/UCgqt1RE0k0MIr0LoyJRy2lg/join
🟢 Merch: https://rational-animations-shop.fourthwall.com
🟤 Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations
▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Discord: https://discord.gg/5Y3Dwz89yH
Reddit: https://www.reddit.com/r/RationalAnimations/
X/Twitter: https://twitter.com/RationalAnimat1
▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
AAAA you don't fit in the description this time! But we thank you from the bottom of our hearts. All of you, in this Google Doc: https://docs.google.com/document/d/18S3cEkXrllXdWQMxL9G0KjB26YMZnbA4I4VHw5j55oA/edit?usp=sharing
▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
All the good doggos who worked on this video: https://docs.google.com/document/d/1KQZCfiv1nFKrAm9vcXNjNzQfTLVqY_ofXcWlgXH_dVY/edit?usp=sharing
▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
This topic is truly a rabbit hole. If you want to learn more about this important research and even contribute to it, check out this list of sources about mechanistic interpretability and interpretability in general we've compiled for you:
On Interpreting InceptionV1:
Feature visualization: https://distill.pub/2017/feature-visualization/
Zoom in: An Introduction to Circuits: https://distill.pub/2020/circuits/zoom-in/
The Distill journal contains several articles that try to make sense of how exactly InceptionV1 does what it does: https://distill.pub/2020/circuits/
OpenAI's Microscope tool lets us visualize the neurons and channels of a number of vision models in great detail: https://microscope.openai.com/models
Here's OpenAI's Microscope tool pointed on layer Mixed3b in InceptionV1: https://microscope.openai.com/models/inceptionv1/mixed3b_0?models.op.feature_vis.type=channel&models.op.technique=feature_vis
Activation atlases: https://distill.pub/2019/activation-atlas/
More recent work applying SAEs to InceptionV1: https://arxiv.org/abs/2406.03662v1
Transformer Circuits Thread, the spiritual successor of the circuits thread on InceptionV1. This time on transformers: https://transformer-circuits.pub/
In the video, we cite "Toy Models of Superposition": https://transformer-circuits.pub/2022/toy_model/index.html
We also cite "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning": https://transformer-circuits.pub/2023/monosemantic-features/
More recent progress:
Mapping the Mind of a Large Language Model:
Press: https://www.anthropic.com/research/mapping-mind-language-model
Paper in the transformers circuits thread: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Extracting Concepts from GPT-4:
Press: https://openai.com/index/extracting-concepts-from-gpt-4/
Paper: https://arxiv.org/abs/2406.04093
Browse features: https://openaipublic.blob.core.windows.net/sparse-autoencoder/sae-viewer/index.html
Language models can explain neurons in language models (cited in the video):
Press: https://openai.com/index/language-models-can-explain-neurons-in-language-models/
Paper: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
View neurons: https://openaipublic.blob.core.windows.net/neuron-explainer/neuron-viewer/index.html
Neel Nanda on how to get started with Mechanistic Interpretability:
Concrete Steps to Get Started in Transformer Mechanistic Interpretability: https://www.neelnanda.io/mechanistic-interpretability/getting-started
Mechanistic Interpretability Quickstart Guide: https://www.neelnanda.io/mechanistic-interpretability/quickstart
200 Concrete Open Problems in Mechanistic Interpretability: https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability
More work mentioned in the video:
Progress measures for grokking via mechanistic interpretability: https://arxiv.org/abs/2301.05217
Discovering Latent Knowledge in Language Models Without Supervision: https://arxiv.org/abs/2212.03827
Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning: https://www.nature.com/articles/s41551-018-0195-0
▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🟠 Patreon: https://www.patreon.com/rationalanimations
🔵 Channel membership: https://www.youtube.com/channel/UCgqt1RE0k0MIr0LoyJRy2lg/join
🟢 Merch: https://rational-animations-shop.fourthwall.com
🟤 Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations
▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Discord: https://discord.gg/5Y3Dwz89yH
Reddit: https://www.reddit.com/r/RationalAnimations/
X/Twitter: https://twitter.com/RationalAnimat1
▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
AAAA you don't fit in the description this time! But we thank you from the bottom of our hearts. All of you, in this Google Doc: https://docs.google.com/document/d/18S3cEkXrllXdWQMxL9G0KjB26YMZnbA4I4VHw5j55oA/edit?usp=sharing
▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
All the good doggos who worked on this video: https://docs.google.com/document/d/1KQZCfiv1nFKrAm9vcXNjNzQfTLVqY_ofXcWlgXH_dVY/edit?usp=sharing
Tags and Topics
Browse our collection to discover more content in these categories.
Video Information
Views
287.2K
Likes
17.4K
Duration
17:35
Published
Jun 14, 2024
User Reviews
4.7
(57) Related Trending Topics
LIVE TRENDSRelated trending topics. Click any trend to explore more videos.
Trending Now