1. Show GPT-4 a GPT-produced text with the activation level of a specific neuron at the time it was producing that part of the text highlighted. They then ask GPT-4 for an explanation of what the neuron is doing.
GPT produces "words and phrases related to performing actions correctly or properly".
2. Based on the explanation, get GPT to guess how strong the neuron activates on a new text.
"Assuming that the neuron activates on words and phrases related to performing actions correctly or properly. GPT-4 guesses how strongly the neuron responds at each token: '...Boot. When done _correctly_, "Secure...'"
3. Compare those predictions to the actual activations of the neuron on the text to generate a score.
So there is no introspection going on.
They say, "We applied our method to all MLP neurons in GPT-2 XL [out of 1.5B?]. We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron's top-activating behavior." But they also mention, "However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic."
1. Show GPT-4 a GPT-produced text with the activation level of a specific neuron at the time it was producing that part of the text highlighted. They then ask GPT-4 for an explanation of what the neuron is doing.
Text: "...mathematics is done _properly_, it...if it's done _right_. (Take ..."
GPT produces "words and phrases related to performing actions correctly or properly".
2. Based on the explanation, get GPT to guess how strong the neuron activates on a new text.
"Assuming that the neuron activates on words and phrases related to performing actions correctly or properly. GPT-4 guesses how strongly the neuron responds at each token: '...Boot. When done _correctly_, "Secure...'"
3. Compare those predictions to the actual activations of the neuron on the text to generate a score.
So there is no introspection going on.
They say, "We applied our method to all MLP neurons in GPT-2 XL [out of 1.5B?]. We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron's top-activating behavior." But they also mention, "However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic."