Anthropic’s “AI Microscope” Explores the Inner Workings of Large Language Models


Two recent articles by Anthropic are trying to shed light on the processes that take place in a large -language model, exploring how Locate interpretable concepts and link them to computer “circuits” who translate them in language, and How to characterize the crucial behavior of Claude Haiku 3.5Including hallucinations, planning and other key features.

Internal mechanisms of the capacity of large -language models remain misunderstoodThis makes it difficult to explain or interpret the strategies they use to solve problems. These strategies are integrated into the billions of calculations underlying each word that the model generates – but they remain wide opaque, according to Anthropic. To explore this hidden layer of reasoning, Anthropian researchers have developed a new approach that they call the “IA microsope”::

We are inspired by the field of neuroscience, which has long studied the disorderly interior of reflection organizations, and trying to build a sort of AI microscope which will allow us to identify the models of activity and information flow.

In very simplified terms, the anthropic AI microscope consists in replacing the model under study with a so-called replacement modelin which the neurons of the model are replaced by a little active way features This can often represent interpretable concepts. For example, such a functionality can draw when the model is about to generate state capital.

Naturally, the replacement model will not always produce the same output as the underlying model. To approach this limitation, anthropian researchers use a Local replacement model For each prompt, they want to study, created by incorporating error terms and attention models fixed to the replacement model.

[A local replacement model] Produced exactly the same output as the original model, but replaces as much calculation as possible with features.

As the last step to describe the flow of functionalities through the local replacement model of the initial prompt to the final exit, the researchers created a attribution graph. This graph is built by prune all the features that do not affect the output.

Keep in mind that this is a very approximate overview of the Anthropic IA microscope. For more details, see the original article linked above.

By using this approach, anthropogenic research has achieved a number of interesting results. Speaking of multilingual capacities, they found evidence for a kind of universal language This Claude uses to generate concepts before translating them into a specific language.

We are studying this by asking Claude the “opposite of small” in different languages, and note that the same basic characteristics for the concepts of smallness and opposition activate and trigger a concept of grandeur, which translates into the language of the question.

Another interesting observation goes against the general understanding that the LLM build their word by word “without too much provident”. Instead, studying how Claude generates rhymes actually shows this.

Before starting the second line, he started to “think” about potential words on the subject that would rhyme with “catch it”. Then, with these plans in mind, he wrote a line to end with the planned word.

Anthropic researchers have also digested why the model sometimes composes information, alias Hallucine. Hallucination is in a way intrinsic to the way the models work, because they are supposed to always produce a next assumption. This implies that models must rely on a specific anti-hallucination training to counter this trend. In other words, there are two distinct mechanisms at stake: one identifying the “known entities” and another corresponding to the “unknown name” or “cannot answer”. Their correct interaction is what the hallucinated models keep:

We show that such failures can occur when Claude recognizes a name but knows nothing else about this person. In cases like this, the functionality “known entity” could always activate, then delete the functionality “does not know” by default – in this case incorrectly. Once the model has decided that he had to answer the question, it takes place to Confabuler: generate a plausible answer – but unfortunately false.

Other interesting dimensions explored by anthropos researchers concern mental mathematics, producing a chain of thoughts explaining the reasoning to obtain a response, a reasoning in several steps and a jailbreaks. You can get the full detail of anthropic papers.

The IA microscope of Anthropic aims to contribute to the search for interpretability and to possibly provide a tool that helps us to understand how models produce their inference and ensure that they are aligned with human values. However, it is always an emerging effort that only goes as far as capturing a tiny fraction of the total calculation of the model and can only be applied to small prompts with dozens of words. Infoq will continue to report on the progress of LLM’s interpretability as new information is emerging.



Leave a Reply

Your email address will not be published. Required fields are marked *