Anthropic Breaks Down AI’s Process When Deciding to Blackmail Fictional CTO
A new report shows exactly what AI thought when it made an undesirable decision, in this case, the blackmail of a fictitious company manager. Previous studies have shown that AI models could make their supervisors sing when threatened with closing and bait with a lever effect, but exactly the models are specified in such decisions….