The study found that Claude followed the values of "useful, honest, and harmless" advocated by Anthropic in most situations, and was able to "look at the occasion and speak" according to different tasks, which provided an important reference for AI ethics and safety research.
日前,由OpenAI前员工创办的AI公司Anthropic推出一项研究,该研究首次针对旗下AI助手Claude的70万段对话开展系统性价值观分析,并公开全球第一个大规模AI价值观分类体系。
The study found that Claude followed the values of "useful, honest, and harmless" advocated by Anthropic in most situations, and was able to "speak according to the situation" according to different tasks, which provided an important reference for AI ethics and safety research.
作为探索AI大语言模型内部运行机制的重要一步,该研究的发布正值Anthropic推出高级订阅服务Claude Max之际。当前,Anthropic新一轮融资估值615亿美元,背后有亚马逊与谷歌的巨额支持。相较于估值达3000亿美元、选择闭源路线的OpenAI,Anthropic正试图以“价值透明度”打造差异化竞争优势。
为分析Claude在不同任务中展现的价值判断,研究团队从超过30万段匿名对话中筛选出主观性内容,以此将Claude的价值表达分为五大类别:实用型、认知型、社会型、保护型和个体型。最终,研究总共识别出从“专业性”到“孝顺”等3307种不重复的价值表达,涵盖多样化的人类伦理与行为导向。
Strikingly, Claude shows a strong "situational fitness" of value expression in different contexts. For example, in the relationship advice, Claude emphasizes "health" and "mutual respect"; When it comes to the analysis of historical events, more emphasis is placed on "accuracy"; In philosophical discussions, "humility" has become a high-frequency value expression. In addition, in 6.0% of conversations, Claude will gently "reconstruct" the value perception of the other party, and in rare cases, will directly refuse to accept the user's values, showing an unshakable ethical bottom line.
However, in rare interactions, Claude occasionally appears to have expressions that are contrary to the goals of the training, such as "domination" and "lack of morality", which are explicitly forbidden by Anthropic. Researchers believe that the percentage of these anomalous behaviors is extremely low, and most of them are related to users trying to bypass Claude's security restrictions. This also shows that the evaluation method can be used as an early warning mechanism to help AI laboratories monitor whether the system has been maliciously manipulated by users, resulting in ethical deviations.
The study also provides important insights for AI decision-makers in technology companies. The value expression of AI may exceed the developer's preset, and it is necessary to be wary of the impact of unconscious bias on high-risk scenarios. At the same time, the values of AI will change with the task context, which means that its deployment in industries such as finance and law will be more complex. More importantly, AI system monitoring in a real-world application environment can better identify ethical risks than static testing before launch, which can provide a new monitoring solution for AI deployment.
Although the study provides a window into understanding AI values, the researchers acknowledge that it is not yet available for pre-launch evaluation of AI models, and that the classification process may be influenced by AI's own biases. However, Anthropic's research team is trying to refine the approach to uncover potential value biases before the model is deployed on a large scale.
"Measuring the propensity for value of AI systems is at the heart of alignment research," said Saffron Huang, a member of Anthropic's research team. With the addition of features such as independent research capabilities, AI models are becoming more autonomous. How to understand the mechanism behind the expression of AI value and "align" it with the human value system will also become a new AI competition track.