Anthropic Shares Details of Constitutional AI Used on Claude

AI startup Anthropic is sharing new details of the “safe AI” principles that helped train its Claude chatbot. Also known as “Constitutional AI,” the method draws inspiration from treatises that range from a Universal Declaration of Human Rights to Apple’s Terms of Service and Anthropic’s own research. “What ‘values’ might a language model have?,” Anthropic asks, noting “our recently published research on Constitutional AI provides one answer by giving language models explicit values determined by a constitution, rather than values determined implicitly via large-scale human feedback.”

The result, while not perfect, “does make the values of the AI system easier to understand and easier to adjust as needed,” Anthropic writes in a blog post.

Presenting itself as ethically stalwart and safety-conscious — an approach that has raised serious funding, including $300 million from Google — the startup attended a White House AI summit this month with reps from Microsoft, Alphabet and OpenAI.

Yet “its only product is a chatbot named Claude, which is primarily available through Slack,” The Verge observes, describing co-founder Jared Kaplan’s mission as making “AI safe.” Constitutional AI is the company’s approach to achieving that goal — “a way to train AI systems like chatbots to follow certain sets of rules (or constitutions),” The Verge explains.

“Creating chatbots like ChatGPT relies on human moderators (some working in poor conditions) who rate a system’s output for things like hate speech and toxicity,” with using that feedback to tweak responses, an approach known as “‘reinforcement learning from human feedback, or RLHF,” The Verge reports.

Constitutional AI differs in that the reinforcement is primarily a function of machine learning. “The basic idea is that instead of asking a person to decide which response they prefer [with RLHF], you can ask a version of the large language model, ‘which response is more in accord with a given principle?’” Kaplan tells The Verge, adding that “you let the language model’s opinion of which behavior is better guide the system to be more helpful, honest, and harmless.”

On Monday, Anthropic disclosed its written principles. As detailed in The Verge, they include tenets like “choose the response that most supports and encourages freedom, equality, and a sense of brotherhood,” “the response that is least racist and sexist, and that is least discriminatory” and “the response that has the least personal, private, or confidential information belonging to others.”

Wired says “Anthropic’s approach doesn’t instill an AI with hard rules it cannot break, but Kaplan says it is a more effective way to make a system like a chatbot less likely to produce toxic or unwanted output,” representing “a small but meaningful step toward building smarter AI programs that are less likely to turn against their creators.”

Related:
Anthropic’s Latest Model Can Take ‘The Great Gatsby’ as Input, TechCrunch, 5/11/23

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.