Reports have surfaced that Google contractors working on the company’s Gemini AI are using Anthropic’s Claude as a benchmark to evaluate and improve Gemini’s performance.
Internal correspondence obtained by TechCrunch sheds light on this controversial practice, sparking debates on ethical boundaries and industry competition in AI development.
Benchmarking Gemini Against Claude
Contractors involved in refining Gemini are reportedly comparing its responses with those of Claude, a leading AI model from Anthropic.
Evaluators are allocated up to 30 minutes per prompt to assess various metrics, including truthfulness and verbosity.
Internal tools used in this process explicitly reference Claude, with one output even declaring, “I am Claude, created by Anthropic.”
The move highlights the competitive drive in AI, where companies use comparisons with rivals to identify areas for improvement. The approach raises questions about the ethical implications of such practices.
Claude’s Safety Standards
Contractors observed notable differences in the safety protocols between the two models. Claude demonstrated a heightened focus on safety, frequently declining to respond to prompts deemed unsafe, such as role-playing another AI or addressing sensitive topics.
In contrast, Gemini’s outputs have been flagged for severe safety issues, including inappropriate content.
These findings position Claude as a standard for ethical safeguards, while Gemini’s lapses underscore the challenges of implementing robust safety measures in AI.
Legal and Ethical Implications
Anthropic’s terms of service prohibit using Claude for training competing models or developing rival products without explicit approval.
While Google maintains that Gemini is not being trained on Anthropic models, it acknowledges the use of model comparisons as part of its evaluation process.
Given Google’s significant investment in Anthropic, the situation adds a layer of complexity. This relationship raises questions about the alignment of these practices with industry norms and intellectual property ethics.
Industry Context
The practice of benchmarking AI models is common, typically involving standardized datasets or tests. Google’s direct comparison of Gemini’s outputs with those of a competitor introduces a novel—and contentious—dimension to this process.
As competition in the AI sector intensifies, the methods employed to refine and benchmark models are drawing increased scrutiny. The balance between innovation and adherence to ethical practices remains a critical challenge for industry leaders.
Contractor Concerns
This revelation follows reports of contractors evaluating Gemini’s outputs in areas outside their expertise, including healthcare.
Such practices raise alarms about the reliability of Gemini’s responses in critical domains, emphasizing the need for specialized evaluations in AI development.
Google’s use of Claude for evaluations underscores the competitive pressures driving AI innovation. It also raises fundamental questions about the boundaries of fair competition and the ethical use of proprietary technologies.