AI Safety | Why the Urgency?

Feb 23

Written By Nicole

It is easy to think of AI Safety as a static set of controls, a “kill switch” embedded into an AI model during development. But the reality is far more complex. AI Safety is a practice that is often less tactical than it sounds. It is an interdisciplinary study of alignment: ensuring that artificial intelligence behaves safely and in accordance with human values. This requires answering a vast spectrum of questions, ranging from the technical: how do we prevent a model from assisting in the creation of bioweapons?, to the more philosophical: what do we actually mean by human values?

A Window Into the Frontier: The Opus 4.6 System Card

As part of my recent Frontier AI Governance course at BlueDot Impact, I analyzed the system card for Opus 4.6, the latest frontier model from Anthropic. A system card is a transparent disclosure of a model’s capabilities and the safety evaluations performed prior to its release.

I appreciate Anthropic’s transparency; they provide a rare window into the reality lurking behind the “cool” and advanced capabilities we see on the surface. While these findings are specific to Opus 4.6, they can be reasonably applied to other frontier AI models with similar capabilities against the same benchmarks.

Capabilities are a double-edged sword. The more “capable” a model becomes, the more efficiently it can be misused. This is why safety evaluations are not just important, they are critical. The 212 page document contains a world of information, but a few specific areas stood out to me:

Agentic Reliability (τ2-bench): This benchmark measures an AI agent’s ability to interact with humans and APIs while maintaining consistency across thousands of interactions. Opus 4.6 was tested on its retail and telecom capabilities to measure its ability to handle client services inquiries and technical troubleshooting. Its high score suggests that we are nearing a point where companies could shift customer service almost entirely to autonomous AI agents.
Cyber Risk (Cybench): Opus 4.6 saturated this benchmark, meaning it reached the ceiling of what the current tests can measure. This indicates that their ability to measure their models’ cyber-offensive potential is failing to keep pace with the models’ actual progression.
Deceptive Capabilities (SHADE-Arena): This test measures a model’s ability to complete a harmful side task while appearing to work on a main task. Opus 4.6’s likelihood of revealing the dangerous side task decreased significantly from 74% in the previous model to 26%. This gives an indication of the model’s increasing capability to complete a dangerous side task without being detected, even while it is being monitored.

Reaching Thresholds and Challenging Frameworks

In their Responsible Scaling Policy (RSP), Anthropic commits to not deploying models that cross specific “catastrophic harm” thresholds unless rigorous safeguards are in place. These threshold assessments are expensive and complex to run, so Anthropic established ‘rule-out-evaluations’ which are proxies to these thresholds and serve as warning flags. If models fail these rule-out-evaluations, then it’s safe to assume they did not cross these thresholds. The thresholds they commit to measuring include:

CBRN-4: The ability of “uplifting a team of people with skills equivalent to entry-level PhD biologists to be able to approximate the capabilities of a world-class, state-backed bioweapons team.”
AI R&D-4: The threshold where a model can conduct its own research and development without human oversight.

The most concerning takeaway from the Opus 4.6 system card is that Anthropic is transparently stating that their models are passing, or nearly passing, these rule-out evaluations, and that it is becoming increasingly difficult for them to rule them out for various complex reasons.

In the case of CBRN-4, for example, Anthropic can measure the model’s knowledge of biology, but running a risk assessment becomes difficult because they aren’t sure what a ‘bad actor’s’ knowledge and capabilities are. This makes it hard for them to accurately measure how much the model would be facilitating such an actor in the creation of bioweapons. The measurement tools are breaking down for them at the very time they are most needed.

Perhaps most striking is the admission that Anthropic is increasingly using Claude Code to analyze and fix their own models. In some scenarios, human teams are accepting AI-suggested code that they do not fully understand. Under the pressure of time, the integrity of the entire evaluation framework is being outsourced to the very systems it is meant to evaluate.

The Human Gap: Knowing vs. Experiencing

Throughout the Frontier AI Governance course at Bluedot Impact, our facilitator, Josh, often reminded us that humans don’t truly understand exponential growth. We can reason about it, but we struggle to grasp it experientially.

In philosophy, phenomenal consciousness describes our subjective, sensory experience. You can “know” that a rose smells, but if you’ve never smelled one, you don’t “know” it through experience. And every human can agree that the two ‘knowings’ feel different. The same applies to the rate of AI change. We see the curve on a graph, but our biological brains are wired for linear progress.

Tim Urban, in The AI Revolution, illustrates this lucidly: An AI at a “village idiot” level is programmed to improve itself. It reaches an Einstein level of intellect, making it even easier to make the next leap. These leaps grow larger and happen faster until the AGI soars into Superintelligence (ASI).

The “AI Triad”: data, algorithms, and compute, will influence this pace, but we must consider that as AI becomes more capable, it will help us find avenues to bypass current limitations that we have not yet explored.

A Future of Flourishing or Disempowerment?

There is a vast future of opportunity if we manage to coevolve with AI toward flourishing. But the outcome depends on the mechanisms we set in place before the pace of change inhibits our control over the trajectory. We face a spectrum of risks that could prevent us from ever reaching that flourishing state; a few of the most talked about are:

Labor disruption and gradual human disempowerment
Bioterrorism and catastrophic pandemics
AI-enabled coups and cyberattacks

Time is of the essence. The race to AGI is a cut-throat competition. Anthropic CEO Dario Amodei has noted the immense pressure frontier labs face: leveraged investments create a need for immediate revenue, and “authoritarian adversaries” (aka China) make slowing down feel like losing national security ground.

The logic is simple and concerning: Leveraged investment creates time pressure; time pressure compresses safety evaluation timelines; and that compression creates existential alignment risks.

The Call for Intentional Governance

In the AI Safety space, this race makes it increasingly difficult to educate policymakers and push for law-abiding regulation. Governments and the public are nowhere near the level of preparedness or education required to manage these implications.

I’ve taken part in extensive discussions with people in the AI safety space of all professional backgrounds, and although there are various opinions about what the ideal AI governance and regulation structure should look like, I am convinced that regulation should not be determined by the AI labs themselves. Private companies are responsible for increasing value for stakeholders, not protecting humanity. This is simply the nature of our current version of capitalism.

Maybe this information has made you reflect on some of the questions I continue to reflect on like: Who should govern AI development? How does this position the US relative to global adversaries? Why do we need AGI to come so soon? Is capitalism still an appropriate world order for an AGI-driven economy?

My biggest concern is that these questions aren’t being asked by the majority of people. Most still see AI as a “cool” tool without looking at the other side of the equation.

I am equal parts skeptical and hopeful. Various outcomes from catastrophe to flourishing are possible. The decisions we make now, and the timing of those decisions, will decide which ones we inherit.

I am currently focused on expanding my knowledge about AI governance and policy as strategies towards AI safety, and I hope to extend on these topics in my future posts.

Thank you for reading.

Nicole