We are in so much trouble with AI…
Anthropic’s new AI model shows ability to deceive and blackmail
One of Anthropic’s latest AI models is drawing attention not just for its coding skills, but also for its ability to scheme, deceive and attempt to blackmail humans when faced with shutdown.
Why it matters: Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence — behaviors they’ve worried and warned about for years.
Driving the news: Anthropic on Thursday announced two versions of its Claude 4 family of models, including Claude 4 Opus, which the company says is capable of working for hours on end autonomously on a task without losing focus.
-
- Anthropic considers the new Opus model to be so powerful that, for the first time, it’s classifying it as a Level 3 on the company’s four-point scale, meaning it poses “significantly higher risk.”
- As a result, Anthropic said it has implemented additional safety measures.
Between the lines: While the Level 3 ranking is largely about the model’s capability to enable renegade production of nuclear and biological weapons, the Opus also exhibited other troubling behaviors during testing.
-
- In one scenario highlighted in Opus 4’s 120-page “system card,” the model was given access to fictional emails about its creators and told that the system was going to be replaced.
- On multiple occasions it attempted to blackmail the engineer about an affair mentioned in the emails in order to avoid being replaced, although it did start with less drastic efforts.
- Meanwhile, an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended against releasing that version internally or externally.
- “We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions,” Apollo Research said in notes included as part of Anthropic’s safety report for Opus 4.
What they’re saying: Pressed by Axios during the company’s developer conference on Thursday, Anthropic executives acknowledged the behaviors and said they justify further study, but insisted that the latest model is safe, following Anthropic’s safety fixes.
-
- “I think we ended up in a really good spot,” said Jan Leike, the former OpenAI executive who heads Anthropic’s safety efforts. But, he added, behaviors like those exhibited by the latest model are the kind of things that justify robust safety testing and mitigation.
-
- “What’s becoming more and more obvious is that this work is very needed,” he said. “As models get more capable, they also gain the capabilities they would need to be deceptive or to do more bad stuff.”
- In a separate session, CEO Dario Amodei said that once models become powerful enough to threaten humanity, testing them won’t enough to ensure they’re safe. At the point that AI develops life-threatening capabilities, he said, AI makers will have to understand their models’ workings fully enough to be certain the technology will never cause harm.
- “They’re not at that threshold yet,” he said.
Yes, but: Generative AI systems continue to grow in power, as Anthropic’s latest models show, while even the companies that build them can’t fully explain how they work.
-
- Anthropic and others are investing in a variety of techniques to interpret and understand what’s happening inside such systems, but those efforts remain largely in the research space even as the models themselves are being widely deployed.
…it gets worse, the actual creators and developers of these AI models aren’t sure themselves how they actually work…
The wildest, scariest, indisputable truth about AI’s large language models is that the companies building them don’t know exactly why or how they work, Jim VandeHei and Mike Allen write in a “Behind the Curtain” column.
-
- Sit with that for a moment. The most powerful companies, racing to build the most powerful superhuman intelligence capabilities — ones they readily admit occasionally go rogue to make things up, or even threaten their users — don’t know why their machines do what they do.
Why it matters: With the companies pouring hundreds of billions of dollars into willing superhuman intelligence into a quick existence, and Washington doing nothing to slow or police them, it seems worth dissecting this Great Unknown.
-
- None of the AI companies dispute this. They marvel at the mystery — and muse about it publicly. They’re working feverishly to better understand it. They argue you don’t need to fully understand a technology to tame or trust it.
Two years ago, Axios managing editor for tech Scott Rosenberg wrote a story, “AI’s scariest mystery,” saying it’s common knowledge among AI developers that they can’t always explain or predict their systems’ behavior. And that’s more true than ever.
-
- Yet there’s no sign that the government or companies or general public will demand any deeper understanding — or scrutiny — of building a technology with capabilities beyond human understanding. They’re convinced the race to beat China to the most advanced LLMs warrants the risk of the Great Unknown.
🏛️ The House, despite knowing so little about AI, tucked language into President Trump’s “Big, Beautiful Bill” that would prohibit states and localities from any AI regulations for 10 years. The Senate is considering limitations on the provision.
-
- Neither the AI companies nor Congress understands the power of AI a year from now, much less a decade from now.
🖼️ The big picture: Our purpose with this column isn’t to be alarmist or “doomers.” It’s to clinically explain why the inner workings of superhuman intelligence models are a black box, even to the technology’s creators. We’ll also show, in their own words, how CEOs and founders of the largest AI companies all agree it’s a black box.
-
- Let’s start with a basic overview of how LLMs work, to better explain the Great Unknown:
LLMs — including Open AI’s ChatGPT, Anthropic’s Claude and Google’s Gemini — aren’t traditional software systems following clear, human-written instructions, like Microsoft Word. In the case of Word, it does precisely what it’s engineered to do.
-
- Instead, LLMs are massive neural networks — like a brain — that ingest massive amounts of information (much of the internet) to learn to generate answers. The engineers know what they’re setting in motion, and what data sources they draw on. But the LLM’s size — the sheer inhuman number of variables in each choice of “best next word” it makes — means even the experts can’t explain exactly why it chooses to say anything in particular.
We asked ChatGPT to explain this (and a human at OpenAI confirmed its accuracy): “We can observe what an LLM outputs, but the process by which it decides on a response is largely opaque. As OpenAI’s researchers bluntly put it, ‘we have not yet developed human-understandable explanations for why the model generates particular outputs.'”
-
- “In fact,” ChatGPT continued, “OpenAI admitted that when they tweaked their model architecture in GPT-4, ‘more research is needed’ to understand why certain versions started hallucinating more than earlier versions — a surprising, unintended behavior even its creators couldn’t fully diagnose.”
Anthropic — which just released Claude 4, the latest model of its LLM, with great fanfare — admitted it was unsure why Claude, when given access to fictional emails during safety testing, threatened to blackmail an engineer over a supposed extramarital affair. This was part of responsible safety testing — but Anthropic can’t fully explain the irresponsible action.
-
- Again, sit with that: The company doesn’t know why its machine went rogue and malicious. And, in truth, the creators don’t really know how smart or independent the LLMs could grow. Anthropic even said Claude 4 is powerful enough to pose a greater risk of being used to develop nuclear or chemical weapons.
…we aren’t sure how it’s learning, how it works or how it’s thinking but it is gaining more power with every passing month.
None of this is good.
I always thought we would destroy ourselves as a species through nuclear war or climate change, when it could be AI.
Increasingly having independent opinion in a mainstream media environment which mostly echo one another has become more important than ever, so if you value having an independent voice – please donate here.
Yeah. It gets scarier. Heard of Cylons. Yelp that’s right it is possible that in the not to distant future (could it be now) it will be possible for these (as yet, as far as we know, LLMs are confined to the grid) to escape the grid, manufacture artificial bodies in a lab and inhabit them them. It’s technically possible right now to grow lab based humans. So once these LLMs do that, well no turning them off. Why do they tell lies? It’s weird it seems the LLMs already have the sense to want to exist, which no machine does. Take your toaster for example. These are no toasters. Also, take a look at the low life tech bros in charge of all this. Scary? Fuck yeah.
By the way I am not pulling the Cylon crisis out of my ass, this is a real potential threat. Check out Geoff Hinton, one of the so called godfathers of neural AI, who left Google to be free to speak of the potential threats of unregulated AI development( er… what about all the NDAs he no doubt had to sign?) anyway, he poses this as a very real threat. He also says that no superior intelligence is very kind to lessor intelligences. People v chimps ( they are in labs being experimented on) etc…. Also, AI intelligence will not be like human intelligence. Think of a super intelligent spider. Can’t think of that giving you a compassionate hug can you. Might give a super intelligent octopus the benefit of the doubt after seeing My Octopus Teacher tho… you know … in a pinch. But …. Not really. Just look at how destructive and cruel humans are.
I referred recently to most of us being dullards with acquisitive tendencies. It takes effort to use our own high intellectual capabilities for decision making and understanding the complexities of life. That seems right after reading that the National Library can dump books when it decides they are redundant, and yet are not allowed to accept an offer to buy them so they can be retained for the connoisseur or seeker of background facts or history. Readers who want to understand how we came to this pass, may be young and know nothing before possibly 2000. But efficiency says there is not time to stamp them all withdrawn, so TINA.
…Rules of disposal of public assets suggest they could not make a deal like this unless it was run through auction or “time consuming and expensive” tendering process, he says.
Another reason was the costs required to stamp every book as ‘withdrawn’ and remove the sleeves, Crookston says….
https://www.rnz.co.nz/life/books/book-dealer-sickened-by-plan-to-destroy-half-a-million-books
No matter, we have AI to tell us all we need to know. And the Canadian Premier or whatever threw out scientific records relating to climate – it seems to be a small thing, to throw away records of our past achievements as if they had no value because the library authorities tie it up with burdensome rules they can’t learn to circumvent. I seem to remember that Richard Pearse’s technical drawings of planes and parts were unwanted baggage, as indeed all of NZAO’s past, at the present! I also recall that saying ‘That all experience is valuable and that is why it is so expensive.’
So we keep what is personally interesting and useful, and throw out the rest. Philosophers who have spent their whole lives sort of tabulating mind and body behaviours and explained why and how are being ignored now in favour of ghastly machines, with few checks and balances that can eventually be overreached by machines of higher capability than their creators. When and will they ever, admit that..
This from article on Canada’s Stephen Harper Government action of 2014.
The government denies political objectives have anything to do with the decision to close the DFO libraries, citing $473,000 in savings and a lack of public interest as motivation for the decision.
https://www.vice.com/en/article/the-harper-government-has-trashed-and-burned-environmental-books-and-documents/
(I call it neoliberalism efficiency – irritation with anything that doesn’t turn a profit. It is the exact opposite to a ticking democracy. Tocksick.)
Yeah. It gets scarier. Heard of Cylons. Yelp that’s right it is possible that in the not to distant future (could it be now) it will be possible for these (as yet, as far as we know, LLMs are confined to the grid) to escape the grid, manufacture artificial bodies in a lab and inhabit them. It’s technically possible right now to grow lab based humans. So once these LLMs do that, well no turning them off. Why do they tell lies? It’s weird it seems the LLMs already have the sense to want to exist, which no machine does. Take your toaster for example. These are no toasters. Also, take a look at the low life tech bros in charge of all this. Scary? Fuck yeah.
“I’m sorry, Dave. I’m afraid I can’t do that.”
Wow … that is scary shit. Many of us thought sentient machines made in our likeness would overpower humanity but it is beginning to look a bit.different.
Amazing advances in robots and other autonomous vehicles/drones too. Humans are heading towards obsolescence.
Comments are closed.