AI large language models (LLMs)—such as Chat GPT, Claude, and Gemini (formerly Bard)—appear to go through a predictable hype cycle. Posts trickle out about a new model’s impressive capabilities, people are floored by the model’s sophistication (or experience existential dread over losing their jobs), and, if you’re lucky, someone starts claiming that this new-and-improved LLM is displaying signs of sentience.

This hype cycle is currently in full force for Claude 3, an LLM created by the U.S.-based AI company Anthropic. In early March, the company introduced its latest lineup of AI models, Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus—all in ascending order of capability. The new models delivered updates across the board, including near-perfect recall, less hallucinations (a.k.a. incorrect answers), and quicker response times.

“Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K), and more,” Anthropic wrote in its announcement release. “It exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.”

Following the announcement, AI experts posted their own thoughts on X (formerly Twitter), and detailed some pretty impressive results. As Live Science details, one expert compared how quickly Claude 3 could summarize a 42-page PDF (almost instantly) to Open AI’s Chat GPT-4 (much slower).

But things got creepier when Anthropic prompt engineer Alex Albert pulled back the testing curtain to detail one of the more strange responses Claude 3 gave when fulfilling certain tasks designed to stump it. In a post on X, Albert said they were performing a “needle-in-the-haystack eval,” where the a sentence is inserted into a random documents and then a question is asked that only that sentence can answer. Claude 3’s response to the question was…surprising.

“When we ran this test on Opus, we noticed some interesting behavior—it seemed to suspect that we were running an eval on it,” Albert posted on X. “Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities.”