The integration of generative artificial intelligence into the world’s most dominant search engine was intended to revolutionize how users interact with information, yet recent performance issues have highlighted a fundamental disconnect between machine learning capabilities and basic linguistic logic. Google’s AI Overview feature, the centerpiece of its recent search engine overhaul, has recently come under scrutiny for a series of high-profile errors involving simple spelling and character counting. According to the AI-generated summaries now appearing at the top of many search results, there are allegedly two "p"s in the word "Google," a single "r" in the word "poop," and two "d"s in the word "journalism"—a word the AI subsequently spelled as "j-o-u-r-n-a-d-i-s-m." Furthermore, while the system correctly identified a single "p" in the surname of a former U.S. president, it presented the spelling as "t-r-p-u-m."
These errors are not isolated incidents but represent a recurring challenge for the developers of Large Language Models (LLMs). Despite their ability to synthesize complex medical data, write functional software code, and solve high-level mathematical theorems, these systems frequently stumble over tasks that a primary school student would find trivial. The persistence of these "hallucinations"—a term used by the industry to describe confident but incorrect AI outputs—suggests that the transition from traditional indexed search to generative search remains fraught with technical hurdles.
A Chronology of Google’s Generative Search Evolution
The journey toward an AI-first search experience began in earnest in early 2023, following the viral success of OpenAI’s ChatGPT. Google, fearing a threat to its core advertising business, declared a "code red" and accelerated its development of Bard, which later evolved into Gemini. In May 2023, the company introduced the Search Generative Experience (SGE) as an experimental feature within Search Labs, allowing a limited group of users to test AI-generated summaries.
By May 2024, at its annual I/O developer conference, Google announced that AI Overviews would be rolled out to hundreds of millions of users in the United States, with plans to expand globally. However, the initial public launch was met with immediate controversy. Within days, users reported that the AI was citing satirical sources, such as The Onion, or unverified Reddit comments as factual medical and culinary advice. Notable examples included the AI suggesting that users should eat at least one small rock per day for mineral intake or use non-toxic glue to keep cheese from sliding off pizza.
Google responded by scaling back the frequency of AI Overviews for certain "nonsensical" queries and refining the types of websites the model could use as sources. Despite these patches, the current wave of spelling and counting errors demonstrates that the underlying architecture of the model still struggles with the granular components of language.
The Technical Reality: Why AI Cannot Count
The reason for these persistent spelling failures lies in the fundamental architecture of modern LLMs, specifically the Transformer model. Unlike humans, who learn to read by associating sounds with individual letters and then combining those letters into words, AI models process language through a system called tokenization.
When a user inputs a prompt into Google Search, the AI does not "see" the word "strawberry" as a sequence of ten letters. Instead, it converts the word into a numerical representation known as a token. Depending on the specific tokenizer used, "strawberry" might be a single token, or it might be broken into "straw" and "berry." Because the model operates on these numerical chunks rather than individual characters, it has no inherent understanding of the internal composition of the word.
Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, explains that when an LLM encounters the word "the," it recognizes the encoding for that specific concept but does not necessarily "know" that it consists of the letters T, H, and E. To the model, the token for a word is an atomic unit. Asking an AI to count the letters in a word is akin to asking a human to identify the chemical composition of a brick just by looking at the outside of a house; the human knows it is a brick, but they aren’t looking at the atoms.
This phenomenon is famously illustrated by the "strawberry" test, a long-standing benchmark in the AI community where users ask models how many "r"s are in the word "strawberry." Most models, including advanced iterations like GPT-4 and Gemini, have historically failed this test, often insisting there are only two "r"s because the tokenization process obscures the third character.

Official Responses and Iterative Fixes
In a statement provided to TechCrunch, Google acknowledged the specific difficulties associated with character-level processing. "Counting within words has been a known challenge for LLMs, and we’re working to fix this particular issue," a company spokesperson stated. This admission reflects a broader industry-wide effort to bridge the gap between semantic understanding and syntactic precision.
Google has already moved to patch some of the more egregious errors. For instance, a recent bug involved the word "disregard." When users searched for the definition of "disregard," the AI Overview would display what appeared to be a dictionary entry, but the definition text read: "Understood. Let me know whenever you have a new prompt or question!" This suggested that the model had misinterpreted the search term as a command to ignore previous instructions—a classic "prompt injection" style failure triggered by a single vocabulary word.
While Google continues to deploy manual overrides and fine-tuning to address these glitches, researchers remain skeptical about whether a perfect solution exists within the current Transformer framework. Sheridan Feucht, a PhD student specializing in LLM interpretability at Northeastern University, suggests that the "fuzziness" of tokenization is a feature of the system’s efficiency, not just a bug. Creating a "perfect" token vocabulary that accounts for every possible character permutation would likely degrade the model’s ability to understand context and nuance at scale.
Broader Implications for Information Integrity
The stakes for Google are significantly higher than for other AI developers. While a chatbot like ChatGPT is understood to be a conversational partner that can occasionally be wrong, Google Search is the primary gateway to the internet for billions of people. It is viewed as a utility and a source of truth. When the world’s most trusted search engine provides factually incorrect information—even regarding something as minor as the spelling of "journalism"—it erodes user confidence in the entire platform.
There are also significant economic implications. Google’s transition to AI-forward search is an attempt to defend its market share against emerging competitors like Perplexity AI and OpenAI’s rumored search products. However, the high computational cost of generating AI responses, combined with the need for constant human-in-the-loop monitoring to catch errors, presents a challenge to Google’s profit margins.
Furthermore, the "hallucination" problem raises legal and ethical questions regarding misinformation. If an AI Overview provides incorrect medical advice or misattributes a quote to a political figure, the liability frameworks are still largely untested. The current trend of "spelling stumbles" serves as a visible reminder that these systems are probabilistic engines rather than deterministic databases. They predict the next most likely token in a sequence; they do not "know" facts in the way humans define knowledge.
The Path Forward for Generative Search
To combat these issues, developers are exploring "Chain of Thought" (CoT) prompting and "System 2" thinking for AI. This involves training the model to break a task down into steps—for example, first spelling the word out letter by letter and then counting them—rather than attempting to provide an answer in a single pass. Some newer models have shown improvement in the "strawberry" test by using these multi-step reasoning techniques.
However, as long as the core architecture remains dependent on tokenization, the tension between linguistic fluidity and character-level accuracy will persist. For the average user, the takeaway is clear: while AI can be an incredibly powerful tool for summarization and creative brainstorming, it is not yet a reliable substitute for traditional fact-checking.
The "two Ps in Google" error is more than just a humorous anecdote; it is a diagnostic look into the limitations of current artificial intelligence. As Google continues to double down on making generative AI the centerpiece of its 29-year-old flagship product, the company faces the monumental task of teaching a system that thinks in numbers how to respect the rigid rules of the alphabet. Until that gap is closed, the burden of verification remains firmly on the shoulders of the human user.
