Google unveils two new TPUs designed for the “agentic era”

Google’s new generation of Tensor AI chips is actually two chips, one for inference and one for training.

Google’s TPU 8t chips were designed for training AI models, not running them. Credit: Google

Most of the companies that have fully committed to building AI models are gobbling up every Nvidia AI accelerator they can get, but Google has taken a different approach. Most of its cloud AI infrastructure is based on its line of custom Tensor processing units (TPUs). After announcing the seventh-gen Ironwood TPU in 2025, the company has moved on to the eighth-gen version, but it’s not just a faster iteration of the same chip.

The new TPUs come in two flavors, providing Google and its customers with an AI platform that is faster and more efficient, the company says. Google is pushing the idea that the “agent era” is fundamentally different from the AI systems that came before, necessitating a new approach to the hardware. So engineers have devised the TPU8t (for training) and the TPU 8i (for inference).

Before AI models become something you can use to analyze data or make silly memes, they need to be trained. The TPU 8t was designed specifically for this part of the AI lifecycle to reduce the training time for frontier AI models from months to weeks.

Updated Tensor 8t server clusters, which Google calls “pods,” now house 9600 chips with two petabytes of shared high-bandwidth memory. Google claims TPU 8t can even scale linearly, with up to a million chips in a single logical cluster. It’s innovations like this that are making super-sized AI models much faster while also driving up RAM prices for everyone else. But if you’re involved in building those giant AI models, all this hardware saves time, with an impressive 121 FP4 EFlops of compute per pod. That’s almost three times higher than Ironwood’s training compute ceiling.

So the new chips allow for faster training, but Google also says you get more useful computation for every volt you pump into a TPU 8t. The company claims a “goodpute” rate of 97 percent, which means less waiting and wasted effort. With better handling of irregular memory access, automatic handling of hardware faults, and real-time telemetry across all connected chips, TPU 8t spends more time actively advancing model training.

When training is done, AI models run in inference mode to generate tokens—that’s the process happening behind the scenes when you tell a model to do something. This doesn’t require as much horsepower, so using the same hardware for both parts of the AI lifecycle is inefficient. That’s why inference is the purview of TPU 8i, which is designed to be more efficient when running multiple specialized agents, with less waiting time. TPU 8i chips also run in larger pods of 1,152 chips versus just 256 for the last-gen Ironwood inference clusters. That works out to 11.6 EFlops per pod, much lower than TPU 8t pods.

TPU 8i chip — The TPU 8i has less raw power than TPU 8t. Credit: Google

Google has tripled the amount of on-chip SRAM for each TPU 8i to 384 MB. This allows the company’s new chips to keep a larger key value cache on the chip, speeding up models with longer context windows. The eighth-gen AI accelerators are also the first from Google to rely solely on Google’s custom Axion ARM CPU host, featuring one CPU for every two TPUs. In Ironwood, each x86 CPU serviced four TPU chips. Google says this “full-stack” ARM-based approach allows for much greater efficiency.

An efficiency play

It makes sense that efficiency is a core part of Google’s new TPU setup. Training and running frontier AI models is expensive, and the return on investment is unclear. Companies are still burning money on generative AI in the hopes that efficiency will turn the corner at some point. Maybe Google’s new TPUs will help get there and maybe not, but the company has made notable improvements.

Generative AI systems consume a lot of power, which is often cited as one of the primary reasons not to use them. The eighth-gen TPUs don’t exactly sip power, but Google claims the chips offer twice the performance per watt compared to Ironwood. Google also touts improvements in its data centers, which are apparently “co-designed” with TPUs. Features like integrating networking with compute on a single chip and more efficient pod layouts have reportedly increased computing power per unit of electricity by six times. Of course, that doesn’t mean data centers will use less power, just that they get more compute for all the power they use.

Water usage for cooling data centers is also a big efficiency concern. The heat generated by the dense computing requirements of AI servers cannot be dissipated with air, so liquid cooling is the only way. Google has adapted its fourth-gen liquid-cooling setup to the new chips, using actively controlled valves to adjust water flow based on workload. Again, this is supposed to be more efficient.

Two racks of TPU 8i chips. Each rack has eight boards with four TPUs. Credit: Google

The TPU 8t and TPU 8i will power Google’s Gemini-based agents in the future, but they are also designed with third-party developers in mind. Both new TPUs support the frameworks developers already use, including JAX, MaxText, PyTorch, SGLang, and vLLM.

Nvidia’s stock price briefly dropped about 1.5 percent after Google’s announcement, but it has since recovered and is again over $200 per share. Surging demand for AI accelerators has more than doubled Nvidia’s value over the past year, and Google’s gains have been even greater. Such is the nature of the potential AI bubble. Of course, the companies benefiting most don’t see it as a bubble—they see this as the beginning of an agentic AI future.

Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he’s written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.

43 Comments

An efficiency play

Leave a Reply Cancel reply

Related Posts

Hume AI Introduces OCTAVE: A Next-Generation Speech-Language Model with New Emergent Capabilities like On-The-Fly Voice and Personality Creation

Hugging Face Released Moonshine Web: A Browser-Based Real-Time, Privacy-Focused Speech Recognition Running Locally

Guardrails AI Introduces Snowglobe: The Simulation Engine for AI Agents and Chatbots