Scaling AI with 8 to 20x energy efficiency

As AI becomes part of daily life for people and organizations around the world, that shift brings a key question from leaders: can AI scale sustainably? This question is especially real in the communities where datacenters operate. Leaders need clear, credible answers about what it takes to run AI on a local and global scale, how much energy and water it uses to serve a user request today, and what we at Microsoft are doing to improve efficiency over time as we scale access to AI.

Our recent research study by Microsoft AI for Good Lab, Microsoft Sustainability, and Azure, published in the peer-reviewed energy journal Joule, answers this question. For organizations evaluating AI adoption, understanding per‑user energy and water impact is essential for scaling responsibly. When a user sends a text request (“a query”) to a large language model (LLM), like the AI models powering Microsoft Copilot, the system reads the input and then generates a response one piece at a time. Each piece is called a “token,” roughly equivalent to three-quarters of a word. This process, known as “inference,” runs on specialized hardware inside datacenters.

The energy used per query depends on how many tokens are read and generated, how fast the hardware processes them, how large and resource-consuming the LLM is, and how efficiently the whole system is managed.

The key finding of this study: AI at scale is significantly more efficient than previously reported in literature and media. The analysis, focused on serving AI at large scale, finds that a typical AI query to some of the largest and most capable LLMs uses between 0.16 and 0.60 watt-hours of electricity, depending on the length of the query, the LLM used, and datacenter specifications. This is equivalent to the amount of electricity used by a PC (~40 W1) for 15 to 60 seconds or running a home microwave oven (1000 W2) for 0.6 to 2 seconds. That is 4 to 20 times less energy than previous measurements, as described in the study, mainly because those past reports didn’t account for how efficient large-scale AI systems are.

Understanding energy per query also allows us to estimate the amount of cooling water consumed by a typical query. For large production models under conservative assumptions, we estimate that a typical query uses in the range of 0.0 to 0.067 mL of water, with a median water use equivalent to about one-hundredth of a teaspoon or less than a single drop. As datacenter designs continue to evolve, including our rollout of zero water datacenter designs, this amount of water is expected to decrease further.

Bigger systems unlock greater efficiency

Our analysis considered the efficiency of AI inference at scale: usually the bigger an LLM serving system is, the more efficient it becomes for each individual query or user. Think of it as a major airline versus a small regional carrier. A small airline running just a few flights can’t do much if a plane is half-empty—that’s just wasting fuel or underutilizing aircraft. But a large airline running thousands of flights every day can constantly adjust, fill up planes, reroute aircraft, and apply fuel-saving techniques across every single flight at once.

AI works the same way. When billions of queries are served by a hyperscaler such as Microsoft Azure, thousands of requests can be processed at the same time, multiple efficiency optimization techniques can be applied at various stages of the AI inference process, and trade-offs can be made to reduce the resource consumption of the whole system or product without compromising user experience or response quality. Usually, the bigger the system, the more efficiency improvement compounds.

At a billion queries a day, efficiency cuts energy use in half

Leading AI products already serve in the order of a few billion queries every single day. The analysis in the study shows that serving one billion queries, assuming those are conversational queries with a few hundred tokens per interaction, takes about 0.7 gigawatt-hours (GWh) of electricity at baseline, roughly comparable to about 0.4% of the energy US households use watching TV each day. But when smart efficiency improvements are applied, that number drops by more than half, to about 0.3 GWh.

Chart demonstrating energy required to serve 1 billion queries per day. — Energy required to serve 1 billion queries per day. “Conversational” = typical queries (median ~300 output tokens).
“Mixed” = 90% conversational + 10% long queries (median ~5,000 output tokens). Efficiency improvements reflect conservative line-of-sight gains across model, serving, and hardware layers. Source: Oviedo at al., Joule (2026).

Even with 10% of queries consisting of longer, more complex tasks that consume more than ten times the tokens—such as code generation or multi-step reasoning—our study showed that efficiency improvements still cut total energy use by more than half relative to the baseline, effectively mitigating overall consumption.

Microsoft is actively investing in multiple efficiency levers

Efficiency at scale doesn’t happen on its own. It takes deliberate research and development and investment. The study estimates the impact of three main categories of efficiency improvements:

Optimized models and the right model for a task. Carefully designed and specialized models, such as Microsoft’s Fara-7B and Phi models, can match the performance of much larger ones at a small fraction of energy and cost. In the same way, intelligent model routing, such as Microsoft’s Model Router in Azure AI Foundry, is designed to automatically direct simple questions to lightweight models and reserves large models for complex tasks. Similar model improvements, as described under the modeling assumptions in the study, can lead to 5 to 10x reductions in energy use in the near term.
Smarter AI serving. Beyond models, queries must be orchestrated in a datacenter to maximize efficiency while providing a great customer experience. Techniques such as disaggregated serving or adapting serving being implemented by Microsoft can reduce energy use substantially. For long queries generating thousands of tokens, these serving optimizations in general are especially impactful, with estimated efficiency gains in the study leading to up to 5x reductions in energy use.
Better hardware. Next-generation chips deliver substantially more computation per watt. Together with datacenter-level energy use improvements, the study estimates that advances in GPU hardware offer over at least 1.5x to 2.5x energy reduction per query. And custom AI chips built for inference, such as Microsoft’s Maia 200, can provide even larger efficiency gains.

These improvements build on each other. In the study, we estimate that these efficiency gains, many currently being implemented or scaled up, have a combined near-term reduction of energy per query of 8 to 20x. An efficiency gain made in one area becomes the new starting point for everything that runs on the platform going forward.

Scaling AI responsibly

AI is becoming something that billions of people rely on every day—to learn, to work, and to create. As that happens, it is important that we make sure growing access to AI doesn’t mean growing pressure on local energy grids or on water supplies.

This research shows that scaling AI does not require proportional increases in energy or water use. With the right engineering and investment decisions, organizations can grow AI adoption while improving efficiency. Microsoft remains committed to making that possible—combining advancing capability with infrastructure.