According to a new study, the more accurate we strive to create AI models, the larger their carbon footprint becomes, with some prompts emitting up to 50 times more CO2 than others.
Specialized large language models (LLMs) known as reasoning models—like DeepSeek’s R1, OpenAI’s o3, and Anthropic’s Claude—invest more time and processing resources to generate replies that are more accurate than those of their predecessors.
However, despite some remarkable outcomes, it has been demonstrated that these models have significant limits when it comes to solving complicated issues. The models’ enormous carbon footprint has now been identified by a group of academics as another performance limitation. Their results were released in the journal Frontiers in Communication on June 19.
According to study first author Maximilian Dauner, a researcher at Hochschule München University of Applied Sciences in Germany, the environmental impact of questioning trained LLMs is heavily influenced by their reasoning approach, with explicit reasoning processes significantly increasing energy consumption and carbon emissions. They discovered that reasoning-enabled models emit up to 50 times more CO₂ than simple response models.
In order to respond to the prompts, LLMs deconstruct language into tokens, which are word segments that are first transformed into a string of numbers and then fed into neural networks. Using training data that determines the likelihood of specific patterns emerging, these neural networks are fine-tuned. They then create answers based on these probabilities.
Reasoning models also try to improve accuracy through a process known as “chain-of-thought.” This approach works by breaking down a large problem into smaller, more digestible intermediate steps that follow a logical flow, similar to how people may reach the same result.
However, compared to traditional LLMs, these models use a lot more energy, which might be a financial barrier for businesses and consumers looking to implement them. Comparing the carbon footprints of various models is still rather uncommon, despite significant studies on the environmental effects of expanding AI use more broadly.
The cost of reasoning
The researchers in the latest study asked 14 LLMs 1,000 questions on a variety of subjects in order to investigate the CO₂ emissions generated by various models. The number of parameters in the various models ranged from 7 to 72 billion.
Using an NVIDIA A100 GPU and a Perun framework, which examines LLM performance and energy consumption, the calculations were carried out. The group then assumed that every kilowatt-hour of energy produced 480 grams of CO₂, converting energy consumption into CO₂.
Their findings indicate that reasoning models produced 543.5 tokens per question on average, whereas concise models only produced 37.7 tokens. Because of these additional tokens, which amounted to more calculations, the reasoning models that were more accurate generated more CO₂.
With 84.9% of the benchmark questions correctly answered, the 72 billion parameter Cogito model was the most accurate model. Cogito’s CO₂ emissions were three times higher than those of comparable-sized models designed to produce concise responses.
LLM systems now exhibit a pronounced accuracy-sustainability trade-off, according to Dauner. None of the models that maintained emissions below 500 grams of CO₂ equivalent [total greenhouse gases emitted] were able to accurately answer more than 80% of the 1,000 questions.
The problems extend beyond accuracy, however. In comparison to simple look-up searches, emissions spiked six times higher for questions requiring lengthier thinking durations, such as those in mathematics or philosophy.
According to the researchers’ estimations, the emissions varied depending on the models used. To answer 60,000 queries, DeepSeek’s 70 billion parameter R1 model would generate the CO₂ emissions from a round-trip flight between New York and London. However, Alibaba Cloud’s 72 billion parameter Qwen 2.5 model could answer questions with comparable accuracy rates for one-third of the emissions.
The results of the study are not conclusive; the researchers stressed that emissions might differ based on the hardware and energy networks that power them. However, they should encourage AI users to consider their options before implementing the technology, the researchers said.
Users may utilize these technologies more carefully and selectively if they are aware of the precise CO₂ cost of their outputs, such as jokingly making themselves into action figures, according to Dauner.