Notes: Defeating Non-Determinism in LLM Inference
Source: This article is based on the research paper “Defeating Nondeterminism in LLM Inference” by Horace He and the Thinking Machines Lab (Sep 2025). The following is a commentary and study notes on their findings.
When we demand deterministic behavior from Large Language Models, we expect a simple bargain: same input, same output. It’s a principle fundamental to any system where consistency matters—from medical diagnostics and financial modeling to autonomous vehicle decision-making.
In the realm of software and data insights, this principle underpins our confidence in computational systems. If we provide a Large Language Model (LLM) the same prompt twice under identical conditions, we expect the same result. Yet our current reliance on high-speed, parallel inference infrastructure systematically betrays this expectation, creating what can only be described as a crisis of data integrity.
Here’s what makes this particularly insidious: we often observe that asking an LLM the same question multiple times yields different answers. Most people dismiss this as expected behavior because LLMs use “sampling” - a probabilistic process that introduces intentional randomness. What’s shocking, however, is that even when we eliminate this probabilistic element entirely - by setting temperature to 0 (a technique called “greedy sampling”) that theoretically forces the model to always choose the highest probability token - LLM APIs and open-source inference libraries are still not deterministic in practice.
This isn’t a minor bug. This isn’t a configuration issue. This is a fundamental system challenge that undermines debugging, validation, and advanced AI training workflows at their core.
Researchers at Thinking Machines Lab recently conducted experiments where they generated 1,000 completions from a model using the exact same prompt with temperature set to 0. The expectation? One unique completion, repeated 1,000 times. The reality? 80 unique completions - an 8% variance rate that’s utterly unacceptable for production systems requiring reliability.
The source of this numerical anarchy requires us to journey deep into the architecture of modern GPU processing, where the intersection of hardware optimization and mathematical constraints creates perfect conditions for chaos.
Why Numbers Lie
The Incomplete Hypothesis
When trying to explain LLM nondeterminism, a common but incomplete hypothesis emerges: it’s the combination of concurrent execution and floating-point non-associativity. While this “concurrency + floating point” explanation points to the mechanism of numerical deviation, it fundamentally fails to identify the source of the nondeterministic ordering that triggers these deviations.
Before we can understand the full picture, we must first confront an uncomfortable truth about computer arithmetic itself.
Floating-Point Non-Associativity
In the pristine world of traditional mathematics, addition is associative. This is one of those bedrock properties we learn early and trust completely:
(a + b) + c = a + (b + c)
This relationship seems self-evident. It’s so fundamental that we rarely question it. However, due to finite precision and rounding errors inherent in GPU floating-point arithmetic, this relationship catastrophically breaks down.
In GPU arithmetic:
(a + b) + c ≠ a + (b + c)
This violation isn’t theoretical - it’s constant, pervasive, and unavoidable.
How Precision Limits Destroy Information
Floating-point numbers represent values in a format similar to scientific notation: mantissa × 10^exponent (using base 10 for illustration, though actual implementations use base 2). This “dynamic” precision allows computers to represent both incredibly large and incredibly small numbers within the same system.
The problem emerges when adding two numbers with vastly different “scales” - different exponents. The floating-point format must drop trailing digits to maintain its defined precision limit. This isn’t rounding for convenience; it’s information destruction dictated by hardware constraints.
A Concrete Example
Consider a system restricted to 3 significant digits of precision (vastly simplified from real systems, but the principle holds). When we add:
- 1,230 (represented as 1.23 × 10³)
- 23.4
The exact mathematical result should be 1,253.4.
However, the system must maintain only 3 significant figures. Here’s what actually happens:
- The system aligns the numbers by their exponents
- To add them, 23.4 must be represented in the same magnitude as 1,230
- In the 10³ magnitude, 23.4 becomes 0.0234 × 10³
- But we can only store 3 significant digits
- So 0.0234 gets rounded to 0.020 × 10³ (which is 20.0)
- The final result: 1.23 × 10³ + 0.020 × 10³ = 1.25 × 10³ = 1,250
We’ve lost 3.4 in the process. The system has effectively rounded 23.4 down to 20.0 before performing the final addition.
Since LLMs involve summing vast arrays of floating-point numbers - billions of operations in sequential and parallel patterns - every time these numbers are added in a different order, a different result can emerge. Research from Thinking Machines Lab demonstrates that summing a simple eight-element array can produce 102 possible unique results depending solely on the order of accumulation.
The Budgeted Calculator Analogy
Think of floating-point arithmetic as a calculator with a small, fixed budget for recording digits - say, only 5 digits total, period. No matter what numbers you’re working with, you can never exceed this limit.
Now consider this calculation: (A) 1,000,000 + (B) 0.10 + (C) -1,000,000
Order 1: (A + B) + C
- First: 1,000,000 + 0.10 must be stored using only 5 digits
- Result: 1,000,000 (the 0.10 is completely lost due to precision limits)
- Then: 1,000,000 + (-1,000,000) = 0
Order 2: A + (B + C)
- First: 0.10 + (-1,000,000) = -999,999.90
- But we can only store 5 digits, so this becomes -999,999 (losing the .90)
- Then: 1,000,000 + (-999,999) = 1 (or potentially 0.10 with different rounding)
The same three numbers, added in different orders, produce fundamentally different results. This isn’t a bug - it’s a consequence of the finite precision budget enforced by hardware limitations.
This non-associativity is the numerical prerequisite for nondeterminism. It’s the powder keg. But something still needs to light the match - something needs to cause the order to vary in the first place.
The True Culprit: Batch Variance
Dispelling the Atomic Add Myth
While floating-point non-associativity explains how results can differ, it doesn’t explain why they differ between identical runs of the same prompt. If the computation order were fixed, we’d get the same (possibly imprecise) answer every time.
There’s a common technical assumption in the GPU programming community: concurrent atomic adds must be causing the chaos. Atomic operations - where multiple threads try to update the same memory location simultaneously - are indeed a classic source of nondeterminism in parallel systems.
However, the research clarifies a crucial insight: this assumption is largely irrelevant for the LLM forward pass. The typical forward pass of an LLM contains virtually no operations that require atomic adds. From a computational graph perspective, the inference process is “run-to-run deterministic” - given the exact same set of concurrent inputs processed together, the server always produces the same output.
Yet from the individual user’s perspective, the experience is still complete chaos.
The Critical Failure: Lack of Batch Invariance
The real culprit is far more subtle and systemic: the lack of batch invariance.
What is Batch Invariance?
A computational kernel (a low-level GPU operation) lacks batch invariance if, when the size of the batch changes, an element within that batch receives different computation results - even though mathematically, the operation should be independent along the batch dimension.
Think about matrix multiplication as an example. When you multiply matrices, the computation for row i in the batch should be completely independent from the computation for row j. Changing how many rows you process simultaneously shouldn’t affect the numerical result for any specific row. Mathematically, this should be absolute.
Empirically, however, this is not true.
Here’s the mechanism of nondeterminism laid bare:
-
From the user’s perspective, the load on the inference server (i.e., the requests from other users) is completely nondeterministic and outside your control.
-
The server load determines the batch size under which your request’s GPU kernel runs. High load means large batches. Low load means small batches.
-
Because the kernel implementation is not batch-invariant, the change in batch size forces a change in the internal numerical reduction order - how the billions of floating-point additions are sequenced.
-
Since the order of addition changes, the floating-point non-associativity kicks in, resulting in different numerical outputs.
In short, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load - and thus the batch size - nondeterministically varies, and this variance changes how your specific request is computed internally.
The Factory Floor Analogy
Imagine a high-speed factory floor designed to assemble products - this is your GPU kernel. You’ve submitted an order for a single custom widget (your LLM query).
Scenario 1: High Load (Large Batch)
- The factory is swamped with orders
- The floor manager, seeing 500 orders to process, dynamically reorganizes the assembly line
- They split the workload across 50 different teams
- Each team uses highly parallelized procedures to maximize throughput
- Your widget is processed as part of this massive parallel operation
- The specific sequence of assembly steps depends on this 500-unit batch configuration
Scenario 2: Low Load (Small Batch)
- The factory has only 10 orders
- The floor manager uses a simpler, less parallelized procedure to minimize coordination overhead
- Your widget is processed with fewer parallel teams
- The specific sequence of assembly steps is completely different
Here’s the kicker: if the internal numerical steps taken to calculate the final result of your individual request depend on whether it was processed alone or in a huge group, the factory lacks batch invariance.
Since you, the customer, have zero control over the factory’s load (the nondeterministic variable determined by all other users on the inference server), your resulting product is fundamentally unpredictable.
You submit the exact same order. You get different results. Not because of anything you did, but because of when you happened to submit it relative to everyone else.
Engineering the Fix: Achieving Batch Invariance
The Core Principle
To defeat systemic nondeterminism, we must ensure that the reduction order for each element remains fixed regardless of the batch size of the kernel. This requires fixing the parallelism strategy even if it means sacrificing peak performance in some scenarios.
This is where engineering discipline meets mathematical rigor. We’re essentially choosing reproducibility over raw throughput - a trade-off that high-reliability systems must make.
The Three Critical Operations
The challenge focuses on three core operations in transformer models that involve numerical reductions: RMSNorm, Matrix Multiplication, and Attention.
Each presents unique challenges. Let’s dissect them.
Fixing RMSNorm and Matmul
The Ideal Approach
Both RMSNorm and Matmul ideally use a “data-parallel” strategy. In this approach:
- The entire reduction for a single element is kept within a single GPU core
- Parallelization happens only along the batch dimension
- Each batch element is computed independently and identically
This naturally provides batch invariance because each element’s computation is isolated.
The Invariance Breaker
The problem emerges when batch size is very small. Picture this:
You have a powerful GPU with 108 streaming multiprocessors (cores). If your batch size is only 4, a pure data-parallel approach leaves 104 cores completely idle - a 96% waste of hardware capacity.
To maintain performance, kernel engineers implement dynamic optimization strategies:
- For RMSNorm: Split-Reduction (divide the reduction dimension across multiple cores)
- For Matmul: Split-K (split the K dimension across multiple cores)
These strategies dramatically improve hardware utilization for small batches. But they also break batch invariance because now the computation order for a single element depends on how many cores are recruited - which depends on batch size.
The Solution: Disciplined Uniformity
The fix requires rejecting dynamic optimization in favor of deterministic simplicity:
Compile one fixed kernel configuration and use it for all batch sizes.
This forces the reduction order to remain constant. A batch size of 1 uses the same computational sequence as a batch size of 1,000. Yes, this leads to some performance loss for smaller batches - those idle cores represent wasted potential. But it guarantees that your computation is batch-invariant.
This is the engineering equivalent of saying: “We will accept 20% slower performance on small batches in exchange for 100% reproducibility across all scenarios.”
The Attention Challenge
Attention is the crown jewel of transformer architectures - and the most difficult operation to make deterministic. Why?
- It involves reductions over both the feature dimension and the sequence dimension simultaneously
- Inference optimizations like Key/Value caching create boundary conditions and partial block scenarios
- Different inputs have wildly different sequence lengths, from dozens to hundreds of thousands of tokens
Defining Invariance for Attention
For attention, batch invariance means the reduction order for a given token must not depend on how many other tokens from its sequence are being simultaneously processed.
If the KV cache is handled separately from current tokens, boundary conditions where cache pages are partially filled can change the numerical reduction order - breaking invariance.
The preliminary solution involves updating the KV cache and page table before the attention kernel runs, ensuring consistent data layout. But this only addresses part of the problem.
The Decode Stage: Where Complexity Peaks
The main technical challenge lies in the decode stage of attention - when generating tokens one at a time. During decode:
- Query length is extremely small (often just 1 token)
- KV cache length can be very large (thousands of tokens)
- To saturate GPU compute, you must use a split-reduction kernel (often called Split-KV or FlashDecoding)
Here’s the typical non-invariant approach:
Your KV cache has 1,000 tokens. The kernel determines it needs 4 splits to saturate the GPU.
- Strategy: Divide 1,000 evenly → each of 4 cores handles 250 tokens
- Core 1: tokens 1-250
- Core 2: tokens 251-500
- Core 3: tokens 501-750
- Core 4: tokens 751-1,000
This seems reasonable. But it’s catastrophically non-invariant because the split boundaries depend on the total KV length, which depends on batch composition.
If another request in the batch has a KV length of 800, it would split differently:
- 800 ÷ 4 = 200 tokens per core
- Core 1: tokens 1-200
- Core 2: tokens 201-400
- Core 3: tokens 401-600
- Core 4: tokens 601-800
Even though both requests use 4 splits, the internal reduction order is different due to different split sizes.
The Advanced Solution: Fixed Split-Size Strategy
The breakthrough comes from inverting the strategy: instead of fixing the number of splits, fix the split size.
Establish a fixed split size of 256 tokens.
For a KV length of 1,000:
- Number of splits needed: ⌈1,000 ÷ 256⌉ = 4 splits
- Core 1: tokens 1-256
- Core 2: tokens 257-512
- Core 3: tokens 513-768
- Core 4: tokens 769-1,000 (partial split of 232 tokens)
For a KV length of 800:
- Number of splits needed: ⌈800 ÷ 256⌉ = 4 splits
- Core 1: tokens 1-256
- Core 2: tokens 257-512
- Core 3: tokens 513-768
- Core 4: tokens 769-800 (partial split of 32 tokens)
The split boundaries (256, 512, 768) are now identical regardless of total sequence length. The number of splits may vary, but the internal reduction procedure within each 256-token chunk remains perfectly consistent.
This guarantees identical reduction order for tokens 1-768 across different batch compositions. Token computations only differ in the final partial block - but even that partial block uses the same internal procedure, just with fewer elements.
This is batch invariance for attention.
The Payoff: When Mathematics Meets Reality
Validation Results
When batch-invariant kernels are deployed, the expected mathematical outcome is achieved with beautiful precision:
1,000 completions at temperature 0 → 1 unique result (100% reproducibility)
Not 80 unique results. Not even 2. One.
This isn’t just aesthetically satisfying - it has profound practical implications.
Enabling True On-Policy Reinforcement Learning
Reproducibility simplifies debugging and restores user trust, certainly. But the most exciting payoff lies in enabling breakthrough capabilities in advanced machine learning.
Consider Reinforcement Learning training for LLMs - the technique behind models like ChatGPT and Claude that align AI behavior with human preferences.
The On-Policy vs. Off-Policy Divide
In true on-policy RL, the training environment learns from samples generated by the current version of the policy being optimized. This creates a tight feedback loop where the model learns directly from its own current behavior.
In off-policy RL, the training environment learns from samples generated by a different version of the policy - often an older checkpoint or a separate sampling model.
On-policy methods, when they work, often converge faster and more stably. But they require that the sampling policy and the training policy are truly identical.
The Hidden Problem
Researchers have long documented a frustrating phenomenon: what should be on-policy RL frequently exhibits instability, reward collapse, or loss spikes that seem to emerge from nowhere.
The culprit? Numerical nondeterminism.
Here’s the scenario:
- Training generates a checkpoint of the model
- This checkpoint is deployed to an inference server for sampling
- Due to nondeterminism, the inference server produces slightly different outputs than the training environment would produce for the same inputs
- This creates an implicit distribution shift - the KL divergence between the training policy and sampling policy is non-zero
- What should be on-policy RL has become off-policy RL, without anyone realizing it
This subtle divergence introduces noise into the reward signal and gradient estimates, leading to training instability.
The Deterministic Solution
By achieving bitwise identical results between the sampler and the trainer, we maintain a KL-divergence of exactly 0 between the training and sampling policies.
Research from Thinking Machines Lab demonstrates this stabilization empirically: RL training that would previously crash due to mysterious loss spikes now proceeds smoothly when using batch-invariant kernels. The numerical control has direct, measurable impact on model quality and training robustness.
This proves that reproducibility isn’t just about aesthetics or debugging convenience - it’s a prerequisite for cutting-edge AI capabilities.
Moving Forward
The Defeatist Narrative
There’s a pervasive narrative in the AI community that goes something like this:
“LLMs are inherently probabilistic. They use sampling. A little numerical variation doesn’t matter in the grand scheme of things. Demanding exact reproducibility is pedantic - it’s optimizing for a metric that doesn’t affect user experience.”
This narrative is fundamentally wrong and dangerously complacent.
The Engineering Reality
The fact that these nondeterminism issues can be definitively traced to specific kernel implementation decisions - and solved through principled engineering fixes like the fixed split-size strategy - proves that reproducibility in LLM inference is not a philosophical hope or a theoretical ideal.
It’s an achievable technical reality.
Yes, we can choose to enable sampling when we want creative variation. But that should be a controlled source of randomness - a parameter we set intentionally, not a system-level failure that happens even when we explicitly request determinism.
The Path Forward
By committing to batch-invariant kernel implementations, we:
- Restore trust so users and developers can rely on reproducible outputs for the same inputs
- Enable debugging so developers can actually trace issues instead of chasing ghosts caused by nondeterministic infrastructure
- Unlock advanced workflows so techniques like on-policy RL become viable at scale
- Secure data integrity so critical applications in healthcare, finance, and safety can depend on consistent AI outputs
The cost? Accepting some performance overhead in edge cases - small batch scenarios where dynamic optimization would have squeezed out extra throughput.
The benefit? A foundation of reproducibility upon which we can build the next generation of reliable AI systems.
What This Means for AI’s Future
As we deploy AI systems into increasingly critical domains - medical diagnosis, legal reasoning, financial analysis, autonomous systems - the question of reproducibility transitions from “nice to have” to “absolutely essential.”
Imagine a medical AI system that gives different diagnostic recommendations for the same patient data depending on server load. Imagine a financial AI that produces different risk assessments for the same portfolio based on when the query happens to run. Imagine an autonomous vehicle’s decision-making system that produces different outputs for identical sensor inputs.
These scenarios aren’t hypothetical edge cases - they’re the direct consequence of accepting nondeterminism in LLM inference.
The research we’ve explored today, conducted by teams like Thinking Machines Lab, represents more than just a technical achievement. It represents a rejection of the idea that we must accept chaos as the price of performance. It demonstrates that with careful engineering, rigorous mathematical analysis, and a willingness to make principled trade-offs, we can build AI systems that are both powerful and reliable.
Reproducibility isn’t negotiable. It’s foundational.
By solving the nondeterminism problem, we’re not just fixing a bug - we’re securing the integrity of the data pipelines and training workflows that will shape AI’s future. We’re ensuring that as these systems become more capable, they also become more trustworthy.
May your inference be reproducible and your gradients stable!