EST · MMXXVI · INDEPENDENT
VOL. 02 — TECHNICAL
APPENDIX APPENDIX APPENDIX
Mathematics of GPU utilization, flow matching, flow maps, discrete flows and flow maps, and reward alignment.

Contents.

  1. Autoregressive attention is matvec.
  2. Flow matching learns to transform noise into data.
  3. Flow maps compress ODE trajectories into one step.
  4. The problem with discrete diffusion.
  5. Adapting flows to language.
  6. Flow maps for language.
  7. Reward alignment for flows.
  8. Fine-tuning via flow maps.
  9. Inference-time steering with flow maps.
  10. Selected works.

Autoregressive attention is matvec.

The core computational component of autoregressive inference is self-attention. At inference time, even though this generically involves a matrix-matrix operation, computing it reduces to a matrix-vector multiply because of the ability to cache previous computation in the single token rollout.

At each step, the transformer operates on the current token \(x_t \in \mathbb{R}^{1 \times d}\) and its prefix \(X_{1:t} \in \mathbb{R}^{t \times d}\). Projection matrices \(W_Q,\,W_K,\,W_V \in \mathbb{R}^{d \times d}\) define the queries, keys, and values:

\[ \begin{aligned} Q &= x_t W_Q &&\in \mathbb{R}^{1 \times d}, \\ K_{1:t} &= X_{1:t} W_K &&\in \mathbb{R}^{t \times d}, \\ V_{1:t} &= X_{1:t} W_V &&\in \mathbb{R}^{t \times d}. \end{aligned} \]

Causal attention is defined as a soft-max over inner products of \(Q\) with cached keys, applied to cached values:

\[ A = \mathrm{softmax}\!\left(\frac{1}{\sqrt{d}}\,Q\,K_{1:t}^\top\right) \in \mathbb{R}^{1 \times t}, \qquad o = A\,V_{1:t} \in \mathbb{R}^{1 \times d}. \]

The KV cache stores \(K_{1:t-1}\) and \(V_{1:t-1}\) from previous steps to make inference efficient by avoiding repeated computation. As a result, each new token only requires building the new row \(K_t,\,V_t \in \mathbb{R}^{1 \times d}\) and the matrix-vector multiplication

\[ Q\,K_{1:t}^\top \;\in\; \mathbb{R}^{1 \times t}. \]

Arithmetic intensity is the amount of computation performed per byte moved from memory:

\[ \mathrm{AI} \;=\; \frac{\mathrm{FLOPs}}{\mathrm{bytes\ moved}}. \]

High arithmetic intensity means the accelerator can spend most of its time doing math; low arithmetic intensity means runtime is dominated by fetching data from memory.

Let \(S\) be the current KV-cache length, \(T_q\) be the number of query tokens processed in the forward pass, \(B\) be batch size, and \(d\) be model width. The attention FLOPs count both \(QK^\top\) and the weighted sum \(AV\):

\[ \mathrm{FLOPs}_{\mathrm{attn}} \;=\; 2 B S T_q d + 2 B S T_q d \;=\; 4 B S T_q d . \]

In bf16, reading the cached \(K,V\) tensors costs \(4BSd\) bytes, and reading/writing the query-side activations costs \(4BT_qd\) bytes. Thus the arithmetic intensity is

\[ \mathrm{AI}_{\mathrm{AR}} \;=\; \frac{4 B S T_q d}{4 B S d + 4 B T_q d} \;=\; \frac{S\,T_q}{S + T_q}. \]

Autoregressive decode has \(T_q = 1\), so

\[ \mathrm{AI}_{\mathrm{AR,decode}} \;=\; \frac{S}{S + 1} \;\approx\; 1. \]

An arithmetic intensity near \(1\) means AR decode is memory bound: each byte read from the KV cache buys only about one FLOP of work. The accelerator therefore spends more time waiting on memory traffic than doing useful arithmetic. By comparison, modern GPU and TPU accelerators reach peak compute at the roofline ridge, where the arithmetic intensity is on the order of 200–400 FLOPs per byte. Autoregressive serving sits roughly two orders of magnitude below that ridge, with the silicon delivering only a small fraction of its rated throughput.

This forces frontier labs to build complex serving pipelines that batch many users simultaneously to push the arithmetic intensity upward. In practice, this procedure is fundamentally limited in the values of AI it can reach — each additional user adds another full KV cache, and memory pressure caps the effective batch size well before AI approaches the ridge.

Flow matching learns to transform noise into data.

Our technology builds on flow matching, which we co-invented. The core idea is to learn a velocity field \(b_t(x)\) for an ordinary differential equation (ODE) that transports noise to new data samples.

Let \(\rho_0\) be a simple noise distribution on \(\mathbb{R}^d\) such as a Gaussian, and let \(\rho_1\) be the unknown data distribution from which we have collected a large dataset. The goal is to learn \(b_t\) such that the corresponding ODE satisfies:

\[ \dot x_t = b_t(x_t), \qquad \underbrace{x_0 \sim \rho_0}_{\text{initial noise}} \qquad \Longrightarrow \qquad \underbrace{x_1 \sim \rho_1}_{\text{new data}}. \]

The key question is how to learn this velocity field efficiently. Flow matching creates supervised targets by coupling a noise sample \(I_0 \sim \rho_0\) with a data sample \(I_1 \sim \rho_1\) and forming the linear stochastic interpolant:

\[ I_t = (1-t)\,I_0 + t\,I_1 \in \mathbb{R}^d, \qquad t \in [0,1]. \]

The correct velocity $b_t(x)$ at each intermediate point \(x\) is the average slope of all interpolants passing through \(x\):

\[ b_t(x) \;=\; \mathbb{E}\bigl[\dot I_t \,\big|\, I_t = x\bigr] \;=\; \mathbb{E}\bigl[I_1 - I_0 \,\big|\, I_t = x\bigr]. \]

We can learn this velocity field by training a neural network \(\hat b_t\) with standard squared-error regression on those slopes:

\[ \mathcal{L}_{\text{FM}}(\hat b) \;=\; \int_0^1 \mathbb{E}\bigl|\hat b_t(I_t) - \dot I_t\bigr|^2\,dt. \]

Minimizing this objective ensures that our learned velocity field recovers the true velocity $b_t$. After training, generation is just ODE integration: start from \(x_0 \sim \rho_0\), follow \(\dot x_t = \hat b_t(x_t)\), and read out \(x_1\).

Flow maps compress ODE trajectories into one step.

To make sampling efficient, we compress the entire generative process into a single neural function evaluation. That is, we learn a function that maps the initial state directly to the final state. The flow map \(X_{s,t} : \mathbb{R}^d \to \mathbb{R}^d\) is the required object. It is defined as the solution operator for the ODE:

\[ X_{s,t}(x_s) \;=\; x_t \qquad \text{for any } s, t \in [0, 1]. \]

Our team has developed some of the core methods for training flow maps, which we describe next. We parameterize the flow map as:

\[ X_{s,t}(x) \;=\; x + (t-s)\,v_{s,t}(x), \]

where $v_{s,t}$ is the average velocity, and is related to the ODE velocity via: $v_{t,t}(x) = b_t(x)$. We enforce this on a neural parametrization $\hat v_{s,t}$ by:

\[ \mathcal{L}_{\mathrm{anchor}}(\hat v) = \int_0^1 \mathbb{E}\,\bigl|\hat v_{t,t}(I_t) - b_t(I_t)\bigr|^2\,dt. \]

We have shown that flow maps admit three equivalent characterizations. For example, the semigroup condition states:

\[ X_{u,t}\bigl(X_{s,u}(x)\bigr) = X_{s,t}(x), \qquad \text{for all } s \leq u \leq t \in [0, 1], \]

and can be enforced on $\hat v_{s,t}$ with the following consistency loss:

\[ \mathcal{L}_{\mathrm{cons}}(\hat v) = \iiint \mathbb{E}\,\bigl|\hat X_{s,t}(I_s) - \hat X_{u,t}\bigl(\hat X_{s,u}(I_s)\bigr)\bigr|^2 \, du\, ds\, dt. \]

Minimizing the anchor and consistency losses recovers the true flow map. The flow map can be used as a one-step generative model from noise to data or with multiple steps by applying it repeatedly at intermediate times. These methods are now competitive with their counterparts that involve integrating the ODE, while offering significant computational advantages.

Why discrete diffusion needs many steps.

A discrete diffusion model does not sample a sentence as one coupled object. Instead, at each denoising step, it predicts a distribution for every token position. During sampling, the positions that are updated are sampled independently. Updating multiple tokens at once can make discrete diffusion faster, but it also introduces errors because dependencies between those tokens are not captured in that step.

As an illustrative example, consider a dataset with only two examples: New York and San Diego. The probability that the first word is New is \(\tfrac{1}{2}\), and the probability that it is San is \(\tfrac{1}{2}\). Similarly, the second word is York with probability \(\tfrac{1}{2}\) and Diego with probability \(\tfrac{1}{2}\). If we sample the two positions independently, we produce all four combinations — New York, New Diego, San York, and San Diego — each with probability \(\tfrac{1}{4}\). The between-token dependency was ignored during sampling.

Many-step diffusion mitigates this by taking more steps and making only small changes at each step, creating a trade-off between speed and quality. Flow maps are designed to generate all token positions at once in a correlated, coherent manner. Because the flow map is trained to represent the full jump directly, one-step sampling is not a coarse discretization of many smaller diffusion updates.

Adapting flows to language.

The compelling question is then: how do we adapt these technologies for language generation, which live on a discrete state space, and how do we do so while leveraging the architectures and optimization techniques that have made LLMs scale as they have?

For a vocabulary \(V\) and sequences of length \(L\), we assign each token $v$ a unique one-hot vector $e_v \in \mathbb{R}^{|V|}$, and use this to embed sequences:

\[ f(\mathrm{sequence}) = \bigl(\mathrm{onehot}(\mathrm{token\, 1}),\,\ldots,\,\mathrm{onehot}(\mathrm{token\, L})\bigr)^\top \in \mathbb{R}^{L \times |V|}. \]

We can now use the flow matching framework from before to transform noise into one-hot vectors representing language.

However, naively training with the standard flow matching loss does not exploit the objective functions and training paradigms which have made AR LLMs so successful. To do so, we reparameterise the flow in terms of the denoiser, which per-token lives on the probability simplex:

\[ D_t({x}) \;=\; \mathbb{E}\bigl[{I}_1 \,\big|\, I_t = {x}\bigr] \in \bigl(\Delta^{|V|-1}\bigr)^{L}. \]

We parameterize our neural denoiser to output a per-token categorical distribution, \(\hat D_t({x})^l \in \Delta^{|V|-1}\), and train it with the cross-entropy loss:

\[ \mathcal{L}_{\mathrm{CE}}(\hat D) \;=\; \int_0^1 \mathbb{E}\!\left[-\sum_{l=1}^L \log \hat D_t(I_t)^l \cdot \mathbf{x}_1^l\right] dt. \]

The denoiser can be used to recover the velocity from flow matching via

\[ b_t(\mathbf{x}) \;=\; \frac{D_t(\mathbf{x}) - \mathbf{x}}{1-t}, \]

allowing us to simulate the ODE to generate new text.

Flow maps for language.

While we can use cross-entropy to learn the denoiser, the flow map is still parameterized through the average velocity \(v_{s,t}\), which does not stay on the simplex, and so it cannot be trained using cross-entropy objectives.

To address this, our works introduce a quantity that does always live on the probability simplex, the mean denoiser $\delta_{s,t} : \mathbb{R}^{L \times |V|} \to \bigl(\Delta^{|V|-1}\bigr)^{L}$:

\[\delta_{s,t}({x}) \;\coloneqq\; {x} + (1-s)\,v_{s,t}({x}). \]

Intuitively, the mean denoiser shadows the flow map on the simplex.

On the diagonal, the mean denoiser reduces to the standard denoiser $ \delta_{t,t}({x}) = D_t({x})$, and so $\hat \delta$ can be trained via cross-entropy:

\[ \mathcal{L}_{\mathrm{anchor}}(\hat \delta) \;=\; \int_0^1 \mathbb{E}\!\left[-\sum_{l=1}^L \textrm{CE}\!\left(D_t(I_t)^l \,\big\|\, \hat \delta_{t,t}(I_t)^l\right)\right] dt. \]

We have shown that the three characterizations of flow maps have analogous conditions for the mean denoiser, leading to natural cross-entropy losses. For example, the semigroup condition yields:

\[ \mathcal{L}_{\mathrm{PSD}}(\hat \delta) = \iiint \mathbb{E}\!\left[\sum_l \mathrm{CE}\!\left(\delta_{\textrm{target}}^l \,\big\|\, \hat \delta_{s,t}(I_s)^l\right)\right] \, du\, ds\, dt, \]

where \(\delta_{\textrm{target}} = \gamma\hat\delta_{s,u}(I_s) + (1-\gamma)\hat\delta_{u,t}(\hat X_{s,u}(I_s))\) and \(\gamma = \frac{(1-t)(u-s)}{(1-u)(t-s)}\). Minimizing the anchor loss combined with one of the consistency losses ensures we learn the true mean denoiser. Moreover, the mean denoiser can be used to exactly recover the flow map:

\[ X_{s,t}({x}) \;=\; \frac{1-t}{1-s}\,{x} \;+\; \frac{t-s}{1-s}\,\delta_{s,t}({x}). \]

Together, these insights enable us to carry over all the machinery developed for modern LLMs to train flow map language models.

Reward alignment for flows.

Given a reward function \(r(x)\), alignment can be framed as sampling from the reward-tilted distribution:

\[ \rho_{\mathrm{reward}}(x) \;\propto\; e^{\lambda\,r(x)}\,\rho_1(x). \]

This biases samples toward higher reward, with \(\lambda\) controlling the strength of the preference. Flow-based models give two natural ways to realize this tilt: fine-tuning, which updates the model permanently, and inference-time steering, which keeps the model fixed and does the adaptation on the fly.

In both cases, the goal is to learn or estimate the reward-adapted velocity \(b_t^{\!*}\) whose ODE targets the tilted endpoint distribution:

\[ \dot x_t = b_t^{\!*}(x_t), \qquad x_0 \sim \rho_0 \qquad \Longrightarrow \qquad x_1 \sim \rho_{\mathrm{reward}}. \]

Flow maps make both routes practical because the reward signal can be propagated through entire generated trajectories.

Fine-tuning via flow maps.

In the fine-tuning approach, we adapt the flow map itself. The map sends noise or an intermediate state to a terminal sample \(x_1^\theta = X_{t,1}^\theta(x_t)\). When \(r\) is differentiable, we can backpropagate through this map to get a direct update direction for increasing terminal reward:

\[ \theta \leftarrow \theta + \eta\,\nabla_\theta r(x_1^\theta). \]

This reward update is powerful, but by itself would tend to collapse the model onto a few high-reward samples. To preserve diversity and target the reward tilt, we instead pose reward alignment as a regularized objective over the endpoint distribution \(\rho_\theta\) induced by the fine-tuned map:

\[ \max_\theta\; \mathbb{E}\big[r(x_1^\theta)\big] - D\!\left(\rho_{\theta} \,\middle\|\, \rho_1\right). \]

Here \(D\) keeps the fine-tuned distribution close to the base model \(\rho_1\), while the reward term biases it toward \(\rho_{\mathrm{reward}}\). The flow map is key because it gives one object that produces endpoints, propagates reward gradients, and exposes the intermediate probability path needed for regularization.

For non-differentiable rewards, the same lookahead lets us evaluate candidate endpoints and decide how much to up- or down-weight the corresponding samples.

flow map \(X_\theta\)
reward \(r(x_1)\)
\(r\)
\(\nabla_\theta\, r(x_1)\)
\(x_1 = X_{t,1}(x_t)\)

Inference-time steering with flow maps.

For deployments where retraining is impractical, we can adapt at inference time. Given an intermediate state along the flow, \(x_t\), the base flow moves samples toward the data distribution. To bias samples toward higher-reward regions, we use the flow map as a lookahead mechanism. It produces a candidate terminal sample \(x_1 = X_{t,1}(x_t)\), and differentiating \(r(x_1)\) with respect to the current state gives \(\nabla_{x_t} r(x_1) = \nabla_{x_t} r(X_{t,1}(x_t))\). Intuitively, this points to directions in the current state that increase expected reward at the end of the flow. This suggests an update like:

\[ x_s \leftarrow \underbrace{X_{t,s}(x_t)}_{\text{base flow}} + \eta\,\underbrace{\nabla_{x_t} r(X_{t,1}(x_t))}_{\text{reward signal}}. \]

The workhorse is this lookahead gradient: it turns a terminal reward into a local steering direction available during inference, so the model can be nudged toward higher-reward endpoints without updating its weights. Our recent works develop several variants of this idea, adding correction and normalization mechanisms that target the reward tilt \(\rho_{\mathrm{reward}}\) and estimate the reward-adapted velocity \(b_t^{\star}\). Across these variants, the common object doing the work is still the flow-map lookahead and its reward gradient, with different variants emphasizing efficiency, sample diversity, and steering accuracy.

flow map \(X_\theta\)
reward \(r(x_1)\)
\(r\)
\(\nabla_{x_t} r(x_1)\)
\(x_1 = X_{t,1}(x_t)\)
References

Selected works.

  1. M. S. Albergo and E. Vanden-Eijnden. Building Normalizing Flows with Stochastic Interpolants. ICLR 2023. arxiv.org/abs/2209.15571
  2. M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. JMLR 26(209):1–80, 2025. jmlr.org/papers/v26/23-1605.html
  3. N. M. Boffi, M. S. Albergo, and E. Vanden-Eijnden. Flow Map Matching with Stochastic Interpolants: A Mathematical Framework for Consistency Models. TMLR, 2025. openreview.net/forum?id=cqDH0e6ak2
  4. N. M. Boffi, M. S. Albergo, and E. Vanden-Eijnden. How to Build a Consistency Model: Learning Flow Maps via Self-Distillation. NeurIPS 2025. arxiv.org/abs/2505.18825
  5. P. Holderrieth, D. Chen, L. Eyring, I. Shah, G. Anantharaman, Y. He, Z. Akata, T. Jaakkola, N. M. Boffi, and M. Simchowitz. Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps. 2026. arxiv.org/abs/2602.05993
  6. P. Potaptchik, A. Saravanan, A. Mammadov, A. Prat, M. S. Albergo, and Y. W. Teh. Meta Flow Maps Enable Scalable Reward Alignment. 2026. arxiv.org/abs/2601.14430
  7. P. Potaptchik, C.-K. Lee, and M. S. Albergo. Tilt Matching for Scalable Sampling and Fine-Tuning. ICML 2026. arxiv.org/abs/2512.21829
  8. C. Lee, J. Yoo, M. Agarwal, S. Shah, J. Huang, A. Raghunathan, S. Hong, N. M. Boffi, and J. Kim. Flow Map Language Models: One-Step Language Modeling via Continuous Denoising. 2026. arxiv.org/abs/2602.16813
  9. P. Potaptchik, J. Yim, A. Saravanan, P. Holderrieth, E. Vanden-Eijnden, and M. S. Albergo. Discrete Flow Maps. 2026. arxiv.org/abs/2604.09784
  10. S. Dieleman. Learning the Integral of a Diffusion Model. Blog post, May 2026. sander.ai/2026/05/06/flow-maps.html

End of appendix.

That covers the math behind each block of the pitch deck — flow matching, flow maps, the simplex setup for language, reward fine-tuning through differentiable flow-map rollouts, and inference-time guidance via flow-map gradients.

← Back to the pitch deck