Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Zhang, Gongbo; Wang, Wen; Tian, Ye; Yuan, Li

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Gongbo Zhang¹, Wen Wang², Ye Tian¹, Li Yuan^1,*

¹ Peking University · ² Zhejiang University
^* Corresponding author

Paper arXiv 🤗 Paper Code 🤗 Models 🤗 Data

TIDE: cross-architecture distillation overview

Distilling 8B dense and 16B MoE diffusion-LM teachers into a 0.6B student — +1.53 average gain across 8 benchmarks, +16.48 on HumanEval, 22× memory reduction, 5.2× faster inference.

Abstract

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, where teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline. Code and models will be publicly available.

The TIDE Framework

TIDE framework: TIDAL + CompDemo + Reverse CALM

Component	Paper	Role	One-line description
TIDAL	§2.1	Scheduling — when to learn	Dual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios. Generalizes prior single-axis interpolation to the diffusion setting.
CompDemo	§2.2	Contextual — what to enrich	Two-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context.
Reverse CALM	§2.3	Output — how to project	Reverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective.

Main Results

Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.

Benchmark	Qwen3-0.6B		Shared-Tokenizer			Cross-Tokenizer
Benchmark	AR	BD3LM	KL	TIDE-Cross	TIDE-Shared	CALM	TIDE-Shared	TIDE-Cross
GSM8K	59.60	45.56	43.97	45.03	48.98	48.60	49.89	52.24
MATH	32.40	13.08	9.40	9.76	11.16	13.14	12.98	13.20
BBH	41.50	26.32	25.79	26.00	26.79	24.21	26.85	27.37
MMLU-Pro	24.70	13.80	13.19	12.88	14.48	13.47	14.02	14.52
HellaSwag	47.40	39.28	39.78	39.50	40.50	40.42	39.57	39.88
MMLU	52.80	39.15	39.57	39.09	39.92	39.42	39.54	39.59
HumanEval	32.30	46.34	41.46	42.68	48.78	43.90	49.39	48.17
MBPP	36.60	37.80	31.20	31.40	37.80	34.80	38.40	38.60
Avg	40.91	32.67	30.55	30.79	33.55	32.25	33.83	34.20

Case Studies

Dark Knowledge Transfer

Within the shared-tokenizer pipeline, TIDE-Shared reduces KL divergence relative to the WeDLM teacher by 46% on GSM8K, from 12.44 to 6.69, confirming that the distilled student inherits the teacher's prediction distribution.

KL divergence to the WeDLM teacher on GSM8K (TIDE-Shared 6.69 vs. baseline 12.44)

Qualitative Error Analysis

Each pipeline corrects errors that the non-distilled baseline makes, and each teacher imparts distinct knowledge: LLaDA2 (TIDE-Cross) strengthens multi-step arithmetic and type-level code awareness, while WeDLM (TIDE-Shared) strengthens proportional reasoning and algorithmic inversion.

Case & Error Type	Base	TIDE-Cross	TIDE-Shared
Math Reasoning (GSM8K)
#39: rate-distance confusion in multi-step problem	✗	✓	✗
#77: proportional relationships reversed	✗	✗	✓
Code Generation (HumanEval)
#11: bitwise XOR on string characters	✗	✓	✗
#50: decode calls encode without reversal	✗	✗	✓

GSM8K #39 — Multi-step distance allocation

Prompt. Dana can run four times faster than she walks; skip is half as fast as run; she skips at 3 mph. In six hours, one-third running and two-thirds walking, how many miles does she cover? (Ground truth: 18.)

Baseline (no distill) — answer 16. Confuses time allocation with speed (treats 6/3 = 2 as a speed instead of an hour count).

TIDE-Cross (LLaDA2) — answer 18 ✓. Correctly computes allocated hours and per-segment distances.

TIDE-Shared (WeDLM) — answer 54. Allocates time correctly but then multiplies each speed by the full 6 hours instead of the allocated 2.

Raw outputs

Baseline:    Dana skips at 3 mph, so she runs at 3*2 = 6 mph. Running 1/3 of
             6 hours: 6/3 = 2 mph. Walking 2/3: 6/3 = 2 mph. Total: 2*2 + 4*2
             = 16. #### 16

TIDE-Cross:  3*2 = 6 mph running. 6/3 = 2 hours running. 6/3 = 2 hours
             walking. 2*6 = 12 mi running. 2*3 = 6 mi walking. 12 + 6 = 18.
             #### 18

TIDE-Shared: 6 mph running, 6/3 = 2 hours running, 6/3 = 2 hours walking,
             total 4 hours. 6*6 = 36 mi running. 3*6 = 18 mi skipping.
             36 + 18 = 54. #### 54

HumanEval #50 — `decode_shift` reverses `encode_shift`

Prompt. encode_shift shifts each character by +5 in the alphabet. Implement decode_shift to reverse the encoding.

Baseline (no distill) — fail. return encode_shift(s) — applies +5 again instead of reversing.

TIDE-Cross (LLaDA2) — fail. Same erroneous code as the baseline, plus a long explanation that incorrectly justifies it as valid decoding.

TIDE-Shared (WeDLM) — pass ✓. Correctly subtracts 5 modulo 26:

return "".join(
    [chr(((ord(ch) - 5 - ord("a")) % 26) + ord("a"))
     for ch in s])

BibTeX

@misc{zhang2026turningtidecrossarchitecturedistillation,
      title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
      author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
      year={2026},
      eprint={2604.26951},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.26951},
}

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Distilling 8B dense and 16B MoE diffusion-LM teachers into a 0.6B student — +1.53 average gain across 8 benchmarks, +16.48 on HumanEval, 22× memory reduction, 5.2× faster inference.

Abstract

The TIDE Framework

Main Results

Case Studies

Dark Knowledge Transfer

Qualitative Error Analysis

GSM8K #39 — Multi-step distance allocation

HumanEval #50 — decode_shift reverses encode_shift

BibTeX

HumanEval #50 — `decode_shift` reverses `encode_shift`