TIDE logo

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

1 Peking University  ยท  2 Zhejiang University
* Corresponding author
TIDE: cross-architecture distillation overview

Distilling 8B dense and 16B MoE diffusion-LM teachers into a 0.6B student โ€” +1.53 average gain across 8 benchmarks, +16.48 on HumanEval, 22ร— memory reduction, 5.2ร— faster inference.

Abstract

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, where teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline. Code and models will be publicly available.

The TIDE Framework

TIDE framework: TIDAL + CompDemo + Reverse CALM
ComponentPaperRoleOne-line description
TIDAL ยง2.1 Scheduling โ€” when to learn Dual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios. Generalizes prior single-axis interpolation to the diffusion setting.
CompDemo ยง2.2 Contextual โ€” what to enrich Two-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context.
Reverse CALM ยง2.3 Output โ€” how to project Reverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective.

Main Results

Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.

Benchmark Qwen3-0.6B Shared-Tokenizer Cross-Tokenizer
ARBD3LM KLTIDE-CrossTIDE-Shared CALMTIDE-SharedTIDE-Cross
GSM8K59.6045.5643.9745.0348.9848.6049.8952.24
MATH32.4013.089.409.7611.1613.1412.9813.20
BBH41.5026.3225.7926.0026.7924.2126.8527.37
MMLU-Pro24.7013.8013.1912.8814.4813.4714.0214.52
HellaSwag47.4039.2839.7839.5040.5040.4239.5739.88
MMLU52.8039.1539.5739.0939.9239.4239.5439.59
HumanEval32.3046.3441.4642.6848.7843.9049.3948.17
MBPP36.6037.8031.2031.4037.8034.8038.4038.60
Avg40.9132.6730.5530.7933.5532.2533.8334.20

Case Studies

Dark Knowledge Transfer

Within the shared-tokenizer pipeline, TIDE-Shared reduces KL divergence relative to the WeDLM teacher by 46% on GSM8K, from 12.44 to 6.69, confirming that the distilled student inherits the teacher's prediction distribution.

KL divergence to the WeDLM teacher on GSM8K (TIDE-Shared 6.69 vs. baseline 12.44)

Qualitative Error Analysis

Each pipeline corrects errors that the non-distilled baseline makes, and each teacher imparts distinct knowledge: LLaDA2 (TIDE-Cross) strengthens multi-step arithmetic and type-level code awareness, while WeDLM (TIDE-Shared) strengthens proportional reasoning and algorithmic inversion.

Case & Error Type BaseTIDE-CrossTIDE-Shared
Math Reasoning (GSM8K)
#39: rate-distance confusion in multi-step problemโœ—โœ“โœ—
#77: proportional relationships reversedโœ—โœ—โœ“
Code Generation (HumanEval)
#11: bitwise XOR on string charactersโœ—โœ“โœ—
#50: decode calls encode without reversalโœ—โœ—โœ“

GSM8K #39 โ€” Multi-step distance allocation

Prompt. Dana can run four times faster than she walks; skip is half as fast as run; she skips at 3 mph. In six hours, one-third running and two-thirds walking, how many miles does she cover? (Ground truth: 18.)

Baseline (no distill) โ€” answer 16. Confuses time allocation with speed (treats 6/3 = 2 as a speed instead of an hour count).

TIDE-Cross (LLaDA2) โ€” answer 18 โœ“. Correctly computes allocated hours and per-segment distances.

TIDE-Shared (WeDLM) โ€” answer 54. Allocates time correctly but then multiplies each speed by the full 6 hours instead of the allocated 2.

Raw outputs
Baseline:    Dana skips at 3 mph, so she runs at 3*2 = 6 mph. Running 1/3 of
             6 hours: 6/3 = 2 mph. Walking 2/3: 6/3 = 2 mph. Total: 2*2 + 4*2
             = 16. #### 16

TIDE-Cross:  3*2 = 6 mph running. 6/3 = 2 hours running. 6/3 = 2 hours
             walking. 2*6 = 12 mi running. 2*3 = 6 mi walking. 12 + 6 = 18.
             #### 18

TIDE-Shared: 6 mph running, 6/3 = 2 hours running, 6/3 = 2 hours walking,
             total 4 hours. 6*6 = 36 mi running. 3*6 = 18 mi skipping.
             36 + 18 = 54. #### 54

HumanEval #50 โ€” decode_shift reverses encode_shift

Prompt. encode_shift shifts each character by +5 in the alphabet. Implement decode_shift to reverse the encoding.

Baseline (no distill) โ€” fail. return encode_shift(s) โ€” applies +5 again instead of reversing.

TIDE-Cross (LLaDA2) โ€” fail. Same erroneous code as the baseline, plus a long explanation that incorrectly justifies it as valid decoding.

TIDE-Shared (WeDLM) โ€” pass โœ“. Correctly subtracts 5 modulo 26:

return "".join(
    [chr(((ord(ch) - 5 - ord("a")) % 26) + ord("a"))
     for ch in s])

BibTeX

@misc{zhang2026turningtidecrossarchitecturedistillation,
      title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
      author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
      year={2026},
      eprint={2604.26951},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.26951},
}