
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
Abstract
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, where teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline. Code and models will be publicly available.
The TIDE Framework
| Component | Paper | Role | One-line description |
|---|---|---|---|
| TIDAL | ยง2.1 | Scheduling โ when to learn | Dual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios. Generalizes prior single-axis interpolation to the diffusion setting. |
| CompDemo | ยง2.2 | Contextual โ what to enrich | Two-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context. |
| Reverse CALM | ยง2.3 | Output โ how to project | Reverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective. |
Main Results
Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.
| Benchmark | Qwen3-0.6B | Shared-Tokenizer | Cross-Tokenizer | |||||
|---|---|---|---|---|---|---|---|---|
| AR | BD3LM | KL | TIDE-Cross | TIDE-Shared | CALM | TIDE-Shared | TIDE-Cross | |
| GSM8K | 59.60 | 45.56 | 43.97 | 45.03 | 48.98 | 48.60 | 49.89 | 52.24 |
| MATH | 32.40 | 13.08 | 9.40 | 9.76 | 11.16 | 13.14 | 12.98 | 13.20 |
| BBH | 41.50 | 26.32 | 25.79 | 26.00 | 26.79 | 24.21 | 26.85 | 27.37 |
| MMLU-Pro | 24.70 | 13.80 | 13.19 | 12.88 | 14.48 | 13.47 | 14.02 | 14.52 |
| HellaSwag | 47.40 | 39.28 | 39.78 | 39.50 | 40.50 | 40.42 | 39.57 | 39.88 |
| MMLU | 52.80 | 39.15 | 39.57 | 39.09 | 39.92 | 39.42 | 39.54 | 39.59 |
| HumanEval | 32.30 | 46.34 | 41.46 | 42.68 | 48.78 | 43.90 | 49.39 | 48.17 |
| MBPP | 36.60 | 37.80 | 31.20 | 31.40 | 37.80 | 34.80 | 38.40 | 38.60 |
| Avg | 40.91 | 32.67 | 30.55 | 30.79 | 33.55 | 32.25 | 33.83 | 34.20 |
Case Studies
Dark Knowledge Transfer
Within the shared-tokenizer pipeline, TIDE-Shared reduces KL divergence relative to the WeDLM teacher by 46% on GSM8K, from 12.44 to 6.69, confirming that the distilled student inherits the teacher's prediction distribution.
Qualitative Error Analysis
Each pipeline corrects errors that the non-distilled baseline makes, and each teacher imparts distinct knowledge: LLaDA2 (TIDE-Cross) strengthens multi-step arithmetic and type-level code awareness, while WeDLM (TIDE-Shared) strengthens proportional reasoning and algorithmic inversion.
| Case & Error Type | Base | TIDE-Cross | TIDE-Shared |
|---|---|---|---|
| Math Reasoning (GSM8K) | |||
| #39: rate-distance confusion in multi-step problem | โ | โ | โ |
| #77: proportional relationships reversed | โ | โ | โ |
| Code Generation (HumanEval) | |||
| #11: bitwise XOR on string characters | โ | โ | โ |
| #50: decode calls encode without reversal | โ | โ | โ |
GSM8K #39 โ Multi-step distance allocation
Prompt. Dana can run four times faster than she walks; skip is half as fast as run; she skips at 3 mph. In six hours, one-third running and two-thirds walking, how many miles does she cover? (Ground truth: 18.)
Baseline (no distill) โ answer 16. Confuses time allocation with speed (treats 6/3 = 2 as a speed instead of an hour count).
TIDE-Cross (LLaDA2) โ answer 18 โ. Correctly computes allocated hours and per-segment distances.
TIDE-Shared (WeDLM) โ answer 54. Allocates time correctly but then multiplies each speed by the full 6 hours instead of the allocated 2.
Raw outputs
Baseline: Dana skips at 3 mph, so she runs at 3*2 = 6 mph. Running 1/3 of
6 hours: 6/3 = 2 mph. Walking 2/3: 6/3 = 2 mph. Total: 2*2 + 4*2
= 16. #### 16
TIDE-Cross: 3*2 = 6 mph running. 6/3 = 2 hours running. 6/3 = 2 hours
walking. 2*6 = 12 mi running. 2*3 = 6 mi walking. 12 + 6 = 18.
#### 18
TIDE-Shared: 6 mph running, 6/3 = 2 hours running, 6/3 = 2 hours walking,
total 4 hours. 6*6 = 36 mi running. 3*6 = 18 mi skipping.
36 + 18 = 54. #### 54
HumanEval #50 โ decode_shift reverses encode_shift
Prompt. encode_shift shifts each character by +5 in the alphabet. Implement decode_shift to reverse the encoding.
Baseline (no distill) โ fail. return encode_shift(s) โ applies +5 again instead of reversing.
TIDE-Cross (LLaDA2) โ fail. Same erroneous code as the baseline, plus a long explanation that incorrectly justifies it as valid decoding.
TIDE-Shared (WeDLM) โ pass โ. Correctly subtracts 5 modulo 26:
return "".join(
[chr(((ord(ch) - 5 - ord("a")) % 26) + ord("a"))
for ch in s])
BibTeX
@misc{zhang2026turningtidecrossarchitecturedistillation,
title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
year={2026},
eprint={2604.26951},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.26951},
}