I-DLM: Mô hình ngôn ngữ diffusion có khả năng tự kiểm tra

Vấn đề với Diffusion Language Models hiện tại

Diffusion language models (DLM) có một promise hấp dẫn: generate toàn bộ output từ noise rồi refine dần, thay vì generate token từ trái sang phải như autoregressive (AR) models. Về lý thuyết điều này cho phép parallelize hoàn toàn và tăng tốc đáng kể.

Thực tế? DLM hiện tại consistently tệ hơn AR models về chất lượng.

Tác giả cho rằng nguyên nhân là thiếu introspective consistency: AR models “đồng ý” với những gì chúng generate vì generation và verification xảy ra trong cùng một forward pass. DLM thì không — chúng học denoise nhưng không học introspect.

Ba bottleneck cụ thể:

Low introspective consistency: DLM sinh token nhưng không verify lại — SDAR đạt 0.699, I-DLM đạt 0.984
Compute inefficiency: các approach verify cũ tốn ~7.8x overhead so với I-DLM chỉ ~2.5x
Infrastructure mismatch: DLM cũ không tận dụng được continuous batching và paged KV cache

Giải pháp: Introspective Strided Decoding (ISD)

Introspective-Consistency Training

Convert pretrained AR model thành I-DLM:

Input: ghép fully-masked sequence với clean sequence [x_t | x_0]
Attention: strict causal masking trên tất cả positions
Loss: auto-balanced cross-entropy trên cả masked và clean positions
Scale: 4.5B tokens, 8 H100 GPUs, 2 epochs với stride curriculum (N=2 rồi N=3)

Introspective Strided Decoding

Trong mỗi forward pass:

MASK positions: propose N token mới (distribution q)
Clean positions: verify token trước đó (anchor distribution p)
Acceptance: min(1, p(x)/q(x)) — đảm bảo output tương đương AR distribution
Stride N=4 cho TPF ≈ 2.96, tức ~3x wall-clock speedup trong memory-bound regime

AR-Compatible Serving

Strict causal attention cho phép tích hợp trực tiếp vào SGLang — không cần custom infrastructure:

Paged KV cache và continuous batching
CUDA graph capture (+42-76% throughput)
Stationary-batch decode-loop scheduling (+11-21%)

Kết quả

I-DLM-8B là DLM đầu tiên match chất lượng AR model cùng kích thước:

Benchmark	Qwen3-8B (AR)	LLaDA-2.1-mini (DLM 16B)	I-DLM-8B
AIME-24	73.1	43.3	69.6
MATH-500	95.8	85.0	96.8
HumanEval	95.1	86.0	93.3
MMLU	83.5	74.5	82.4
LiveCodeBench-v6	50.3	30.4	45.7

I-DLM-8B vượt trội LLaDA-2.1-mini (16B, gấp đôi parameters) +26 điểm trên AIME-24 và +15 điểm trên LiveCodeBench-v6.

Throughput so với LLaDA-2.1-mini tại batch size C=64: 2.9-4.1x cao hơn.

Quick Start

# Launch server
python -m sglang.launch_server \
    --model-path yifanyu/I-DLM-8B \
    --trust-remote-code --tp-size 1 --dtype bfloat16 \
    --attention-backend flashinfer --dllm-algorithm IDLMBlockN \
    --dllm-algorithm-config inference/configs/idlm_blockN4_config.yaml \
    --port 30000

# Generate
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "Prove sqrt(2) is irrational"}], "max_tokens": 4096}'

Models

Model	Base	Ghi chú
`yifanyu/I-DLM-8B`	Qwen3-8B	Model chính, match AR quality
`yifanyu/I-DLM-32B`	Qwen3-32B	Vượt LLaDA-2.1-flash (100B)
`yifanyu/I-DLM-8B-lora-r128`	Qwen3-8B	Lossless variant (bit-for-bit identical với AR)

Tất cả models cần trust_remote_code=True.

Tại sao Dev nên quan tâm

Nếu inference latency là bottleneck — đặc biệt ở high concurrency (batch size lớn) — I-DLM là hướng đáng thử nghiệm. Không cần custom infrastructure: tích hợp trực tiếp vào SGLang giống AR models. Lossless variant (R-ISD với gated LoRA) cho output bit-for-bit giống base AR model — không có quality tradeoff.

I-DLM: Mô hình ngôn ngữ diffusion có khả năng tự kiểm tra

TL;DR

Vấn đề với Diffusion Language Models hiện tại

Giải pháp: Introspective Strided Decoding (ISD)

Introspective-Consistency Training

Introspective Strided Decoding

AR-Compatible Serving

Kết quả

Quick Start

Models

Tại sao Dev nên quan tâm

Đường dẫn nguồn

I-DLM: Mô hình ngôn ngữ diffusion có khả năng tự kiểm tra

TL;DR

Vấn đề với Diffusion Language Models hiện tại

Giải pháp: Introspective Strided Decoding (ISD)

Introspective-Consistency Training

Introspective Strided Decoding

AR-Compatible Serving

Kết quả

Quick Start

Models

Tại sao Dev nên quan tâm

Đường dẫn nguồn

Cùng bản tin này