BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

*Equal contribution Corresponding author
Xi'an Jiaotong University,

Abstract

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining global causal coherence with local parallel generation. This design enables prefix KV-cache reuse across completed blocks, reduces the effective cost of iterative denoising, and provides a smoother transition from AR pretraining to diffusion-based policy fine-tuning. We conduct extensive evaluations on the LIBERO and SimplerEnv benchmarks. Experimental results demonstrate that our BlockVLA achieves a 3.3× inference acceleration over standard discrete diffusion baselines. Furthermore, our model exhibits superior training efficiency, with success rates converging substantially faster than baselines, a gain that is particularly pronounced in complex, long-horizon tasks, where BlockVLA achieves significant performance gains in the early stages of training. This work establishes Block Diffusion as a robust bridge between large-scale pretrained AR models and efficient, high-frequency real-time robotic control.

Overview

Overall architecture of the proposed BlockVLA framework.
BlockVLA adapts a pretrained autoregressive VLA into an efficient diffusion-based policy. Instead of moving to fully bidirectional denoising, BlockVLA uses blockwise causal masking to preserve global causal dependencies while enabling parallel denoising within each block. During inference, this block-level autoregressive flow supports VL prefix caching and KV-Cache reuse, reducing the latency of discrete diffusion policies for real-world robotic control. The key distinction from standard Discrete Diffusion is summarized below.

Discrete Diffusion (Baseline)

  • Parallel Decoding
  • KV-Cache Reuse
  • Efficient Seq Scaling
Introducing 1. VL Prefix Caching 2. Blockwise Causal Masking

Block Diffusion (Ours)

  • Parallel Decoding
  • KV-Cache Reuse
  • Efficient Seq Scaling

Simulation Experiments on LIBERO

Our BlockVLA accelerates both training convergence and inference: it reaches above 85% overall success rate 2.5× earlier during training and achieves 3.3× faster inference.

Simulation experiments on LIBERO comparing BlockVLA with baselines.

Real-world Experiment Demo

We evaluate on real-world robot arm for three horizons:

  • Short-horizon (150-200 frames), Put the cube into the bowl.
  • Mid-horizon (250-300 frames), Put the corn into the plate (with different gripper orientations).
  • Long-horizon (450-550 frames), Put both carrot and eggplant into the plate.

Each task contains 100 collected trajectories and is evaluated over 24 trials. The paired videos are synchronized at 3× playback speed. When one policy finishes early, it stays on the final frame until the paired run also completes.

DiscreteDiffusionVLA (Baseline)
  • High latency
  • Frequent stuttering
  • Not smooth
BlockVLA (Ours)
  • Faster inference
  • Greatly reduced stuttering
  • Smooth execution
Put the cube into the bowl (150-200 frames)
3× speed
DiscreteDiffusionVLA (Baseline)
Finished
Put the cube into the bowl (150-200 frames)
3× speed
BlockVLA (Ours)
Finished
Put the corn into the plate (250-300 frames)
3× speed
DiscreteDiffusionVLA (Baseline)
Finished
Put the corn into the plate (250-300 frames)
3× speed
BlockVLA (Ours)
Finished
Put both carrot and eggplant into the plate (450 - 550 frames)
3× speed
DiscreteDiffusionVLA (Baseline)
Finished
Put both carrot and eggplant into the plate (450 - 550 frames)
3× speed
BlockVLA (Ours)
Finished

Real-world Success Rate

BlockVLA also achieves faster training convergence and inference speed in real-world experiments. The convergence acceleration becomes more pronounced as the task horizon increases.

Put the cube into the bowl

DiscreteDiffusionVLA10k14/24
DiscreteDiffusionVLA30k20/24
DiscreteDiffusionVLA50k19/24
BlockVLA10k18/24
BlockVLA30k20/24
BlockVLA50k21/24

Put the corn into the plate

DiscreteDiffusionVLA10k9/24
DiscreteDiffusionVLA30k19/24
DiscreteDiffusionVLA50k23/24
BlockVLA10k17/24
BlockVLA30k22/24
BlockVLA50k23/24

Put both carrot and eggplant into the plate

DiscreteDiffusionVLA10k5/24
DiscreteDiffusionVLA30k14/24
DiscreteDiffusionVLA50k13/24
BlockVLA10k17/24
BlockVLA30k18/24
BlockVLA50k16/24
Inference time DiscreteDiffusionVLA: 2.35 s 4.6× faster BlockVLA: 0.51 s

BibTeX

@article{wang2026blockvla,
  title={BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning},
  author={Wang, Ruiheng and Bai, Shuanghao and Zhang, Haoran and Chen, Badong and Xu, Xiangyu},
  journal={arXiv preprint arXiv:2605.13382},
  year={2026}
}