Online RL · Flow-based VLA

π-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Critic-free, likelihood-free online RL fine-tuning for flow-based VLA policies using Flow-SDE exploration and step-wise contrastive ranking.

Siting Wang, Xiaofeng Wang, Zheng Zhu, Minnan Pei, Xinyu Cui, Cheng Deng, Jian Zhao, Guan Huang, Haifeng Zhang, Jun Wang

Paper Code Poster Model BibTeX

TL;DR

π-StepNFT is a critic-free, likelihood-free online RL fine-tuning method for flow-based VLA policies.

Key insight: wider exploration space needs finer, step-wise alignment — achieved via Flow-SDE exploration + step-wise contrastive ranking.

Highlights

Critic-free & likelihood-free online RL for flow-based VLAs (single forward pass per update).
Flow-SDE sampling widens exploration beyond the expert manifold.
Step-wise supervision + contrastive ranking stabilizes alignment across solver steps.
Strong gains on LIBERO (few-shot) and improved OOD robustness on ManiSkill.

Method at a Glance

Figure 1 — Comparison of training paradigms.

How π-StepNFT works (conceptually):

1) Explore wider with Flow-SDE rollouts during training.

2) Align finer: supervise each solver step (local transition) instead of only the final output.

3) Learn preferences via contrastive ranking using positive/negative mirror branches.

4) Update the policy without critic and without likelihood estimation.

Why It Matters

Flow-based VLA policies often collapse into a narrow expert manifold after SFT. When the trajectory drifts at test time, deterministic rollouts struggle to recover. Online RL should expand the recoverable behavior manifold — but naïve stochastic rollouts increase variance and destabilize training.

π-StepNFT couples wide exploration with fine-grained step-wise alignment, enabling stable online improvement and better generalization.

Main Results

Note: comparisons are under the π-RL Flow-SDE setting.

LIBERO (Few-shot)

Model	Spatial	Object	Goal	Long	Avg.	Δ Avg.
# Full SFT
π₀	96.8	98.8	95.8	85.2	94.2	—
π_0.5	98.8	98.2	98.0	92.4	96.9	—
# Few-shot SFT + RL (π₀)
SFT	65.3	64.4	49.8	51.2	57.6	—
π_RL (PPO)	98.4	99.4	96.2	90.2	96.0	+38.4
π_RL (GRPO)	97.8	97.8	83.2	81.4	90.0	+32.4
Ours	93.5	98.0	83.7	86.7	90.5	+32.9
# Few-shot SFT + RL (π_0.5)
SFT	84.6	95.4	84.6	43.9	77.1	—
π_RL (PPO)	99.6	100	98.8	93.0	97.9	+20.8
π_RL (GRPO)	97.4	99.8	91.2	77.6	91.5	+14.4
Ours	97.8	100	98.2	79.8	94.0	+16.9

Table 1 — Success rates (%) on LIBERO in the few-shot setting.

ManiSkill (IND & OOD)

Model	IND	OOD			Avg.
Model	IND	Vision	Semantic	Execution	Avg.
π₀
Full SFT	38.4	32.6	8.4	13.2	18.1
π_RL (PPO)	78.8	61.1	25.4	31.5	39.3
Ours	79.2	69.1	49.1	33.1	50.4
π_0.5
Full SFT	40.1	40.2	16.6	22.4	26.4
π_RL (PPO)	90.9	68.0	34.5	45.4	49.3
Ours	85.4	76.9	56.6	45.1	59.5

Table 2 — Success rates (%) on ManiSkill across IND and OOD settings.

LIBERO (few-shot): π-StepNFT improves performance by 32.9% over SFT.

ManiSkill (OOD generalization): π-StepNFT improves OOD success by 11.1% over critic-based baselines by mitigating multimodal overfitting.

Key Ablations

Figure 2 — On-policy stability under stochastic rollouts.

Figure 3 — Contrastive ranking enables stable critic-free learning.

Figure 4 — Hyperparameter sensitivity analysis.

Citation

@misc{wang2026pistepnftwiderspaceneeds,
      title={$\pi$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs}, 
      author={Siting Wang and Xiaofeng Wang and Zheng Zhu and Minnan Pei and Xinyu Cui and Cheng Deng and Jian Zhao and Guan Huang and Haifeng Zhang and Jun Wang},
      year={2026},
      eprint={2603.02083},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.02083}, 
}