📣 Changyeon Kim is actively on the job market! If you're interested in my work or would like to chat about research, feel free to reach out at cykim1006@gmail.com.
DEAS is a simple yet effective offline RL framework that learns a critic over H-step action sequences with detached value learning, enabling stable, plug-and-play RL enhancement of VLA models.
Prior actor-critic methods that adopt action sequences let the policy π propose a full H-step sequence âH:2H−1. Because this expanded action space is far larger than what the offline dataset supports, the target critic Q̄ is queried on out-of-distribution sequences, and the actor is free to exploit the resulting critic errors — causing severe value overestimation and unstable training in offline settings.
DEAS removes the actor from the critic update loop entirely. A value network V(s) and action-sequence critic Q(s, a0:H−1) are trained with IQL-style expectile regression (the τ·-weighted target), biasing value estimates toward high-return, in-distribution action sequences. We further combine this with distributional RL — categorical distributions over a fixed value support — and dual discount factors γ1, γ2 that separately control reward aggregation within a sequence and bootstrapping across sequence-level decision points. The policy is then extracted with any off-the-shelf method (BoN, DPG, AWR, etc.), which makes DEAS directly applicable to expressive policies including large-scale VLAs.
We report the partial success rate (%, over 20 trials per task) on 3 tasks from 5 initial points. Bold and underline indicate best and runner-up results, respectively.
DEAS demonstrates consistent performance improvements across all tasks In contrast, baseline methods show inconsistent performance—while some methods perform well on certain tasks, they exhibit significant performance degradation or minimal improvement on others.
All videos are
GR00T N1.5
Filtered BC
IQL
QC
DEAS (Ours)
All videos are
GR00T N1.5
Filtered BC
IQL
QC
DEAS (Ours)
We report the success rate (%, over 50 trials per task) on 4 tasks, aggregated with 3 different seeds. Bold and underline indicate best and runner-up results, respectively.
DEAS achieves the highest success rates in 3 out of 4 tasks, with the remaining task also showing improved performance compared to the base model.
GR00T N1.5
Filtered BC
IQL
QC
DEAS (Ours)
GR00T N1.5
Filtered BC
IQL
QC
DEAS (Ours)
GR00T N1.5
Filtered BC
IQL
QC
DEAS (Ours)
GR00T N1.5
Filtered BC
IQL
QC
DEAS (Ours)
DEAS consistently outperforms various prior offline RL methods in OGBench, and shows far more effective performance in more challenging tasks (e.g., puzzle and cube-quadruple). Furthermore, DEAS shows consistent performance across different data scales in diverse tasks.
@inproceedings{kim2026deas,
title={DEAS: DEtached value learning with Action Sequence for Scalable Offline RL},
author={Changyeon Kim and Haeone Lee and Younggyo Seo and Kimin Lee and Yuke Zhu},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
}