DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

Changyeon Kim1    Haeone Lee1    Younggyo Seo2    Kimin Lee^1    Yuke Zhu^3,4   
1 KAIST    2 UC Berkeley    3 The University of Texas at Austin    4 NVIDIA
ICLR 2026

📣 Changyeon Kim is actively on the job market! If you're interested in my work or would like to chat about research, feel free to reach out at cykim1006@gmail.com.

TL;DR & Key Insights

DEAS is a simple yet effective offline RL framework that learns a critic over H-step action sequences with detached value learning, enabling stable, plug-and-play RL enhancement of VLA models.

  • Detached value learning. Decouple the critic from the actor to avoid value overestimation over expanded action spaces.
  • Distributional RL + dual discount factors. Categorical values with separate intra- / inter-sequence discounts γ1, γ2H stabilize multi-step returns.
  • Plug-and-play with any policy, including large VLAs. Boosts GR00T N1.5 / pi_0 on RoboCasa Kitchen and real-world Franka manipulation without architectural changes.

Overview

Previous Methods: Value Overestimation over Expanded Action Spaces

Prior actor-critic methods that adopt action sequences let the policy π propose a full H-step sequence âH:2H−1. Because this expanded action space is far larger than what the offline dataset supports, the target critic Q̄ is queried on out-of-distribution sequences, and the actor is free to exploit the resulting critic errors — causing severe value overestimation and unstable training in offline settings.

Illustration of value overestimation in previous action-sequence actor-critic methods

DEAS: Detached Value Learning with Action Sequences

DEAS removes the actor from the critic update loop entirely. A value network V(s) and action-sequence critic Q(s, a0:H−1) are trained with IQL-style expectile regression (the τ·-weighted target), biasing value estimates toward high-return, in-distribution action sequences. We further combine this with distributional RL — categorical distributions over a fixed value support — and dual discount factors γ1, γ2 that separately control reward aggregation within a sequence and bootstrapping across sequence-level decision points. The policy is then extracted with any off-the-shelf method (BoN, DPG, AWR, etc.), which makes DEAS directly applicable to expressive policies including large-scale VLAs.

Overview of DEAS

Experiments

Franka Research 3 Kitchen

We report the partial success rate (%, over 20 trials per task) on 3 tasks from 5 initial points. Bold and underline indicate best and runner-up results, respectively.

Experimental results on Franka Research 3 Kitchen

DEAS demonstrates consistent performance improvements across all tasks In contrast, baseline methods show inconsistent performance—while some methods perform well on certain tasks, they exhibit significant performance degradation or minimal improvement on others.

Peach

All videos are 2x real-time.



GR00T N1.5

Filtered BC

IQL

QC

DEAS (Ours)


Hichew

All videos are 2x real-time.



GR00T N1.5

Filtered BC

IQL

QC

DEAS (Ours)



RoboCasa Kitchen

We report the success rate (%, over 50 trials per task) on 4 tasks, aggregated with 3 different seeds. Bold and underline indicate best and runner-up results, respectively.

Experimental results on RoboCasa Kitchen

DEAS achieves the highest success rates in 3 out of 4 tasks, with the remaining task also showing improved performance compared to the base model.


CoffeeSetupMug

GR00T N1.5

Filtered BC

IQL

QC

DEAS (Ours)


PnPCounterToMicrowave

GR00T N1.5

Filtered BC

IQL

QC

DEAS (Ours)


PnPMicrowaveToCounter

GR00T N1.5

Filtered BC

IQL

QC

DEAS (Ours)


TurnOffStove

GR00T N1.5

Filtered BC

IQL

QC

DEAS (Ours)



OGBench

Experimental results on OGBench Data scaling results on OGBench

DEAS consistently outperforms various prior offline RL methods in OGBench, and shows far more effective performance in more challenging tasks (e.g., puzzle and cube-quadruple). Furthermore, DEAS shows consistent performance across different data scales in diverse tasks.


Citation


@inproceedings{kim2026deas,
    title={DEAS: DEtached value learning with Action Sequence for Scalable Offline RL},
    author={Changyeon Kim and Haeone Lee and Younggyo Seo and Kimin Lee and Yuke Zhu},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2026},
}