Improving RECAP: Better Advantage Labeling and Active Intervention Requests for Vision-Language-Action Models

RECAP ^[↓] (RL with Experience and Corrections via Advantage-conditioned Policies) enables Vision-Language-Action models to learn from both successes and failures through advantage conditioning, bypassing the need for policy gradients. However, the original method has two key limitations: (1) in failed trajectories, all actions are treated uniformly despite only some causing the failure, and (2) human supervision requires constant monitoring. We propose two improvements. First, failure point detection uses the value function to identify the exact moment a trajectory transitions from approaching success to drifting toward failure, enabling precise labeling - only post-failure actions receive negative advantage labels. Second, active intervention requests enable the policy to explicitly request human help when uncertain, shifting from passive monitoring to efficient collaboration. We implement these improvements on top of π₀.₆* and evaluate on simulated manipulation tasks.

TL;DR39s read

RECAP labels all actions in a failed trajectory as 'bad' - even the good ones. And it requires a human watching every robot at all times. We fix both: use the value function to find exactly when failure begins, and let the robot request help only when uncertain.

Failure Point Detection: The value function identifies the exact step where the trajectory transitions from progress to failure - only post-failure actions receive negative advantage labels
Active Intervention Requests: An ensemble of value functions estimates uncertainty, allowing the robot to pause and request human help when it enters unfamiliar territory
Result: More precise training signal + scalable human supervision (1 human for N robots instead of 1:1)
Both improvements integrate into the existing RECAP training loop without changing its core mechanism

1. Introduction

Recent work on π₀.₆* ^[↓] introduced RECAP, a method for enabling Vision-Language-Action (VLA) models to improve through experience without requiring policy gradients. By conditioning the policy on binary advantage indicators ("good" vs "bad"), RECAP converts reinforcement learning into conditional supervised learning, making RL-style improvement compatible with flow matching and diffusion policies ^[↓].

If you're unfamiliar with RECAP or π₀.₆*, I recommend reading my deep dive on the topic first. This article assumes familiarity with the core mechanism.

However, RECAP has practical limitations that reduce its effectiveness:

Problem 1: Imprecise labeling of failed trajectories. RECAP labels actions based on trajectory-level outcomes. In a failed trajectory, this means all actions receive negative advantage - but this is often incorrect. Consider a robot reaching for an object: it may move correctly toward the target for many steps before making a single mistake that causes failure. Labeling all actions as "bad" teaches the model to avoid good behaviors, contaminating the training signal.

Problem 2: Human supervision doesn't scale. RECAP relies on human interventions to demonstrate recovery behaviors and escape local optima. The current approach requires constant monitoring - one person per robot - which is impractical for fleet deployment.

We propose two improvements to address these limitations:

Contributions

Failure Point Detection: We use the value function to identify the failure point - the step where the robot was closest to success before things went wrong. Only actions after this point receive negative advantage labels. Actions before are discarded as ambiguous rather than mislabeled as bad.
Active Intervention Requests: Instead of passive human monitoring, the robot estimates its own uncertainty using an ensemble of value functions. When uncertainty is high, it requests human help instead of relying on constant human monitoring.

These improvements are complementary to RECAP's core mechanism and can be integrated into the existing π₀.₆* pipeline.

2. Background: RECAP and π₀.₆*

We briefly review RECAP as introduced in π₀.₆* ^[↓] to establish context for our improvements.

2.1 The Problem RECAP Solves

VLAs trained with behavioral cloning are limited by their demonstration data - they cannot improve beyond what they've seen. Standard RL (PPO, SAC) could enable improvement, but these algorithms require log-probabilities that flow matching policies cannot provide tractably.

2.2 Advantage-Conditioned Policies

RECAP's key insight is that RL-style improvement can be achieved through conditional supervised learning:

Train a value function $V_\phi(s)$ to predict steps-to-success from any state
Compute advantages: $A_t = G_t - V_\phi(s_t)$ , where $G_t$ is the actual return
Binarize advantages: Top $k\%$ labeled "positive", rest "negative"
Train policy with conditioning: The policy learns to produce actions conditioned on this binary indicator
At inference: Always condition on "positive" to sample from successful behavior patterns

The theoretical justification comes from the observation that the optimal improved policy has the form:

$\pi_{\text{improved}}(a|s) \propto \pi_{\text{ref}}(a|s) \cdot P(\text{improvement} \mid a)^\beta$

When $\beta=1$ , this reduces to conditioning on an improvement indicator - which is exactly what RECAP's binary advantage label provides.

2.3 Human Interventions in RECAP

When a human operator takes control during a rollout, their actions are recorded and forced to "positive" advantage (trusting expert corrections). This provides recovery demonstrations that the policy learns to imitate.

2.4 Limitations We Address

Trajectory-level labeling: The original RECAP computes advantages based on episode returns. For failed trajectories, this assigns negative advantage to all actions - even those that were making progress toward the goal.

The Mislabeling Problem

Failed trajectory with ORIGINAL RECAP labeling:

Step 0        Step 5        Step 10       Step 15       Step 20 (FAIL)
  │             │             │             │              │
  ▼             ▼             ▼             ▼              ▼
[reach]  →  [approach]  →  [align]  →  [FUMBLE]  →  [DROP]
  ✗             ✗             ✗             ✗              ✗
NEGATIVE     NEGATIVE     NEGATIVE     NEGATIVE       NEGATIVE

↑ These were GOOD actions!    ↑ Only THESE caused the failure!

Original RECAP assigns negative advantage to ALL actions in a failed trajectory, including those that were making correct progress.

Passive supervision: Human operators must continuously monitor rollouts to catch failures and intervene. This requires one person per robot and doesn't scale.

3. Method

3.1 Failure Point Detection

Our first contribution is a method for precisely identifying which actions in a failed trajectory actually caused the failure.

Key insight: The value function $V_\phi(s)$ predicts steps-to-success. In a failed trajectory, there is typically a point where the robot was closest to success before things went wrong. Actions before this point were making progress; actions after were compounding errors.

Algorithm:

Given a failed trajectory $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$ :

Compute $V_\phi(s_t)$ for each step $t$
Find the failure point: $K = \arg\min_t V_\phi(s_t)$ $K = ar g min_{t} V_{ϕ} (s_{t})$
- This is the step where predicted time-to-success was lowest (closest to goal)
Label actions:
- Steps $0$ to $K-1$ : Discard (these were approaching the goal)
- Steps $K$ to $T$ : Negative advantage (these caused the failure)

Failure Point Detection

Failed Trajectory Timeline:

Step 0 ──────────── Step K (failure point) ───────────> Step T (failure)
    │                        │                               │
    │   Value decreasing     │       Value increasing        │
    │   (approaching goal)   │       (drifting away)         │
    │                        │                               │
    │      DISCARDED         │      KEPT (advantage = -)     │
    │   (ambiguous data)     │      (caused failure)         │

The value function identifies the inflection point where the trajectory transitions from progress to failure.

Why discard instead of label positive? Actions before the failure point were moving toward the goal, but they were insufficient - they didn't prevent the eventual failure. Labeling them positive could teach suboptimal behaviors. Labeling them negative is clearly wrong. Discarding them avoids contaminating the training signal.

Why this works: The value function provides a dense progress signal that identifies exactly when the trajectory "turned bad." This is more precise than trajectory-level returns, which assign blame uniformly.

🔧Value Function Dynamics in Failed Trajectories▶

The value function $V_\phi(s)$ is trained to predict cumulative return (negative steps-to-success). In successful trajectories, $V_\phi(s_t)$ monotonically increases toward 0 as the robot approaches the goal. In failed trajectories, $V_\phi(s_t)$ typically shows a characteristic pattern:

Early phase: Value improves (robot approaching goal), $V_\phi(s_t)$ decreasing (fewer predicted steps remaining)
Failure point $K$ : Value reaches minimum (closest to success)
Late phase: Value deteriorates (robot drifting away), $V_\phi(s_t)$ increasing as penalties accumulate

The argmin captures the inflection point between these phases. This requires a reasonably well-calibrated value function - if $V_\phi$ is inaccurate, the detected failure point will be noisy. However, RECAP's value function is trained on a mix of successful and failed trajectories with simple step-count rewards ( $r = -1$ per step, $r = 0$ at success, $r = -C_{\text{fail}}$ at failure), which provides sufficient signal for reliable inflection detection after a few training iterations.

3.2 Active Intervention Requests

Our second contribution enables the robot to request help when uncertain, rather than requiring constant human monitoring.

Uncertainty estimation via ensemble: We train $M$ value functions $\{V_{\phi_1}, \ldots, V_{\phi_M}\}$ with different random initializations. Uncertainty is measured by disagreement:

$u(s) = \text{Var}_{i=1}^{M} \left[ V_{\phi_i}(s) \right]$

High disagreement indicates the ensemble hasn't seen similar states during training - the robot is in unfamiliar territory.

Intervention request protocol:

At each step:

Compute uncertainty $u(s)$
If $u(s) > \theta$ $u (s) > θ$ :
- Pause execution
- Signal for human assistance (audio/visual alert)
- Wait for human to take control
- Record human actions with positive advantage
- Resume autonomous execution when human releases control
Else: continue autonomous execution

Active Intervention Request Flow

At each timestep:

Compute uncertainty u(s) = Var[V_φ₁(s), ..., V_φₘ(s)]
     │
     ├── u(s) ≤ θ ──────────► Continue autonomous execution
     │
     └── u(s) > θ ──────────► PAUSE execution
                                  │
                                  ├── Signal for human (audio/visual)
                                  ├── Wait for human takeover
                                  ├── Record human actions (advantage = +)
                                  └── Resume when human releases

The robot monitors its own uncertainty and requests help only when needed.

Benefits over passive monitoring:

Aspect	Passive (Original)	Active (Ours)
Human role	Watch continuously	Respond to requests
Attention required	100% of runtime	Only when needed
Scalability	1 human per robot	1 human per $N$ robots
Intervention timing	Human judgment	Uncertainty-driven

When does the robot request help?

Entering states not well-represented in training data
Facing ambiguous situations where the value function is uncertain
Approaching potential failure modes

🔧Ensemble Design and Calibration▶

Number of ensemble members: We use $M = 5$ value functions, following the deep ensemble literature ^[↓]. More members improve uncertainty estimates but increase compute linearly. Five provides a good trade-off between calibration quality and overhead.

Training: Each value function is initialized with different random seeds but trained on the same data. Diversity comes from random initialization and stochastic gradient descent - this is sufficient for meaningful disagreement without requiring different architectures or data splits ^[↓].

Threshold selection: The intervention threshold $\theta$ controls the safety-efficiency trade-off. Lower $\theta$ means more intervention requests (safer but more human burden); higher $\theta$ means fewer requests (more autonomous but higher failure risk). We calibrate $\theta$ by monitoring the false-negative rate on a validation set: how often does the robot fail in states where $u(s) < \theta$ ?

Computational cost: The ensemble adds $M\times$ forward passes through the value function at each step. Since value functions are small relative to the VLA policy (they only predict a scalar from state), this overhead is negligible compared to action generation.

3.3 Integration with RECAP Pipeline

Our improvements integrate into the standard RECAP training loop:

Improved RECAP Pipeline

IMPROVED RECAP PIPELINE
═══════════════════════

For each training iteration:

1. COLLECT ROLLOUTS
   ├── Execute policy with advantage = "positive"
   ├── Monitor uncertainty u(s) at each step
   └── Request intervention when u(s) > θ

2. PROCESS TRAJECTORIES
   ├── Successful trajectories:
   │   └── All steps → advantage = positive
   ├── Failed trajectories:
   │   ├── Find failure point K = argmin V(s_t)
   │   ├── Discard steps 0 to K-1
   │   └── Steps K to T → advantage = negative
   └── Intervention segments:
       └── All human actions → advantage = positive

3. TRAIN VALUE FUNCTION ENSEMBLE
   └── M value functions on successful trajectories only (avoid drift)

4. TRAIN POLICY
   └── With advantage conditioning on processed data

The complete training loop with failure point detection and active intervention requests.

The key differences from standard RECAP:

Step 1: Active intervention requests replace passive monitoring
Step 2: Failure point detection replaces trajectory-level labeling for failed trajectories
Step 3: An ensemble of value functions is trained instead of a single value function

4. Experiments

4.1 Setup

We implement our improvements on top of the RECAP codebase and evaluate on simulated reaching tasks in Genesis.

Environment: Robot arm reaching task with random target positions.

Observation: joint positions/velocities, end effector position, relative target position (24-dim)
Action: delta joint positions (7-dim)
Success: end effector within 0.08m of target

Baselines:

RECAP (original): trajectory-level advantage labeling, passive monitoring
RECAP + FPD: with failure point detection only
RECAP + AIR: with active intervention requests only
RECAP + Both: full method

4.2 Metrics

Success rate: Percentage of episodes reaching the goal
Label accuracy: For failed trajectories, what fraction of "negative" labels are actually on bad actions?
Human effort: Interventions per episode, total human attention time
Sample efficiency: Success rate vs. training episodes

4.3 Results

Under Review

Content Under Review

This section is currently being refined and reviewed for accuracy. Quality content takes time, and we want to ensure you get the best insights.

Failure Point Detection Analysis (planned):

Comparison of label quality with/without FPD
Visualization of failure points in example trajectories
Impact on final policy performance

Active Intervention Request Analysis (planned):

Human effort reduction vs. passive monitoring
Comparison of intervention timing (uncertainty-based vs. human judgment)
Impact on learning efficiency from interventions

5. Discussion

5.1 When Does Failure Point Detection Help Most?

Failure point detection provides the most benefit when:

Failed trajectories have significant "good" portions before failure
The value function accurately predicts progress
Failures stem from specific actions rather than gradual drift

It helps less when:

Failures occur immediately (no good actions to preserve)
The value function is poorly calibrated
The entire trajectory is uniformly suboptimal

⚠️

Important caveat: Failure point detection is only as good as the value function. If $V_\phi$ is inaccurate - especially early in training when data is limited - the detected failure point may be noisy. The method improves as the value function becomes more calibrated through training iterations.

5.2 Limitations of Active Intervention Requests

Uncertainty calibration: If the ensemble is overconfident (low uncertainty in unfamiliar states), the robot won't request help when it should. Conversely, if underconfident, it will request help too frequently.

Threshold sensitivity: The intervention threshold must balance safety (lower threshold, more requests) against human burden (higher threshold, fewer requests).

Not a replacement for good pretraining: Active intervention requests reduce monitoring burden but don't eliminate the need for human expertise. The quality of interventions still matters.

5.3 Relationship to Prior Work

Failure point detection relates to credit assignment in RL. While TD methods distribute credit temporally through bootstrapping, we use the value function directly to segment trajectories. This is simpler and doesn't require on-policy data assumptions.

Active intervention requests build on uncertainty-based methods in active learning and safe RL. ThriftyDAgger ^[↓] uses novelty detection to decide when to query an expert during DAgger-style training. The key difference is integration with advantage conditioning - uncertainty drives not just exploration decisions but also the labeling of human collaboration data within the RECAP framework.

Both DAgger ^[↓] and our active intervention requests address distribution shift between training and deployment. DAgger queries the expert at on-policy states; we request interventions at uncertain states and feed the resulting demonstrations into the advantage-conditioned training pipeline.

5.4 Future Directions

Adaptive failure margins: Rather than hard cutoff at the argmin, use a soft margin before the failure point to account for actions that led to the failure. The transition from "good" to "bad" is rarely instantaneous - a smooth weighting scheme based on $\frac{d}{dt} V_\phi(s_t)$ could capture this more faithfully.

Language-based intervention requests: Instead of a simple "help needed" signal, the robot could describe what it's uncertain about, leveraging the VLM backbone (e.g., "unsure how to orient the gripper for this object shape").

Cross-task transfer of value functions: Training a general progress predictor that transfers across tasks could improve failure point detection in new domains where task-specific data is limited.

6. Conclusion

We presented two improvements to RECAP for Vision-Language-Action models:

Failure point detection uses the value function to identify exactly when a trajectory transitions from progress to failure, enabling precise advantage labeling that avoids mislabeling good actions as bad.
Active intervention requests enable the robot to request help when uncertain, reducing human supervision burden while maintaining sample-efficient learning from expert corrections.

Together, these make RECAP's training signal cleaner and its human supervision requirements practical for multi-robot deployment.

References

Black et al. (2024): π₀: A Vision-Language-Action Model for General Robot Control↗

The foundational VLA model that RECAP builds upon.

Physical Intelligence (2025): π₀.₆*: Vision-Language-Action Models Learn from Experience with RECAP↗

Introduces RECAP - the advantage-conditioned policy improvement method we build upon.

Chi et al. (2023): Diffusion Policy: Visuomotor Policy Learning via Action Diffusion↗

Pioneered diffusion-based action generation for robotics.

Hoque et al. (2021): ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning↗

Related work on uncertainty-based intervention timing in imitation learning.

Ross et al. (2011): A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning↗— AISTATS

The foundational DAgger paper establishing interactive imitation learning.

Lakshminarayanan et al. (2017): Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles↗— NeurIPS

Established deep ensembles as a practical method for uncertainty estimation.

Improving RECAP: Better Advantage Labeling and Active Intervention Requests for VLA Models

Improving RECAP: Better Advantage Labeling and Active Intervention Requests for Vision-Language-Action Models

1. Introduction

Contributions

2. Background: RECAP and π₀.₆*

2.1 The Problem RECAP Solves

2.2 Advantage-Conditioned Policies

2.3 Human Interventions in RECAP

2.4 Limitations We Address

3. Method

3.1 Failure Point Detection

3.2 Active Intervention Requests

3.3 Integration with RECAP Pipeline

4. Experiments

4.1 Setup

4.2 Metrics

4.3 Results

Content Under Review

5. Discussion

5.1 When Does Failure Point Detection Help Most?

5.2 Limitations of Active Intervention Requests

5.3 Relationship to Prior Work

5.4 Future Directions

6. Conclusion

References

About the author

Federico Sarrocco

You might also like

π*0.6 and RECAP: Teaching Robots to Learn From Their Mistakes

Making quadrupeds Learning to walk: Step-by-Step Guide

Physical Intelligence in Robotics: Bridging AI and the Physical World

Unit Testing Robotic Software Components: Best Practices in C++