Back to Blog

Improving RECAP: Better Advantage Labeling and Active Intervention Requests for VLA Models

Federico Sarrocco
Federico SarroccoVLA • Physical Intelligence • Reinforcement Learning
25 minutes

We propose two improvements to RECAP: failure point detection for precise advantage labeling in failed trajectories, and active intervention requests that let the robot ask for help when uncertain - shifting from constant human monitoring to efficient collaboration.

Improving RECAP: Better Advantage Labeling and Active Intervention Requests for VLA Models

Improving RECAP: Better Advantage Labeling and Active Intervention Requests for Vision-Language-Action Models

RECAP [↓] (RL with Experience and Corrections via Advantage-conditioned Policies) enables Vision-Language-Action models to learn from both successes and failures through advantage conditioning, bypassing the need for policy gradients. However, the original method has two key limitations: (1) in failed trajectories, all actions are treated uniformly despite only some causing the failure, and (2) human supervision requires constant monitoring. We propose two improvements. First, failure point detection uses the value function to identify the exact moment a trajectory transitions from approaching success to drifting toward failure, enabling precise labeling - only post-failure actions receive negative advantage labels. Second, active intervention requests enable the policy to explicitly request human help when uncertain, shifting from passive monitoring to efficient collaboration. We implement these improvements on top of π₀.₆* and evaluate on simulated manipulation tasks.


1. Introduction

Recent work on π₀.₆* [↓] introduced RECAP, a method for enabling Vision-Language-Action (VLA) models to improve through experience without requiring policy gradients. By conditioning the policy on binary advantage indicators ("good" vs "bad"), RECAP converts reinforcement learning into conditional supervised learning, making RL-style improvement compatible with flow matching and diffusion policies [↓].

If you're unfamiliar with RECAP or π₀.₆*, I recommend reading my deep dive on the topic first. This article assumes familiarity with the core mechanism.

However, RECAP has practical limitations that reduce its effectiveness:

Problem 1: Imprecise labeling of failed trajectories. RECAP labels actions based on trajectory-level outcomes. In a failed trajectory, this means all actions receive negative advantage - but this is often incorrect. Consider a robot reaching for an object: it may move correctly toward the target for many steps before making a single mistake that causes failure. Labeling all actions as "bad" teaches the model to avoid good behaviors, contaminating the training signal.

Problem 2: Human supervision doesn't scale. RECAP relies on human interventions to demonstrate recovery behaviors and escape local optima. The current approach requires constant monitoring - one person per robot - which is impractical for fleet deployment.

We propose two improvements to address these limitations:

Contributions

  1. Failure Point Detection: We use the value function to identify the failure point - the step where the robot was closest to success before things went wrong. Only actions after this point receive negative advantage labels. Actions before are discarded as ambiguous rather than mislabeled as bad.

  2. Active Intervention Requests: Instead of passive human monitoring, the robot estimates its own uncertainty using an ensemble of value functions. When uncertainty is high, it requests human help instead of relying on constant human monitoring.

These improvements are complementary to RECAP's core mechanism and can be integrated into the existing π₀.₆* pipeline.


2. Background: RECAP and π₀.₆*

We briefly review RECAP as introduced in π₀.₆* [↓] to establish context for our improvements.

2.1 The Problem RECAP Solves

VLAs trained with behavioral cloning are limited by their demonstration data - they cannot improve beyond what they've seen. Standard RL (PPO, SAC) could enable improvement, but these algorithms require log-probabilities that flow matching policies cannot provide tractably.

2.2 Advantage-Conditioned Policies

RECAP's key insight is that RL-style improvement can be achieved through conditional supervised learning:

  1. Train a value function Vϕ(s)V_\phi(s) to predict steps-to-success from any state
  2. Compute advantages: At=GtVϕ(st)A_t = G_t - V_\phi(s_t), where GtG_t is the actual return
  3. Binarize advantages: Top k%k\% labeled "positive", rest "negative"
  4. Train policy with conditioning: The policy learns to produce actions conditioned on this binary indicator
  5. At inference: Always condition on "positive" to sample from successful behavior patterns

The theoretical justification comes from the observation that the optimal improved policy has the form:

πimproved(as)πref(as)P(improvementa)β\pi_{\text{improved}}(a|s) \propto \pi_{\text{ref}}(a|s) \cdot P(\text{improvement} \mid a)^\beta

When β=1\beta=1, this reduces to conditioning on an improvement indicator - which is exactly what RECAP's binary advantage label provides.

2.3 Human Interventions in RECAP

When a human operator takes control during a rollout, their actions are recorded and forced to "positive" advantage (trusting expert corrections). This provides recovery demonstrations that the policy learns to imitate.

2.4 Limitations We Address

Trajectory-level labeling: The original RECAP computes advantages based on episode returns. For failed trajectories, this assigns negative advantage to all actions - even those that were making progress toward the goal.

The Mislabeling Problem
Failed trajectory with ORIGINAL RECAP labeling:

Step 0        Step 5        Step 10       Step 15       Step 20 (FAIL)
  │             │             │             │              │
  ▼             ▼             ▼             ▼              ▼
[reach]  →  [approach]  →  [align]  →  [FUMBLE]  →  [DROP]
  ✗             ✗             ✗             ✗              ✗
NEGATIVE     NEGATIVE     NEGATIVE     NEGATIVE       NEGATIVE

↑ These were GOOD actions!    ↑ Only THESE caused the failure!
Original RECAP assigns negative advantage to ALL actions in a failed trajectory, including those that were making correct progress.

Passive supervision: Human operators must continuously monitor rollouts to catch failures and intervene. This requires one person per robot and doesn't scale.


3. Method

3.1 Failure Point Detection

Our first contribution is a method for precisely identifying which actions in a failed trajectory actually caused the failure.

Key insight: The value function Vϕ(s)V_\phi(s) predicts steps-to-success. In a failed trajectory, there is typically a point where the robot was closest to success before things went wrong. Actions before this point were making progress; actions after were compounding errors.

Algorithm:

Given a failed trajectory τ=(s0,a0,s1,a1,,sT)\tau = (s_0, a_0, s_1, a_1, \ldots, s_T):

  1. Compute Vϕ(st)V_\phi(s_t) for each step tt
  2. Find the failure point: K=argmintVϕ(st)K = \arg\min_t V_\phi(s_t)
    • This is the step where predicted time-to-success was lowest (closest to goal)
  3. Label actions:
    • Steps 00 to K1K-1: Discard (these were approaching the goal)
    • Steps KK to TT: Negative advantage (these caused the failure)
Failure Point Detection
Failed Trajectory Timeline:

Step 0 ──────────── Step K (failure point) ───────────> Step T (failure)
    │                        │                               │
    │   Value decreasing     │       Value increasing        │
    │   (approaching goal)   │       (drifting away)         │
    │                        │                               │
    │      DISCARDED         │      KEPT (advantage = -)     │
    │   (ambiguous data)     │      (caused failure)         │
The value function identifies the inflection point where the trajectory transitions from progress to failure.

Why discard instead of label positive? Actions before the failure point were moving toward the goal, but they were insufficient - they didn't prevent the eventual failure. Labeling them positive could teach suboptimal behaviors. Labeling them negative is clearly wrong. Discarding them avoids contaminating the training signal.

Why this works: The value function provides a dense progress signal that identifies exactly when the trajectory "turned bad." This is more precise than trajectory-level returns, which assign blame uniformly.

🔧Value Function Dynamics in Failed Trajectories

The value function Vϕ(s)V_\phi(s) is trained to predict cumulative return (negative steps-to-success). In successful trajectories, Vϕ(st)V_\phi(s_t) monotonically increases toward 0 as the robot approaches the goal. In failed trajectories, Vϕ(st)V_\phi(s_t) typically shows a characteristic pattern:

  • Early phase: Value improves (robot approaching goal), Vϕ(st)V_\phi(s_t) decreasing (fewer predicted steps remaining)
  • Failure point KK: Value reaches minimum (closest to success)
  • Late phase: Value deteriorates (robot drifting away), Vϕ(st)V_\phi(s_t) increasing as penalties accumulate

The argmin captures the inflection point between these phases. This requires a reasonably well-calibrated value function - if VϕV_\phi is inaccurate, the detected failure point will be noisy. However, RECAP's value function is trained on a mix of successful and failed trajectories with simple step-count rewards (r=1r = -1 per step, r=0r = 0 at success, r=Cfailr = -C_{\text{fail}} at failure), which provides sufficient signal for reliable inflection detection after a few training iterations.

3.2 Active Intervention Requests

Our second contribution enables the robot to request help when uncertain, rather than requiring constant human monitoring.

Uncertainty estimation via ensemble: We train MM value functions {Vϕ1,,VϕM}\{V_{\phi_1}, \ldots, V_{\phi_M}\} with different random initializations. Uncertainty is measured by disagreement:

u(s)=Vari=1M[Vϕi(s)]u(s) = \text{Var}_{i=1}^{M} \left[ V_{\phi_i}(s) \right]

High disagreement indicates the ensemble hasn't seen similar states during training - the robot is in unfamiliar territory.

Intervention request protocol:

At each step:

  1. Compute uncertainty u(s)u(s)
  2. If u(s)>θu(s) > \theta:
    • Pause execution
    • Signal for human assistance (audio/visual alert)
    • Wait for human to take control
    • Record human actions with positive advantage
    • Resume autonomous execution when human releases control
  3. Else: continue autonomous execution
Active Intervention Request Flow
At each timestep:

Compute uncertainty u(s) = Var[V_φ₁(s), ..., V_φₘ(s)]
     │
     ├── u(s) ≤ θ ──────────► Continue autonomous execution
     │
     └── u(s) > θ ──────────► PAUSE execution
                                  │
                                  ├── Signal for human (audio/visual)
                                  ├── Wait for human takeover
                                  ├── Record human actions (advantage = +)
                                  └── Resume when human releases
The robot monitors its own uncertainty and requests help only when needed.

Benefits over passive monitoring:

AspectPassive (Original)Active (Ours)
Human roleWatch continuouslyRespond to requests
Attention required100% of runtimeOnly when needed
Scalability1 human per robot1 human per NN robots
Intervention timingHuman judgmentUncertainty-driven

When does the robot request help?

  • Entering states not well-represented in training data
  • Facing ambiguous situations where the value function is uncertain
  • Approaching potential failure modes
🔧Ensemble Design and Calibration

Number of ensemble members: We use M=5M = 5 value functions, following the deep ensemble literature [↓]. More members improve uncertainty estimates but increase compute linearly. Five provides a good trade-off between calibration quality and overhead.

Training: Each value function is initialized with different random seeds but trained on the same data. Diversity comes from random initialization and stochastic gradient descent - this is sufficient for meaningful disagreement without requiring different architectures or data splits [↓].

Threshold selection: The intervention threshold θ\theta controls the safety-efficiency trade-off. Lower θ\theta means more intervention requests (safer but more human burden); higher θ\theta means fewer requests (more autonomous but higher failure risk). We calibrate θ\theta by monitoring the false-negative rate on a validation set: how often does the robot fail in states where u(s)<θu(s) < \theta?

Computational cost: The ensemble adds M×M\times forward passes through the value function at each step. Since value functions are small relative to the VLA policy (they only predict a scalar from state), this overhead is negligible compared to action generation.

3.3 Integration with RECAP Pipeline

Our improvements integrate into the standard RECAP training loop:

Improved RECAP Pipeline
IMPROVED RECAP PIPELINE
═══════════════════════

For each training iteration:

1. COLLECT ROLLOUTS
   ├── Execute policy with advantage = "positive"
   ├── Monitor uncertainty u(s) at each step
   └── Request intervention when u(s) > θ

2. PROCESS TRAJECTORIES
   ├── Successful trajectories:
   │   └── All steps → advantage = positive
   ├── Failed trajectories:
   │   ├── Find failure point K = argmin V(s_t)
   │   ├── Discard steps 0 to K-1
   │   └── Steps K to T → advantage = negative
   └── Intervention segments:
       └── All human actions → advantage = positive

3. TRAIN VALUE FUNCTION ENSEMBLE
   └── M value functions on successful trajectories only (avoid drift)

4. TRAIN POLICY
   └── With advantage conditioning on processed data
The complete training loop with failure point detection and active intervention requests.

The key differences from standard RECAP:

  • Step 1: Active intervention requests replace passive monitoring
  • Step 2: Failure point detection replaces trajectory-level labeling for failed trajectories
  • Step 3: An ensemble of value functions is trained instead of a single value function

4. Experiments

4.1 Setup

We implement our improvements on top of the RECAP codebase and evaluate on simulated reaching tasks in Genesis.

Environment: Robot arm reaching task with random target positions.

  • Observation: joint positions/velocities, end effector position, relative target position (24-dim)
  • Action: delta joint positions (7-dim)
  • Success: end effector within 0.08m of target

Baselines:

  • RECAP (original): trajectory-level advantage labeling, passive monitoring
  • RECAP + FPD: with failure point detection only
  • RECAP + AIR: with active intervention requests only
  • RECAP + Both: full method

4.2 Metrics

  • Success rate: Percentage of episodes reaching the goal
  • Label accuracy: For failed trajectories, what fraction of "negative" labels are actually on bad actions?
  • Human effort: Interventions per episode, total human attention time
  • Sample efficiency: Success rate vs. training episodes

4.3 Results

Under Review

Content Under Review

This section is currently being refined and reviewed for accuracy. Quality content takes time, and we want to ensure you get the best insights.

Failure Point Detection Analysis (planned):

  • Comparison of label quality with/without FPD
  • Visualization of failure points in example trajectories
  • Impact on final policy performance

Active Intervention Request Analysis (planned):

  • Human effort reduction vs. passive monitoring
  • Comparison of intervention timing (uncertainty-based vs. human judgment)
  • Impact on learning efficiency from interventions

5. Discussion

5.1 When Does Failure Point Detection Help Most?

Failure point detection provides the most benefit when:

  • Failed trajectories have significant "good" portions before failure
  • The value function accurately predicts progress
  • Failures stem from specific actions rather than gradual drift

It helps less when:

  • Failures occur immediately (no good actions to preserve)
  • The value function is poorly calibrated
  • The entire trajectory is uniformly suboptimal
⚠️

Important caveat: Failure point detection is only as good as the value function. If VϕV_\phi is inaccurate - especially early in training when data is limited - the detected failure point may be noisy. The method improves as the value function becomes more calibrated through training iterations.

5.2 Limitations of Active Intervention Requests

Uncertainty calibration: If the ensemble is overconfident (low uncertainty in unfamiliar states), the robot won't request help when it should. Conversely, if underconfident, it will request help too frequently.

Threshold sensitivity: The intervention threshold must balance safety (lower threshold, more requests) against human burden (higher threshold, fewer requests).

Not a replacement for good pretraining: Active intervention requests reduce monitoring burden but don't eliminate the need for human expertise. The quality of interventions still matters.

5.3 Relationship to Prior Work

Failure point detection relates to credit assignment in RL. While TD methods distribute credit temporally through bootstrapping, we use the value function directly to segment trajectories. This is simpler and doesn't require on-policy data assumptions.

Active intervention requests build on uncertainty-based methods in active learning and safe RL. ThriftyDAgger [↓] uses novelty detection to decide when to query an expert during DAgger-style training. The key difference is integration with advantage conditioning - uncertainty drives not just exploration decisions but also the labeling of human collaboration data within the RECAP framework.

Both DAgger [↓] and our active intervention requests address distribution shift between training and deployment. DAgger queries the expert at on-policy states; we request interventions at uncertain states and feed the resulting demonstrations into the advantage-conditioned training pipeline.

5.4 Future Directions

Adaptive failure margins: Rather than hard cutoff at the argmin, use a soft margin before the failure point to account for actions that led to the failure. The transition from "good" to "bad" is rarely instantaneous - a smooth weighting scheme based on ddtVϕ(st)\frac{d}{dt} V_\phi(s_t) could capture this more faithfully.

Language-based intervention requests: Instead of a simple "help needed" signal, the robot could describe what it's uncertain about, leveraging the VLM backbone (e.g., "unsure how to orient the gripper for this object shape").

Cross-task transfer of value functions: Training a general progress predictor that transfers across tasks could improve failure point detection in new domains where task-specific data is limited.


6. Conclusion

We presented two improvements to RECAP for Vision-Language-Action models:

  1. Failure point detection uses the value function to identify exactly when a trajectory transitions from progress to failure, enabling precise advantage labeling that avoids mislabeling good actions as bad.

  2. Active intervention requests enable the robot to request help when uncertain, reducing human supervision burden while maintaining sample-efficient learning from expert corrections.

Together, these make RECAP's training signal cleaner and its human supervision requirements practical for multi-robot deployment.


References

1.

The foundational VLA model that RECAP builds upon.

2.

Introduces RECAP - the advantage-conditioned policy improvement method we build upon.

3.

Pioneered diffusion-based action generation for robotics.

4.

Related work on uncertainty-based intervention timing in imitation learning.

5.

The foundational DAgger paper establishing interactive imitation learning.

6.

Established deep ensembles as a practical method for uncertainty estimation.

@article{federicosarrocco2026,
  author = {Federico Sarrocco},
  title = {Improving RECAP: Better Advantage Labeling and Active Intervention Requests for VLA Models},
  year = {2026},
  month = {February},
  day = {12},
  url = {https://federicosarrocco.com/blog/improving-recap}
}

About the author

Federico Sarrocco

Federico Sarrocco

View Portfolio