BASE

OpenAI Agent RFT 2026: Generally Available on o4-mini, SafetyKit Achieves F1 Score of 90%

2026-01-23濱本 隆太

In 2026, OpenAI's Agent RFT (Reinforcement Fine-Tuning) became generally available on o4-mini, dramatically improving AI agent performance. SafetyKit achieved an F1 score improvement from 86% to 90% in content moderation.

OpenAI Agent RFT 2026: Generally Available on o4-mini, SafetyKit Achieves F1 Score of 90%
シェア

Hello, this is Hamamoto from TIMEWELL.

"Content moderation F1 score improved from 86% to 90%" — in 2026, OpenAI's Agent RFT (Reinforcement Fine-Tuning) has completed its alpha program and become generally available on o4-mini, dramatically improving AI agent performance. SafetyKit achieved this result with advanced content moderation capabilities, and Runloop automated the use of complex third-party APIs like the Stripe API.

Agent RFT adopts a mechanism by which agents self-learn while autonomously interacting with external tools, realizing performance improvements that far exceed conventional fine-tuning methods. This article explains the full scope of Agent RFT and how to apply it in practice.

What Is Agent RFT: Status in 2026

General Availability and o4-mini Support

Agent RFT status in 2026:

Item Content
Availability Generally available since May 2025; full adoption period in 2026
Supported models o4-mini (for verified organizations)
Price $100/hour (core training loop runtime)
Comparison with SFT 100–700x the cost, but performance gains are dramatic

In May 2025, OpenAI's Reinforcement Fine-Tuning exited its alpha program and became generally available. This is an important milestone, as specialized AI is now accessible to more organizations.

How Agent RFT Works

The basic process of Reinforcement Fine-Tuning (RFT):

  1. Define custom graders: Define reward signals for each task
  2. Generate multiple candidates: Sample multiple answer candidates for each prompt
  3. Scoring: The grader scores each candidate
  4. Policy gradient update: Fine-tune the model toward high-scoring answers

Unlike traditional supervised learning (SFT), which "imitates correct examples," RFT "learns the actions that maximize reward" — a key distinction.

Comparing SFT, DPO, and RFT

Comparison of fine-tuning methods:

Method Objective Cost Use Cases
SFT (Supervised Fine-Tuning) Imitate correct examples Low Style/tone adjustment
DPO (Direct Preference Optimization) Learn preferences Medium Reflecting user preferences
RFT (Reinforcement Fine-Tuning) Learn reward-maximizing behavior High (100–700x) Agent tasks, complex rule application

Three Key Use Cases for Agent RFT

1. Converting Instructions to Working Code

Use case:

The agent receives user instructions and generates code that actually works. Tests are run in a code execution environment — high scores for success, low scores for failure.

Real example: MacO (automatic GPU kernel generation)

  • Challenge: The model couldn't learn from a limited dataset
  • Solution: RFT with 100 PyTorch prompts
  • Result: GPU kernel generation capability improved by 72%

2. Extracting Facts in Structured Formats

Use case:

Extract necessary facts from vast information sources and output them in structured formats. Ideal for analyzing financial reports, medical records, legal documents, and more.

Real example: Rogo (financial analysis)

  • Task: Extract information needed for investment decisions from financial reports
  • Custom grader: Evaluates fact-checking, reasoning accuracy, information completeness, and clarity of explanation
  • Result: 21% performance improvement from the base model; significant reduction in misinformation and citation omissions

3. Accurately Applying Complex Rules

Use case:

Accurately apply complex rules such as corporate business rules, compliance requirements, and regulations.

Real example: SafetyKit (content moderation)

  • Task: Advanced content moderation
  • Model: o3-mini RFT
  • Result: F1 score improved from 86% to 90%

Looking to optimize community management?

We have prepared materials on BASE best practices and success stories.

Demonstrated Performance Gains from Agent RFT

Financial QA Task Case Study

Experiment setup:

  • Data: Approximately 2,800 financial reports
  • Constraint: Maximum 10 tool calls
  • Task: Accurate numerical answers to questions

Results:

Metric Base Model After RFT Improvement
Average reward 0.6 0.74+ +14 points
Tool call count 6–9 4 50% reduction
Reasoning tokens 2,500 1,500 40% reduction

Cognition (Devon) Case Study

Devon (autonomous AI engineer) improvements:

  • Initial challenge: 8–10 communications required per user query
  • After RFT: Communication count reduced by half
  • Effect: Reduced wait time for editing tasks; faster user experience

Devon's operational flow:

  1. Enter planning mode
  2. Gather information via file search and shell operations (read-only tools)
  3. Execute necessary actions in parallel
  4. Accurate execution with minimal tool calls

Runloop (Stripe API Usage) Case Study

Runloop's challenge and solution:

  • Challenge: Wanted to utilize large, complex third-party APIs like the Stripe API without human intervention
  • Solution: Learned Stripe API call optimization with Agent RFT
  • Effect: Automated complex API operations; reduced error rate

Precautions and Success Strategies for Agent RFT Adoption

1. High-Quality Task Design and Grader Construction

Key factors for success:

  • Clear evaluation criteria: Consistently define what constitutes a correct answer
  • Partial scoring: Handle differences below the decimal point and formatting errors
  • Expert input: Specialists with business knowledge participate in grader design

Financial QA evaluation example:

  • Accurate number: 1.0 point
  • Correct number but wrong format: 0.8 points
  • Minor decimal difference: 0.6 points
  • Large error: 0.0 points

2. Ensuring Initial Performance

Important principle:

It's important for the agent to produce reasonably correct results from the start. If the agent can't reach the correct answer in any of its attempts, insufficient exploration occurs and the model loses the opportunity to improve.

Recommended approach:

  1. First ensure baseline performance with SFT
  2. Then apply RFT on top for optimization

3. Training Environment Close to Production

Infrastructure preparation points:

  • Tool call stability: Endpoint failures have a negative impact on learning
  • Server load monitoring: Prevent failures from overload
  • Error handling: Appropriate handling when tool calls fail

4. Countermeasures Against the Repeat Phenomenon

What is the repeat phenomenon:

A phenomenon where the agent calls the same tool unnecessarily in succession. This increases latency across the entire system and has a negative impact on the user experience.

Countermeasures:

  • Apply a light penalty during training
  • Build "reaching the correct answer in as few tries as possible" into the reward design

5. Adjusting the Compute Multiplier

What is the compute multiplier:

A parameter that determines how many additional attempts the agent makes per sample.

Setting Effect Caution
Low Cost reduction, faster training Risk of insufficient exploration
Medium Balanced Recommended setting
High Discovering superior decision patterns Increased load on tool endpoints

Cost vs. Benefit: Is RFT Worth the Investment?

Cost Structure

Agent RFT costs:

  • Training cost: $100/hour (core training loop)
  • Compared to SFT: 100–700x higher
  • Example: If SFT costs hundreds of dollars for an equivalent dataset, RFT costs tens to hundreds of thousands of dollars

ROI Analysis

Cases where RFT justifies the investment:

  1. Automating agent tasks: Dramatically reducing human workload

    • Example: SafetyKit's content moderation automation
  2. High accuracy required: Tasks where error costs are high

    • Example: Financial analysis, medical diagnostic support, legal document review
  3. Large-scale operations: Processing large volumes of tasks with a single investment

    • Example: Optimizing millions of API calls per month

Cases where RFT is not needed:

  • Simple style/tone adjustment (SFT is sufficient)
  • Small datasets only (large datasets are more effective with SFT)
  • Projects with tight budget constraints

TIMEWELL's Agent RFT Support

Build an Enterprise Agent RFT Environment with ZEROCK

ZEROCK is an enterprise AI platform that supports Agent RFT from adoption through operation.

Key features:

  • Custom grader design: Build evaluation criteria in collaboration with business specialists
  • Training dataset management: Creating and managing high-quality datasets
  • AWS domestic servers: Ensuring security and privacy

Example of ZEROCK + Agent RFT integration:

  1. Define business tasks and design evaluation criteria
  2. Leverage company-specific knowledge base with ZEROCK
  3. Optimize agent performance with Agent RFT
  4. Deploy to production and monitor

Optimize Agent RFT Adoption Strategy with WARP

WARP maximizes ROI through Agent RFT adoption consulting.

Support includes:

  • Determining applicability of Agent RFT (SFT/DPO/RFT selection)
  • Cost vs. benefit ROI analysis
  • Task design and grader construction support
  • Strategic planning by former enterprise DX specialists

Summary: The Dawn of the Agent RFT Era

Key Points

  • General availability: Alpha ended May 2025; full adoption period in 2026
  • o4-mini support: Available for verified organizations
  • SafetyKit case: F1 score 86% → 90% (content moderation)
  • Financial QA case: Average reward +14 points, tool calls reduced by 50%
  • MacO case: GPU kernel generation capability improved by 72%
  • Rogo case: 21% performance improvement in financial analysis
  • Cost: 100–700x more than SFT, but dramatic improvement in agent performance

The Future of Agent RFT

In 2026, Agent RFT is opening a new era in which "AI agents self-learn by autonomously calling tools." Beyond the confines of conventional prompt engineering, the ability for agents to connect with real-time external information and guide toward the optimal answer aligned with their objectives is bringing innovation to every field — finance, healthcare, engineering, content moderation, and more.

The cost is high at 100–700x more than SFT, but as the track records demonstrate — SafetyKit's 4-point F1 score improvement, MacO's 72% performance gain, and Rogo's 21% improvement — for agent tasks requiring high accuracy, the returns justify the investment.

What Companies Should Do Now

  1. Identify use cases: Map out business tasks where Agent RFT can be applied
  2. SFT vs. RFT decision: Compare costs and benefits and choose the appropriate method
  3. Grader design: Build clear evaluation criteria in collaboration with business specialists
  4. Pilot deployment: Demonstrate effectiveness with a small-scale task
  5. Scale rollout: Replicate successful tasks across the organization

Agent RFT is a powerful technology for taking AI agent performance to the next level. In 2026, companies that master this technology will build competitive advantage in the AI agent era.

References

Want to measure your community health?

Visualize your community challenges in 5 minutes. Analyze engagement, growth, and more.

Share this article if you found it useful

シェア

Newsletter

Get the latest AI and DX insights delivered weekly

Your email will only be used for newsletter delivery.

無料診断ツール

あなたのコミュニティは健全ですか?

5分で分かるコミュニティ健全度診断。運営の課題を可視化し、改善のヒントをお届けします。

Learn More About BASE

Discover the features and case studies for BASE.