OpenAI Agent RFT 2026: Generally Available on o4-mini, SafetyKit Achieves F1 Score of 90%

Hello, this is Hamamoto from TIMEWELL.

"Content moderation F1 score improved from 86% to 90%" — in 2026, OpenAI's Agent RFT (Reinforcement Fine-Tuning) has completed its alpha program and become generally available on o4-mini, dramatically improving AI agent performance. SafetyKit achieved this result with advanced content moderation capabilities, and Runloop automated the use of complex third-party APIs like the Stripe API.

Agent RFT adopts a mechanism by which agents self-learn while autonomously interacting with external tools, realizing performance improvements that far exceed conventional fine-tuning methods. This article explains the full scope of Agent RFT and how to apply it in practice.

What Is Agent RFT: Status in 2026

General Availability and o4-mini Support

Agent RFT status in 2026:

Item	Content
Availability	Generally available since May 2025; full adoption period in 2026
Supported models	o4-mini (for verified organizations)
Price	$100/hour (core training loop runtime)
Comparison with SFT	100–700x the cost, but performance gains are dramatic

In May 2025, OpenAI's Reinforcement Fine-Tuning exited its alpha program and became generally available. This is an important milestone, as specialized AI is now accessible to more organizations.

How Agent RFT Works

The basic process of Reinforcement Fine-Tuning (RFT):

Define custom graders: Define reward signals for each task
Generate multiple candidates: Sample multiple answer candidates for each prompt
Scoring: The grader scores each candidate
Policy gradient update: Fine-tune the model toward high-scoring answers

Unlike traditional supervised learning (SFT), which "imitates correct examples," RFT "learns the actions that maximize reward" — a key distinction.

Comparing SFT, DPO, and RFT

Comparison of fine-tuning methods:

Method	Objective	Cost	Use Cases
SFT (Supervised Fine-Tuning)	Imitate correct examples	Low	Style/tone adjustment
DPO (Direct Preference Optimization)	Learn preferences	Medium	Reflecting user preferences
RFT (Reinforcement Fine-Tuning)	Learn reward-maximizing behavior	High (100–700x)	Agent tasks, complex rule application

Three Key Use Cases for Agent RFT

1. Converting Instructions to Working Code

Use case:

The agent receives user instructions and generates code that actually works. Tests are run in a code execution environment — high scores for success, low scores for failure.

Real example: MacO (automatic GPU kernel generation)

Challenge: The model couldn't learn from a limited dataset
Solution: RFT with 100 PyTorch prompts
Result: GPU kernel generation capability improved by 72%

2. Extracting Facts in Structured Formats

Use case:

Extract necessary facts from vast information sources and output them in structured formats. Ideal for analyzing financial reports, medical records, legal documents, and more.

Real example: Rogo (financial analysis)

Task: Extract information needed for investment decisions from financial reports
Custom grader: Evaluates fact-checking, reasoning accuracy, information completeness, and clarity of explanation
Result: 21% performance improvement from the base model; significant reduction in misinformation and citation omissions

3. Accurately Applying Complex Rules

Use case:

Accurately apply complex rules such as corporate business rules, compliance requirements, and regulations.

Real example: SafetyKit (content moderation)

Task: Advanced content moderation
Model: o3-mini RFT
Result: F1 score improved from 86% to 90%

Demonstrated Performance Gains from Agent RFT

Financial QA Task Case Study

Experiment setup:

Data: Approximately 2,800 financial reports
Constraint: Maximum 10 tool calls
Task: Accurate numerical answers to questions

Results:

Metric	Base Model	After RFT	Improvement
Average reward	0.6	0.74+	+14 points
Tool call count	6–9	4	50% reduction
Reasoning tokens	2,500	1,500	40% reduction

Cognition (Devon) Case Study

Devon (autonomous AI engineer) improvements:

Initial challenge: 8–10 communications required per user query
After RFT: Communication count reduced by half
Effect: Reduced wait time for editing tasks; faster user experience

Devon's operational flow:

Enter planning mode
Gather information via file search and shell operations (read-only tools)
Execute necessary actions in parallel
Accurate execution with minimal tool calls

Runloop (Stripe API Usage) Case Study

Runloop's challenge and solution:

Challenge: Wanted to utilize large, complex third-party APIs like the Stripe API without human intervention
Solution: Learned Stripe API call optimization with Agent RFT
Effect: Automated complex API operations; reduced error rate

Precautions and Success Strategies for Agent RFT Adoption

1. High-Quality Task Design and Grader Construction

Key factors for success:

Clear evaluation criteria: Consistently define what constitutes a correct answer
Partial scoring: Handle differences below the decimal point and formatting errors
Expert input: Specialists with business knowledge participate in grader design

Financial QA evaluation example:

Accurate number: 1.0 point
Correct number but wrong format: 0.8 points
Minor decimal difference: 0.6 points
Large error: 0.0 points

2. Ensuring Initial Performance

Important principle:

It's important for the agent to produce reasonably correct results from the start. If the agent can't reach the correct answer in any of its attempts, insufficient exploration occurs and the model loses the opportunity to improve.

Recommended approach:

First ensure baseline performance with SFT
Then apply RFT on top for optimization

3. Training Environment Close to Production

Infrastructure preparation points:

Tool call stability: Endpoint failures have a negative impact on learning
Server load monitoring: Prevent failures from overload
Error handling: Appropriate handling when tool calls fail

4. Countermeasures Against the Repeat Phenomenon

What is the repeat phenomenon:

A phenomenon where the agent calls the same tool unnecessarily in succession. This increases latency across the entire system and has a negative impact on the user experience.

Countermeasures:

Apply a light penalty during training
Build "reaching the correct answer in as few tries as possible" into the reward design

5. Adjusting the Compute Multiplier

What is the compute multiplier:

A parameter that determines how many additional attempts the agent makes per sample.

Setting	Effect	Caution
Low	Cost reduction, faster training	Risk of insufficient exploration
Medium	Balanced	Recommended setting
High	Discovering superior decision patterns	Increased load on tool endpoints

Cost vs. Benefit: Is RFT Worth the Investment?

Cost Structure

Agent RFT costs:

Training cost: $100/hour (core training loop)
Compared to SFT: 100–700x higher
Example: If SFT costs hundreds of dollars for an equivalent dataset, RFT costs tens to hundreds of thousands of dollars

ROI Analysis

Cases where RFT justifies the investment:

Automating agent tasks: Dramatically reducing human workload
- Example: SafetyKit's content moderation automation
High accuracy required: Tasks where error costs are high
- Example: Financial analysis, medical diagnostic support, legal document review
Large-scale operations: Processing large volumes of tasks with a single investment
- Example: Optimizing millions of API calls per month

Cases where RFT is not needed:

Simple style/tone adjustment (SFT is sufficient)
Small datasets only (large datasets are more effective with SFT)
Projects with tight budget constraints

TIMEWELL's Agent RFT Support

Build an Enterprise Agent RFT Environment with ZEROCK

ZEROCK is an enterprise AI platform that supports Agent RFT from adoption through operation.

Key features:

Custom grader design: Build evaluation criteria in collaboration with business specialists
Training dataset management: Creating and managing high-quality datasets
AWS domestic servers: Ensuring security and privacy

Example of ZEROCK + Agent RFT integration:

Define business tasks and design evaluation criteria
Leverage company-specific knowledge base with ZEROCK
Optimize agent performance with Agent RFT
Deploy to production and monitor

Optimize Agent RFT Adoption Strategy with WARP

WARP maximizes ROI through Agent RFT adoption consulting.

Support includes:

Determining applicability of Agent RFT (SFT/DPO/RFT selection)
Cost vs. benefit ROI analysis
Task design and grader construction support
Strategic planning by former enterprise DX specialists

Summary: The Dawn of the Agent RFT Era

Key Points

General availability: Alpha ended May 2025; full adoption period in 2026
o4-mini support: Available for verified organizations
SafetyKit case: F1 score 86% → 90% (content moderation)
Financial QA case: Average reward +14 points, tool calls reduced by 50%
MacO case: GPU kernel generation capability improved by 72%
Rogo case: 21% performance improvement in financial analysis
Cost: 100–700x more than SFT, but dramatic improvement in agent performance

The Future of Agent RFT

In 2026, Agent RFT is opening a new era in which "AI agents self-learn by autonomously calling tools." Beyond the confines of conventional prompt engineering, the ability for agents to connect with real-time external information and guide toward the optimal answer aligned with their objectives is bringing innovation to every field — finance, healthcare, engineering, content moderation, and more.

The cost is high at 100–700x more than SFT, but as the track records demonstrate — SafetyKit's 4-point F1 score improvement, MacO's 72% performance gain, and Rogo's 21% improvement — for agent tasks requiring high accuracy, the returns justify the investment.

What Companies Should Do Now

Identify use cases: Map out business tasks where Agent RFT can be applied
SFT vs. RFT decision: Compare costs and benefits and choose the appropriate method
Grader design: Build clear evaluation criteria in collaboration with business specialists
Pilot deployment: Demonstrate effectiveness with a small-scale task
Scale rollout: Replicate successful tasks across the organization

Agent RFT is a powerful technology for taking AI agent performance to the next level. In 2026, companies that master this technology will build competitive advantage in the AI agent era.

OpenAI Agent RFT 2026: Generally Available on o4-mini, SafetyKit Achieves F1 Score of 90%

What Is Agent RFT: Status in 2026

General Availability and o4-mini Support

How Agent RFT Works

Comparing SFT, DPO, and RFT

Three Key Use Cases for Agent RFT

1. Converting Instructions to Working Code

2. Extracting Facts in Structured Formats

3. Accurately Applying Complex Rules

Demonstrated Performance Gains from Agent RFT

Financial QA Task Case Study

Cognition (Devon) Case Study

Runloop (Stripe API Usage) Case Study

Precautions and Success Strategies for Agent RFT Adoption

1. High-Quality Task Design and Grader Construction

2. Ensuring Initial Performance

3. Training Environment Close to Production

4. Countermeasures Against the Repeat Phenomenon

5. Adjusting the Compute Multiplier

Cost vs. Benefit: Is RFT Worth the Investment?

Cost Structure

ROI Analysis

TIMEWELL's Agent RFT Support

Build an Enterprise Agent RFT Environment with ZEROCK

Optimize Agent RFT Adoption Strategy with WARP

Summary: The Dawn of the Agent RFT Era

Key Points

The Future of Agent RFT

What Companies Should Do Now

References

Related Articles

Want to measure your community health?

Newsletter

あなたのコミュニティは健全ですか？

Related Knowledge Base

Solutions

Learn More About BASE

Related Articles

What Studio STELLAR's Launch Reveals About Community Strategy in the Independent-Talent Era: BtoC Fandom Economics Lessons from the VTuber Industry

¥2,000 in Fees on a Single Ticket — Why Japan's Ticketing Giants Get Away with Stacking Charges

PassMarket Is Shutting Down — How to Choose Your Next Platform and Migrate