Hello, this is Hamamoto from TIMEWELL.
"Content moderation F1 score improved from 86% to 90%" — in 2026, OpenAI's Agent RFT (Reinforcement Fine-Tuning) has completed its alpha program and become generally available on o4-mini, dramatically improving AI agent performance. SafetyKit achieved this result with advanced content moderation capabilities, and Runloop automated the use of complex third-party APIs like the Stripe API.
Agent RFT adopts a mechanism by which agents self-learn while autonomously interacting with external tools, realizing performance improvements that far exceed conventional fine-tuning methods. This article explains the full scope of Agent RFT and how to apply it in practice.
What Is Agent RFT: Status in 2026
General Availability and o4-mini Support
Agent RFT status in 2026:
| Item | Content |
|---|---|
| Availability | Generally available since May 2025; full adoption period in 2026 |
| Supported models | o4-mini (for verified organizations) |
| Price | $100/hour (core training loop runtime) |
| Comparison with SFT | 100–700x the cost, but performance gains are dramatic |
In May 2025, OpenAI's Reinforcement Fine-Tuning exited its alpha program and became generally available. This is an important milestone, as specialized AI is now accessible to more organizations.
How Agent RFT Works
The basic process of Reinforcement Fine-Tuning (RFT):
- Define custom graders: Define reward signals for each task
- Generate multiple candidates: Sample multiple answer candidates for each prompt
- Scoring: The grader scores each candidate
- Policy gradient update: Fine-tune the model toward high-scoring answers
Unlike traditional supervised learning (SFT), which "imitates correct examples," RFT "learns the actions that maximize reward" — a key distinction.
Comparing SFT, DPO, and RFT
Comparison of fine-tuning methods:
| Method | Objective | Cost | Use Cases |
|---|---|---|---|
| SFT (Supervised Fine-Tuning) | Imitate correct examples | Low | Style/tone adjustment |
| DPO (Direct Preference Optimization) | Learn preferences | Medium | Reflecting user preferences |
| RFT (Reinforcement Fine-Tuning) | Learn reward-maximizing behavior | High (100–700x) | Agent tasks, complex rule application |
Three Key Use Cases for Agent RFT
1. Converting Instructions to Working Code
Use case:
The agent receives user instructions and generates code that actually works. Tests are run in a code execution environment — high scores for success, low scores for failure.
Real example: MacO (automatic GPU kernel generation)
- Challenge: The model couldn't learn from a limited dataset
- Solution: RFT with 100 PyTorch prompts
- Result: GPU kernel generation capability improved by 72%
2. Extracting Facts in Structured Formats
Use case:
Extract necessary facts from vast information sources and output them in structured formats. Ideal for analyzing financial reports, medical records, legal documents, and more.
Real example: Rogo (financial analysis)
- Task: Extract information needed for investment decisions from financial reports
- Custom grader: Evaluates fact-checking, reasoning accuracy, information completeness, and clarity of explanation
- Result: 21% performance improvement from the base model; significant reduction in misinformation and citation omissions
3. Accurately Applying Complex Rules
Use case:
Accurately apply complex rules such as corporate business rules, compliance requirements, and regulations.
Real example: SafetyKit (content moderation)
- Task: Advanced content moderation
- Model: o3-mini RFT
- Result: F1 score improved from 86% to 90%
Looking to optimize community management?
We have prepared materials on BASE best practices and success stories.
Demonstrated Performance Gains from Agent RFT
Financial QA Task Case Study
Experiment setup:
- Data: Approximately 2,800 financial reports
- Constraint: Maximum 10 tool calls
- Task: Accurate numerical answers to questions
Results:
| Metric | Base Model | After RFT | Improvement |
|---|---|---|---|
| Average reward | 0.6 | 0.74+ | +14 points |
| Tool call count | 6–9 | 4 | 50% reduction |
| Reasoning tokens | 2,500 | 1,500 | 40% reduction |
Cognition (Devon) Case Study
Devon (autonomous AI engineer) improvements:
- Initial challenge: 8–10 communications required per user query
- After RFT: Communication count reduced by half
- Effect: Reduced wait time for editing tasks; faster user experience
Devon's operational flow:
- Enter planning mode
- Gather information via file search and shell operations (read-only tools)
- Execute necessary actions in parallel
- Accurate execution with minimal tool calls
Runloop (Stripe API Usage) Case Study
Runloop's challenge and solution:
- Challenge: Wanted to utilize large, complex third-party APIs like the Stripe API without human intervention
- Solution: Learned Stripe API call optimization with Agent RFT
- Effect: Automated complex API operations; reduced error rate
Precautions and Success Strategies for Agent RFT Adoption
1. High-Quality Task Design and Grader Construction
Key factors for success:
- Clear evaluation criteria: Consistently define what constitutes a correct answer
- Partial scoring: Handle differences below the decimal point and formatting errors
- Expert input: Specialists with business knowledge participate in grader design
Financial QA evaluation example:
- Accurate number: 1.0 point
- Correct number but wrong format: 0.8 points
- Minor decimal difference: 0.6 points
- Large error: 0.0 points
2. Ensuring Initial Performance
Important principle:
It's important for the agent to produce reasonably correct results from the start. If the agent can't reach the correct answer in any of its attempts, insufficient exploration occurs and the model loses the opportunity to improve.
Recommended approach:
- First ensure baseline performance with SFT
- Then apply RFT on top for optimization
3. Training Environment Close to Production
Infrastructure preparation points:
- Tool call stability: Endpoint failures have a negative impact on learning
- Server load monitoring: Prevent failures from overload
- Error handling: Appropriate handling when tool calls fail
4. Countermeasures Against the Repeat Phenomenon
What is the repeat phenomenon:
A phenomenon where the agent calls the same tool unnecessarily in succession. This increases latency across the entire system and has a negative impact on the user experience.
Countermeasures:
- Apply a light penalty during training
- Build "reaching the correct answer in as few tries as possible" into the reward design
5. Adjusting the Compute Multiplier
What is the compute multiplier:
A parameter that determines how many additional attempts the agent makes per sample.
| Setting | Effect | Caution |
|---|---|---|
| Low | Cost reduction, faster training | Risk of insufficient exploration |
| Medium | Balanced | Recommended setting |
| High | Discovering superior decision patterns | Increased load on tool endpoints |
Cost vs. Benefit: Is RFT Worth the Investment?
Cost Structure
Agent RFT costs:
- Training cost: $100/hour (core training loop)
- Compared to SFT: 100–700x higher
- Example: If SFT costs hundreds of dollars for an equivalent dataset, RFT costs tens to hundreds of thousands of dollars
ROI Analysis
Cases where RFT justifies the investment:
Automating agent tasks: Dramatically reducing human workload
- Example: SafetyKit's content moderation automation
High accuracy required: Tasks where error costs are high
- Example: Financial analysis, medical diagnostic support, legal document review
Large-scale operations: Processing large volumes of tasks with a single investment
- Example: Optimizing millions of API calls per month
Cases where RFT is not needed:
- Simple style/tone adjustment (SFT is sufficient)
- Small datasets only (large datasets are more effective with SFT)
- Projects with tight budget constraints
TIMEWELL's Agent RFT Support
Build an Enterprise Agent RFT Environment with ZEROCK
ZEROCK is an enterprise AI platform that supports Agent RFT from adoption through operation.
Key features:
- Custom grader design: Build evaluation criteria in collaboration with business specialists
- Training dataset management: Creating and managing high-quality datasets
- AWS domestic servers: Ensuring security and privacy
Example of ZEROCK + Agent RFT integration:
- Define business tasks and design evaluation criteria
- Leverage company-specific knowledge base with ZEROCK
- Optimize agent performance with Agent RFT
- Deploy to production and monitor
Optimize Agent RFT Adoption Strategy with WARP
WARP maximizes ROI through Agent RFT adoption consulting.
Support includes:
- Determining applicability of Agent RFT (SFT/DPO/RFT selection)
- Cost vs. benefit ROI analysis
- Task design and grader construction support
- Strategic planning by former enterprise DX specialists
Summary: The Dawn of the Agent RFT Era
Key Points
- General availability: Alpha ended May 2025; full adoption period in 2026
- o4-mini support: Available for verified organizations
- SafetyKit case: F1 score 86% → 90% (content moderation)
- Financial QA case: Average reward +14 points, tool calls reduced by 50%
- MacO case: GPU kernel generation capability improved by 72%
- Rogo case: 21% performance improvement in financial analysis
- Cost: 100–700x more than SFT, but dramatic improvement in agent performance
The Future of Agent RFT
In 2026, Agent RFT is opening a new era in which "AI agents self-learn by autonomously calling tools." Beyond the confines of conventional prompt engineering, the ability for agents to connect with real-time external information and guide toward the optimal answer aligned with their objectives is bringing innovation to every field — finance, healthcare, engineering, content moderation, and more.
The cost is high at 100–700x more than SFT, but as the track records demonstrate — SafetyKit's 4-point F1 score improvement, MacO's 72% performance gain, and Rogo's 21% improvement — for agent tasks requiring high accuracy, the returns justify the investment.
What Companies Should Do Now
- Identify use cases: Map out business tasks where Agent RFT can be applied
- SFT vs. RFT decision: Compare costs and benefits and choose the appropriate method
- Grader design: Build clear evaluation criteria in collaboration with business specialists
- Pilot deployment: Demonstrate effectiveness with a small-scale task
- Scale rollout: Replicate successful tasks across the organization
Agent RFT is a powerful technology for taking AI agent performance to the next level. In 2026, companies that master this technology will build competitive advantage in the AI agent era.
References
- Reinforcement fine-tuning | OpenAI API
- Fine-tuning updates: Reinforcement fine-tuning now available + GPT-4.1 nano fine-tuning | OpenAI Developer Community
- Reinforcement fine-tuning use cases | OpenAI API
- Is OpenAI's Reinforcement Fine-Tuning (RFT) Worth It? | TensorZero
- Fine-Tuning Techniques - Choosing Between SFT, DPO, and RFT | OpenAI Cookbook
- Exploring Model Graders for Reinforcement Fine-Tuning | OpenAI Cookbook
