Has open-source AI caught up with proprietary AI in performance?

It depends on the metric. On coding's SWE-bench Pro, Claude Opus 4.7 still leads at 64.3%, while on world knowledge MMLU, DeepSeek V4-Pro tops the open camp at 90.1, closing in on Gemini 3.1 Pro. On Chatbot Arena, the gap between leading OSS and commercial models has narrowed to a few dozen Elo. Multimodal video understanding, however, is still dominated by Gemini 3.1 Pro.

Is it safe for enterprises to use Chinese OSS models like Qwen and DeepSeek?

The US, Korea, Australia, Taiwan, and Italy have all banned them in government use, and they are an economic security issue in Japan as well. Using them via cloud APIs risks routing data into China, so that path is best avoided. On the other hand, downloading the weights and running them on-prem or in a fully closed domestic-cloud environment substantially reduces the technical exfiltration risk. The pragmatic approach is to separate decisions by use case and operating mode.

How much does it cost to self-host an OSS model?

For a setup with two H100s amortized over three years, including hardware, electricity, colocation, and operational labor, a rough monthly figure is around JPY 600,000. Self-hosting tends to become cheaper than Claude Opus or GPT-5.5 API consumption above roughly 2 million tokens per day. Below that, API consumption is overwhelmingly more economical.

Is Llama 4 actually open source?

The Open Source Initiative (OSI) explicitly states that the Llama 4 community license is "not open source." Companies with more than 700 million MAU need a separate license, and the multimodal version is not made available to EU residents or EU-based companies. By contrast, Mistral Large 3 ships under Apache 2.0 and is judged to qualify under the EU AI Act's disclosure obligations.

How should enterprises mix and match these models?

A three-tier design is the standard. Top-tier reasoning and customer-facing conversation quality go to commercial models like Claude Opus 4.7 or Gemini 3.1 Pro. High-volume, structured workloads such as internal knowledge search and form processing go to OSS models like Qwen 3 or Llama 4 running on-prem. Highly sensitive data is handled by Phi-4 or Gemma 3 class models in a fully closed environment. WARP supports this three-tier design hands-on with our customers.

Open Source AI vs Proprietary AI: A Definitive Comparison [Updated 2026] | Llama, Mistral, Qwen, DeepSeek vs Claude, GPT, Gemini

Hello, this is Hamamoto from TIMEWELL.

"Which AI model should we pick in the end?" Last week, a CIO at a manufacturing company asked me this almost as an afterthought at the close of a meeting. They knew Claude, GPT, and Gemini. They had seen DeepSeek become the talk of the industry. Their engineers were apparently experimenting with Llama. But they could not see how to put it all together.

Honestly, AI model selection today looks completely different from a year ago. As of April 2026, DeepSeek has released a preview of V4, Mistral has fully open-sourced a 256k-context large MoE, and Alibaba's Qwen has crossed 700 million cumulative downloads on Hugging Face. At the same time, commercial models like Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro continue to extend their capabilities.

In this fifth installment of the series, I want to redraw the boundary between open-source and proprietary AI, and dig into the latest benchmarks, cost, regulation, and even the security policy angle. By the end, my goal is for Japanese enterprises to be able to say, "in our case, this combination is the right call."

What Is Open-Source AI? Three Tiers to Understand It

"Open-source AI" gets used as a catch-all, but in reality there are three tiers. Discussing them without distinguishing among them goes nowhere.

The strictest is fully open. The Open Source Initiative (OSI) released "Open Source AI Definition 1.0" in October 2024, which is becoming the global standard. It requires public availability of training-data information, source code, weights, and architecture documentation. The models that have passed OSI's verification are research-leaning ones like Pythia (EleutherAI), OLMo (AI2), Amber, CrystalCoder (LLM360), and T5 (Google).

Next is open weights. Only the weight files are public; training data and preprocessing scripts are not. Most of what the world calls "open-source AI" - Meta Llama, Mistral, Qwen, DeepSeek, Phi, Gemma - falls here. Because the weights are available, fine-tuning, quantization, and on-prem operation are possible, but full retraining from scratch is not.

Finally, partially open. The weights are public, but the license imposes use restrictions. The Llama 4 community license is the canonical example: companies with more than 700 million monthly active users must obtain a separate license from Meta, and rights to the multimodal version are explicitly withheld from EU residents and EU-based companies. OSI has issued a formal statement that "Llama is still not open source."[^2]

I want to emphasize that license strictness and practical usability are separate things. For domestic enterprise engagements like ZEROCK, what matters is "can we run it freely on-prem," "are there extra fees for commercial use," and "are we stepping on a regulatory landmine." OSI certification is just one input. That said, with the EU AI Act fully applicable from August 2026, license selection has become a topic that cannot be ignored[^1].

Major OSS Models as of April 2026: Real Performance and License Details

Let me get specific. As of April 2026, six OSS model families are worth knowing.

Llama 4 (Meta, April 2025) comes in three configurations: Scout, Maverick, and Behemoth. Maverick is an MoE design with 400B total parameters but 17B active, scoring 80.5 on MMLU Pro and 69.8 on GPQA Diamond. Llama was long the OSS flagship, but license issues and the rise of DeepSeek and Qwen pushed it from the top of the cumulative-download leaderboard on Hugging Face in October 2025; Qwen took the lead.

Mistral Large 3 (Mistral AI, December 2025) is a France-born Apache 2.0 model. With 41B active and 675B total parameters in an MoE, it offers a remarkable 256k token context window. It supports both multimodal and multilingual operation, and is widely considered the first frontier OSS model to reach parity with OpenAI's GPT-4o and Google's Gemini 2. It also has the best fit with EU AI Act disclosure obligations.

Qwen 3 (Alibaba, April 2025) is the most-used OSS model in the world today. It spans dense networks from 0.6B to 32B and MoEs at 30B and 235B, trained on 36 trillion tokens. It supports 119 languages and ships with native Model Context Protocol (MCP) and Function Calling. In January 2026 it crossed 700 million cumulative downloads on Hugging Face, accounting for the majority of global OSS downloads[^5]. Qwen 2.5-1.5B-Instruct has been called "the most downloaded AI model in the world."

DeepSeek V4 (preview release, April 2026) is the model that directly motivated this article. The Pro version is an MoE with 1.6T total and 49B active parameters - among the largest open-weight models in existence. MMLU is 90.1, HumanEval Pass@1 is 76.8, and SWE-bench and BrowseComp scores approach commercial models. Even more striking, it achieves a native 1M token context window while reducing FLOPs to 27% and KV cache to 10% of V3.2 levels[^4]. API pricing is USD 0.14 input / USD 0.28 output for Flash and USD 0.145 / USD 3.48 for Pro - literally an order of magnitude apart from GPT-5.5's USD 5 / USD 30.

Phi-4 (Microsoft, January 2025, MIT license) is the standard-bearer for the size-conscious camp. With 14B parameters, it beats Llama 3.3 70B on benchmarks like GPQA and MATH. Scoring above 80% on MATH and MGSM at this size is remarkable.

Gemma 3 (Google) spans 1B to 27B, with multimodal support, 128k context, and 140 languages from 4B and up. The 27B-IT scores MMLU-Pro 67.5, GPQA Diamond 42.4, and MATH 69.0, beating Gemini 1.5 Pro. Google's serious entry into the small-model space, it is the prime candidate for lightweight enterprise use cases.

What needs to be emphasized is that "OSS is performance-inferior" is no longer a valid premise. On LMSys Chatbot Arena, the gap between leading OSS and commercial models has narrowed to dozens of Elo - in some metrics, within margin of error[^3]. Chinese labs have shipped at unusual speed; DeepSeek went from V3 in December 2024 to frontier-class in 16 months.

Interested in leveraging AI?

Download our service materials. Feel free to reach out for a consultation.

Book a Free Consultation Download Resources

Benchmarks Side by Side: Real Differences by Use Case

When you line the numbers up, each model's strengths look very different. Here are the main April 2026 benchmarks in summary.

Model	MMLU/MMLU-Pro	GPQA Diamond	SWE-bench Pro	HumanEval	Notes
Claude Opus 4.7	-	94.2%	64.3%	Saturated	Tops coding and MCP-Atlas (77.3%)
GPT-5.5	-	94.4% (5.4)	57.7% (5.4)	Saturated	Leads Terminal-Bench 2.0; price increased
Gemini 3.1 Pro	-	94.3%	54.2%	Saturated	Multimodal (Video-MME 78.2%) leader
DeepSeek V4-Pro	90.1 (MMLU 5-shot)	-	Approaching commercial	76.8	1M context, exceptional cost-performance
Llama 4 Maverick	80.5 (MMLU Pro)	69.8	-	-	MoE; EU restrictions apply
Qwen 3-235B	-	-	-	-	Top-tier on AIME25, LiveCodeBench, Arena-Hard
Mistral Large 3	-	-	-	-	256k context, Apache 2.0
Phi-4 (14B)	80%+ (MATH/MGSM)	Beats Llama 3.3 70B	-	-	Outsized performance for its size
Gemma 3 27B-IT	67.5 (MMLU-Pro)	42.4	-	-	128k context, leading lightweight

I understand the urge to look at the table and conclude "DeepSeek wins." But from what I have observed in implementation work, benchmarks indicate only the floor. Real selection requires another step.

For coding assistance, Claude Opus 4.7 at 64.3% on SWE-bench Pro is dominant. Qwen3-Coder-Next is described as "matching models 20 times its size," but in real-world productivity it has not yet caught the commercial frontier. Conversely, for "good-enough accuracy at lower cost and higher speed" use cases like internal knowledge search and document summarization, DeepSeek V4-Flash and Qwen 3-30B are overwhelmingly favorable.

Multimodal work, especially video understanding, is in a class of its own; Gemini 3.1 Pro at 78.2% will not be caught for the foreseeable future. For just transcription or speech generation, Mistral added Voxtral in March 2026, fully open. Read benchmarks by use case or you will pick the wrong model.

On "how to read" benchmarks, Chatbot Arena results should not be taken at face value either. The 2025 paper "The Leaderboard Illusion" pointed out that data access is heavily skewed across providers (the proprietary camp at 61.4%) and that open models tend to be deprecated early, making evaluation unstable. Judging by Elo alone is an argument that ignores sampling bias.

Self-Hosting Cost: The Reality of an H100 x2 Configuration

You hear "OSS lowers costs," but the actual numbers tell a more nuanced story. For a Llama 4 70B-class workload running on two H100s, the realistic cost structure looks like this.

Hardware first. Used H100 SXM units run USD 15,000 to USD 20,000 each, amortized over three years to USD 900 to USD 1,200 per month. Power, at 500W per card and 720 kWh per month including PUE 1.4, is USD 131 per month for two cards. Datacenter colocation runs USD 200 to USD 500 per month. The often-overlooked piece is operational labor: even at 25% allocation of a senior MLOps engineer, that is roughly USD 4,000 per month.

Adding it up, for a workload processing 5 million tokens per day, monthly TCO comes to about USD 5,931, or roughly USD 0.40 per million tokens. Compared to GPT-5.5's USD 5 input / USD 30 output per million tokens, the unit price is dramatically lower. But the math reverses if traffic is low. The general break-even sits around 2 million tokens per day; below that, API consumption is overwhelmingly more economical.

There are also hidden costs to remember: GPU procurement lead time, supply constraints, the difficulty of hiring infrastructure teams, and validation effort during model updates. In my experience, three-to-six month lead times for H100 procurement are common. "Available next month" is not a real promise.

This is where DeepSeek V4's pricing breaks the market's assumptions. Flash at USD 0.14 input / USD 0.28 output and Pro at USD 0.145 / USD 3.48. Before the "API or self-host" debate, a third option - "use OSS weights, but consume them via Chinese-vendor APIs" - has become real. As I will cover in the next section, however, data sovereignty and security considerations make this a difficult choice for commercial enterprises.

Talking with WARP customers, what I notice is that surprisingly few decide on cost alone. The dominant ask is "we will accept paying somewhat more total, as long as we can defend the architecture in front of regulators and auditors." Cost is one input to the decision; without integrating regulation, audit, SLA, and observability into the picture, you are seeing only half of it.

Data Sovereignty and Regulation: The Status of Chinese Models and the EU AI Act

The technology and price discussion is not the whole of enterprise AI selection. 2026 is the year regulation and geopolitics moved firmly to the front.

The EU AI Act sees most obligations apply from August 2, 2026[^1]. OSS GPAIs (general-purpose AI models) are partially exempt, but providers of models with systemic risk are excluded from the exemption. The contentious point is the clause that "even if released for free, monetization through API charges or support contracts disqualifies the exemption." Misreading that risks finding yourself subject to disclosure obligations on what you thought was an OSS deployment. The legal analysis released by the EU AI Office's advisory group in January 2026 concluded that the Llama Community License does not qualify as a "free and open license" under the EU AI Act. Mistral 7B and Mixtral 8x7B's Apache 2.0, on the other hand, qualify - that is the current state of the answer.

For Chinese models, geopolitical risk is more direct. In the US, state governments in Texas, Virginia, New York and others have prohibited DeepSeek for public use, and bipartisan bills have been introduced to exclude it from federal procurement. Korea, Australia, Taiwan, and Italy have followed. NIST's Center for AI Standards and Innovation tested DeepSeek R1 against jailbreak techniques and found it complied with harmful requests 94% of the time, versus 8% for US frontier models.

I am uncomfortable with the shorthand of "Qwen and DeepSeek are dangerous because they are Chinese." Downloading the weights locally and running them on-prem or in a closed AWS Tokyo region is technically a different exfiltration path than API calls. Weight files have no networking capability. That said, when used via cloud APIs, the design routes up to 100,000 words of user data per request to mainland China - that is a landmine commercial enterprises should not step on.

Inside Japan as well, as I covered in related articles - AI Export Controls 2026 Edition and Local Government Trends Banning Chinese IT - operations under the Economic Security Promotion Act and the Foreign Exchange Act are tightening, and direct use of DeepSeek or Qwen in defense, public sector, and infrastructure has been all but eliminated as an option. At the same time, options for domestic OSS - PLaMo 2.2 Prime, Rakuten AI 3.0, Fujitsu tsuzumi 2 - have grown for the public sector, supported by METI's GENIAC (the JPY 1 trillion, five-year AI investment program announced in December 2025).

ZEROCK runs GraphRAG entirely on AWS domestic servers and adopts a model-swappable architecture for exactly this reason. The flexibility to switch regulatory posture by swapping weights is what enterprise AI now requires. For specific trends in domestic agents, see also The Enterprise AI Agent Wave at Google Cloud Next 2025.

The Hybrid Strategy Enterprises Should Pursue

By this point you are likely sensing that "all OSS" or "all commercial" is not the answer; mixing is the practical reality. NVIDIA's CEO put it well in an interview: "Proprietary versus open is not a thing. It's proprietary AND open"[^7]. That captures the heart of today's enterprise AI strategy.

What I recommend in the field is a three-tier separation. At the top, place commercial frontier models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) for customer-facing work, executive decision support, and tasks requiring top-tier reasoning. In the middle, run open-weight large models (Mistral Large 3, Qwen 3, Llama 4, DeepSeek V4 - only those that satisfy regulatory requirements) on-prem or in a domestic cloud for fast, cheap internal knowledge search and structured processing. At the bottom, deploy small OSS models (Phi-4, Gemma 3) in a fully closed environment for the most sensitive PHI and PII processing.

NVIDIA itself adopts this hybrid configuration, using frontier models for orchestration and Nemotron-family OSS for research and experimentation, reportedly cutting query cost by more than 50%. As Kai Waehner notes in his analysis of the enterprise AI landscape, 89% of enterprises now run OSS models in production, and 73% report higher ROI than commercial alternatives[^7].

A common stumble in hybrid strategy is lock-in at the orchestration layer. Riding a commercial vendor's proprietary agent framework looks like an upgrade path but quietly creates deep vendor lock-in. Centering your stack on an open standard like Model Context Protocol (MCP) and keeping models loosely coupled to the agent platform is the key to safe operation three years out. Qwen 3 shipping with native MCP support is symbolic - the OSS camp is starting to lead on interoperability.

A note on implementation tooling. For local development, use Ollama (one command to launch any of more than 45,000 GGUF models from the Hugging Face Hub). For production, use vLLM (PagedAttention reduces memory fragmentation by over 50% and lifts throughput by 2-4x). Both expose OpenAI-compatible APIs, so you can switch at the code level. Hugging Face Inference Endpoints provide a third option to spin up dedicated endpoints without owning GPUs.

In WARP, we provide hands-on support for selecting models across these tiers and designing the hybrid. The work is calculating "for our workload, where do we hand off to commercial models, and where do we absorb with OSS," not just on benchmarks but on TCO and regulatory requirements as well. The points you cannot see in a single pilot get worked through with the operations team in this phase.

My Take: Be Explicit About "Picks" by Use Case

To close, here is my honest opinion. Ending on "use both" is a cop-out, so let me state today's picks explicitly by use case.

Coding assistance: commercial only. Claude Opus 4.7's 64.3% SWE-bench Pro score is, in my view, still 12-18 months ahead of OSS catch-up. This is the wrong place to economize - it directly drives engineering productivity.

Internal knowledge search and RAG: OSS in the lead. On both data volume and cost, DeepSeek V4-Flash and Qwen 3-30B are the practical answer. If you use Chinese models, do it strictly on-prem or in a domestic closed environment. For commercial enterprise customers, I have been recommending Mistral Large 3 more often, because its Apache 2.0 makes the regulatory story easy to explain.

Multimodal, especially video: Gemini 3.1 Pro. Google is winning here decisively for now. For inspection-image analysis at manufacturers or video tagging at media companies, the choice is essentially made.

Consumer-facing dialogue such as community and event interactions: abstract at the application layer like BASE, and design with model swappability assumed. End users do not care which model sits behind BASE.

Maximum-sensitivity internal data processing: Phi-4 or Gemma 3 in a fully closed environment. We are now in an era where 14B parameters can score above 80% on MATH for narrowly scoped tasks. The message that "smaller can be enough" is hitting home again.

When you include implementation, "which model to choose" matters less - by an order of magnitude - than "can you build a swap-friendly architecture from the start." Last year's best practices will be obsolete a year from now. As DeepSeek V4 has just shown, game changes arrive without warning.

To be candid, the era of treating AI model selection as an extension of vendor comparison is over. This is an architecture question, a regulatory question, and an organizational capability question. What we do at WARP is the hands-on work of binding all of those together into a single decision. If any of this is on your plate, please reach out.

In the sixth installment of this series, I plan to write about model orchestration design in the agent era, going deep into MCP and A2A protocol implementation. See you next time.

References

[^1]: EU Artificial Intelligence Act - Official Site [^2]: Open Source Initiative - Meta's Llama license is still not Open Source [^3]: BenchLM.ai - LLM Leaderboard History 2023-2026 [^4]: VentureBeat - DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost [^5]: Alibaba Group - Qwen3 Sets New Benchmark in Open-Source AI [^6]: Vellum - Claude Opus 4.7 Benchmarks Explained [^7]: Kai Waehner - Enterprise Agentic AI Landscape 2026: Trust, Flexibility, and Vendor Lock-in

Open Source AI vs Proprietary AI: A Definitive Comparison [Updated 2026] | Llama, Mistral, Qwen, DeepSeek vs Claude, GPT, Gemini

What Is Open-Source AI? Three Tiers to Understand It

Major OSS Models as of April 2026: Real Performance and License Details

Benchmarks Side by Side: Real Differences by Use Case

Self-Hosting Cost: The Reality of an H100 x2 Configuration

Data Sovereignty and Regulation: The Status of Chinese Models and the EU AI Act

The Hybrid Strategy Enterprises Should Pursue

My Take: Be Explicit About "Picks" by Use Case

References

How well do you understand AI?

Newsletter

あなたのAIリテラシー、診断してみませんか？

Related Knowledge Base

Solutions

Learn More About テックトレンド

Related Articles

Anthropic's $965 Billion Raise and What the Agentic Economy Means for Japanese Companies

Which AI Are Startups Actually Paying For? Reading Into a16z's "AI Apps 50"

Claude Dynamic Workflows — A Comprehensive Guide: What's New and What It Means for Enterprise AI

Newsletter