


Cover Story: When One Whale Is Born, All Things Are Born v2.0
Old friends should know very well that, since the weekly issues at the beginning of the year, we have been looking forward to the arrival of DeepSeek V4. We even worried whether its repeated delays would raise expectations higher and higher, only for the final performance to turn out mediocre. Clearly, the results show that we worried too much. From performance to size to price, and even chip adaptation, it is almost comprehensive across the board.
Even compared with GPT-5.5, which has just returned to the peak, it is by no means inferior. All this cannot help but make us wonder whether the so-called 2.7% gap between China and the United States that we reported just last week has already become outdated in only one week — lol.
What is even more exciting is how much application space such a V4 will generate.

Hey, my friend 😊, welcome to “There Is Something New in AI Under the Sun,” a weekly newsletter produced by JoinAI|Zhuoyin Intelligent Algorithm Team.
With the unique technical perspective and restraint of “AI builders,” we carefully select this week’s Top 3 papers, projects, and industry updates for you. We do not care about the illusion of hot-topic traffic. We only track technologies and trends that are truly worth attention. We do not only praise the good side of AI; we also expose its problems.
The most important progress in DeepSeek-V4 is that it pushes million-token context from a showcase capability into usable intelligence. The paper builds a full system design around hybrid attention, mHC, Muon, KV cache, quantization, parallelism, and agent sandbox, with a very clear goal: allowing long context to truly participate in reasoning, coding, search, and complex agent workflows. Its significance has gone beyond “a larger window.” It is more like building a complete runtime stack for long-context intelligence.
Vision Banana sends a very strong signal: inside high-level image generation models, a substantial amount of general visual representation has already been accumulated. It rewrites visual tasks such as segmentation, depth, and normal estimation into RGB image generation tasks, then releases these abilities through lightweight instruction-tuning. This direction matters because visual understanding and image generation are beginning to share the same interface, and generative visual pretraining is increasingly looking like a candidate route for the next-generation CV foundation model.
The focus of Omni lies in Context Unrolling. It places text, image, video, and 3D capabilities into the same shared workspace, allowing the model to call intermediate abilities before output, such as textual reasoning, visual tokens, camera poses, depth estimation, and novel view synthesis. In this way, the value of a unified multimodal model moves from “being able to handle more inputs and outputs” toward “being able to actively organize cross-modal context.” This may be a key step for multimodal models entering the stage of complex reasoning.
Kimi K2.6 has a very clear positioning: it is built for long-horizon coding, complex tool calling, and multi-agent collaboration. It supports 256K context, multimodal input, thinking / non-thinking modes, and emphasizes long-duration code execution, instruction following, self-correction, and autonomous agent execution. The value of this project lies in pushing open-source models further toward a production-grade agent foundation. The core test shifts from “can it answer?” to “can it keep executing?”
CubeSandbox solves a very fundamental problem in agent deployment: how a code execution environment can balance security, speed, and concurrency. It uses MicroVM / KVM for isolation, emphasizes lightweight cold starts, high-concurrency execution, and E2B SDK compatibility, and covers a full service chain from template creation to runtime management and sandbox startup. For agent infrastructure, projects like this are important because real workflows need a stable, secure, and standardizable execution layer that can scale.
The attraction of OpenMythos lies in how it places hot directions such as Recurrent-Depth Transformer, recurrent computation, MoE routing, variable-depth reasoning, and large context into an open-source framework that can be directly experimented with. It explicitly describes itself as a community-driven theoretical reconstruction and does not represent the actual Claude Mythos architecture. Even so, it remains worth attention, because future models may increasingly rely on computation that unfolds on demand, and this kind of playground can help researchers test ideas earlier.
The release of DeepSeek-V4 sends a very strong industry signal: competition among domestic open-source models is moving from leaderboard capability toward an integrated competition across long-context usability, inference cost, chip collaboration, and open ecosystems. Both V4-Pro and V4-Flash support 1M context, while continuing to lower the barrier in FLOPs, KV cache, and API cost. Together with adaptation to Huawei’s Ascend ecosystem, DeepSeek’s move this time is advancing model capability as well as the deployment of domestic compute collaboration.
The main thread of OpenAI’s round of releases is very clear: connecting stronger model capabilities to real workflows. GPT-5.5 strengthens coding, research, data analysis, and document-heavy tasks. Workspace Agents push GPTs toward organization-level persistent agents. Images 2.0 fills in multimodal content production capability. Taken together, OpenAI is once again competing for the core narrative around execution-oriented knowledge work, team collaboration, and complex workflows.
Google’s commitment to invest up to $40 billion in Anthropic, together with Amazon locking in up to 5GW of new compute capacity for Anthropic, shows that frontier model competition has entered the stage of infrastructure binding. Model capability remains important, while capital, cloud platforms, long-term compute, and enterprise distribution channels are now jointly determining the upper limit of competition. OpenAI is strengthening workflows, DeepSeek is advancing open source and domestic chip collaboration, and Anthropic is receiving capital and compute support. Together, these three lines form the biggest picture worth watching this week.

When making synthetic data, do not simply ask AI to “generate 1,000 samples.” First, write down the rules behind the data clearly: what scenarios exist, how variables change, how difficulty levels are divided, what kind of answer counts as qualified, and which cases should be treated as boundary cases. Then let AI generate samples, standard answers, and self-check results in batches according to these rules.
If you are working on title generation, summary rewriting, customer-service Q&A, sales scripts, or agent task sets, you can apply this method: first define the “scenario mechanism + variable range + evaluation standard,” then generate the data.
The inspiration from Google’s paper Reasoning-Driven Synthetic Data Generation and Evaluation is that the value of synthetic data lies not only in quantity, but also in structure and coverage. Writing rules first can prevent the generation of many samples that look rich but are actually repetitive. It also makes it easier to actively cover simple, medium, difficult, and boundary cases.
Data produced this way is more controllable and easier to reuse. Later, when you find that the model performs poorly in a certain type of scenario, you can add data in that direction, making testing and iteration more stable.

Link: see Reference 1.1
Recommendation Index: 🌟🌟🌟🌟🌟
The core progress of DeepSeek-V4 is that it extends context length to one million tokens and, through hybrid compressed attention, mHC residual enhancement, and the Muon optimizer, places “ultra-long context + reasoning + agent” into the same efficient architecture.
DeepSeek-V4 is absolutely worth your attention because it captures a very real bottleneck: reasoning models and agent systems are moving toward longer workflows, while the quadratic complexity of traditional attention directly turns ultra-long context into a cost black hole. The direction given by the paper is very clear: rewrite the system around “million-token context intelligence.” Architecturally, it introduces hybrid attention composed of CSA + HCA to compress long-context computation, uses mHC to strengthen signal propagation stability inside the residual stream, and officially brings Muon into the main flow of large-scale training. More importantly, this work does not only discuss structural innovation. It also lays out the entire infrastructure stack of training, parallelism, KV cache, quantization, inference framework, and agent sandbox. In terms of results, both DeepSeek-V4-Pro and Flash natively support 1M context. At this length, compared with DeepSeek-V3.2, Pro only needs 27% of the single-token inference FLOPs and 10% of the KV cache. This has already gone beyond “making the window larger.” It is pushing ultra-long context into a stage where it can be used in daily practice.
Reasoning models are being reshaped by long-workflow tasks: at the beginning of the paper, it emphasizes that test-time scaling, agent workflows, and cross-document analysis are all increasing the demand for ultra-long context, while traditional attention has become the core bottleneck.
Simply expanding the window is not enough; efficiency and stability are the key: when context reaches one million tokens, the real problem is no longer just whether the input can fit. The question is whether single-token FLOPs, KV cache, training stability, and inference infrastructure can all hold together.
DeepSeek wants to build a complete foundation for the long-context era: this paper covers architecture, optimizer, kernels, parallelism, quantization, KV cache management, post-training, RL, and agent sandbox. It is not a single-point structural paper.
Hybrid attention makes million-token context truly usable: DeepSeek-V4 uses CSA + HCA to form hybrid attention. CSA first compresses KV and then performs sparse attention. HCA performs more aggressive KV compression while retaining dense attention. The goal is very clear: reduce the FLOPs and KV cache of 1M-token context to a level that can run in daily use. The two figures on the first page of the paper show that at 1M context, V4-Pro’s single-token FLOPs are about 27% of V3.2, and KV cache is about 10%; V4-Flash can further reduce them to 10% FLOPs and 7% KV cache.
mHC + Muon are not decoration; they are rewriting training stability: DeepSeek-V4 uses mHC to strengthen stable propagation in the residual stream, while systematically integrating the Muon optimizer into large-scale training. It also adds mixed Newton-Schulz, fine-grained checkpointing, parallelism, and kernel optimization. This shows that it is not only trying to make the window longer. It wants to turn “how to train and infer long-context models stably” into an engineering foundation.
Post-training clearly leans toward agents and long-workflow tasks: the paper includes not only standard benchmarks, but also on-policy distillation, a million-token RL framework, agent sandbox, and real-task evaluation such as search, white-collar tasks, and code agents. The signal is very clear: DeepSeek-V4 is addressing reasoning, agents, and complex workflows under long-context conditions, rather than single-turn Q&A.

Benchmark and inference efficiency comparison chart
The value of DeepSeek-V4 lies in pushing “million-token context” into a real system capability that can participate in reasoning and agent workflows. The technical route in the paper is complete: long-context efficiency relies on hybrid attention, deep stability relies on mHC, and training convergence plus engineering controllability are jointly supported by Muon, quantization, kernels, parallelism, and KV cache management. This combination shows that DeepSeek-V4 is addressing a full runtime stack for long-context intelligence, beyond a single metric improvement. Another signal worth noting is that V4-Pro also clearly strengthens adaptation to domestic compute ecosystems such as Huawei Ascend. V4-Pro has been optimized for Huawei chips, and Huawei has also announced that its Ascend supernode will support DeepSeek V4. This extends the meaning of DeepSeek-V4 from the model layer to the coordinated deployment of domestic models and domestic compute.
From a cost-performance perspective, DeepSeek’s strategy this time is also aggressive. The official API documentation has already launched V4-Pro and V4-Flash. On the second day, DeepSeek offered developers a limited-time 75% discount for the newly released V4-Pro and further reduced the input cache hit price of the entire API to one-tenth of its previous level. This means that the V4 generation is doing three things at the same time: strengthening long-context capability, further lowering inference cost, and lowering the price barrier again. For development scenarios that already rely on long workflows, shared prefixes, and large context, this generation of V4 will be very competitive in cost performance.
Link: see Reference 1.2
Recommendation Index: 🌟🌟🌟🌟🌟
This paper proposes a strong judgment: high-level image generation models themselves have already learned general visual representations. With only a small amount of instruction-tuning, these hidden capabilities can be released into standard visual task outputs such as segmentation, depth, and normals.
This work is very worth watching because it makes concrete a view that many people have vaguely felt but lacked strong evidence for: a model that can generate images may already understand images. Based on Nano Banana Pro, the authors build a lightweight instruction-tuned version called Vision Banana. The core method is very unified: all visual task outputs are re-parameterized as RGB images, allowing the model to continue “generating images,” except this time it generates visual answers such as decodable segmentation maps, depth maps, and normal maps. The power of this idea is significant because it does not add a dedicated head for each task, nor does it turn the model into a specialist that can only do a single perception task. In terms of results, Vision Banana reaches or approaches SOTA on multiple 2D and 3D understanding tasks, while basically preserving its original image generation and editing capabilities. For computer vision, the importance of this paper lies not only in high metrics. It pushes a paradigm judgment: image generation pretraining is becoming a foundation for general visual learning.
The vision field has long been dominated by discriminative and task-specific models: past mainstream routes for visual representation learning mainly came from supervised learning, contrastive learning, self-distillation, and autoencoding. Although generative visual pretraining has long been discussed, its overall influence has been weaker than generative pretraining in language.
Image generation models have already shown signs of “understanding what they see”: the paper mentions that some recent image and video generation models can zero-shot produce visualizations such as segmentation maps, depth maps, and normal maps, although past methods had difficulty producing stable formats for quantitative evaluation.
This work borrows the classic path of LLMs: just as language models first undergo generative pretraining and then instruction-tuning, the authors treat the image generation model as a visual base model and use a small amount of visual task data for lightweight alignment, unlocking understanding capabilities.
Visual tasks are unified as image generation: Vision Banana does not directly output masks, depth tensors, or normal maps. Instead, it generates decodable RGB images according to prompts, such as using color maps for semantic classes, reversible color maps to encode metric depth, and RGB values to directly express surface normals. In this way, the same model can cover multiple types of tasks simply by changing the prompt.
Lightweight instruction-tuning is enough: the authors do not retrain at large scale. They mix a small amount of visual task data into Nano Banana Pro’s original training mixture at a very low proportion for alignment. This allows the model to learn to “submit answers” in the specified format while preserving the original generation capability as much as possible. Table 1 of the paper shows that it has a win rate of 53.5% over the base model on GenAI-Bench and 47.8% on ImgEdit, indicating that generation capability is basically maintained.
Understanding capability is already close to or stronger than specialists: according to Table 1 and later experiments, Vision Banana surpasses or approaches strong specialist models such as SAM 3, Depth Anything 3, and Lotus-2 on tasks including RefCOCOg, ReasonSeg, Cityscapes, metric depth, and surface normal. Its average δ1 on metric depth reaches 0.882, and neither training nor inference depends on camera intrinsics.

Vision Banana method and capability overview
The paper offers a judgment that may influence future visual routes: image generators are moving from content generation tools into generalist vision learners. The smartest part of the authors’ approach is that they did not create a pile of complex task heads, nor did they distort the model in pursuit of single-task metrics. Instead, they insisted that the model continue using the single interface of “generating images” to complete understanding tasks. This unified interface is critical because it lets generation and understanding truly share the same output paradigm for the first time. The signals given by Table 1 and Figures 1, 6, and 8 are also strong: a generalist model lightly tuned from an image generation model can already push close to specialists, and even surpass them in some cases, on fundamental tasks such as segmentation, depth, and normals. At the same time, it still preserves its original generation and editing capabilities. This result looks very much like the early stage in which language models, after generative pretraining, gradually swallowed traditional NLP tasks. In the short term, this work may not immediately replace all specialist vision models, because inference cost is still significantly higher, and the paper itself acknowledges deployment cost as an important obstacle in the discussion section. Directionally, however, it has pushed the question of whether “generative visual pretraining can become the CV foundation” to a point where it is hard to ignore.
Link: see Reference 1.3
Recommendation Index: 🌟🌟🌟🌟🌟
The key point proposed by ByteDance’s Omni is that it does more than place text, images, video, and 3D into the same model. Before output, it lets the model call and combine intermediate cross-modal contexts, turning abilities such as “seeing,” “thinking,” “generating,” and “estimating geometry” into reusable reasoning primitives.
This paper gives a more concrete explanation for why “unified multimodal models” are valuable. Many works understand a unified model as a multitask container with a shared backbone. Omni argues that the truly important part is Context Unrolling: before answering questions or generating content, the model can first pull task-relevant context from a heterogeneous modality pool, such as textual chain-of-thought, visual tokens, camera poses, depth, and novel view synthesis results, write these contexts back into a shared workspace, and then make the final prediction. This perspective is powerful because it upgrades the benefit of unified models from “feature stacking” to “context construction capability.” The paper’s results also illustrate the point well: whether in image generation, spatial understanding, or depth estimation, performance continues to rise as the model receives more and more appropriate intermediate context. In other words, what Omni really wants to prove is that the potential of unified models lies not only in any-to-any input and output, but also in the ability to “first unroll context across modalities, then make decisions.”
Unified multimodal models used to be understood more as capability collections: a common approach is to place understanding, generation, editing, video, 3D, and other tasks into one architecture, while these abilities are often still used separately and lack a clear cross-modal reasoning mechanism.
Different modalities are essentially projections of the same world knowledge: the paper treats text, images, video, 3D geometry, and hidden representations as different projections of a shared multimodal manifold. Each modality brings partial but complementary information.
The question is whether a unified model can actively call intermediate abilities after unification: Omni focuses on whether the model can actively use primitives such as “description,” “generating visual tokens,” “predicting camera poses,” “estimating depth,” and “doing novel view synthesis” during inference to build stronger context for the final answer.
Context Unrolling is the core contribution of this paper: the paper formulates reasoning as an iterative context construction process. It first continuously expands context through atomic capabilities, then performs context-conditioned decoding. Figure 1 and Section 2 both emphasize this shared workspace perspective.
More and finer intermediate context directly improves results: Figure 2 on page 3 shows that on GenEval2, from baseline to short text, long text, visual tokens, and then long + visual, the score rises from 29.2 all the way to 53.4. Spatial understanding and monocular depth estimation also continue to improve as textual / visual context increases.
The unified model can already chain generation, understanding, and 3D primitives together: Omni does not only perform ordinary VLM tasks. It also uses 3D abilities such as camera pose estimation, novel view synthesis, and depth estimation as callable context for spatial understanding and geometric reasoning. Tables 9 and 10 also show that it can already compete with specialist 3D models in camera pose and depth estimation.

Context Unrolling concept diagram
The most interesting aspect of Omni is that it pushes “unified multimodal model” from an architecture label into a reasoning paradigm. Its focus is beyond any-to-any capability. It teaches the model to construct context before output: when needed, first think through a piece of text; when needed, first generate visual tokens; when needed, first estimate camera pose, fill in novel views, and then feed those results back to itself. In this way, the meaning of a unified model is no longer simply that it can do more tasks. It begins to possess a capability closer to “workspace reasoning.” The signals in the paper are consistent: as long as intermediate context becomes richer and better structured, the model’s performance in understanding, generation, spatial reasoning, and depth estimation continues to rise. This judgment matters because it suggests that future multimodal competition may depend not only on who has more modalities or larger parameters, but on who can better reorganize multimodal abilities into iteratively unrolled context. Of course, the paper still has clear boundaries. On understanding benchmarks, it does not comprehensively surpass strong models at the same level, and its video resolution and duration still lag behind top specialist video models. As a directional paper, however, it has already explained the next-step value of unified multimodal models very clearly.

Link: see Reference 2.1
The positioning of Kimi K2.6 is very clear: its main battlefield is not ordinary chat, but long-horizon coding, complex tool calling, and multi-agent collaboration. Moonshot officially describes it as a native multimodal model that supports text, image, and video input, supports thinking / non-thinking modes, and clearly emphasizes stronger and more stable long-range code-writing ability, significantly improved instruction following and self-correction, and stronger autonomous agent execution capability. The official page also shows entry points such as Agent Swarm, Kimi Code, and document-to-skill, indicating that it is targeting a complete workflow execution scenario rather than single-turn dialogue.

Kimi K2.6 benchmarks
One outstanding feature of Kimi K2.6 is that it pushes “open-source large models” one step closer to a production-grade agent foundation. The most recognizable selling points in public information include: long-horizon coding can run continuously for more than 12 hours, supports 4,000+ tool calls, and can cover multi-language software engineering tasks such as Rust and Go. It also supports 256K context and has multi-step reasoning and complex tool-calling capabilities. This product judgment is straightforward: Moonshot wants to prove not whether it can answer questions, but whether it can remain stable in long-flow, tool-heavy, continuously executing tasks. Combined with its open platform documentation and official capability demonstrations, K2.6 looks more like an open foundation optimized around coding and agent workflows than a chat model built only to climb leaderboards.
Link: see Reference 2.2
Tencent’s open-source CubeSandbox targets a hard foundational problem in agent deployment: how exactly can a code execution environment be safe, fast, and highly concurrent at the same time? This project is not an ordinary code runner, nor a lightweight sandbox only for local demos. The official repository states its positioning very directly: it is an Instant, Concurrent, Secure & Lightweight Sandbox for AI Agents, and it is compatible with the E2B Code Interpreter SDK, allowing existing agent code to be integrated relatively smoothly. More importantly, it does not only provide an SDK interface. It builds out the entire service stack from template creation and runtime management to sandbox startup. For agent infrastructure, this kind of project has high value because what truly limits agents from entering production environments is often not the model itself, but the isolation, security, and stability of the execution layer.

CubeSandbox repo screenshot
CubeSandbox’s highlights: first, it solves secure execution. Public information emphasizes that it uses MicroVM / KVM hardware virtualization for isolation, avoiding the shared-kernel route. This is critical for executing untrusted code, tool calls, and multi-agent concurrency. Second, it addresses startup speed and concurrency density. One of the official external selling points is sub-60ms cold start and lightweight resource usage, which means its target is not a traditional virtual machine, but a high-frequency execution layer more suitable as an agent runtime. Going further down, it also takes a realistic approach to developer integration experience: the repository quick start directly provides a complete process from environment setup and service startup to template creation and Python SDK calls. It also clearly supports example scenarios such as browser automation, OpenClaw integration, and RL training workflows. Overall, CubeSandbox is more like a piece of agent execution infrastructure. Its selling point is not “can it run a piece of code,” but “can it safely, stably, and standardly support a large volume of agent execution tasks.”
Link: see Reference 2.3
The README of the OpenMythos repository states clearly that it is a community-driven theoretical reconstruction of the Claude Mythos architecture based on public research and speculation, and explicitly declares that it has no relation to Anthropic. In implementation, the core structure it proposes is a Recurrent-Depth Transformer: the front is Prelude, the middle is a Recurrent Block that can loop multiple times, and the end is Coda. Attention can switch between MLA and GQA, while the feed-forward layer uses sparse MoE with routed / shared experts. This design is attractive because it places “variable-depth reasoning,” “recurrent computation,” “MoE routing,” and “large context,” all currently hot topics, into a unified framework that can be directly tried. For people who care about reasoning architectures, agent foundations, and compute-adaptive models, this kind of project is very worth reading.

Claude Mythos speculative architecture diagram
The value of OpenMythos mainly lies in taking an architecture idea that originally stayed at the discussion level and pushing it into an actual open-source playground that can be experimented with. The repository already provides complete entry points from installation and inference calls to training scripts, and offers multiple preset configurations from 1B to 1T. The README lists experts, loop iterations, context, and max output at different scales. The training section also directly gives a script for a 3B model on FineWeb-Edu, including optimizer, token budget, and multi-GPU method. This level of engineering completion shows that the author wants to build more than a “concept replica.” It is an experimental framework around recurrent-depth Transformers. Its boundaries are also clear: first, the project itself emphasizes that this is a theoretical reconstruction and does not represent the real Claude Mythos implementation; second, it leans more toward research exploration and architecture prototyping, with a long distance from production-grade validation. But if you care about whether future models will move from fixed-depth Transformers toward stronger recursive, recurrent, and on-demand computation-unfolding systems, OpenMythos has strong reference value.

From repeated delays since the beginning of the year to its official landing, DeepSeek-V4 did not deliver a mediocre version. Million-token context, significantly reduced long-context inference cost, and adaptation signals for Huawei Ascend make this update look more like a systematic push.
DeepSeek-V4 had been delayed again and again since the start of the year, but pleasingly, the result proves that the wait was worth it. The official release includes both DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both support 1M context and focus on long-context efficiency, agent usability, and API availability. The signal released by this launch is also very clear: domestic open-source model competition is moving from single-point leaderboards and parameter scale toward a comprehensive competition of long-context usability, deployment efficiency, chip collaboration, and open ecosystems.
The DeepSeek-V4 series consists of V4-Pro and V4-Flash, targeting flagship capability and lower-cost deployment respectively. Both support 1M context and have already opened API access. DeepSeek also announced that the existing deepseek-chat and deepseek-reasoner will be retired after July 24, 2026, indicating that the V4 series is taking over the main product line.
The key to this upgrade is not only context length, but efficiency optimization under long context. The technical report shows that in a 1M-token scenario, V4-Pro’s single-token inference FLOPs drop to 27% of V3.2, and KV cache drops to 10%; V4-Flash further reduces them to 10% and 7%. This means DeepSeek is pushing “million-token context” from a showcase capability closer to a usable capability.
Chip adaptation is also an important change this time. Reuters reported that V4 is DeepSeek’s first version adapted to Huawei Ascend advanced AI chips. Huawei also stated that its Ascend 950 supernode already supports V4, and that its chips participated in part of the training of V4-Flash. This shows that DeepSeek is strengthening adaptation and migration capability for the domestic chip ecosystem. Perhaps in the future, Nvidia chips will no longer be the noble bottleneck material that can hold domestic models by the throat.
From an industry perspective, the value of DeepSeek-V4 does not lie in whether it wins a single benchmark. It further proves that open-source model competition is increasingly moving away from the catch-up narrative of “who is closer to closed-source flagships” and forming its own independent route. On one hand, V4 continues to maintain open source and open interface compatibility. On the other hand, it pushes forward more system-engineering-oriented capabilities such as 1M context, low KV cache, agent integration, and Ascend adaptation. This shows that the domestic open-source camp is truly entering a comprehensive competition of “model capability + engineering efficiency + ecosystem collaboration.”
Looking again at the rhythm from repeated delays since the beginning of the year to today’s release, this launch also carries another meaning: the market was not simply waiting for a stronger model. It was waiting to see whether DeepSeek could use V4 to prove that it is not just a short-term phenomenon after the early-year explosion. Judging from this result, it has at least handed in an answer sheet worthy of continued discussion. It may not mean that the frontier landscape has been completely rewritten, but it does show that competition among domestic open-source models has entered a new stage that is harder to summarize with a single leaderboard.

Performance comparison of DeepSeek-V4-Pro-Max and other models
GPT-5.5 strengthens the ability to execute complex tasks. Workspace Agents push GPTs toward organization-level agents. Images 2.0 fills in the multimodal content production layer. OpenAI’s round of releases points to an entire work system.
OpenAI’s direction in this round of moves is very clear: systematically push model capabilities into real workflows. GPT-5.5 strengthens sustained execution ability in coding, research, data analysis, and document-heavy tasks. Workspace Agents solve how such abilities enter team and organizational processes. ChatGPT Images 2.0 completes the multimodal content production layer. Taken together, although some believe GPT-5.5 and Images 2.0 are enough to put OpenAI back at the peak, what is very clear from its series of releases is that OpenAI is advancing a complete execution-oriented work system, going beyond a single model upgrade.
GPT-5.5 is defined by OpenAI as a frontier model for real work, focusing on tasks such as coding, research, data analysis, documents, and software operation. Official evaluations include GDPval 84.9%, OSWorld-Verified 78.7%, Tau2-bench Telecom 98.0%, and Terminal-Bench 2.0 82.7%, pointing toward stronger sustained execution and tool-based work capability.
At the same time, OpenAI released Workspace Agents and directly defined them as the evolution of GPTs. They are powered by Codex, can run in the cloud, support ChatGPT and Slack, and possess memory, files, tools, and workspace capabilities. They will switch to credit-based pricing after May 6, 2026. This change means OpenAI is pushing the more personal GPTs system toward a persistent agent system for teams and organizations.
On the multimodal side, ChatGPT Images 2.0 has been officially released. The official focus is on complex layouts, multilingual text rendering, UI scenarios, and high-density information typesetting. Its role is clear: extending OpenAI’s workflow capability further into the image and content production layer.
OpenAI’s round of releases points in one common direction: stronger execution-oriented models, more organized team agents, and more mature multimodal content production.
Over the past period, Anthropic has continued to occupy a reputation advantage in coding and agents. After the release of GPT-5.5, OpenAI has at least regained a strong voice in execution-oriented knowledge work and complex workflows.
If we only focus on GPT-5.5’s benchmark performance, that may be somewhat superficial. What we think is more worth noting is that OpenAI is rapidly connecting the stronger execution-oriented ability of GPT-5.5 to concrete product layers such as Codex, Workspace Agents, Images 2.0, and Chronicle.
This shows that OpenAI’s core goal is no longer simply to prove that “the model has become stronger again.” It wants to directly transform this strengthening into productivity in development, documents, research, content generation, and organizational collaboration. From this perspective, OpenAI, even while at a disadvantage in public opinion, has a very clear mind. It knows exactly what has been taken away from it, and what it should fight to take back.
From the competitive landscape, this also explains why the release of GPT-5.5 has been seen by many practitioners as a signal that OpenAI is “back at the table.” Over the past period, Anthropic has continued to occupy the upper hand in public opinion, especially in coding and agents. GPT-5.5 did not end this competition in one stroke, but it clearly helped OpenAI regain the core narrative around execution-oriented knowledge work and workflow capability. In other words, commercial competition among major companies below the model layer must compete for the frontier position through the trinity of model, product, and workflow.

Text-to-image model ranking screenshot from Artificial Analysis
OpenAI uses GPT-5.5 to regain part of the public-opinion initiative, while Google heavily bets on Anthropic during the same period. Top companies are not waiting for a winner to emerge. They are using capital, cloud, and compute resources to lock in their camp positions ahead of time.
Google’s committed financing for Anthropic directly lays bare the underlying logic of industry competition. Frontier model competition is moving from model capability comparison into compound competition among models, capital, cloud, compute, and ecosystems. Google / Alphabet plans to invest up to $40 billion in Anthropic, including an initial $10 billion in cash and the remaining $30 billion tied to performance goals. A few days earlier, Anthropic had just expanded its partnership with Amazon, locking in up to 5GW of new compute capacity. For Anthropic, this is more than financing; for the industry, it looks more like an early reshuffling of infrastructure alliances.
Reuters reported that Google / Alphabet plans to invest up to $40 billion in Anthropic, corresponding to a valuation of about $350 billion. Anthropic has confirmed an initial $10 billion in cash, with the remaining $30 billion tied to performance goals. Reuters also mentioned that Anthropic’s annualized revenue run rate has exceeded $30 billion. This shows that Anthropic is no longer simply a “promising model company,” but a platform with high-speed commercialization capability.
Another piece of information worth viewing together comes from the collaboration between Anthropic and Amazon. On April 20, Anthropic officially announced that the new agreement between the two sides will lock in up to 5GW of new compute capacity and provide close to 1GW of combined Trainium2 and Trainium3 capacity by the end of 2026. In other words, Anthropic is currently receiving not only one financing support, but simultaneous increases in both capital and infrastructure resources.
Google commits to investing up to $40 billion, Amazon provides up to 5GW of new compute, and Anthropic has an annualized revenue run rate above $30 billion. Taken together, competition among frontier model companies has long gone beyond “who is stronger.” It has become about who can bind capital and compute resources faster.
This also shows that top-tier AI competition is escalating from a model war into a camp-based infrastructure war.
Google and Amazon acting as Anthropic’s left and right guardians may rather plainly reveal the true competitive logic of today’s frontier AI companies: after models become stronger and stronger, what determines the upper limit of competition is who can obtain capital faster, who can lock in long-term compute, and who can establish more stable entry points in cloud platforms and enterprise distribution. Anthropic being deeply bound to both Google and Amazon is essentially the embodiment of this logic.
If we view this together with the other two trend stories, it forms the same picture as GPT-5.5 and DeepSeek-V4: OpenAI is strengthening execution-oriented workflows, DeepSeek is advancing collaboration between open-source models and domestic chips, while Anthropic is gaining greater support at the capital and infrastructure layer. This perhaps fully proves that model competition is comprehensively escalating into platform, ecosystem, and infrastructure competition.

Google and Anthropic cooperation diagram
1.1 DeepSeek-V4 technical report
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
1.2 Vision Banana paper
https://arxiv.org/abs/2604.20329
1.3 Omni paper
https://arxiv.org/abs/2604.21921
2.1 Kimi K2.6
https://huggingface.co/moonshotai/Kimi-K2.6
2.2 CubeSandbox
https://github.com/TencentCloud/CubeSandbox
2.3 OpenMythos
https://github.com/kyegomez/OpenMythos
3.1 DeepSeek API news
https://api-docs.deepseek.com/news/news260424
3.2.1 OpenAI GPT-5.5
https://openai.com/index/introducing-gpt-5-5
3.2.2 ChatGPT Images 2.0
https://openai.com/zh-Hans-CN/index/introducing-chatgpt-images-2-0
3.3 Reuters report on Google / Anthropic
https://www.reuters.com/business/google-plans-invest-up-40-billion-anthropic-bloomberg-news-reports-2026-04-24/





