



Cover Story: 100% - 2.7% / 3 years
The latest report revealed by Stanford shows that the capability gap between Chinese and U.S. models has already narrowed to 2.7%. Of course, this is not the whole story of the report. If you want to know more, you can check News 03. At the beginning, we only want to write from another angle about what this means to us.
The core meaning is this: there is no need to feel anxious anymore because you cannot use Claude/GPT. In just 3 years, the gap has already narrowed to 2.7%. And even OpenAI and Anthropic have both already started leaping toward productization and processization. On this point, domestic Chinese players may only be catching up even faster.
At the same time, this also means that you must insist on being a “sea king.”
Leaving aside the recent incident in which Claude suddenly banned over 60 accounts of a Spanish company and directly cut off the company’s operations, and this was not even in a restricted region, whether it is Claude or not, dependence on any single point of support may suddenly leave you without water or food.
Models are already this close. Please do not fall in love with only one flower.
What is more, each model differs in terms of capability range and strengths, so please continue playing the open-world game.

Hey, my friend 😊, welcome to the newsletter “There’s Always Something Fresh in AI Under the Sun,” produced by the JoinAI | Zhuoyin Intelligent Algorithm Team.
With the unique technical perspective and restraint of “AI builders,” we carefully select the weekly Top 3 for you: papers, projects, and developments. We do not care here about the illusions of trending traffic; we only track technologies and trends that are truly worth paying attention to. We will not only praise the good of AI here; we will also expose the problems of AI.
π0.7 pushes things one step forward. It expands the input of robot models from a simple task description into a more complete execution context. After subtask instruction, subgoal image, episode metadata, and control mode all enter the context together, what the model is exposed to is no longer just the target, but also action style, execution quality, local future state, and failure experience. In this way, when facing training data from multiple sources, with heterogeneity and uneven quality, the model can more easily distinguish what each trajectory represents. The significance of the whole work is very direct: the evolution of robot foundation models is starting to depend more and more on execution context design, and on how VLA and world models collaborate.
Seedance 2.0 shows a very strong tendency toward systematization. The paper unifies four input types—text, image, audio, and video—into one multimodal audiovisual architecture, while simultaneously covering generation, reference, editing, continuation, subject control, motion control, style transfer, and audio-visual synchronization. This design brings video models one layer closer to real production workflows. The industry’s requirements for video models have already clearly risen. What people care about now includes complex motion, multi-shot storytelling, character expressions, audio expressiveness, and overall controllability. The signal given by Seedance 2.0 is very clear: the next round of competition in video generation will increasingly look like a contest between creative engines.
Lyra 2.0 pushes its target even further. It is concerned not only with novel-view completion from a single image, but also with allowing users to continuously explore the same generated world along arbitrarily long trajectories. The paper identifies two long-standing challenges: when the camera moves far away and then looks back, the model tends to forget regions it has already generated; during long-horizon generation, it is also prone to color drift and geometric drift. Lyra 2.0 addresses these issues with a per-frame 3D cache, geometry-aware retrieval, and self-augmentation training, and further connects the generated results to 3DGS, meshes, and Isaac Sim. This overall pipeline is already very close to the world-building middleware needed for 3D content production and embodied AI.
The step that HY-World 2.0 takes forward is very close to practical application. It supports starting from text, a single image, multi-view images, and video, and finally generates 3D results such as 3DGS, mesh, and point cloud that can continue to be edited, imported, and reused. This output form better matches the needs of game engines, simulation platforms, and robot environments. What the project as a whole presents is a complete world-building workflow, including Panorama Generation, Trajectory Planning, World Expansion, and World Composition. After being open-sourced, it will also have higher reference value for researchers and developers.
LingBot-Map focuses on a very hard engineering problem: in long-sequence streaming 3D reconstruction, how can stability and efficiency both be maintained? It adopts the route of a feed-forward 3D foundation model, continuously receiving video streams or image streams and outputting reconstruction results, no longer relying on slow scene-by-scene optimization. The designs given in the repository, such as long sequence inference, paged KV cache attention, and browser-based visualization, also reflect a strong engineering mindset. The value of this project is directly related to robotics, spatial understanding, AR/VR, map building, and similar directions, because what these scenarios really need is 3D infrastructure that can run continuously.
The positioning of MiniMax M2.7 is very clear, with emphasis placed on long-process, multi-tool, and multi-role collaborative agent tasks. In the model card, Model Self-Evolution is placed very prominently, emphasizing its continuous optimization ability in memory, skills, scaffolding, and learning workflows. At the same time, it covers scenarios such as software engineering, SRE diagnosis, multi-round editing of documents/spreadsheets/PPTs, tool use, and multi-agent collaboration. This capability structure is more suitable for complex tasks in real production environments. As a 229B open-weight model, it has also already given a relatively clear deployment path and support for multiple inference frameworks. Overall, M2.7 represents another flagship-model path: it places more emphasis on long-session stability, depth of tool collaboration, and the capacity to carry complex tasks.
From Codex and Agents SDK to Cyber and Rosalind, OpenAI’s focus has become embedding AI more deeply into development, security, and scientific research workflows. OpenAI’s recent official releases respectively cover development entry points, agent infrastructure, cybersecurity, and life sciences: Codex has added computer use, in-app browser, automations, and memories; Agents SDK has added model-native harness and native sandbox; GPT-5.4-Cyber has opened higher-privilege access to certified defenders; and GPT-Rosalind is explicitly aimed at biology, drug discovery, and translational medicine.
Opus 4.7 continues to thicken coding, agent, and vision capabilities, while Claude Design further connects Claude to the output layer of prototypes, PPTs, and visual documents. Anthropic officially positions Opus 4.7 as a stronger general flagship in coding, agents, vision, and multi-step tasks, and at the same time launches Claude Design, which supports generating prototypes, presentations, and one-pagers from prompts, documents, code repositories, and web content, and can export to Canva, PDF, PPTX, or standalone HTML.
What is increasingly happening at the same time is that AI models are getting stronger, capability convergence is increasing, capital is pouring in, and the real world of work is being rewritten. Stanford HAI points out in AI Index 2026 that frontier model capabilities are still improving rapidly, the performance gap between Chinese and U.S. models has narrowed to 2.7%, SWE-bench Verified has risen from about 60% within a year to close to 100%, and AI adoption, investment, and organizational usage are also rising at the same time.

How to do it:
When doing prompt testing, copy generation evaluation, or Q&A effect comparison, do not throw all samples to a large model as a judge right from the start. You can first build a two-layer evaluation setup: in the first layer, use a lightweight evaluator for batch pre-screening, for example using a reference-based evaluator like BERT to judge whether the answer is semantically close to the reference answer, and quickly filter out obviously poor results; in the second layer, hand the close, hard-to-distinguish, and high-value samples to a stronger LLM for final review. In this way, your process becomes “low-cost batch filtering + high-quality targeted re-checking.” If you usually test title generation, summary rewriting, customer service replies, or knowledge Q&A, this setup can all be directly applied.
Why it works well:
The inspiration from this work is that many evaluation tasks do not require an expensive large model for every single item. As long as the task itself has a reference answer, or at least a relatively stable target output, a lightweight evaluator can first take on most of the repetitive labor. There are two direct benefits. First, the cost becomes much lower, making it suitable for high-frequency iteration. Second, the results are usually more stable and less likely to let fluctuations in the LLM judge itself skew the test set. What really needs a large model’s intervention are often those boundary samples that are semantically close, stylistically different, and high in business impact. After splitting the evaluation process into two layers, you will find that both iteration speed and reproducibility become better.

Link:https://www.pi.website/download/pi07.pdf
Recommendation index: 🌟🌟🌟🌟🌟
One-sentence guide:
π0.7 brings subtask language, subgoal images, episode metadata, and control mode together into the context, so what the robot model learns is not only the task target, but also the execution style, motion style, and local future state.
⚽️ Why recommend it:
The core contribution of π0.7 is that it expands the input of robot foundation models from “task description” to “execution context.” In the past, when many VLAs faced large-scale and heterogeneous data, the common problem was not that they could not do the task at all, but that they easily mixed together different strategies, different qualities, and even failure trajectories into averaged actions, and the result was dulled actions and weakened generalization. The method given by π0.7 is more systematic: it brings subtask instruction, episode metadata, control mode, and subgoal images generated by a lightweight world model into the model context together, allowing the model to encounter a more complete execution context during training. In this way, what the model learns is not just “what the task needs to accomplish,” but also “how this thing should be accomplished.” Judging from the results, this design brings very obvious capability improvements: π0.7 shows very strong out-of-the-box ability on a variety of long-horizon and dexterous manipulation tasks, can approach or even match task-specific specialists, and also demonstrates zero-shot cross-embodiment transfer and strong compositional generalization.
📚 Background:
Robot foundation models still lack compositional generalization: existing VLAs, although growing larger and larger, still have a clear gap compared with language models when it comes to recombining existing skills into new tasks and stably executing complex instructions in new environments.
As data becomes more abundant, ambiguity also increases at the same time: when training data simultaneously contains multiple robots, multiple strategies, automatically collected trajectories, failure samples, and non-robot data, if there are no finer-grained context labels, the model can easily learn a fuzzy “average action.”
Many key execution details are hard to express in one sentence: things like speed, quality, mistakes, grasping posture, and local future state can all significantly affect action generation, but usually cannot be fully covered by a simple language instruction.
📌 Key points:
Expand the context from task description to execution context: in addition to task instruction, the input of π0.7 also includes subtask instructions, subgoal images, episode metadata, and control mode, allowing the model to understand task goals, execution strategies, and local state change at the same time.
Absorb heterogeneous and suboptimal data more stably: the authors explicitly include low-quality demonstrations, failure trajectories, autonomous execution data, egocentric human video, and web multimodal data, and use metadata to distinguish how these samples should be used.
Stronger generalization and transfer emerge: the paper reports out-of-the-box dexterity, instruction generalization, cross-embodiment generalization, and compositional task generalization, and gives very strong zero-shot performance on tasks such as folding laundry, making coffee, packing boxes, and operating an air fryer.

Overall method framework diagram of π0.7
π0.7 makes one key issue in robot foundation models very clear: as data becomes more abundant and more mixed in quality, the truly scarce capability is understanding the execution style corresponding to different samples. The value of this paper lies in making these differences, which were previously easy to average away during training, explicit again. It uses language, images, and metadata to jointly form an execution context, allowing the model to distinguish what different speeds, different qualities, different strategies, and even failure experiences each mean. More importantly, π0.7 also releases a very clear directional signal: robot foundation models are moving from a single VLA toward a system form in which VLA and WAM (World Action Model) are deeply coupled. Here, subgoal images are essentially turning the world model’s prediction of future state directly into the policy’s conditional input, so that high-level semantics, future visual state, and low-level action generation begin to form a closed loop. If it develops one step further, WAM may no longer just be responsible for generating subgoals, but may further absorb planning, control interfaces, and even part of the policy function, gradually elevating its central position in the system. You can see π0.7 as a signal: future competition among robot foundation models will increasingly depend on how VLA and WAM cooperate, and on whether execution context can be designed into the whole system.
Link:https://arxiv.org/abs/2604.14148
Recommendation index: 🌟🌟🌟🌟🌟
One-sentence guide:
Seedance 2.0 attempts to push video generation from “being able to produce a clip” to “being able to handle real-world complexity”: it uses a unified multimodal audiovisual architecture to take in text, image, audio, and video input at the same time, and brings generation, reference, editing, and continuation into one system.
⚽️ Why recommend it:
The core value of Seedance 2.0 is that it pushes video generation from a single-modality, short-clip, weakly controllable model toward a more creator-engine-like system form. The paper states this very clearly: this generation is a natively multimodal audiovisual joint generation model, supporting four types of input—text, image, audio, and video—while simultaneously covering multiple tasks such as subject control, motion manipulation, style transfer, special effects design, creative generation, and video extension. This is very significant, because today competition among video models is no longer just “generate a nice-looking video,” but rather a competition over who can better enter real workflows: can it reference multiple sources of material, can it edit, continue, control characters, control style, control rhythm, and generate audio and visuals together? Seedance 2.0 gives a very complete answer. Combined with its systematic strengthening in complex motion, facial expressions, multi-shot narration, stereo audio, and audio-visual synchronization, this work looks more like it is defining the appearance of the “next-generation video generation platform” than simply releasing a stronger single-point model.
📚 Background:
Video generation is moving from short-clip generation toward creative infrastructure: the paper directly proposes a paradigm shift at the beginning, where the goal has moved from generating short video clips to a highly controllable synthesis system that natively supports multiple control signals and can enter complex creative workflows.
The industry is no longer satisfied with single-modality generation: real creative scenarios often require simultaneous reference to text scripts, character images, existing video clips, and audio materials, and traditional single T2V or I2V capabilities can hardly cover real production workflows.
A truly usable video model must simultaneously solve visuals, motion, narration, and audio: the paper treats motion stability, prompt following, cinematographic language, audio expressiveness, and audio-visual alignment all as core indicators, showing that video models are moving from visual generators toward complete audiovisual systems.
📌 Key points:
Unified multimodal audiovisual architecture: Seedance 2.0 is a native multimodal audio-video joint generation model that supports four categories of input—text, image, audio, and video—and can complete combined tasks such as generation, reference, editing, and continuation.
Its modeling of “world complexity” is clearly enhanced: the paper emphasizes that it has significantly improved in complex human motion, multi-subject interaction, physical consistency, close-up detail, camera language, and cross-frame consistency, with the goal of reproducing real-world dynamics more stably.
Its performance in evaluation is very strong: in the paper’s own SeedVideoBench 2.0, it ranks first in all evaluated dimensions of T2V, I2V, and R2V; the paper also gives Arena.AI leaderboard results, showing that its 720p version ranks first on both the text-to-video and image-to-video leaderboards.

Seedance 2.0 T2V benchmark in the paper
Seedance 2.0 is very hot right now, and the reason is not hard to understand: from user experience, to platform dissemination, to the various evaluation results in the paper, it has already clearly entered the first tier of current video models. Whether it is the multi-dimensional lead on SeedVideoBench 2.0 or the double first-place ranking on Arena.AI for text-to-video and image-to-video, the signal being conveyed is very strong: Seedance 2.0 already has the position of an industry-leading model, and it is raising people’s expectations for the upper bound of video generation capability.
One prominent feature of this paper is that it talks a lot about capabilities, evaluation, and application scenarios, but gives relatively little detail about the core implementation. The paper very completely presents multimodal input, reference generation, editing, continuation, audio-visual integration, and complex motion control capabilities, and also lays out the evaluation framework and results very fully; but once it goes deeper into the truly critical implementation layer, such as how the unified architecture is built, how audio and video are jointly modeled, how the four modalities are aligned, how the training data is organized, and what the core engineering techniques are, the public information clearly tightens.
This style of writing itself also shows one thing: Seedance 2.0 has already demonstrated its capability advantage, but the most valuable technical details are still being kept very tightly. For readers, the main value of this paper lies in helping you see clearly why it is hot, in what dimensions it is strong, and what position it wants to occupy; if you want to reconstruct the complete technical solution only from this paper, the information is far from enough. This disclosure style itself also matches the status of a leading model.
Link:https://arxiv.org/abs/2604.13036
Recommendation index: 🌟🌟🌟🌟🌟
One-sentence guide:
Lyra 2.0 attempts to push single-image generation from “completing a few new views” to “continuously exploring an entire world”: users can move the camera along arbitrarily long trajectories, while the model generates long-horizon, 3D-consistent video and reconstructs the results into an interactive, simulatable, and exportable 3D scene.
⚽️ Why recommend it:
Lyra 2.0 is highly attractive because what it targets is not ordinary single-image-to-3D, nor ordinary camera-controlled video generation, but a more complete goal: starting from one image, generate a 3D world that can be continuously explored. The difficulty here is very clear, and the paper splits the problem into two core bottlenecks: first, spatial forgetting—when the camera moves far away and then looks back, the model forgets the regions it has seen before; second, temporal drifting—the longer the autoregressive generation continues, the more likely colors drift, structures go off, and geometry distorts. The method design of Lyra 2.0 is also very neat: it uses per-frame 3D cache and geometry-aware retrieval for anti-forgetting, and self-augmentation training to alleviate observation bias and long-horizon drift. More importantly, it does not stop at “generating long video,” but continues to lift the results to 3DGS and surface mesh, and directly connects them to an interactive GUI and Isaac Sim. In this way, this work is no longer just a visual generation paper, but a generative world-building pipeline for embodied AI and 3D content production.
📚 Background:
The key difficulty of generating a 3D world from a single image lies in long-horizon consistency: current camera-controlled video generation is already very strong at short-horizon novel-view synthesis, but once the trajectory becomes long, the viewpoint changes become large, and revisiting old areas is required, the model easily forgets earlier structures and begins to drift.
Existing methods usually solve only half the problem: some methods strengthen spatial memory through 3D structure, but easily feed geometry errors back into the generator; some methods rely on historical frames and long context to support consistency, but still remain unstable under large viewpoint change and long-horizon revisit.
Generative reconstruction is becoming a new route for 3D scene creation: the paper explicitly treats generative reconstruction as a new paradigm—first generate a camera-controlled video, then use feed-forward 3D reconstruction to turn it into an explicit scene that can be rendered and simulated.
📌 Key points:
Use per-frame 3D cache to solve spatial forgetting: Lyra 2.0 does not fuse all history into one global point cloud, but maintains independent geometry for each frame, and when a target viewpoint arrives, retrieves the most relevant historical frames and establishes dense 3D correspondence for information routing.
Video generation and 3D reconstruction are connected: the generated long video is further sent into a 3DGS pipeline and mesh extraction, and can finally be exported to an interactive browser and Isaac Sim, forming a complete downstream loop.
The experimental results and visualizations are very strong: the paper outperforms baselines such as GEN3C, CaM, SPMem, Yume1.5, and VMem in both long video generation and 3D scene generation, especially in long-horizon quality, style consistency, and reconstructable quality.

Overview of the Lyra 2.0 method
Lyra 2.0 captures a very clear trend: video generation is moving from outputting content to outputting worlds that can be explored, reconstructed, and simulated. The value of this work lies in truly connecting video generation, 3D reconstruction, and embodied AI scene construction into one chain. Methodologically, it splits the two core problems of long-horizon 3D consistency very accurately: spatial forgetting is handled by per-frame 3D cache and geometry-aware retrieval, while temporal drifting is suppressed by self-augmentation training. The generated results can also continue to connect to 3DGS, mesh, and Isaac Sim, which gives the whole route a very strong sense of practical landing. Another plus point is that Lyra 2.0 has already been open-sourced, which will make it easier for the research community and developers to actually reproduce it, modify it, and connect it into their own 3D/world model workflows. The paper is also very candid: the current focus is still on static environments, and dynamic scenes have not yet really been solved; at the same time, this work leans more toward high-level system integration and training strategy innovation.

Link:https://arxiv.org/abs/2604.14148
🦄 Why recommend it:
HY-World 2.0 pushes “world models” one step from watchable video toward real 3D assets that can persist, be interacted with, and be imported into engines. The project and technical report make it very clear: it supports text, single-image, multi-view image, and video input, and the output is editable, persistent 3D worlds such as 3DGS, mesh, and point cloud, not pixel video that ends after playback. It also unifies world generation and world reconstruction into one multimodal framework, and splits the whole pipeline into four stages: Panorama Generation, Trajectory Planning, World Expansion, and World Composition. Its goal is no longer only “generate content,” but “build worlds.” The significance of this route is very great, because it is closer to the underlying asset form truly needed by game engines, simulation platforms, and embodied AI. In addition, the project has already been open-sourced, so its discussion value and reference value are both relatively high.

Overall architecture of HY-World 2.0
The strengths of HY-World 2.0 are mainly on three levels.
First, the output form is more practical: compared with world models that only generate video, it directly produces 3D assets that can be imported into Blender, Unity, Unreal Engine, and Isaac Sim, which makes it naturally closer to content production and simulation workflows.
Second, the input coverage is more complete: it can not only do world generation from text and a single image, but also do reconstruction from multi-view images and video, showing that the team wants to build a unified world-construction framework, not just a single-point demo.
Third, the system structure is more like a pipeline: from panorama generation and trajectory planning to world expansion and composition, what the project presents is not the capability of a single model, but a complete building process for 3D world generation.
Its current boundary is also clear: the full experience that many people truly care about—“generating a large world”—still heavily depends on hardware and engineering configuration, and in the short term is more suitable for researchers and developers to get started with. But overall, HY-World 2.0 has already pushed open-source world models from “looking at effect images” to a stage of “able to make assets, enter engines, and connect to simulation.”
Link:https://github.com/Robbyant/lingbot-map
🦄 Why recommend it:
LingBot-Map is worth paying attention to because it targets a very practical problem that has not been well solved for a long time: how to make long-sequence streaming 3D reconstruction both stable and fast. The positioning on the project homepage is very clear: it is a feed-forward 3D foundation model for streaming 3D reconstruction. Its focus is not on slow scene-by-scene optimization, but on continuously taking in video streams or image streams and steadily outputting 3D results. The three key points listed in the repository are also very direct: first, Geometric Context Transformer, which unifies coordinate grounding, dense geometric cues, and long-range drift correction into one streaming framework; second, highly efficient streaming inference, which uses paged KV cache attention to support stable long-sequence inference; third, stronger reconstruction performance on both streaming and optimization-based methods. The value of this kind of project is very high, because it is closer to the capabilities truly needed by robotics, spatial understanding, AR/VR, and map building, not just “offline reconstruction of a nice-looking scene.”

LingBot-Map workflow diagram
The strengths of LingBot-Map are mainly reflected in three points.
First, the architectural direction is very clear: it does not stitch together traditional SfM/SLAM and reconstruction processes, but explicitly takes the route of a feed-forward streaming foundation model, using one unified model to handle long-sequence geometric modeling.
Second, its engineering usability is very strong: the README directly provides complete usage from image streams and video streams to keyframe interval and long-sequence windowed inference, and it also supports browser-based visualization, showing that it is not just a paper implementation, but clearly intends to let developers run it directly.
Third, its awareness of long sequences is very strong: the project specifically distinguishes lingbot-map, lingbot-map-long, and keyframe/windowed modes, and explicitly discusses cache degradation beyond 320 views and long-sequence inference strategies, showing that the team understands the constraints of real streaming scenarios very well.
Its current boundary is also very clear: this type of method is still sensitive to VRAM, CUDA, FlashInfer, and long-sequence cache management, and is still some distance away from “real-time large-scale deployment on any machine.” But overall, LingBot-Map has already pushed streaming 3D reconstruction one step from research demo toward “runnable infrastructure.”
Link:https://huggingface.co/MiniMaxAI/MiniMax-M2.7
🦄 Why recommend it:
MiniMax M2.7 is worth paying attention to because it states its positioning very clearly: the focus is not single-turn chat, but long-process, multi-tool, multi-role collaborative agent tasks. The Hugging Face model card directly emphasizes that it is MiniMax’s first model to “deeply participate in its own evolution.” During development, it not only updates its own memory, but also builds complex skills, participates in RL experiments, and even iterates on its own learning process; internal versions once autonomously optimized programming scaffolds for more than 100 rounds and brought a 30% performance improvement. This direction is very interesting, because it pushes the model from “answering questions” toward “an agent foundation that can continuously improve workflows.” In terms of results, M2.7 is also clearly aimed at engineering and production scenarios: it gives strong scores on SWE-Pro, Terminal Bench 2, NL2Repo, GDPval-AA, Toolathon, and other benchmarks, while supporting Agent Teams, multi-skill calling, and local deployment to frameworks such as vLLM, SGLang, Transformers, and NVIDIA NIM. Overall, M2.7 looks more like an open-weight agent model for complex workflows and professional tasks, rather than just another general chat model.

MiniMax M2.7 Benchmarks
What stands out most about MiniMax M2.7 is that it makes the idea of an “agent foundation” relatively complete. The model card puts Model Self-Evolution very prominently, and it is clear that what the team most wants to emphasize is not single-response ability, but the model’s potential for continuous optimization in memory, skills, scaffolding, and learning workflows. At the same time, the scenarios it covers—software engineering, SRE diagnosis, multi-round editing of documents/spreadsheets/PPTs, tool use, and multi-agent collaboration—also show that it targets longer-process, more production-oriented professional tasks, not just common code completion or Q&A seen in ordinary coding models. Looking one layer further down, as a 229B-parameter open-weight model, it has already explicitly given a local deployment path and supports multiple inference frameworks such as SGLang, vLLM, Transformers, ModelScope, and NVIDIA NIM, showing that the team is considering not only leaderboard performance, but also how it can truly enter agent engineering systems. Of course, the threshold for such a model is also very realistic: it is very large, and running it stably and at high quality requires considerable compute, inference framework support, and engineering tuning. But if you care about agent scenarios such as long sessions, strong tools, and strong collaboration, M2.7 is still a very worthy one to watch.

From Codex, Agents SDK, Cyber, Rosalind, to image-line updates, OpenAI’s moves are no longer just function iteration, but embedding AI more deeply into development, security, scientific research, and content production workflows.
If you look at OpenAI’s recent series of moves together, the most noteworthy change is no longer “another new model came out” or “another new feature was added,” but that it is simultaneously advancing two things: on one side, turning products such as Codex and Agents SDK into more complete work entry points and agent infrastructure; on the other, continuing to push models toward high-value scenarios such as cybersecurity, life sciences, and multimodal content generation. What OpenAI is building is increasingly looking like a system that can enter real work, not just a list of capabilities.
🧩 Related important information
The development line is the clearest. After the recent update, Codex has clearly evolved from a coding assistant toward a more complete development partner: it supports parallel task handling in the background, computer use, built-in browser, historical preference memory, continuous-thread automation, and stronger PR review and task management capabilities. This means that Codex’s positioning is shifting from “help you write a piece of code” toward “help you continuously advance a stretch of development work.”
At the same time, OpenAI is also strengthening its agent infrastructure. The new version of Agents SDK natively integrates harness and sandbox execution, so agents can not only converse and call tools, but can also handle longer and more complex task flows in a more controlled execution environment. For developers and enterprises, this is equivalent to OpenAI gradually platformizing and standardizing the act of “building agents.”
In high-value professional scenarios, OpenAI is also going deeper at the same time. GPT-5.4-Cyber opens higher-privilege access to certified security defenders, showing that OpenAI is pushing models into real cybersecurity workflows; GPT-Rosalind, on the other hand, clearly points toward biology, drug discovery, and translational medicine, beginning to let models go deeper into processes such as literature review, hypothesis generation, experiment planning, and scientific tool calling.
The image line is also releasing a clear signal. GPT Image 2 has already shown signs of gray testing, while DALL·E 2 / 3 have already been confirmed to go offline on May 12, showing that OpenAI’s image capabilities are entering a generational transition window. Looking at this within the whole group of moves, what the image line fills in is one content-generation link in OpenAI’s multimodal productivity chain: from writing code, building agents, and entering professional industries, to generating images that can be more directly used in product and content scenarios, OpenAI is bringing different capabilities together into one unified work system.
🧭 Industry impact analysis
The most central change in OpenAI’s recent round of moves is further organizing model capabilities into a work system: at the front end, there is a development entry point such as Codex; in the middle layer, there is agent infrastructure such as Agents SDK; in depth, there are professional scenario products such as Cyber and Rosalind; and on the multimodal side, there are image capabilities that are undergoing generational transition.
This means that OpenAI’s competitive focus is shifting from single-turn response quality toward who can enter real workflows more deeply and who can form the default entry point in higher-value industries.
Looking at Codex and Agents SDK together, one faces the daily development entry point, and the other faces agent infrastructure; together with the simultaneous advance of Cyber, Rosalind, and the image line, OpenAI’s layout is clearly no longer about letting users “send a question to the model,” but about competing to define the default interface, default calling layer, and default industry solution of future work.
From this angle, OpenAI’s keywords in this round are no longer just benchmark, but computer use, automation, sandbox, trusted access, scientific workflows, and multimodal generation. As the pure capability gap among top models continues to shrink, what will increasingly determine commercial value is whether they can enter high-frequency workflows, whether they can take on long-cycle tasks, whether they are allowed to be used in highly constrained industries, and whether they can cover the complete chain from analysis to execution to output.

Comment feature of the built-in browser after the Codex update
Opus 4.7 continues to deepen coding, agent, and vision capabilities, while Claude Design further connects Claude to the output layer of prototypes, PPTs, and visual documents.
If you look at Anthropic’s current moves together, the most noteworthy thing is that Claude’s product boundary is clearly expanding outward. Opus 4.7 continues to strengthen long tasks, complex engineering tasks, and high-resolution visual understanding, while Claude Design lets Claude directly begin generating visual results such as prototypes, presentations, and one-pagers. Claude’s role is evolving from “a very strong AI collaborator” toward “a platform that can cover more of the output chain of knowledge work.”
🧩 Related important information
On April 16, Anthropic released Claude Opus 4.7, officially defining it as a clear upgrade over Opus 4.6, with a focus on strengthening advanced software engineering, agents, vision, and multi-step tasks. One change in the official description that is particularly worth noticing is that Opus 4.7 will proactively design verification steps and check its own output before giving a result, which makes it more suitable for long-chain, low-supervision complex tasks. The pricing remains the same as Opus 4.6 and it is already available in Claude products and major cloud platforms.
The meaning of this upgrade is not only on benchmarks. Anthropic cited feedback from multiple early testers: Cursor said its score on CursorBench rose from 58% to 70%; Notion said complex multi-step workflows improved by 14% relative to Opus 4.6; Rakuten mentioned that on Rakuten-SWE-Bench, the number of production tasks solved by Opus 4.7 reached three times that of 4.6; XBOW said its vision acuity benchmark rose from 54.5% to 98.5%. These feedbacks jointly point to one trend: Anthropic is continuing to concentrate Claude’s advantages on real-work-near capabilities such as long-task execution, engineering collaboration, complex tool calling, and visual understanding.
What is more noteworthy is that Anthropic is simultaneously pushing Claude toward more concrete artifact generation. Claude Design, released on April 17, allows users to directly generate designs, interactive prototypes, presentations, and one-page documents from text prompts, images, DOCX, PPTX, XLSX, code repositories, and web content; it also supports commenting, partial editing, fine-grained adjustment, and can export to Canva, PDF, PPTX, or standalone HTML, and can hand the design over to Claude Code with one click for further implementation. The significance of this product is that Claude is no longer just “understanding and suggesting,” but has begun directly delivering usable visual artifacts.
The target users of Claude Design also explain Anthropic’s ambition. The official copy directly names founders, product managers, marketers, and account executives, emphasizing speed improvements from “idea to working prototype,” “rough outline to complete deck,” and “static design to interactive prototypes.” In other words, Anthropic does not only want to make Claude an assistant for engineering teams, but is extending toward a wider knowledge-work entry point. (Because of this, Figma’s stock price plunged sharply - -!)
🧭 Industry impact analysis
Previously, Anthropic’s most stable mindshare position had long been coding and agents. But what is more worth noting this time is that it is expanding this capability set to the broader results layer of knowledge work. Opus 4.7 is responsible for raising the upper bound of underlying capability, Claude Code is responsible for taking over the execution environment and engineering flow, and Claude Design further pushes Claude toward visual expression, solution presentation, and prototype validation—links that in the past were usually completed by independent design tools.
Looking at the bigger industry competition, this shows that the focus of competition among top model companies is changing. People stopped competing only over “who is smarter” a long time ago (or you can also think that the degree of smartness among models can no longer open a wide gap), and are now competing over “who can finish an entire stretch of work and directly deliver a usable artifact” (that is to say, based on model capability, they are beginning to fight for entry points, workflows, and ecosystems). From this angle, the significance of Claude Design is not only AI doing design, but that Anthropic has also clearly begun competing for application entry points and workflow control rights.

Official demo image of the Claude Design website
What is most worth paying attention to now is not only that models are getting stronger, but that capability convergence, capital inflow, and the rewriting of the real working world are all happening at the same time.
If what OpenAI and Anthropic showed this week was company-level strategic movement, then Stanford AI Index 2026 provides a more macro background picture of the industry. The report shows that capability gaps among frontier models are continuing to converge, with the lead between top Chinese and U.S. models already compressed to 2.7%; at the same time, AI investment is still growing rapidly, organizational adoption and workplace usage continue to rise, and AI’s influence on the real world of work is beginning to become more concrete. Technological progress, commercial expansion, and social friction are all accelerating together.
🧩 Related important information
At the technical level, one of the most eye-catching signals in this report is that the frontier capability gap is continuing to narrow. Stanford HAI clearly writes that since early 2025, U.S. and Chinese models have alternated multiple times in taking the lead, and as of March 2026, the lead of Anthropic’s top model over China’s top model had shrunk to only 2.7%. This shows that the industry is increasingly no longer a “single-pole leadership” pattern, but more like a stage in which frontier capabilities are rapidly approaching each other and competition is becoming more multipolar.
But capability improvement is not happening evenly. The report particularly emphasizes the so-called jagged frontier: on some difficult tasks, models are already approaching or even surpassing human baselines, but on some seemingly more basic tasks, they may still remain fragile. One very typical example is that Gemini Deep Think has already reached gold-medal-level performance on IMO-level tasks, but the accuracy of top models in reading simulated clocks is still only 50.1%; meanwhile, AI agents’ task success rate on OSWorld has risen from 12% to about 66%. This means that “AI is very strong” is true, but “AI can already stably handle all real tasks” is still obviously an overstatement.
At the economic level, AI is still one of the most capital-concentrated directions. Stanford HAI pointed out in its supporting interpretation of the report that in 2025, global corporate AI investment reached $581.7 billion, up 130% year over year; private AI investment reached $344.7 billion, up 127.5% year over year. U.S. private AI investment still far exceeds that of other countries, but the report also reminds readers that looking only at private investment will underestimate the actual scale of capital that China channels into AI through government-guided funds and similar avenues.
Changes in employment and usage are also becoming more concrete. The economic chapter of AI Index points out that the number of employed software developers aged 22 to 25 has fallen by nearly 20% since 2024, with the impact first concentrating in the youngest exposed occupations; meanwhile, the public-opinion chapter shows that in 2025, 58% of workers globally already said that they use AI semi-regularly or regularly in work, and in India, China, Nigeria, the UAE, and Saudi Arabia, that proportion is over 80%. In other words, AI is no longer just “something some people are trying,” but has entered a more normalized stage of organizational use in many regions.
🧭 Industry impact analysis
This report and the moves of OpenAI and Anthropic this week can in fact mutually confirm each other. Since the pure capability gap among top models is narrowing, companies will naturally become more proactive in competing for workflow entry points, professional scenarios, and product forms. This is also why we are simultaneously seeing OpenAI move toward “platform + specialization” directions such as Codex, Agents SDK, Cyber, and Rosalind, while also seeing Anthropic use Opus 4.7 and Claude Design to connect “underlying capability + application results layer.”
But AI Index also reminds us of another layer of reality: the closer models get to work systems, the more concrete the social frictions become. Employment structure, transparency, reliability evaluation, responsibility boundaries, and governance capacity are no longer side issues far away from business, but will increasingly become part of productization and scaled deployment itself. In other words, in the next stage, it will be very hard for large-model competition to be won only through “being smarter.”

Comparison chart of the performance gap between Chinese and U.S. AI models
1.1 https://www.pi.website/blog/pi07
1.2 https://arxiv.org/abs/2604.14148
1.3 https://arxiv.org/abs/2604.13036
2.1 https://github.com/Tencent-Hunyuan/HY-World-2.0
2.2 https://huggingface.co/robbyant/lingbot-map
2.3 https://huggingface.co/MiniMaxAI/MiniMax-M2.7
3.1 https://openai.com/zh-Hans-CN/index/codex-for-almost-everything/
3.2 https://www.anthropic.com/news/claude-design-anthropic-labs
3.3 https://hai.stanford.edu/ai-index/2026-ai-index-report





