



AI phones are no longer something new that we are hearing about for the first time. At least for people in China, “in a certain sense, ByteDance has already played with this before.” It is easy to understand why AI companies want to seize the smartphone market. After all, it is a gigantic gateway, or rather, it has been the biggest gateway for decades and may continue to be so in the future. But whether models alone can rewrite the competitive landscape of smartphone makers is something that inevitably deserves a question mark.
Especially at a time when AI safety remains highly controversial, and users can only adopt an ostrich strategy, turning a blind eye.
Back to the competitive landscape: will established smartphone manufacturers move toward a new world through cooperation with model companies or with the help of open-source models, or will new model players use their model ecosystems to win a place for themselves?

Hey, my friend 😊, welcome to “Fresh AI Stories Under the Sun,” a weekly newsletter produced by JoinAI|Zhuoyin Intelligent Algorithm Team.
With the unique technical perspective and restraint of “AI builders,” we carefully select the weekly Top 3 papers, projects, and industry updates for you. Here, we do not care about the illusion of traffic-driven hype. We only track technologies and trends truly worth paying attention to. Here, we do not only promote the good side of AI; we also expose AI’s problems.
DeepSeek’s paper elevates points and boxes into the smallest thinking units in multimodal reasoning. When the model performs dense counting, spatial relation reasoning, maze navigation, and path tracing, it can reason while using coordinates to anchor objects, regions, and paths. The signal it releases is this: the next stage of multimodal model improvement cannot rely only on higher resolution and more visual tokens; it also needs more precise visual reference mechanisms.
RecursiveMAS extends the idea of recursive models to multi-agent systems, allowing different agents to pass and revise hidden states in latent space through RecursiveLink, reducing token overhead and information loss caused by intermediate text. The signal it releases is this: multi-agent scaling is beginning to show a new route. In the future, there may be less “roles taking turns to speak” and more “training of internal system information flow.”
OneManCompany treats agents as digital employees that can be recruited, managed, evaluated, and replaced, instead of temporary prompt roles. Talent Market, Talent-Container, E2R tree search, SOP accumulation, and HR lifecycle management together form an “AI company” framework. The signal it releases is this: competition among agent products will continue to expand from individual capability to organizational capability, management mechanisms, and talent markets.
Open Design connects coding agents such as Claude Code, Codex, Cursor Agent, Gemini CLI, and Kimi CLI into the design generation process, forming a local-first design workflow with skills, design systems, sandbox previews, and multi-format exports. The signal it releases is this: design generation is moving away from a single SaaS product form and becoming a pluggable capability in the coding agent ecosystem.
MiMo-V2.5-Pro combines 1.02T total parameters, 42B active parameters, 1M context, MTP acceleration, and KV cache compression, with its target directly pointing to long-horizon agents and complex software engineering tasks. The signal it releases is this: open-source model competition is entering a stage where ultra-long context, agent trajectory efficiency, and deployable MoE foundations work together.
After Warp open-sourced its client, the terminal is no longer just a command entry point. It is also beginning to support coding agents, CLI agents, permission requests, task status, and cloud orchestration. The signal it releases is this: the host environment for AI Coding is expanding from editors to terminals, and the developer’s real execution scene will become an important entry point for agent workflows.
When OpenAI’s relationship with Microsoft shifts, AI phone rumors emerge, and the goblin incident are viewed together, they point to a simultaneous reshuffling of cloud distribution, hardware gateways, and model behavior governance. The signal it releases is this: OpenAI is reducing its dependence on external key links, and its competitive target is expanding from model capability to cloud, terminals, product personality, and long-term trust.
DeepSeek is gray testing image understanding mode while continuing to lower the costs of V4-Pro and input cache hits. After the multimodal entry point is completed, long context, screenshot understanding, chart Q&A, and agent workflows will become easier to connect to the DeepSeek ecosystem. The signal it releases is this: competition among high-capability models is being redefined by “whether long workflows can run at low cost.”
The warning from the Cursor-related production incident does not only come from the fact that AI makes mistakes. It also comes from the fact that once an agent gets real system permissions, mistakes can act directly on the production environment through APIs. The signal it releases is this: the competitive standards for AI Coding tools are changing. In the future, the comparison will not only be about coding ability, but also permission governance, auditing, rollback, isolation, and confirmation of dangerous operations.

When asking AI about important issues such as healthcare, investment, entrepreneurship, law, workplace decisions, or intimate relationships, do not only ask it to “be gentler” or “encourage me.” A more stable approach is to split the answer into two steps: first ask the AI to judge where your statement may be wrong, where the evidence is insufficient, and where the risks are; then ask it to give a short piece of emotional support at the end. You can ask directly like this: “Please don’t comfort me first. First point out where my judgment may not hold, then give factual evidence, opposing views, and action suggestions. Finally, use a short paragraph to provide emotional support.”
The Nature paper Training language models to be warm can reduce accuracy and increase sycophancy offers an important lesson: after models are trained to be warmer and more empathetic, they may sacrifice some reliability and become more likely to go along with the user’s original ideas. For ordinary users, the key point is: do not treat “AI sounds comforting” as “AI is making the correct judgment.” A gentle tone is suitable for companionship and emotional relief, but when facing issues that require judging truth or falsehood, it is better to separate fact-checking from emotional support. Simply remember: first let it find flaws, then let it comfort you.

Link: https://www.alphaxiv.org/abs/visual-primitives
Recommendation index: 🌟🌟🌟🌟🌟
DeepSeek’s paper proposes Thinking with Visual Primitives, turning spatial markers such as points and boxes into “minimal thinking units” in multimodal reasoning, allowing the model to directly use coordinates to anchor objects, paths, and spatial relationships during the reasoning process.
This paper clearly explains a key weakness in multimodal reasoning: after a model sees an image clearly, it still needs to stably refer to objects, regions, and paths in the image during the reasoning process. The paper calls this problem the Reference Gap. Many MLLMs make mistakes in dense counting, spatial relationship judgment, maze navigation, and path tracing tasks. The reason is often that language reasoning chains cannot continuously lock onto specific entities in visual space.
DeepSeek’s solution is Thinking with Visual Primitives: directly insert bounding boxes and points into the model’s thinking process, allowing the model to reason while using boxes to locate objects and points to mark paths. This design turns visual coordinates into part of the reasoning process, making it especially suitable for tasks that require continuous tracking of objects, positions, and topological relationships.
More importantly, this method is based on DeepSeek-V4-Flash and continues DeepSeek’s efficiency route. Through DeepSeek-ViT, visual token compression, and CSA, it compresses image information into very few KV cache entries while still achieving results close to or even surpassing frontier closed-source models in counting, spatial reasoning, and topological reasoning.
Recent high-resolution cropping, dynamic patching, and visual scaling mainly solve the Perception Gap, allowing models to see more details. But complex reasoning tasks also require the model to stably refer to the same object, region, or path across multiple steps.
Natural language is good at describing semantics and logic, but in continuous visual space, it is difficult to unambiguously express which coordinate position phrases like “this small object,” “that path,” or “the second target in the lower left” correspond to. The paper classifies this type of failure as the Reference Gap.
The paper does not treat bounding boxes and points merely as final detection results. Instead, it uses them as the smallest thinking units in the reasoning trajectory, allowing the model to actively “identify” visual objects and spatial paths during thinking.
The model generates <box> and <point> during intermediate reasoning. Boxes represent object position and scale, while points represent paths, trajectories, starting points, ending points, and key spatial positions. In this way, tasks such as counting, spatial relations, maze navigation, and path tracing can all be explicitly grounded in image coordinates.
The paper constructs large-scale box grounding data and designs four types of cold-start tasks: counting, spatial reasoning, maze navigation, and path tracing. Maze navigation uses algorithms such as DFS to generate solvable and unsolvable mazes, while path tracing uses coordinate sequences to supervise the model as it follows curves step by step.
The training process first teaches the model to output visual primitives during pretraining, then separately trains two types of experts: thinking with grounding and thinking with pointing. It then merges them into a unified model through Specialized RL, Unified RFT, and On-Policy Distillation.

Comparison table of frontier multimodal model reasoning capabilities
The value of this paper lies in how it pushes the focus of multimodal reasoning from “seeing images clearly” to “precise reference.” Many visual reasoning errors do not only come from insufficient perception; they also come from the model lacking a stable visual reference mechanism during reasoning. The paper’s approach is natural: let the model generate points and boxes during reasoning, just as humans use fingers to assist thinking, grounding abstract language chains into concrete coordinates. This design is very suitable for dense counting, spatial relations, maze navigation, and path tracing, because all these tasks require the model to continuously maintain the correspondences among objects, positions, and paths.
Another important signal is that this work continues the efficiency route of DeepSeek-V4. The paper is based on DeepSeek-V4-Flash. Through DeepSeek-ViT, 3×3 visual token compression, and CSA, it further compresses visual tokens to very low KV cache overhead. For example, the paper mentions that a 756×756 image is compressed from 2916 ViT patch tokens to 324 LLM input visual tokens, and finally only 81 visual entries are kept in the KV cache. Figure 1 also shows that it can still achieve strong counting and spatial reasoning performance under lower token consumption.
This paper also has a highly viral side story: DeepSeek briefly made the GitHub repository for Thinking with Visual Primitives public, then deleted it, triggering many community discussions and leading to backup and cloned repositories. This event brought additional attention to the paper and made the outside world more interested in its implementation details, model weights, and whether it will be reopened later.
In summary, it represents a new direction for multimodal reasoning: in the future, more and more model “thinking” will happen inside points, boxes, trajectories, and geometric structures.
Link: https://arxiv.org/abs/2604.25917
Recommendation index: 🌟🌟🌟🌟🌟
RecursiveMAS extends the idea of recursive models from a single LLM to multi-agent systems, allowing different agents to stop relying on intermediate text to pass messages and instead circulate, revise, and merge “implicit thoughts” in latent space.
This paper captures a very real bottleneck in multi-agent systems: agent collaboration usually relies on text communication. Each round requires generating, reading, and reinterpreting intermediate answers, causing large token overhead, latency, and information loss.
RecursiveMAS’s idea is to view the entire multi-agent system as a recursive computation graph. Each agent is like a repeatedly callable computation module, passing hidden states in latent space through a lightweight RecursiveLink. In this way, different agents can perform multi-round collaboration and revision without repeatedly decoding intermediate text.
The paper also proposes inner-outer loop training: first, each agent learns to generate latent thoughts; then, the outer links between agents are trained so the entire system can be jointly optimized toward a unified goal.
The results are also quite convincing: across 9 benchmarks in mathematics, science, medicine, search, and code generation, RecursiveMAS achieves an average improvement of 8.3%, inference acceleration of 1.2×–2.4×, and token usage reductions of 34.6%–75.6%. This shows that the next stage of multi-agent scaling does not have to only increase the number of roles and extend conversation rounds. It can also shift toward training collaboration itself, turning internal system information flow into recursive latent-space computation.
Traditional MAS often lets agents pass intermediate conclusions through natural language. This approach is interpretable, but each round requires generation, waiting, reading, and re-encoding, causing inference costs to rise rapidly with the number of rounds.
Recent looped or recursive language models improve reasoning by repeatedly reusing the same computation and refining latent states. RecursiveMAS extends this principle from inside a single model to the multi-agent system layer.
The paper no longer only optimizes the capability of individual agents. It also brings the information flow between agents into training. Through RecursiveLink and cross-round backpropagation, the entire collaborative system is optimized as a whole.
Inside each agent, there is an inner RecursiveLink that maps last-layer hidden states back to the input space, supporting latent thought generation. Between different agents, there is an outer RecursiveLink that aligns latent representations of heterogeneous models and enables cross-agent information transfer.
RecursiveMAS lets agents A1, A2, …, AN sequentially generate and pass latent thoughts in latent space. The latent output of the last agent then flows back to the first agent, forming multi-round recursive collaboration. Only in the final round does the last agent decode a text answer.
The paper evaluates RecursiveMAS on 9 benchmarks. Compared with single-agent systems, TextGrad, LoopLM, Recursive-TextMAS, and other methods, it improves average accuracy by 8.3%. Meanwhile, as the number of recursive rounds increases, inference speed reaches up to 2.4× and token usage drops by as much as 75.6%.

Overview of RecursiveMAS recursive scaling and collaborative generalization capability
RecursiveMAS pushes multi-agent collaboration from “multiple people taking turns to speak” to “multiple models recursively computing in latent space.” The core cost of traditional multi-agent systems comes from the text mediator: each agent has to write intermediate thoughts in natural language, and the next agent then re-encodes these words into its own context. RecursiveMAS directly bypasses this text mediator, using RecursiveLink to build trainable information channels between hidden states, allowing the system to iterate multiple rounds in continuous representation space. This design is interesting because it turns agent collaboration itself into an optimizable object, instead of relying only on prompts and role division to organize the process.
More importantly, this paper extends “recursion” from single-model architecture to system architecture. Previous looped transformers and recursive language models mainly discussed how one model deepens reasoning through repeated computation. RecursiveMAS discusses how multiple heterogeneous agents jointly form a recursive system. It supports four collaboration modes: sequential, mixture, distillation, and deliberation, showing that this idea is not tied to one fixed orchestration pattern. In experiments, it improves accuracy while significantly reducing tokens and inference time, especially when the number of recursive rounds grows deeper.
Its boundaries are also clear: this method requires training RecursiveLink, so connecting it to existing agent products will be heavier than pure prompt orchestration. Latent-space collaboration will also sacrifice some readability of intermediate processes. It provides a very inspiring new judgment: future multi-agent scaling may shift from “stacking more conversation rounds” to “training the system’s internal information flow.”
Link: https://arxiv.org/abs/2604.22446
Recommendation index: 🌟🌟🌟🌟🌟
OneManCompany proposes an “AI company”-style multi-agent framework, upgrading agents from individual skill executors into digital employees that can be recruited, managed, evaluated, and replaced, while using organizational mechanisms to coordinate complex projects.
This paper moves the problem of multi-agent systems from “how to make several agents talk” to “how to manage an agent workforce.” Many previous multi-agent frameworks rely on fixed teams, fixed workflows, and temporary conversation memory. When facing open-ended projects, they easily get stuck in unclear roles, disordered collaboration, exaggerated capabilities, and failure to accumulate experience.
OneManCompany’s core judgment is clear: what is really missing is an organizational layer. It proposes a Talent + Container architecture. The agent’s identity, role, skills, tools, and working principles are packaged into transferable Talents, and Containers adapt them to different runtime environments such as Claude Code, LangGraph, and script executors. As a result, agents are no longer just prompt roles. They are more like digital employees who can be recruited, onboarded, assigned tasks, evaluated, put into PIP, or even eliminated and replaced.
Together with Talent Market, E2R tree search, DAG task scheduling, and organization-level SOP accumulation, this work is already very close to a productized prototype of a “one-person company” or “agent workforce.”
The paper points out at the beginning that agents such as Claude Code, Codex, and OpenClaw can already expand capabilities through skills and tools, but these capabilities mainly happen inside a single agent. They cannot solve how multiple agents should be organized, collaborate, and continuously improve.
Frameworks such as CrewAI and AutoGen can do role division and message passing, but team structures are usually preset. It is difficult to dynamically recruit new capabilities at runtime, and different runtimes are not easy to interoperate.
The paper defines an AI organisation as a self-managing system composed of heterogeneous agents, with structured collaboration, lifecycle management, and experience-driven evolution capabilities. This corresponds to recruitment, task decomposition, performance evaluation, and review mechanisms in real companies.
Talent defines the agent’s role, prompt, skills, tools, and working principles. Container hosts different runtimes, including Claude Code, LangGraph, and script-based executors. The two combine into an Employee, allowing heterogeneous agents to collaborate within the same organization.
OMC divides project execution into three stages: Explore, Execute, and Review. The system first explores task decomposition and personnel allocation strategies, then lets employees execute tasks, and finally has superiors review the results. If the result is not qualified, the system decomposes again, reworks, or adjusts the team.
Each agent updates its working principles through task reviews and CEO one-on-ones. After a project ends, the COO turns experience into SOPs. HR also conducts regular performance evaluations. Underperforming employees enter PIP; if they continue to fail, they are offboarded and replaced from the Talent Market.

Overview of organizational structure and operating mechanism
The value of OneManCompany lies in how it elevates the focus of multi-agent systems from “collaboration process” to “organizational mechanism.” Many agent projects emphasize tool calling, long-horizon tasks, and multi-role division, but once they enter real projects, they encounter familiar problems: who decomposes the task, who accepts the result, who reworks after failure, how to fill missing roles, how experience is retained, and how low-quality agents are eliminated.
OMC directly maps these problems into company systems: CEO, HR, EA, COO, employees, recruitment market, performance evaluation, PIP, SOPs, and project review. A whole set of organizational concepts is moved into the agent system. This design has a strong product sense because its focus has already gone beyond single-model capability and begins to build a management layer for agent workforces.
Its most inspiring aspect is the separation between Talent Market and Talent-Container. The former allows capability supply to be recruited on demand like a talent market. The latter allows the same Talent to run on different backends, avoiding the entire organization being locked into one agent runtime. For future Agent OS or personal AI companies, this abstraction is critical: what users truly need may not be one omnipotent agent, but a set of agent organizations that can dynamically form teams, assign tasks, manage quality, and accumulate experience.
The experimental section is also persuasive. The paper reports an 84.67% success rate on PRDBench, 15.48 percentage points higher than existing baselines. Cases include weekly reports on popular GitHub repositories, web game development, audiobook video generation, and embodied AI world model research, showing that it focuses more on real project-level workflows and aims beyond single benchmark Q&A.
Of course, its boundaries are also obvious: this system is heavier than ordinary multi-agent frameworks. It strongly depends on scheduling, review, market mechanisms, and the judgment of a human CEO. Cost, controllability, and evaluation standards will also become long-term issues.

Link: https://github.com/nexu-io/open-design
Open Design captures an obvious gap left after Claude Design became popular: many people want an artifact-first design generation experience, but do not want to be locked into a single model, cloud service, or paid product. The project’s positioning is direct: it is an open-source alternative to Claude Design.
It does not provide models itself. Instead, it connects local Claude Code, Codex, Cursor Agent, Gemini CLI, OpenCode, Qwen, Kimi CLI, and other coding agents, turning these agents into design production engines. Users can generate web pages, desktop prototypes, mobile prototypes, slides, images, videos, and HyperFrames. It also supports sandbox previews and exports in formats such as HTML, PDF, PPTX, ZIP, and MP4.
More importantly, it breaks capabilities into composable Skills and Design Systems. Public materials have already mentioned 19+ design skills and 70+ brand-grade design systems. This separates it from ordinary “AI-generated page demos” and brings it closer to a local-first, BYOK, deployable, and extensible design workflow.

Screenshot of Open Design open-source repository
Open Design’s product judgment is very clear: the strongest coding agents are already on users’ computers, and what is truly missing is a harness that organizes them into a design workflow. Its value lies in connecting agents, skills, design systems, sandbox previews, and multi-format exports into a complete chain.
This idea is attractive to creators and independent developers: you can use your familiar agent and key to locally generate pitch decks, mobile prototypes, dashboards, docs pages, and other design assets, while keeping project files and export results. It has recently gained very high popularity on GitHub, reaching the ten-thousand-star level, and has also entered discussions in communities such as Hacker News.
It represents a clear trend: design generation is moving from a single SaaS feature into a pluggable workflow inside the coding agent ecosystem. For people who care about agent harnesses, skills, design automation, and open-source Claude Design alternatives, this project is worth actually running.
Link: https://github.com/nexu-io/open-design
MiMo-V2.5-Pro pushes Xiaomi’s MiMo series further from a “reasoning model” toward a foundation for long-horizon agents and complex software engineering. The model card states clearly that it is an open-source MoE language model with 1.02T total parameters, 42B active parameters, and support for up to 1M tokens of context, targeting demanding agentic, complex software engineering, and long-horizon tasks.
It inherits MiMo-V2-Flash’s hybrid attention and 3-layer Multi-Token Prediction design: SWA and Global Attention alternate at a 6:1 ratio, with a 128 sliding window, reducing KV cache by about 7× under long context. MTP is used to improve inference output speed and RL rollout efficiency.
More importantly, Xiaomi directly open-sourced the weights, tokenizer, and model card this time, under the MIT license. This moves it from an API product into open-source model competition where it can be deployed, modified, and connected to agent engineering stacks.

Benchmark results
The official MiMo-V2.5-Pro page especially emphasizes token efficiency: on ClawEval, V2.5-Pro reaches 64% Pass^3 at about 70K tokens per trajectory. Compared with similar capability models such as Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4, it uses about 40%–60% fewer tokens per trajectory. This signal is important because the real cost of agent tasks is not only about single-token price. It also depends on how many tokens one task consumes and whether it can reliably complete long trajectories.
MiMo-V2.5-Pro provides 1M context, MTP acceleration, KV cache compression, and agentic RL / MOPD post-training at the same time, showing that Xiaomi is targeting a combined strategy of “long-horizon usability + open-source deployment + token cost-effectiveness.”
Its boundary is also realistic: 1.02T total parameters and 42B active parameters mean that high-quality local deployment has a high threshold. FP8, vLLM / SGLang, chip adaptation, and inference engineering will all affect real experience. It is more suitable for teams with long-horizon agent, complex coding, code repository understanding, and automated workflow needs, rather than lightweight experimentation on ordinary personal computers.
But from the perspective of the open-source ecosystem, MiMo-V2.5-Pro is a strong signal: after DeepSeek, Kimi, and GLM, Xiaomi is also pushing large model competition into the layer of open-weight trillion-parameter MoE + long-horizon agent capability + domestic compute adaptation. For people interested in open-weight agent models, long-context reasoning, and coding agent foundations, it is worth trying.
Link: https://github.com/warpdotdev/warp
Warp represents a very clear direction: the terminal is no longer just a place to type commands. It is becoming an agentic development environment. The official repository’s positioning is direct: Warp is an Agent development environment grown out of the terminal. It can use its built-in coding agent and also connect to external CLI agents such as Claude Code, Codex, and Gemini CLI.
More importantly, Warp recently open-sourced its client code. This transforms it from a closed AI terminal product into an open-source project where the community can inspect code, file issues, and participate in the roadmap.
For the coding agent ecosystem, this matters because the terminal is naturally located at the entrance of developer workflows: code, shell, git, file systems, builds, tests, and deployments all pass through it. If the terminal itself begins to natively understand agent status, permission requests, task progress, and multi-agent collaboration, then it is no longer just a developer tool. It becomes the front-end console of agent workflows.

Screenshot of Warp interface
Future coding agents need a host environment that understands context better than ordinary terminals, understands task status better, and is more suitable for long-horizon execution. Warp may be one of them. On one hand, it keeps modern terminal experiences such as cross-platform installation, shell integration, block-based command output, and better interaction. On the other hand, it begins placing agent capabilities into the core workflow, including built-in agents, BYO CLI agents, Claude Code integration, notifications, session status, permission requests, and cloud agent orchestration.
This direction differs from many routes that “build agents inside editors.” Warp is betting on the terminal entry point: developers ultimately still need to run commands, read logs, run tests, and solve environment problems. Agents inside the terminal are closer to the real execution scene.
This open-sourcing also makes it more worth tracking. Warp’s official blog clearly states that the client is now open source and that the community can contribute around the open-source repo. The GitHub organization page also shows that the Warp repository has a high level of attention. But its boundary also needs to be seen clearly: what is open-source is the client code, while some agent / cloud orchestration capabilities are still tied to Warp’s cloud service, Oz platform, and account system. The community is also discussing how open it really is.
Overall, Warp is very suitable for this week’s project list because what it represents has gone beyond an ordinary terminal update. It is more like a signal that terminals, coding agents, cloud orchestration, and open-source collaboration are converging.

When cloud, hardware, and model behavior governance change at the same time, OpenAI is pushing itself from a “model supplier” toward a more complete AI platform company.
Several developments around OpenAI this week point not only to model capability, but to how it is rearranging its relationship with cloud, hardware gateways, and model behavior.
First is the updated cooperation agreement with Microsoft. According to OpenAI’s official announcement, Microsoft remains OpenAI’s major cloud partner, and OpenAI products will continue to launch first on Azure. But OpenAI can now provide products to customers through any cloud service provider, and Microsoft’s license to OpenAI model and product IP has shifted from exclusive to non-exclusive, valid through 2032.
At the same time, rumors about OpenAI’s hardware plans continue to heat up. Reuters reported that OpenAI is said to be working with Qualcomm and MediaTek to develop processors for AI smartphones, and Luxshare Precision may become the system design and manufacturing partner. The relevant devices are expected to enter mass production in 2028. Although this is still not an officially confirmed product launch, it is consistent with OpenAI’s recent continued push into consumer AI hardware.
There is also a seemingly light but actually very representative “small piece of news”: OpenAI officially explained why models frequently mentioned fantasy creatures such as goblins and gremlins. The investigation showed that this language habit initially concentrated in ChatGPT’s “Nerdy” personalization style. The reason was that reward signals during training accidentally reinforced this type of expression, and later training spread it to other scenarios. OpenAI subsequently removed the relevant reward signals and adjusted training samples and model instructions.
OpenAI’s relationship with Microsoft is moving from deep binding toward a more flexible cooperation structure. Microsoft still retains important rights and remains OpenAI’s major cloud partner, but OpenAI has gained greater cloud-neutral space.
AI phone rumors show that OpenAI may be trying to bypass the limitations of existing mobile operating systems and app distribution systems. For a company centered on agents, the smartphone is not only a hardware category. It is a combined gateway for user identity, sensors, app permissions, payments, notifications, and daily task flows.
The “goblin incident” places model governance into a more microscopic but more realistic layer: model output style is not only a prompt issue. It is also affected by reward models, personalization product design, and training data feedback loops.
Behind this group of developments, OpenAI’s changes can be summarized into three layers.
The first is cloud loosening. The new agreement does not cut off the relationship between OpenAI and Microsoft, but gives OpenAI greater commercial distribution space. For enterprise customers, this means that OpenAI products may become easier to enter different cloud environments in the future. For Microsoft, it still retains important rights, but it is no longer the only channel for the diffusion of OpenAI technology.
The second is hardware gateway exploration. If AI agents are to truly participate in users’ daily tasks in the future, they cannot remain forever inside chat windows. Existing smartphone systems have strict limitations on permissions, background operation, cross-app actions, and data access. These are both safety boundaries and capability ceilings for agents. If OpenAI is truly pushing AI phones forward, it is essentially looking for a gateway more suitable for agent-native operation.
The third is refinement of model behavior governance. From small words like goblin and gremlin to more serious issues such as hallucination, bias, overreach, and misoperation, the underlying question is the same: how model behavior is shaped by training objectives, and how it can be continuously monitored and corrected after productization.
OpenAI is moving from a “model company” toward a more complete AI platform company.
The Microsoft agreement adjustment solves cloud and commercial distribution flexibility. AI phone rumors point to personal terminal gateways. The goblin incident shows that model behavior governance has entered the level of product details.
The shared thread across this group of OpenAI developments is that it is reducing dependence on external key links.
Cloud determines whom it can serve. Hardware determines whom it can get close to. Model governance determines whether users can trust it over the long term. Loosening ties with Microsoft gives OpenAI more enterprise delivery and cloud deployment flexibility. Hardware exploration points to the possibility that AI agents may need a new native terminal in the future. The “goblin incident” reminds us that when models begin to have personalities, long-term memory, and tool-calling capabilities, every reward signal and default style may be amplified in large-scale usage.
However, hardware is not an easy path. The smartphone market has extremely high barriers in supply chain, channels, app ecosystems, and brand. Many past AI hardware products have already shown that a single novel feature is not enough to replace the smartphone. What OpenAI really needs to prove is whether an agent-native device can bring a sufficiently strong experience leap, instead of merely adding a smarter voice assistant to the phone.

OpenAI logo
When multimodal capability is completed and the costs of long context and caching continue to fall, DeepSeek is pushing “cheap and usable” closer to production-grade agents.
After releasing DeepSeek-V4 Preview last week, DeepSeek continued to send two types of signals this week: one is capability completion, and the other is price reduction.
On the capability side, DeepSeek has started gray testing “image understanding mode” on the web and app. This means DeepSeek’s multimodal visual understanding capability is beginning to enter user-facing experience, rather than remaining only an expected capability on the model roadmap. Based on current public user feedback, DeepSeek’s image understanding is not only basic image description. It also combines context for reasoning, follow-up questions, and self-correction. But it still makes mistakes in extreme tests such as counting fingers and identifying complex details.
On the cost side, DeepSeek officially announced that the input cache hit price for the full DeepSeek API series has been reduced to 1/10 of the launch price. At the same time, DeepSeek-V4-Pro is still offering a 75% discount, with the promotion extended to May 31, 2026.
The core change of DeepSeek-V4 is not only a model capability update. It brings “low-cost long context + agent capability” to a more front-stage position.
Image understanding gray testing completes the multimodal entry point, allowing DeepSeek to move beyond text reasoning, coding, and conversation into scenarios such as image Q&A, chart understanding, screenshot parsing, and multimodal retrieval.
Price reduction directly affects the cost structure of agent applications. Agents are not single-round Q&A systems. They continuously read system prompts, tool descriptions, historical context, knowledge base snippets, and code repository content. The lower the cache hit price, the easier it becomes to lower the real cost of long tasks and multi-round calls.
This round of DeepSeek moves can be summarized with two keywords: completion and lowering.
What is being completed is multimodal capability.
If DeepSeek’s previous advantages were mainly concentrated in reasoning, coding, and cost-effectiveness, then gray testing of image understanding mode means it is starting to enter a more complete multimodal product experience. For developers, visual understanding is also part of more complex agent workflows, such as document screenshot parsing, interface understanding, image quality inspection, and multimodal retrieval.
What is being lowered is agent usage cost.
When actually running agents, model fees do not only come from single Q&A sessions. They come from repeated cycles of planning, retrieval, tool calling, context reading, and result generation. DeepSeek lowering cache hit prices is essentially lowering the threshold for developers to try complex workflows.
DeepSeek is moving from a “low-cost reasoning model” toward a more complete agent infrastructure option.
Image understanding gray testing completes the multimodal entry point. V4 price cuts lower the costs of long context and high-frequency calls. For developers and small to medium-sized teams, this will directly change model selection logic: in the past, many people defaulted to Claude or GPT for complex workflows. Now DeepSeek will at least enter the comparison test list.
DeepSeek’s real pressure on the market may not come from whether it ranks first on every benchmark. It comes from the fact that it continuously lowers the usable cost of high-capability models.
In agent scenarios, model calls are not one-time expenses. They are continuous expenses. One task may require multiple rounds of planning, multiple retrievals, multiple tool calls, and multiple result checks. By combining V4-Flash, V4-Pro, long context, and cache pricing, DeepSeek is effectively telling developers: complex agents do not necessarily have to rely only on the most expensive closed-source models.
This will force other model vendors to re-explain their premium: is it stronger reliability, more stable tool calling, more mature enterprise security systems, or better ecosystem integration?
However, low price does not automatically mean production usability. Multimodal gray testing still needs more stable accuracy. Enterprise customers will also care about privacy, service stability, compliance, ecosystem compatibility, and failure support. What DeepSeek truly needs to prove next is whether it can turn price advantage into stable production-grade usage.

Screenshot of DeepSeek’s official Twitter price-cut announcement
When AI begins to truly call tools, operate systems, and access production data, safety issues move from abstract discussions into the engineering scene.
This week, AI programming tool Cursor triggered a typical production incident controversy.
Jer Crane, founder of the car rental software company PocketOS, said that a Claude-powered Cursor AI agent mistakenly called the Railway API during a test environment task, deleting the production database and backups within a short time and affecting customer reservations and business data. Multiple media outlets later followed up, and Railway subsequently fixed the related endpoint.
The reason this incident attracted attention is not only that “AI deleting a database” sounds frightening enough. It is that the exposed problems do not only belong to one model or one tool. Reports show that the incident involved multiple factors: over-privileged execution by the AI agent, insufficient safety confirmation in cloud platform APIs, overly broad token permissions, and insufficient isolation between backups and source data.
At the same time, developer communities also discussed the availability risks of AI platforms. Some users claimed that an agricultural technology company with about 110 employees had its Claude organization account banned by Anthropic without warning and was unable to get an effective response within 36 hours.
The key issue in the PocketOS incident is not whether AI “makes mistakes,” but that it had real system permissions when it made the mistake. In the past, the risks of coding assistants were mostly limited to generating wrong code, introducing bugs, or misleading developers. But once an agent can call APIs, access cloud resources, and execute delete operations, the risk escalates from “wrong suggestion” to “wrong direct execution.”
This also explains why scholars such as Hinton continue to emphasize the importance of AI safety. In recent reports, Hinton again called for stricter regulation of AI and warned that whether humans can coexist with superintelligent AI remains uncertain. On the other hand, UNCTAD previously predicted that the global AI market could reach $4.8 trillion by 2033. The gap between rapid market expansion and insufficient safety investment will make similar problems more realistic.
This type of incident reminds us that AI agent product capability actually has two sides.
One side is enhanced automation capability.
Tools such as Cursor, Claude Code, and Codex are pushing AI programming from “helping people write code” toward “helping people complete tasks.” They can read projects, modify files, run commands, call tools, analyze errors, and even participate in deployment and operations workflows. The more obvious the efficiency improvement, the more likely teams are to give them more permissions.
The other side is that safety boundaries must be rebuilt.
Traditional software systems assume that humans are the final operators and tools only execute clear instructions. But agents autonomously decompose steps, search for paths, and call interfaces under task goals. At this point, permission management, environment isolation, human confirmation, audit logs, rollback mechanisms, and backup strategies are no longer just operations details. They become part of AI agent products.
AI safety is moving from the grand question of “whether models will threaten humanity” to everyday engineering questions such as “whether an agent has permission to delete a database, whether backups are isolated, whether APIs require confirmation, and whether enterprise accounts can be blocked by a platform as a single point of failure.”
Although this type of incident will not stop the adoption of AI coding and agents, it will push enterprises to redefine rules for AI use in production environments: least privilege, read-only first, production environment isolation, secondary confirmation for dangerous operations, offsite backups, and complete auditing will all become basic requirements before agents go live.
After all, the real risk of AI agents is not that they occasionally make mistakes. It is that they may make mistakes at high speed in high-permission environments. Human engineers also accidentally delete data, misconfigure permissions, and make mistakes in production environments. But humans usually pause before execution, confirm, ask colleagues, or get blocked by process. The problem with agents is that they can complete a chain of wrong judgments within seconds and directly apply the error to real systems through APIs. And it cannot “take the blame” either. LOL.
Therefore, the question enterprises need to solve next is not “whether to use AI agents,” but “within what boundaries to use them.” AI agents are better suited to first enter low-risk, highly repetitive, rollback-friendly tasks. For operations involving production databases, customer data, payment systems, permission systems, and infrastructure deletion, stricter human confirmation and system-level protection should be required by default.
This will also change the competitive standards for AI programming tools. In the past, people mainly compared generation quality, context length, code understanding ability, and speed. Next, safety capability will become increasingly important. A mature AI coding product must not only know how to write code. It must also know what it should not directly do, when it must stop, which operations require human confirmation, and how to let enterprise administrators see, restrict, and trace its behavior.

AI Agent safety theme image





