



Cover Story: The big whale is raising money!?
It is rumored that DeepSeek is about to win a new “first” title: China’s first large-model company with a record-breaking financing round. Although the official side has not released an announcement yet, and it is not impossible that the amount and allocation of contributions have still not been finally defined, the fact that DeepSeek is officially going out to raise money is most likely already basically certain. I don’t know whether this has even the slightest connection with a certain “Bao” rushing to launch a paid version.
But in any case, this signals that the commercialization and landing of domestic large models is about to start a new round of acceleration. Compared with the rather unimpressive effects of that certain “Bao,” the moves DeepSeek will make after receiving financing are extremely worth looking forward to.

Hey, my friend 😊, welcome to “There Is Something New Under the AI Sun”, a weekly newsletter produced by the algorithm team of JoinAI|Zhuoyin Intelligence.
We will use the unique technical perspective and restraint of “AI builders” to carefully select for you each week’s Top 3: papers, projects, and industry updates. Here, we do not care about the hallucination of hot-topic traffic; we only track technologies and trends that are truly worth paying attention to. Here, we will not only promote the good side of AI, but will also reveal AI’s problems.
ByteDance Seed|Cola DLM: Language models begin moving toward continuous latent space
Cola DLM splits text generation into two layers: “global semantic organization” and “local text realization.” It first uses a Text VAE to map text into a continuous latent space, then uses a block-causal DiT to learn the latent prior, and finally generates text through a decoder. The signal it releases is: language-model architecture innovation is becoming active again; autoregression is no longer the only imagination; continuous latent-space diffusion may become a new interface connecting text, image, and video generation.
Robot World Model|Semantic latent is more critical than pixel reconstruction
This paper systematically compares reconstruction-type latent spaces and semantic-type latent spaces in robotic diffusion world models. The results show that reconstruction latents such as VAE/Cosmos have advantages on pixel metrics, but semantic latents such as V-JEPA 2.1, Web-DINO, and SigLIP 2 are stronger in action recovery, task-success judgment, CEM planning, and policy-in-the-loop evaluation. The signal it releases is: the competition of robotic world models cannot only look at whether the generated video looks real; it must also look at whether the latent preserves structures related to actions, task progress, and policy evaluation.
AutoTTS|LLMs begin to automatically discover test-time compute allocation strategies
AutoTTS changes test-time scaling from manually designed heuristic rules into letting a coding agent automatically search for a controller inside an offline replay environment: when to branch, continue, probe, prune, and stop are all gradually discovered by the LLM through execution-trajectory feedback. The signal it releases is: future improvements in LLM reasoning ability will not only rely on larger models or longer CoT, but will increasingly depend on test-time compute scheduling strategies automatically discovered by agents.
NanoWhale|The DeepSeek-V4 architecture is compressed into a small teaching model
NanoWhale compresses cutting-edge large-model structures such as MLA, MoE, Hyper-Connections, and MTP into a small model of about 110M parameters, and provides training, fine-tuning, evaluation, and chat scripts. The signal it releases is: frontier large-model architectures are being broken down into smaller, more open, and more teachable experimental units. Understanding large models does not necessarily mean only reading technical reports; it can also start from the smallest closed loop that can actually run.
Ruflo|Claude Code begins moving toward multi-agent orchestration
Ruflo extends coding agents such as Claude Code / Codex into a multi-agent orchestration framework, adding swarm, persistent memory, MCP tools, GitHub ops, code review, and SPARC workflow. The signal it releases is: the competition in AI Coding is moving from a single model writing code further downward into multi-agent collaboration, long-term memory, tool governance, and runtime infrastructure.
Sulphur 2|Open-source video generation continues to impact local workflows
Sulphur-2-base is based on the LTX ecosystem, supports text-to-video and image-to-video, and is oriented toward ComfyUI, local GPUs, LoRA, quantization, and community-derived workflows. The signal it releases is: open-source video generation is moving from one-off model releases into ecosystem competition around “model + workflow + LoRA + quantization + community nodes.”
DeepSeek|Financing rumors heat up; China’s large models enter a heavy-asset stage
Rumors around DeepSeek’s first-round financing continue to ferment, with valuation discussions possibly reaching as high as the USD 45 billion to USD 50 billion range. The official side has not yet confirmed it, but this matter is already enough to show that domestic large models are moving from the narrative of low-cost breakthrough into a long-term competition of compute power, talent, commercialization, and ecosystem building. The signal it releases is: after large-model companies achieve technical breakout, what they really have to face is continuous iteration, organizational expansion, and industrial landing.
Anthropic|Claude charges into Office; enterprise workflow platformization accelerates
Claude’s integration into Microsoft 365, Dreaming’s ability to let agents organize memory and experience during task gaps, and compute cooperation that supplements larger-scale inference resources all point in one direction. The signal it releases is: Anthropic does not want to only defend the code battlefield, but is pushing Claude toward office entrances, long-term memory, and the enterprise process collaboration layer.
OpenAI|Codex expands, GPT-5.5 reduces hallucinations, and continues walking out of the chat box
This week, OpenAI continued to add workflow entrances around the default model, Codex, Chrome, Slack, and real-time voice. GPT-5.5 Instant strengthens reliability; Codex enters the browser and Slack; the Realtime series fills in low-latency voice interaction. The signal it releases is: OpenAI is continuing to lay ChatGPT and Codex from conversation products into real work sites.

How to do it:
When you ask AI to help you research, do competitor analysis, find papers, tools, or cases, do not only say “help me find similar content.” A better approach is to first tell AI your task goal, then let it search, compare, verify, and filter according to that goal. For example, you can ask like this: “I am not trying to find semantically similar articles, but evidence that can support this judgment. Please first break down my question, then list the keywords, reverse keywords, verification standards that should be searched, and finally give me the most useful materials.” If you are doing topic selection, you can also require AI not only to find similar content, but to find materials that “can refute this topic,” “can supplement cases,” and “can prove market demand.”
Why it works:
The inspiration from TIGER-Lab’s Beyond Semantic Similarity is that Agentic Search cannot remain only at “semantic-similarity retrieval.” Traditional RAG is more like finding text most similar to the question, but what an agent truly needs is information that can support the next step of reasoning, verification, and decision-making. For ordinary users, the key point is: do not treat AI search as “finding similar webpages,” but treat it as “finding evidence with a task in mind.” This makes it easier for AI to find truly useful materials, instead of piling up a bunch of things that look relevant but are not helpful for your judgment.

Link:https://arxiv.org/abs/2605.06548
Recommendation index: 🌟🌟🌟🌟🌟
One-sentence guide:
Cola DLM proposes a continuous latent-space diffusion language model, splitting text generation into two layers: first modeling global semantics in continuous latent space, then generating local text through a decoder, thereby providing a new scaling route beyond autoregressive language models.
⚽️ Recommendation reason:
Traditional autoregressive models generate token by token from left to right. Their advantages are a clear objective and stable training, but inference is naturally serial, and the generation process is bound to a fixed token order. Cola DLM’s idea is more like first generating a “semantic plan,” then generating the surface form of the text: it first uses a Text VAE to map text into continuous latent space, then uses block-causal DiT to learn the latent prior, and finally outputs text through a conditional decoder. The paper calls this process latent prior transport, meaning that the diffusion process is no longer responsible for recovering tokens, but for transporting and organizing global semantics. This design is interesting because it pushes language modeling from token-level recovery toward semantic-level prior modeling, and also creates a more natural unified interface between text and continuous modalities such as image and video.
📚 Background notes:
The autoregressive paradigm still dominates language models: AR models generate text token by token through chain decomposition. Their training objective is clear, and they have already proven strong scaling ability. But left-to-right generation brings serial inference costs and also limits non-monotonic generation tasks such as infilling, local editing, and global reorganization.
Diffusion language models have not fully solved the problem: Discrete diffusion can get rid of fixed left-to-right order, but it usually still performs observation recovery in token or mask states, with high sampling cost, and it is not easy to stably express global semantic structures.
Continuous latent space provides a new interface for language modeling: Cola DLM compresses text into continuous latent space, lets the diffusion model fit the latent prior, and then lets the decoder realize the text. This separates global semantic organization from local text generation.
📌 Key takeaways:
Text VAE is responsible for building a stable mapping from text to latent: In the first stage, Text VAE is trained using reconstruction loss, KL loss, and BERT-style mask loss to establish a latent-text correspondence, avoiding the latent only recording surface information, and also avoiding the decoder purely memorizing text.
Block-causal DiT is responsible for learning the continuous latent prior: In the second stage, DiT performs prior learning in latent space and adopts a block-causal mechanism: within each block, bidirectional modeling is allowed; between blocks, causal dependency is maintained, thereby balancing local parallelism and cross-block sequential structure.
Experiments show that latent needs co-evolution and semantic smoothing: The paper compares strategies such as fixed VAE, joint DiT, all scratch, and interval update. Results show that the most effective strategy is to let the latent space and DiT co-evolve on top of a stable pretrained VAE initialization. BERT-style loss, suitable latent dimensions, noise schedule, and block size all significantly affect performance.

Cola DLM overview
Cola DLM pushes the competition of language models from “how to predict the next token” to “how to organize semantic latent.” It does not treat continuous diffusion as a generation trick, but places it inside a hierarchical latent-variable model: latent prior is responsible for global semantics, and decoder is responsible for local text realization. This division of labor makes language generation look more like latent diffusion in image/video generation, and also provides a more natural technical route for unifying continuous modalities such as text, image, and video in the future.
This paper also gives a very strong directional signal: language-model scaling does not necessarily have only the autoregressive route forever. At around the 2B-parameter scale, the paper compares with strictly matched autoregressive and LLaDA baselines, and gives scaling curves up to around 2000 EFLOPs, showing that continuous latent diffusion language models at least already have discussable expansion potential. More importantly, the authors also specifically discuss the mismatch between perplexity and generation quality, emphasizing that the ability of such models may not be judged only by traditional likelihood metrics.
Its boundary is also obvious. Cola DLM is still at the stage of research-paradigm verification. Its training system is much more complex than ordinary AR models, requiring coordinated tuning of Text VAE, DiT prior, decoder, noise schedule, block size, latent dimension, and many other parts; and at this stage it is not proving that it can already comprehensively replace AR models. A more accurate judgment is: this paper gives a very clear alternative direction — future language models may move from pure token-sequence modeling toward a hierarchical generation paradigm of “continuous semantic latent space + conditional text decoding.” For people who follow non-autoregressive language models, diffusion language models, and unified multimodal generation foundations, this paper is very worth reading.
Link:https://arxiv.org/abs/2605.06388
Recommendation index: 🌟🌟🌟🌟🌟
One-sentence guide:
This paper systematically compares two types of latent space in robotic diffusion world models: one that pursues pixel reconstruction, and another that comes from semantic representations such as V-JEPA 2.1, Web-DINO, and SigLIP 2. The results show that high visual fidelity does not mean greater suitability for robot control; a truly useful world model needs more to preserve structures related to actions, task progress, and policy evaluation.
⚽️ Recommendation reason:
This paper cuts into a very key design problem in robot world models: should a world model prioritize “drawing realistically” or “being useful for control”? In the past, many latent diffusion world models used reconstruction-type latents such as VAE by default, because they decode stably and have good image quality. But robotic tasks care about more than whether future frames are realistic; they also include how actions will change objects, whether the task is progressing, and whether policy evaluation in the generated world is trustworthy. The paper fixes the DiT transition model, action conditioning, and training protocol on Bridge V2 real robot manipulation data, changing only the encoder-defined latent space, and systematically compares SD3 VAE, VA-VAE, Cosmos, V-JEPA 2.1, Web-DINO, and SigLIP 2. The result is clear: reconstruction latents such as VAE/Cosmos have advantages on pixel metrics, but semantic latents such as V-JEPA 2.1, Web-DINO, and SigLIP 2 are overall stronger in action recoverability, success classification, CEM planning, policy-in-the-loop success, and OOD robustness. This conclusion is very important for Physical AI: the latent of a robot world model should not be chosen only by image-generation metrics, but by whether it preserves action-related structures.
📚 Background notes:
Robot world models are moving toward latent diffusion: The paper focuses on action-conditioned video world models, which use historical observations and actions to predict future observations, and then act as proxy environments for robot policy evaluation or planning. As such models increasingly adopt LDMs, the choice of latent space directly affects the dynamics the model learns.
Reconstruction-type latent is not necessarily suitable for control: VAE-type latents are better at preserving pixel details and stable decoding, but robot control cares more about object state, contact relations, action effects, and task progress. A visually reasonable image does not mean the rollout is reliable for a policy.
Semantic representations are beginning to enter robot world models: Pretrained representation encoders such as V-JEPA 2.1, Web-DINO, and SigLIP 2 can more directly expose object layout and task structure, but have higher dimensions and make diffusion training harder. The paper uses methods such as wide-head, noise schedule shift, and S-VAE adapter to make semantic latent diffusion trainable in robotic tasks.
📌 Key takeaways:
Fixed model, only changing latent space: The paper controls the data, history length, action conditioning, DiT transition model, optimizer, and training schedule, and only changes the encoder, adapter, and decoder path, comparing the impact of reconstruction-type latent and semantic-type latent on robotic world models. Figure 1 shows this comparison framework.
Semantic latent is more suitable for actions and tasks: In DiT-S experiments, V-JEPA 2.1, Web-DINO, and SigLIP 2 are overall ahead in VLA success rate, OOD robustness, CEM action error, IDM Pearson r, and success classifier accuracy. Both Table 1 and Table 2 show that semantic spaces more easily preserve action recoverability and task-success information.
Visual metrics cannot determine world-model quality alone: Reconstruction latents such as VAE and Cosmos remain competitive on pixel or image-quality metrics, especially after model scale increases, visual metrics can catch up. But they may still lag behind on action recovery and policy evaluation. The paper’s final practical advice is: first choose a latent that can express action and task progress, then improve visual quality with decoder and adapter.

Comparison framework diagram for latent spaces in robotic world models
This paper pulls the evaluation standard of robot world models back from “does the generated video look real” to “can this world be used for control.” Many video world models can easily mislead people with visual effects: the frames are coherent, objects are clear, textures are stable, and they look as if the environment dynamics have been learned. But robot control needs predictable relationships between actions and states. If the latent space does not organize information such as grasping, contact, object position, and task progress well, then even a beautiful rollout may just be a good-looking hallucination.
The signal given by this work is very important: the next step of robotic world models may increasingly rely on semantic representation spaces, rather than traditional VAE-style reconstruction spaces. Encoders such as V-JEPA 2.1, Web-DINO, and SigLIP 2 are originally better at extracting object structures, action changes, and task semantics. Using them as the latent interface of diffusion world models can make the model closer to “a world usable for policy evaluation,” rather than only “a world decodable into video.” This aligns with a major recent trend in Physical AI: the world model cannot only be a generator; it must also become an intermediate layer for policy, planning, and evaluation.
The experiments are mainly concentrated on Bridge V2 and WidowX 250 robot manipulation data, so the tasks and embodiment are still relatively limited; policy-in-the-loop evaluation also partly depends on a VLM success judge. But it gives a clear judgment: the future competition of robotic world models is not only competition over video quality, but competition over latent-space selection, action-structure fidelity, and policy-evaluation credibility.
Link:https://arxiv.org/abs/2605.08083
Recommendation index: 🌟🌟🌟🌟🌟
One-sentence guide:
AutoTTS changes test-time scaling from manually designed heuristic rules into letting an agent automatically search for controllers inside a replayable environment: when to branch, continue, probe, prune, and stop are all discovered by the LLM inside an offline trajectory environment.
⚽️ Recommendation reason:
This paper is worth paying attention to because it changes the research method of TTS. In the past, test-time scaling usually meant that researchers manually wrote strategies: sample a few more chains, when to early stop, when to prune, when to deepen reasoning. AutoTTS’s judgment goes one step further: humans should not continue handcrafting every reasoning strategy, but should design a searchable environment and let the LLM itself discover better compute allocation algorithms. The paper builds width-depth TTS into an offline replay environment, pre-collecting reasoning trajectories and probe signals. After that, evaluation of candidate controllers no longer repeatedly calls the large model, but replays in the offline environment at low cost. This allows coding agents such as Claude Code to propose controllers over multiple rounds, view execution trajectories, diagnose failure causes, and continue modifying code. The finally discovered strategies achieve a better accuracy-cost tradeoff than manually designed SC@64, ASC, ESC, and Parallel-Probe on multiple mathematical reasoning benchmarks, and the whole discovery process only costs about USD 39.9 and 160 minutes.
📚 Background notes:
TTS has become an important route for improving reasoning ability: Giving models more test-time compute can improve answer quality through multi-branching, multi-step reasoning, repeated sampling, pruning, and voting. But the key question is how exactly this compute should be allocated.
Existing strategies are mostly still manual heuristics: Methods such as SC@64, ASC, ESC, and Parallel-Probe are essentially manually walking different paths in the width-depth space: some perform fixed multi-sampling, some expand horizontally, some deepen vertically, and some probe and prune at the same time. Figure 2 of the paper draws these strategies uniformly as different trajectories in the width-depth control space.
Agentic discovery begins entering reasoning-strategy design: AutoTTS borrows ideas from Meta-Harness-like work and hands the algorithm-design problem to a coding agent. But the key is to first make the environment cheap, replayable, and rich in feedback. Otherwise, every search would require real LLM calls, and the cost would quickly get out of control.
📌 Key takeaways:
Offline replay environment lowers search cost: The paper first pre-samples multiple reasoning trajectories for each question and cuts them into probe points at fixed token intervals. Later, when the controller performs branch, continue, probe, prune, and answer actions, it is only reading pre-collected data and does not need to call the base LLM again.
Beta parameterization compresses the search space: To avoid agents inventing a bunch of hard-to-tune hyperparameters, AutoTTS requires each controller to expose only one β, while all other internal thresholds are monotonically mapped from β. This turns the search from high-dimensional hyperparameter tuning into a one-dimensional trade-off scan, reducing overfitting to the search set.
Execution-trajectory feedback helps the agent improve strategies: The paper does not only give final accuracy/token results, but also writes the concrete behavioral trajectories of each controller into history, allowing the coding agent to see exactly where it pruned too early, saved too many tokens, or failed to allocate compute to promising branches. Table 3 also shows that after removing execution traces, the discovered strategy uses more tokens and performs worse.

Auto-TTS framework overview
The core inspiration of AutoTTS is to push test-time scaling from “strategy engineering” to “environment engineering.” In the past, researchers spent a lot of effort designing rules for branch, prune, and stop; this paper shows that as long as the environment is cheap enough, the feedback is detailed enough, and the search space is controllable enough, LLMs can discover a better reasoning controller by themselves. This direction is very close to Meta-Harness: what truly needs human input is not a specific heuristic, but reusable search environments, state definitions, action spaces, feedback mechanisms, and objective functions.
Its most interesting point is that it treats TTS as a kind of controller synthesis. Model reasoning is no longer “mechanical sampling after a given budget,” but a dynamic resource scheduling problem: which branches are worth continuing, which branches should be probed, which branches can be cut, and when should the system stop and aggregate answers. In the experiments, AutoTTS searches for controllers on AIME24, then transfers them to AIME25, HMMT25, and Qwen3 models of different scales, maintaining good generalization overall. At β=0.5, compared with SC@64, it reduces tokens by about 69.5% on average while keeping accuracy close; at β=1.0, it can further push peak accuracy in multiple settings.
The current instance mainly focuses on width-depth TTS control, and the action space is still relatively limited; the discovery process depends on frontier coding agents such as Claude Code, and whether open-source coding agents can achieve similar effects still needs verification. In addition, although offline replay makes search cheap, it also restricts the controller to making decisions only within pre-collected trajectories, which is still different from real online inference. This paper represents a very important direction: future improvements in LLM reasoning ability will not only rely on larger models or longer CoT, but will increasingly depend on test-time compute scheduling strategies automatically discovered by agents.

🦄 Recommendation reason:
NanoWhale is worth paying attention to because it is not another small model pursuing practical performance, but compresses the DeepSeek-V4 architecture completely into about 110M parameters, making it into an open-source sample that can be trained, fine-tuned, deployed, and studied. The repository includes code, config, tokenizer, pretrain / SFT / eval / chat / upload scripts; on the model side, it provides both base and SFT chat versions. What it replicates is not DeepSeek-V4’s weights, but compresses structures such as MLA, MoE, Hyper-Connections, MTP, and the DeepSeek-V4 tokenizer into an extremely small-scale model trained from scratch. For researchers, students, and engineers, the meaning of such a project is direct: many frontier large-model architectures can usually only be “read in papers” and are hard to run through at low cost; NanoWhale turns them into a minimal closed loop that can be locally experimented with.

NanoWhale repo homepage
NanoWhale is more like an LLM architecture learning kit than a strong model for real business use. The Hugging Face model card shows that nanowhale-100m-base has about 110M parameters, including about 41M embedding parameters and about 69M non-embedding parameters. It uses 8 layers, hidden size 320, 8 attention heads, MQA-style KV head, MLA, 4 routed experts + 1 shared expert, top-2 routing, Hyper-Connections, and 1 layer of MTP. In training, the base model is pretrained on FineWeb-Edu for 5K steps, about 2.6B tokens; the SFT version is then instruction-fine-tuned using SmolTalk. This scale cannot possibly compete with true large models on capability, but it compresses many key DeepSeek-V4 structures into a scale that can be observed, modified, and reproduced in experiments. It is very suitable for understanding new architectures, debugging training scripts, doing teaching demos, or verifying the behavior of components such as MoE / MLA / MTP on small models.
Its boundary must also be written clearly: NanoWhale is not a “mini DeepSeek-V4 capability substitute,” and even less a production-grade model. 110M parameters, short training steps, and small-scale SFT determine that its focus is learnability, reproducibility, and experimentation, not final performance.
Link:https://arxiv.org/abs/2605.06548
🦄 Recommendation reason:
Ruflo is worth paying attention to because it targets a real problem after coding agents enter complex tasks: a single Claude Code is powerful, but complex projects often require planning, coding, testing, review, memory, GitHub operations, and multi-round collaboration. Ruflo’s positioning is to extend tools such as Claude Code / Codex from a single CLI into a multi-agent orchestration framework. It supports swarm orchestration, persistent memory, MCP tool calling, code review, GitHub ops, SPARC methodology, and other capabilities, and also provides a CLI and self-hosted Web UI. In the Web UI, users can talk with models such as Qwen, Claude, Gemini, and OpenAI while calling the same MCP tools as the CLI. This makes it more like an agent runtime / orchestration layer, not merely a Claude Code plugin.

Ruflo repo homepage
The next step of AI coding is not just making one model better at writing code, but organizing the division of labor, memory, tools, and collaborative processes of multiple agents. Ruflo puts components such as swarm, memory, skills, MCP, GitHub ops, security audit, and SPARC workflow into one framework, aiming to let agents continuously collaborate around a complex project, instead of starting from zero every time. For people who follow Claude Code, Codex, OpenClaw, and agent harness, Ruflo is very worth looking at, because it represents the direction of coding agents moving from “single assistant” to “orchestratable team.”
At the same time, the Ruflo / Claude Flow ecosystem has many functions, and the learning entry point is relatively complex. In community issues, some people have already reported “too many choices, not knowing how to distinguish Ruflo from Claude Flow, and the status bar and skill system not being intuitive enough.” This shows that it is more suitable for advanced users and teams willing to tinker with agent workflows, and is not yet a one-click tool for ordinary developers.
Link:https://huggingface.co/SulphurAI/Sulphur-2-base
🦄 Recommendation reason:
Sulphur-2-base represents open-source video-generation models continuing to move toward “local controllability, modifiable workflows, and community extensibility.” The model card positions it as a video-generation model based on LTX 2.3, supporting text-to-video and image-to-video, while being compatible with other LTX 2.3 formats; community introductions also generally discuss it as part of a new wave of popular open-source AI video-generation models. Its core attraction is not one-click product experience, but giving advanced users and ComfyUI / LTX workflow players more control: local running, LoRA connection, derivative merge, and continued tinkering around prompt enhancers and quantized versions. For creators, the meaning of such models is direct: video generation no longer only depends on closed platforms; it can also enter local GPUs, open-source nodes, workflow templates, and community fine-tuning ecosystems.

Sulphur-2-base introduction
In terms of positioning, Sulphur-2-base is more suitable for AI video creators, advanced ComfyUI users, and local video-generation workflow players, rather than an ordinary user-facing one-click generation tool. It is based on the LTX ecosystem, supports T2V / I2V, and itself carries prompt-enhancer-related components. The community has already produced derivative projects such as prompt enhancer GGUF, i2v merge, quantization, and ComfyUI-related variants. This shows that its real value is not only the model weights themselves, but the workflow ecosystem formed around the model: some people make enhanced prompts, some make I2V merge, some make quantization, and some make nodes and templates. This is exactly the typical path of rapid iteration for open-source video models.
The user group of Sulphur-2-base is also biased toward power users: it requires a relatively strong GPU, familiarity with ComfyUI / LTX workflows, and the ability to handle prompts, parameters, LoRA, and inference optimization on one’s own. Medium community articles also mention that such models are more suitable for local video research, AI short films, cinematic images, and experimental video processes, but are not friendly to low-end devices and beginners. Overall, it is suitable to put into this week’s project list, because the signal it releases is clear: open-source video generation is moving from “model release” toward ecosystem competition of “model + workflow + LoRA + quantization + community derivatives.”

If the rumors are true, this will become the largest financing round so far in China’s large-model field, and it also means DeepSeek is moving from a low-cost myth toward more realistic heavy-asset competition.
DeepSeek’s first-round financing rumors continue to ferment. Reuters, FT, WSJ, and many other media outlets have reported that DeepSeek is advancing its first external financing, with valuation possibly reaching the USD 45 billion to USD 50 billion range; other media have cited sources saying that its financing target may reach as high as RMB 50 billion. If it finally lands, this will become one of the largest financing rounds so far in China’s large-model field.
It should be noted that, up to now, DeepSeek’s official side has not confirmed this, and the reason is unclear.
But one fact that will not change is that DeepSeek is definitely carrying out financing, and financing marks that DeepSeek’s company strategy and narrative will both change. In the past year, DeepSeek’s strongest public impression was training high-performance models at relatively low cost, breaking the outside world’s imagination that large models “can only be piled up with massive compute power.” But if the first-round financing finally becomes true, DeepSeek will also enter a more realistic competitive logic: model iteration needs compute power, top talent needs retention, commercialization needs teams and channels, and the open-model ecosystem also needs continuous investment.
Reports from multiple media show that DeepSeek’s financing round may introduce China’s national-level industrial funds and investors such as Tencent, and the financing funds may be used to strengthen computing infrastructure and improve employee compensation.
DeepSeek previously long appeared with an image of “little financing, heavy research, low cost.” The High-Flyer Quant behind it and Liang Wenfeng’s personal investment once made DeepSeek look more like a research laboratory with idealistic color, rather than a typical heavy-asset AI company.
But with the release of V4, the filling-in of multimodal and agent capabilities, and the expansion of API and developer ecosystem, DeepSeek’s operating costs and organizational pressure are also rising. Low-cost training does not equal low-cost competition, especially after models enter the continuous-iteration and commercialization stage, compute power, talent, and channels will all become long-term investment items.
There has always been a very sensitive question around DeepSeek: after the low-cost miracle, what comes next? Previously, one of DeepSeek’s most impactful aspects was that it made the industry rethink “whether model capability can only be determined by compute scale.” This is especially important for China’s AI industry, because it provides a more imaginative path: in an environment where compute is restricted, chips are restricted, and capital is more cautious, it is still possible to build globally competitive models through algorithms, engineering, and open-source ecosystems.
But the long-term competition of large-model companies does not only happen in the training stage. After a model is released, it still has to face a whole system of issues including user growth, inference cost, API stability, multimodal expansion, enterprise delivery, developer ecosystem, compliance, and safety evaluation. Any company that enters this stage will inevitably move from “research-driven” toward “system competition.”
This is the real turning point behind the DeepSeek financing rumors: it is not simply moving from idealism to commercialization, nor from low cost to burning money, but moving from “proving the model can be built” into “proving the company can keep running.”
For the domestic large-model industry, this may mean the beginning of a new stage. In the past, when people discussed DeepSeek, they were more often discussing how it challenged OpenAI, how it impacted Nvidia, and how it changed the open-source model landscape. Next, the market may care more about whether it can iterate stably, retain core talent, form revenue, and establish real advantages inside the domestic compute ecosystem.
If DeepSeek’s financing succeeds, it may become a landmark event for China’s large models moving from technical breakthrough toward industrial competition. It will also remind the industry: the efficiency revolution of large models is very important, but after the efficiency revolution, one still has to return to longer-term issues such as company building, capital investment, and ecosystem competition.

DeepSeek gains capital attention
From Microsoft 365 to Dreaming, and then to SpaceXAI compute cooperation, Claude’s next step is moving from developer tools toward an enterprise workflow platform.
Anthropic’s recent updates all point toward one direction: Claude is expanding from a “code assistant” into a more complete enterprise workflow platform.
First is the office entrance. Anthropic has integrated Claude into the Microsoft 365 ecosystem, allowing users to call Claude inside Excel, PowerPoint, and Word, while Outlook has also entered public beta. This means Claude is no longer only staying on the web or inside IDEs, but is beginning to embed into larger-scale daily office scenarios.
Second is agent memory and self-improvement capability. At its developer conference, Anthropic launched Dreaming, allowing Claude Managed Agents to review past tasks, organize memory, summarize experience, and reuse these patterns in later tasks during gaps between tasks. VentureBeat reported that after legal AI company Harvey integrated Dreaming, its task completion rate increased by about 6 times.
Third is compute. SpaceXAI and Anthropic reached a compute cooperation agreement, providing Claude with Colossus 1 cluster resources. xAI’s official page shows that Colossus 1 contains more than 220,000 NVIDIA GPUs; Reuters reported that this cooperation will provide Anthropic with more than 300MW of new compute power and support higher usage limits for Claude Code and Claude API.
Anthropic’s current round of expansion can be summarized into three lines: office, memory, and compute.
The office line solves the problem of entrance growth.
Claude Code has already helped Anthropic establish strong recognition among developer groups, but the truly high-frequency work inside enterprises is not only writing code. Email processing, document writing, spreadsheet analysis, and report generation are also knowledge-work scenarios where AI can most easily land. Integrating into Microsoft 365 is equivalent to letting Claude enter a larger office site.
The memory line solves the problem of long-term tasks.
If an agent can only complete one-off tasks, it is more like a tool; if it can summarize experience from past tasks, update preferences, and discover patterns, it becomes closer to an assistant. The value of Dreaming is precisely that it lets agents not only execute tasks, but begin accumulating continuity across tasks.
The compute line solves the problem of scale.
The growth in Claude Code usage means Anthropic needs not only stronger models, but also more abundant inference resources. SpaceXAI’s Colossus 1 compute cooperation directly corresponds to usage limits, API throughput, and larger-scale developer demand.
Anthropic’s strongest growth lever over the past period has been Claude Code. It gave Claude a very clear position in developer mindshare: strong reasoning, strong coding, and suitable for complex engineering tasks. But if it only stays on the code battlefield, Claude’s market boundary will still be limited.
The Microsoft 365 integration is actually helping Claude open a second battlefield. Compared with IDEs, Office has a larger user scale, higher usage frequency, and is closer to daily enterprise workflows. An employee may not necessarily write code every day, but is very likely to handle emails, write documents, read spreadsheets, and make reports. After Claude enters these scenarios, it will no longer only face programmers, but a broader group of knowledge workers such as product managers, operations staff, finance, legal, sales, and managers.
Dreaming, meanwhile, relates to whether agents can truly become “long-term colleagues” and gradually reduce human review dependence. The problem with many AI tools today is that they are smart every time, but every time they seem as if they have just come to work for the first time. They can complete the task in front of them, but find it hard to naturally remember organizational habits, project preferences, document standards, and historical decisions. What Dreaming tries to solve is precisely this sense of fracture between tasks. It lets agents organize experience, merge duplicate memories, update outdated information, and bring these deposits back into the next task during idle time.
If this direction matures, the way enterprise AI value is evaluated will also change. In the past, we more often looked at whether a single answer was accurate and whether one generation was usable; in the future, we may also have to look at whether an agent can accumulate experience in continuous tasks, reduce repeated communication, and gradually understand the team’s working methods.
Compute cooperation is the foundation of this route. The growth of Claude Code has already proven that once agents are truly placed into workflows by users, call volume rises quickly. Higher context, longer task chains, multi-agent parallelism, repeated self-checks and modifications will all consume a large amount of inference resources. Anthropic’s cooperation with SpaceXAI shows that competition and cooperation among frontier AI companies are becoming more complex: competing on models and products on one side, while forming cross-dependencies on compute, cloud, and infrastructure on the other.
So Anthropic’s recent integration moves are a platformization signal. Claude is moving from a developer tool toward a work system jointly composed of enterprise office entrances, long-term agent memory, and large-scale compute support. Defending the code battlefield is only the first step; the real goal is to enter more enterprise processes and make Claude a collaboration layer that continuously exists in those processes.

Claude Dreaming diagram
From the default model to code agents, and then to real-time voice, OpenAI’s updates this week look more like steady paving for real work scenarios.
OpenAI’s recent updates lean toward practicality. The core revolves around the default model, code agents, browser, and real-time voice, continuing to steadily push forward its workflow ecosystem.
First is the default model upgrade. OpenAI officially launched GPT-5.5 Instant and made it the new default model in ChatGPT. According to the official statement, GPT-5.5 Instant reduces hallucinated claims by 52.5% compared with GPT-5.3 Instant in high-risk domains such as medicine, law, and finance, and inaccurate claims also drop by 37.3% in difficult conversations where users previously flagged factual errors. At the same time, the new model’s answers are more concise and natural, and it strengthens personalization and traceability of memory sources.
Second is Codex’s workflow expansion. OpenAI has launched the Codex Chrome extension, allowing Codex, under user authorization and website approval mechanisms, to use Chrome and handle browser tasks that require login state, such as LinkedIn, Salesforce, Gmail, or internal tools; the Slack Marketplace also shows that Codex can be @-mentioned in Slack threads, understand context, and answer questions or write code.
Third is real-time voice. OpenAI launched three new models in the API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, covering voice reasoning, real-time translation, and real-time transcription. GPT-Realtime-2 is positioned as a real-time voice model with GPT-5-level reasoning ability; Realtime-Translate supports more than 70 input languages into 13 output languages; Realtime-Whisper is oriented toward low-latency streaming transcription.
OpenAI’s current round of moves can be summarized into three lines: reliability, executability, and real-time interaction.
Reliability comes from GPT-5.5 Instant.
The stronger ChatGPT’s default model is, the easier it is for users to place it into serious scenarios. But what serious scenarios fear most is not “not smart enough,” but instability, inaccuracy, and unclear context sources. Therefore, lower hallucination rates, more concise answers, and traceable memory sources are all foundational capabilities for the default model to move toward high-frequency work use.
Executability comes from Codex.
Codex is moving from code assistant gradually into browsers, Slack, and more work systems. Its meaning is not only “being able to write code,” but beginning to take on cross-tool tasks: understanding context, entering webpages, calling login states, reading internal tools, and completing concrete operations within the scope of user authorization. If AI wants to truly become a work assistant, it must move from generating text to executing tasks.
Real-time interaction comes from the Realtime series.
Voice is a more natural entrance than the chat box, but voice products have higher requirements for latency, context, tone, translation quality, and tool calling. The combination of GPT-Realtime-2, Realtime-Translate, and Realtime-Whisper means OpenAI is breaking voice capability into infrastructure modules more suitable for developers to call, rather than only using it as an experience button inside ChatGPT.
This round of OpenAI updates, although not dazzling enough, is very practical.
The value of GPT-5.5 Instant lies in the fact that it is the default model. Default models are unlike flagship demos that pursue single-point shock; they need to undertake hundreds of millions of ordinary tasks every day: writing emails, reading files, explaining spreadsheets, assisting decisions, and handling professional questions. Shorter answers, lower hallucination, and clearer memory sources mean OpenAI is placing “the reliability level of high-frequency use” in a more important position.
Codex’s expansion is solving another problem: how AI truly enters the work environments users are already using. Much work does not happen inside a clean code repository, but between browsers, Slack, Gmail, CRM, internal backends, and all kinds of web tools. The meaning of the Chrome extension is that it gives Codex an opportunity to touch real web states and logged-in systems within authorization scope. The stronger this capability becomes, the closer AI gets to being a “digital employee”; but at the same time, permission approval, site authorization, data boundaries, and operation traceability will also become more important.
Real-time voice models fill in another entrance for AI agents. The chat box is suitable for thinking and editing, but many real scenarios require continuous conversation: customer service, meeting notes, cross-language communication, voice assistants, real-time teaching, telephone sales, and remote collaboration. If a voice model only “can speak,” its value is limited; it must understand context, wait for sufficient semantics, respond with low latency, call tools when necessary, and remain stable in long interactions.
So OpenAI’s focus this week does not look like an explosive leap, but more like continuing to lay the ChatGPT and Codex systems into real work sites.
This also makes the competition between OpenAI and Anthropic clearer. Anthropic is strengthening enterprise workflow platforms: Office, Dreaming, compute. OpenAI is expanding general work entrances: default model, browser, Slack, voice. Both are walking out of the chat box, only with different paths.
Anthropic leans more toward an “enterprise collaboration system,” while OpenAI leans more toward a “general operation entrance.” The next stage of competition will depend on who can enter the places where users truly work every day more naturally, safely, and stably.

GPT-5.5 Instant benchmark
1.1 https://arxiv.org/abs/2605.06548
1.2 https://arxiv.org/abs/2605.06388
1.3 https://arxiv.org/abs/2605.08083
2.1 https://github.com/huggingface/nanowhale
2.2 https://github.com/ruvnet/ruflo
2.3 https://huggingface.co/SulphurAI/Sulphur-2-base
3.1 https://www.theinformation.com/articles/deepseek-raise-7-billion-startup-plots-revenue-efforts
3.2 https://claude.com/blog/new-in-claude-managed-agents
3.3





