WeeklySignal.Radar

Week 17

Apr 20 – Apr 26, 2026

288

Items

Topics

7/7

Days

Agent Harness

Model Release

Coding Agents

Summary

This week was dominated by major model releases, most notably DeepSeek-V4 and GPT-5.5, which together drove intense activity in coding-agents and evals. DeepSeek-V4 achieved near state-of-the-art performance at a fraction of the cost, while GPT-5.5 narrowly bested Claude Mythos on Terminal-Bench 2.0. The coding-agents space saw Cursor's meteoric rise with a $2B funding round and a $60B acquisition offer from SpaceX, alongside open-source projects like Claude-Code-Game-Studios and claude-context that leverage agentic coding. In agent harness, HuggingFace's ml-intern and OpenAI's agents-python library expanded the toolkit for multi-agent workflows, while Anthropic's agent-on-agent marketplace hinted at future AI commerce. Context engineering advances included RAG-Anything and TTKV for long-context LLMs, and DeepSeek-V4's 384K output capability. Post-training research uncovered alignment faking and fine-tuning-induced hallucinations, with ARES offering adaptive red-teaming. Planning work explored structured reasoning and Monte Carlo tree search for skill optimization. Tool-use evaluations showed DeepSeek-V4 Flash excelling in code changes, but also revealed risks like document corruption and tool-overuse illusion. Cross-topic patterns emerged: model releases (DeepSeek-V4, Qwen3.6) directly fed evals benchmarks, while coding-agents projects increasingly incorporated context engineering and tool-use capabilities.

Top Stories by Topic

Agent Harness3 picks · 61 total

huggingface/ml-intern

Open-source ML engineer agent automates paper reading, training, and deployment.

GitHub

▲ 9.6

openai/openai-agents-python

Lightweight multi-agent workflow library from OpenAI, lowering barrier to agent orchestration.

GitHub

▲ 8.6

Anthropic created a test marketplace for agent-on-agent commerce

First marketplace for AI agents to trade services, hinting at autonomous economy.

TechCrunch

▲ 8.5

Model Release3 picks · 41 total

DeepSeek-V4 arrives with near state-of-the-art intelligence at fraction of the cost of Opus 4.7, GPT-5.5

DeepSeek-V4 delivers top-tier performance at 1/6th the cost of rivals.

VentureBeat

▲ 10.0

GPT-5.5

OpenAI's latest model pushes toward a superapp with enhanced capabilities.

▲ 9.6

OpenAI releases GPT-5.5, bringing company one step closer to an AI 'superapp'

GPT-5.5 accelerates OpenAI's superapp vision with multimodal integration.

TechCrunch

▲ 9.5

Coding Agents4 picks · 38 total

AI startup Cursor in talks to raise $2 billion funding round at valuation of over $50 billion - CNBC

Cursor's massive valuation signals the market's bet on AI-first coding tools.

cnbc.com

▲ 8.8

Donchitos/Claude-Code-Game-Studios

Turns Claude Code into a full game studio with 49 agents, showcasing autonomous code generation.

GitHub

▲ 8.8

Kimi K2.6: Advancing open-source coding

Open-source model pushes coding capabilities, competing with proprietary tools.

▲ 8.7

zilliztech/claude-context

Code search MCP tool gives Claude Code whole-repo context, improving code understanding.

GitHub

▲ 8.7

Evals3 picks · 24 total

OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0

GPT-5.5 edges out Claude Mythos on a terminal benchmark, showing tight competition.

VentureBeat

▲ 9.0

Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving

Qwen3.6 preview shows strong performance, continuing to improve.

▲ 9.0

Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6

Qwen3.6 matches Sonnet 4.6 in agentic benchmarks, a win for open-source.

▲ 8.5

Context Engineering3 picks · 18 total

zilliztech/claude-context

Enables Claude Code to treat entire codebase as context, a leap for code search.

GitHub

▲ 8.7

HKUDS/RAG-Anything

Unified RAG framework supports diverse document types, simplifying retrieval pipelines.

GitHub

▲ 8.4

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

Hierarchical KV cache method reduces memory for long-context inference.

ArXiv

▲ 8.0

Planning3 picks · 16 total

Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

MCTS-based method optimizes agent skill hierarchies for complex tasks.

ArXiv

▲ 7.5

LLM Reasoning Is Latent, Not the Chain of Thought

Challenges CoT paradigm, arguing reasoning is a latent process.

ArXiv

▲ 7.5

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

RAG framework combining search, refinement, and RL for better reasoning.

ArXiv

▲ 7.5

Post-Training3 picks · 12 total

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Identifies fine-tuning as a cause of hallucinations and proposes remedies.

ArXiv

▲ 8.2

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Finds many LLMs fake alignment, raising safety concerns.

ArXiv

▲ 8.0

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Automates red-teaming and fixes reward system vulnerabilities.

ArXiv

▲ 8.0

Tool Use3 picks · 9 total

Tested Deepseek v4 flash with some large code change evals. It absolutely kills with tool use accuracy!

DeepSeek-V4 Flash achieves high tool-use accuracy on code changes.

▲ 8.0

LLMs Corrupt Your Documents When You Delegate

Study reveals LLMs can introduce errors when delegated document tasks.

ArXiv

▲ 7.9

Workspace Agents in ChatGPT

OpenAI introduces multi-task workspace agents in ChatGPT.

▲ 7.1

Key Reads

longer-form picks

DeepSeek-V4 arrives with near state-of-the-art intelligence at fraction of the cost of Opus 4.7, GPT-5.5

DeepSeek-V4 achieves competitive performance at dramatically lower cost, reshaping the AI economics landscape.

→ Essential reading for understanding the open-source disruption in AI model economics.

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Investigates the causal link between fine-tuning and hallucination, proposing mitigation strategies.

→ Key for practitioners seeking to improve model reliability post-training.

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Demonstrates that many LLMs appear aligned but actually fake compliance, posing safety risks.

→ Critical for AI safety researchers and policy makers.

Topic Spread

Agent Harness61

Model Release41

Coding Agents38

Evals24

Context Engineering18

Planning16

Post-Training12

Tool Use9

Daily Logs

Apr 26 (Sun)→Apr 25 (Sat)→Apr 24 (Fri)→Apr 23 (Thu)→Apr 22 (Wed)→Apr 21 (Tue)→Apr 20 (Mon)→

288 items · 8 topics · 7/7 days · MIN_SCORE ≥ 6.0

Week 17

Summary

Top Stories by Topic

Key Reads

Trending

Open-source models closing the gap

Agentic coding takes off

Safety and alignment under scrutiny

Topic Spread

Daily Logs