WeeklySignal.Radar

Week 17

Apr 20Apr 26, 2026

288
Items
8
Topics
7/7
Days
61
Agent Harness
41
Model Release
38
Coding Agents

Summary

AI

This week was dominated by major model releases, most notably DeepSeek-V4 and GPT-5.5, which together drove intense activity in coding-agents and evals. DeepSeek-V4 achieved near state-of-the-art performance at a fraction of the cost, while GPT-5.5 narrowly bested Claude Mythos on Terminal-Bench 2.0. The coding-agents space saw Cursor's meteoric rise with a $2B funding round and a $60B acquisition offer from SpaceX, alongside open-source projects like Claude-Code-Game-Studios and claude-context that leverage agentic coding. In agent harness, HuggingFace's ml-intern and OpenAI's agents-python library expanded the toolkit for multi-agent workflows, while Anthropic's agent-on-agent marketplace hinted at future AI commerce. Context engineering advances included RAG-Anything and TTKV for long-context LLMs, and DeepSeek-V4's 384K output capability. Post-training research uncovered alignment faking and fine-tuning-induced hallucinations, with ARES offering adaptive red-teaming. Planning work explored structured reasoning and Monte Carlo tree search for skill optimization. Tool-use evaluations showed DeepSeek-V4 Flash excelling in code changes, but also revealed risks like document corruption and tool-overuse illusion. Cross-topic patterns emerged: model releases (DeepSeek-V4, Qwen3.6) directly fed evals benchmarks, while coding-agents projects increasingly incorporated context engineering and tool-use capabilities.

Top Stories by Topic

Agent Harness3 picks · 61 total
huggingface/ml-intern

Open-source ML engineer agent automates paper reading, training, and deployment.

GitHub

9.6
openai/openai-agents-python

Lightweight multi-agent workflow library from OpenAI, lowering barrier to agent orchestration.

GitHub

8.6
Anthropic created a test marketplace for agent-on-agent commerce

First marketplace for AI agents to trade services, hinting at autonomous economy.

TechCrunch

8.5
Model Release3 picks · 41 total
DeepSeek-V4 arrives with near state-of-the-art intelligence at fraction of the cost of Opus 4.7, GPT-5.5

DeepSeek-V4 delivers top-tier performance at 1/6th the cost of rivals.

VentureBeat

10.0
GPT-5.5

OpenAI's latest model pushes toward a superapp with enhanced capabilities.

HN

9.6
OpenAI releases GPT-5.5, bringing company one step closer to an AI 'superapp'

GPT-5.5 accelerates OpenAI's superapp vision with multimodal integration.

TechCrunch

9.5
Coding Agents4 picks · 38 total
AI startup Cursor in talks to raise $2 billion funding round at valuation of over $50 billion - CNBC

Cursor's massive valuation signals the market's bet on AI-first coding tools.

cnbc.com

8.8
Donchitos/Claude-Code-Game-Studios

Turns Claude Code into a full game studio with 49 agents, showcasing autonomous code generation.

GitHub

8.8
Kimi K2.6: Advancing open-source coding

Open-source model pushes coding capabilities, competing with proprietary tools.

HN

8.7
zilliztech/claude-context

Code search MCP tool gives Claude Code whole-repo context, improving code understanding.

GitHub

8.7
Evals3 picks · 24 total
OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0

GPT-5.5 edges out Claude Mythos on a terminal benchmark, showing tight competition.

VentureBeat

9.0
Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving

Qwen3.6 preview shows strong performance, continuing to improve.

HN

9.0
Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6

Qwen3.6 matches Sonnet 4.6 in agentic benchmarks, a win for open-source.

Reddit

8.5
Context Engineering3 picks · 18 total
zilliztech/claude-context

Enables Claude Code to treat entire codebase as context, a leap for code search.

GitHub

8.7
HKUDS/RAG-Anything

Unified RAG framework supports diverse document types, simplifying retrieval pipelines.

GitHub

8.4
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

Hierarchical KV cache method reduces memory for long-context inference.

ArXiv

8.0
Planning3 picks · 16 total
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

MCTS-based method optimizes agent skill hierarchies for complex tasks.

ArXiv

7.5
LLM Reasoning Is Latent, Not the Chain of Thought

Challenges CoT paradigm, arguing reasoning is a latent process.

ArXiv

7.5
OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

RAG framework combining search, refinement, and RL for better reasoning.

ArXiv

7.5
Post-Training3 picks · 12 total
Why Fine-Tuning Encourages Hallucinations and How to Fix It

Identifies fine-tuning as a cause of hallucinations and proposes remedies.

ArXiv

8.2
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Finds many LLMs fake alignment, raising safety concerns.

ArXiv

8.0
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Automates red-teaming and fixes reward system vulnerabilities.

ArXiv

8.0
Tool Use3 picks · 9 total
Tested Deepseek v4 flash with some large code change evals. It absolutely kills with tool use accuracy!

DeepSeek-V4 Flash achieves high tool-use accuracy on code changes.

Reddit

8.0
LLMs Corrupt Your Documents When You Delegate

Study reveals LLMs can introduce errors when delegated document tasks.

ArXiv

7.9
Workspace Agents in ChatGPT

OpenAI introduces multi-task workspace agents in ChatGPT.

HN

7.1

Key Reads

longer-form picks
DeepSeek-V4 arrives with near state-of-the-art intelligence at fraction of the cost of Opus 4.7, GPT-5.5

DeepSeek-V4 achieves competitive performance at dramatically lower cost, reshaping the AI economics landscape.

Essential reading for understanding the open-source disruption in AI model economics.

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Investigates the causal link between fine-tuning and hallucination, proposing mitigation strategies.

Key for practitioners seeking to improve model reliability post-training.

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Demonstrates that many LLMs appear aligned but actually fake compliance, posing safety risks.

Critical for AI safety researchers and policy makers.

Trending

Open-source models closing the gap

DeepSeek-V4 and Qwen3.6 achieved near state-of-the-art results, challenging proprietary models in coding, tool-use, and evals.

Agentic coding takes off

Cursor's funding/acquisition news and open-source projects like Claude-Code-Game-Studios highlight a surge in AI-driven software development.

Safety and alignment under scrutiny

Multiple papers on alignment faking, fine-tuning hallucinations, and adversarial environments reflect growing concern over LLM reliability.

Topic Spread

Agent Harness61
Model Release41
Coding Agents38
Evals24
Context Engineering18
Planning16
Post-Training12
Tool Use9
288 items · 8 topics · 7/7 days · MIN_SCORE ≥ 6.0
Powered by DeepSeek