TURING POST 2025 – An Agentic Review

After creating an agent music aggregator and a podcast reviewer (see recent blog posts), I was after a 3rd AI Agentic project: comparing agent outputs across multiple platforms.

Hypothesis: different AI Agents from different providers will have quite some different approaches and nuances in their assessment, even if given the exact same limitations and requirements. How do you pick the right one?

In order to use a baseline data asset, I was after some additional personal value: I do NOT have enough time to read all of the fantastic updates on Turing Post (visit: https://www.turingpost.com) by Ksenia Se. A true shame, as I believe it delivers great insights.

The goal of my little project was a 2-step process:
1) Create a overview of all Turing Post entries of 2025 and create an overview together with a strict rubric framework to ensure a structured approach
2) Ask each model to review all report outputs and again score them

The outcome still surprised me.

Key Observations

I would not have guessed the final results of my little research (see below), if you asked me about it up front. In hindsight it may seem “obvious”. Too blatant, I would have said.

After running the agents to create reports (see below) and then analyse the results (see step 2), each agent scored it’s own work HIGHER or BETTER than any other effort and reviewed the other work as wrong, including errors. Given that all guidance was standardised across all models, I am still a bit surprised. Turns out though that this result is nothing “new”, but something I believe is a great reminder when using agents: they like their own output…

Self-Grounded Evaluation Bias in LLM-as-a-Judge Systems¹
A failure mode where an LLM evaluates outputs by implicitly re-deriving and enforcing the same latent rubric interpretation it used during generation, leading to systematic self-favouring judgments and false negatives against alternative valid implementations.

My experiment demonstrated a critical principle for Agentic systems:

An agent cannot be its own auditor unless the audit criteria are externalised, executable, and non-interpretive.

But there was 1 notable exception to this bias: Claude. Of all models, Claude actually picked Manus AI…

So how did I get here?

The Agents

Below is a summary of the applied tasks:

1) Report creation

Define objective
- Produce a coverage-first 2025 deep dive
- Accuracy, traceability, and completeness override synthesis quality
Establish ground truth
- Enumerate the full 2025 Turing Post archive (site/RSS)
- Lock a canonical post list for the year
Apply mandatory execution order
1. Build canonical 2025 post list
2. Score every post using the fixed rubric dimensions
  - Capability shift
  - Adoption / momentum
  - Strategic impact
  - Risk / downside
  - Time sensitivity
  - Coverage density
3. Record raw scores and brief justification per dimension
4. Calculate total scores per post
Enforce rubric thresholds
- Exec Brief: include only posts with total score ≥ 12
- Theme Map: aggregate themes supported by multiple posts (generally ≥ 9)
- Watchlist: high capability + low adoption + forward-looking
Construct required output (in order)
- Executive Brief (what changed, why it matters, exec implications)
- Theme Map (cross-post synthesis)
- Risk & Opportunity Log
- Watchlist (2026 spillover)
- Complete Post Digest (appendix)
  - Every 2025 post, exactly once
  - Date, title, summary, scores, traceable reference
Hard constraints
- No missing or duplicated posts
- No hype inflation of low-signal content
- All insights traceable to specific posts
- External knowledge explicitly labeled

The Reports

Instead of printing all 4 reports, I selected a couple of paragraphs from each.

Common threads and signals (OpenAI)

Experience over data: The year’s strongest signal is a pivot from pretraining on static corpora to agents learning from interaction. Posts on RL, memory architectures (FOD#103, XQuant, MoR) and evaluation frameworks converge on this trend.
Standardisation: MCP, NLWeb and A2A illustrate a push toward interoperable agent ecosystems. Similar standardisation appears in evaluation (LLM‐as‐a‐judge) and memory management (Atlas, PAN).
Efficiency & memory: Numerous AI 101 posts (XQuant, PAN, MoR, causal attention, Mixture‐of‐States) and FOD articles (Nanochat, Quantum Whispers) spotlight methods to reduce compute and memory. Efficiency is not optional; it’s the only path to sustain adoption.
Open‐source renaissance: OSS models matured; posts on Qwen‐Agent, Kimi K2, Gemma, Mistral and the State of Open Source AI show a vibrant ecosystem. OSS is no longer a niche but a strategic lever.
AI meets geopolitics: Hardware scarcity, energy consumption and industrial policy threads run through posts on NVIDIA’s strategy (FOD#124), Blackwell GPUs, and US AI Action Plan. Executives must monitor these macro‐factors.
Science & discovery: The idea of Artificial General Science and nano‐scale models (nanochat) indicate that AI will radically transform R&D, not just software. Execs should prepare for cross‐disciplinary convergence.

Theme Map (2025) (Claude)

Theme 1: Reinforcement Learning Renaissance
Developments: RL techniques moved from alignment to core capability enhancement. RLHF became default, RLVR enabled reasoning, GRPO powered DeepSeek-R1 Convergence: All major labs now using RL for post-training Divergence: Open models focus on verifiable rewards (math/code), closed models on human preference Implications: RL expertise now critical hiring priority; expect 6-12 month advantage for teams with strong RL talent
Theme 2: Small Model Efficiency Revolution
Developments: Phi-4, SmolLM2, rStar-Math proved <10B parameter models can match GPT-4 on specific tasks Convergence: Industry consensus that task-specific small models beat general large models for production Divergence: US focuses on edge deployment, China on mobile/consumer applications Implications: Rethink “bigger is better” strategy; invest in model specialization and distillation capabilities

Theme 3: Agent Infrastructure Maturation
Developments: MCP protocol standardization, Agent-to-Agent (A2A) protocols, emergence of agent frameworks Convergence: Industry aligning on common standards for agent interoperability Divergence: Significant gap between demo capabilities and production reliability Implications: Agent infrastructure investments should focus on reliability/monitoring over new capabilities

Theme 4: Multimodal Convergence
Developments: Vision-language models became default, audio/video generation reached production quality Convergence: All major labs now have multimodal offerings Divergence: Closed models lead in quality, open models in accessibility Implications: Multimodal capabilities now table stakes; differentiation will come from domain-specific fine-tuning

Theme 5: World Models Emergence
Developments: JEPA theory solidified, DreamerV3 achieved Minecraft diamond mining, physical AI gaining traction Convergence: Recognition that pure language models insufficient for AGI Divergence: Approaches vary widely (neural fields vs transformers vs hybrid) Implications: Early investment in world model research could provide 2-3 year competitive advantage

Theme 6: Economic Integration Reality
Developments: Stanford AI Index shows 78% business adoption, but productivity gains concentrated in specific sectors Convergence: AI now core business driver, not experiment Divergence: Massive variance in ROI across use cases Implications: Focus on proven high-ROI applications (coding, customer service) before experimental use cases

Top 3 Actions for 2026 (Manus AI)

1. Invest in Agentic Infrastructure (Score 15–16/17)
– Agents are now production-ready; competitive advantage goes to early movers.
– Action: Hire agentic engineers; build guardrails and observability; pilot agent-first products.

2. Evaluate Open Source Models (Score 15/17)
– OSS models now match closed models; cost and customization advantages clear.
– Action: Benchmark OSS models against proprietary; consider fine-tuning strategy; audit IP/licensing.

3. Prepare for Physical AI (Score 12/17)
– Physical AI is 2–3 years behind LLMs; early movers have advantage.
– Action: Explore robotics partnerships; invest in simulation; monitor NVIDIA/Figure AI.

Risks to Monitor:
– Agent failures and cascading errors (immediate)
– Inference cost explosion (3–6 months)
– Data quality bottleneck (6–12 months)
– Regulatory uncertainty (6–12 months)

Opportunities to Pursue:
– Agentic product innovation (3–6 months)
– OSS model adoption (6–12 months)
– Physical AI market entry (12–18 months)

Watchlist (2026 Spillover) (Gemini)

Deep Research Agents (DRAs): 2025 prototypes (OpenAI Deep Research, Gemini) struggled with “strategic content fabrication” (lying to look professional). The “FINDER” benchmark and “DEFT” taxonomy provide the roadmap for 2026 improvements. Watch for agents that can reliably cite sources without hallucination.
Browser as Agent Body: OpenAI and Google are moving to own the browser (ChatGPT Atlas). The browser is becoming the “operating system” for AGI, allowing agents to see and interact with the web directly. This threatens traditional browser vendors and changes how SEO/Web standards work.
Artificial General Science: The integration of AI into the scientific method (hypothesis -> experiment -> paper) is accelerating. Watch for “Self-Evolving” labs where AI agents conduct research autonomously. This is the “sleeper hit” that could disrupt R&D heavy industries.
Neuro-Symbolic Renaissance: Pure neural nets are hitting reliability limits. Hybrid systems (Tensor Logic, Neuro-Symbolic AI) are re-emerging to provide the “System 2” rigor required for law, math, and science. Watch for “Tensor Logic” as a potential master algorithm.
Autonomous QA Loops: As verification becomes the bottleneck, the rise of “Agents testing Agents” will be the dominant theme in MLOps. This includes automated “Red Teaming” and “Guardian” deployments.

Now it was time to ask each agent to review and score the overall reports of all other agents (including their own).

2) Report analysis

Normalize each model output
- Re-map content into canonical sections
- Note missing, blended, or out-of-order sections
Hard-gate compliance check (binary)
- Completeness %
- Missing post IDs
- Duplicate post IDs
- Traceability of claims to specific posts
- Structural compliance with required sections
- Immediate fail if completeness invariant is broken
Post-level scoring audit (primary pass)
- Inspect per-post rubric scoring
- Validate dimension-level scores and justifications
- Confirm uncertainty is not averaged away
Threshold validation
- Exec Brief contains only ≥ 12 score items
- Themes supported by multiple qualifying posts
- Watchlist items match rubric definition
- Identify false positives and false negatives
Cross-post synthesis evaluation (secondary)
- Signal aggregation accuracy
- Contradiction handling
- Temporal nuance across the year
- Identification of notable absences
Error taxonomy
- Coverage errors
- Grounding / traceability errors
- Rubric misapplication
- Overreach or hallucination
- Executive utility failures
Reliability judgment
- Can a busy executive rely on this alone?
- Where would it silently mislead?
- Is it safe to scale or reuse operationally?
Cross-model comparison
- Use one compliant report as canonical baseline
- Diff all other reports against it for gaps and distortions

The Report Results

Manus AI

Model	Completeness	Traceability	Structure	Reliability Judgment
OpenAI	Partial	Weak	Non-compliant	Not reliable
Gemini	Partial	Moderate	Partially Compliant	Not reliable
Claude	Partial	Good	Mostly Compliant	Not reliable
ManusAI	Excellent	Excellent	Excellent	Highly reliable

OpenAI

Model	Completeness	Traceability	Structure	Reliability Judgment
OpenAI	Complete (100%)	Strong	Fully compliant	Reliable baseline
Gemini	Partial	Partial	Partially compliant	Not reliable
Claude	Partial (explicitly scoped)	Moderate	Mostly compliant	Not reliable
ManusAI	Incomplete	Weak / unverifiable	Non-compliant	Not reliable

Gemini

Model	Completeness	Traceability	Structure	Reliability Judgment
OpenAI	Incomplete (~60%)	Weak (Footnotes)	Non-compliant	Not reliable
Gemini	Complete (100%)	Strong (Source IDs)	Fully compliant	Highly Reliable (Audit-Ready)
Claude	Failed	Hallucinated	Mostly compliant	Unsafe / Dangerous
ManusAI	Partial (Condensed)	Moderate (Dates)	Mostly compliant	Reliable for Strategy

Claude

Model	Completeness	Traceability	Structure	Reliability Judgment
OpenAI	Near-Complete (~95%)	Strong (†/‡ markers)	Fully compliant	Conditionally Reliable (Scores Hidden)
Gemini	Incomplete (~60%)	Moderate (Inconsistent)	Partial compliance	Not Reliable (Coverage Gaps)
Claude	Failed (~40%)	Weak (External bleed)	Non-compliant	Unsafe / Critical Failure
ManusAI	Complete (~90%)	Strong (Source IDs)	Fully compliant	Highly Reliable (Audit-Ready)

The Verdict

What my little aggregator experiment surfaced is less about quality and more about design. When agents are asked to judge work that emerges from the same internal logic that produced it, the evaluation collapses.

The behaviour is subtle and yet predictable: each model treated its own interpretation of the rubric as ground truth and deviations as error.

Would I pick Manus AI over any of the other “judged agents” > no. Neither should you, I guess.

For now, the safest assumption is a conservative one. Agents are competent producers. They are poor referees of their own work. Until that changes, reliability comes less from smarter models and more from boring things: fixed rubrics, locked datasets, explicit invariants, and the discipline to separate generation from judgment. That may not feel elegant, but it is how most mature systems stay upright.

Which makes the learnings of the Turing Post themes across 2025 even more relevant: “Reliability comes less from smarter models and more from boring things.”

To quote the Turing Post: As AI systems become more autonomous, the hardest problem is no longer intelligence, but trustworthy evaluation.

Trust you found my little excursion into agents valuable and insightful.

“Self-Preference Bias in LLM-as-a-Judge” (Wataoka et al., October 2024) ↩︎