🧠

Generative AI: 5th August 2025

Published 5th August 2025

📣 Headlines

• Google launched Gemini 2.5 Deep Think, a new reasoning model that tests multiple ideas in parallel and enhances problem-solving capabilities with improved coding skills and computational efficiency.

• Anthropic is set to raise up to $5bn led by Iconiq Capital, tripling its valuation to $170bn and positioning it as a rival to OpenAI and SpaceX in the AI space.

• Security concerns emerged as a flaw in Gemini CLI could allow hackers to execute malicious commands, while OpenAI removed a ChatGPT feature after private conversations leaked to Google search results.

• GitHub Copilot crossed 20 million users, driven by significant enterprise adoption among Fortune 100 companies and competition with tools like Cursor.

• OpenAI announced its first European data center in Norway, partnering with Nscale and Aker to utilize renewable energy and innovative cooling technologies for enhanced AI infrastructure.

• Meta's profits surged, enabling Zuckerberg to accelerate AI investments and develop 'superintelligence' tools to compete with OpenAI and Google.

• AI cyberattacks are outpacing security measures according to IBM's report, revealing increased vulnerabilities and costs as organizations struggle with shadow AI attacks.

• The Federal Reserve released analysis suggesting AI productivity growth will be slow and risky, drawing parallels to historical technologies like electricity adoption.

🔧 Company Engineering Blogs

Covariate Selection in Causal Inference: Good and Bad Controls (booking.ai). Explores covariate selection in causal inference, discussing confounding, mediators, colliders, and biases impacting causal effect estimates using observational data

AlphaEarth Foundations helps map our planet in unprecedented detail (deepmind.google). AlphaEarth Foundations integrates vast Earth observation data using AI to enhance global mapping, benefiting various applications like food security and environmental monitoring

How AI Test Automation Cut Developer Productivity Bottlenecks by 30% at Scale (engineering.salesforce.com). AI-powered Test Failure Triage Agent improves developer productivity by 30%, enhancing test failure resolution with context-driven suggestions and semantic search integration

Automate your project with GitHub Models in Actions (github.blog). Integrate AI with GitHub Actions using GitHub Models to automate workflows, improve issue triage, and enhance developer productivity

Qwen3 30B-A3B (huggingface.co). Qwen3-30B-A3B-Instruct-2507 model enhances text generation, features non-thinking mode, supports tool-calling via Qwen-Agent, provides API endpoints, and requires updated libraries

🇨🇳 Chinese Models & Open Source LLMs

The best available open weight LLMs now come from China (simonwillison.net). Chinese AI labs now lead with top open weight LLMs, outperforming Mistral and Gemma, featuring Qwen and Z.ai among notable releases

Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next (interconnects.ai). Ross Taylor discusses the rapid development of AI, focusing on Chinese open models, reasoning, LLM training dynamics, and emerging technologies in a podcast

GLM 4.5: test drive (konradb.substack.com). GLM 4.5 offers advanced agentic functionality with 355B parameters, optimizing energy research and automating presentations with generative AI tools

🤖 AI Agents & Production Systems

How we built AI agents at Airtable (medium.com/airtable-eng). Airtable's agentic framework enables AI features like Omni and Field Agents, enhancing automation, reasoning, and decision-making capabilities within its app platform

Context Engineering: Building Production-Grade AI (akashbajwa.co). Exploring context engineering in AI: stateful products, single vs multi-agent systems, KV caches, and key practices for optimizing performance

Architecting the Foundation — LLM Function Calling and Toolchains (digitalthoughtdisruption.com). Explore LLM function calling using LangChain and OpenAI's function schema for production deployment and AI agent creation

How LLMs Actually Process Your Prompts, Tools, and Schemas (hippocampus-garden.com). Explores how LLMs serialize prompts, tools, and schemas into token sequences, using Llama 4 and Kimi K2 as examples

🎨 Models & Infrastructure

Releasing weights for FLUX.1 Krea (krea.ai). Krea releases FLUX.1 Krea, an open-source image model focusing on superior aesthetic control, collaborating with Black Forest Labs to enhance image quality

The Reviewer is Dead, Long Live the Review: Re-engineering Peer Review for the Age of AI (sigarch.org). Exploring Large Language Models (LLMs) to enhance peer review efficiency, bias reduction, and analytical rigor in scientific publishing

AI Engineer World's Fair 2025: My Day 2 Highlights (craftycto.com). Highlights from AI Engineer World's Fair 2025 include innovations from Google DeepMind, Dagger, Morph, and insights on agentic coding and AI infrastructure

A Survey Of Architectures And Methodologies For Distributed LLM Dissaggregation (api.follow.it). Survey on architectures, methodologies, and tools for distributed LLM disaggregation, covering KV-cache optimization, scheduling, resource management, and the impact of heterogeneous systems

🔍 RAG Systems & Data Quality

Red-teaming a RAG app: gpt-4o-mini v. llama3.1 v. hermes3 (blog.pamelafox.org). Red-teaming evaluation of RAG applications using gpt-4o-mini, llama3.1, and hermes3 to assess safety against unsafe outputs

Data quality and rubrics: how to build trust in your models (s46486.pcdn.co). Rubric-based evaluation enhances data annotation quality for generative AI models, addressing outdated methods with structured, systematic frameworks for better trust and performance

Multilingual RAG: Does Query-Doc Language Mismatch Matter? (mikulskibartosz.name). Exploring query-document language mismatch in a multilingual RAG chatbot using Pinecone and a Japanese-English corpus for effective semantic search

Red-teaming a RAG app: What happens? (blog.pamelafox.org). Red-teaming a RAG app using automated tools like Azure AI evaluates LLM safety against malicious queries within product databases

📊 Evaluation & Benchmarking

Achieving Early Wins in Generative AI (cacm.acm.org). Exploring Structured Outputs in Generative AI for better IT integration, unstructured data processing, and enhanced automation in enterprise systems

How Kimi RL’ed Qualitative Data to Write Better (dbreunig.com). Kimi K2 enhances qualitative writing using reinforcement learning, addressing challenges in qualitative scoring while demonstrating effective categorization techniques in AI

AI at Play - Lessons from a silly benchmark (andreasthinks.me). Andreas Varotsis explores LLM interactions in the game Risk, highlighting model behaviors and insights gained from his open-source project AI at Risk

The Hidden Homework Problem: How ArxivRoll Exposed AI’s Inflated Test Scores (emsi.me). ArxivRoll reveals AI models' inflated test scores due to training on leaked benchmarks, introducing new evaluation strategies for accurate assessments

🔬 Technical Research & Optimization

Native Sparse Attention (aclanthology.org). Native Sparse Attention (NSA) introduces efficient long-context modeling through hardware-aligned sparse mechanisms, enhancing performance and reducing computational needs for language models

GEPA: Reflective prompt evolution can outperform reinforcement learning (arxiviq.substack.com). GEPA algorithm utilizes reflective prompt evolution, outperforming RL in sample efficiency, optimizing LLM prompts, and enabling cost-effective AI system adaptations

Attention Probes (blog.eleuther.ai). Exploration of attention probes in language models using an attention layer for classification, outperforming traditional mean and last-token probes in specific contexts

Paper Review: Group Sequence Policy Optimization (andlukyane.com). GSPO optimizes reinforcement learning for LLMs using sequence-level importance ratios, improving efficiency, stability, and performance compared to GRPO and token-level methods

Optimizing training a GPT style Tokenizer with C++ (justinhj.github.io). C++ optimization of a GPT-style tokenizer, leveraging BPE and experiments reducing training time by 23x, guided by Andrej Karpathy's concepts

Putting Math Behind the Madness: A Theoretical Framework for LLM Hallucinations (emsi.me). Esmail Gumaan's framework clarifies LLM hallucinations, introduces mathematical definitions, quantifies hallucination risk, and proposes unified detection and mitigation strategies using PAC-Bayes and Rademacher complexity

📚 Academic Research

How Far Are AI Scientists from Changing the World? (arxiv:cs). Survey on AI Scientist systems, large language models' impact on scientific discovery, bottlenecks, achievements, and future goals for innovative research

A Survey on Code Generation with LLM-based Agents (arxiv:cs). Code generation agents utilizing LLMs enhance software development through autonomy, extended task scope, and practical engineering solutions, with a focus on SDLC integration

Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities (arxiv:econ). Explores LLMs like ChatGPT and Claude for Multinomial Logit model specification and estimation, assessing prompting strategies and information availability for choice modelling

ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation (arxiv:cs). Zero-shot question answering over tabular data using LLM-driven Python code generation, achieving notable rankings in SemEval-2025 DataBench tasks

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective (arxiv:cs). Explores softmax attention's expressiveness vs. linear attention, using RNN methods to analyze components and interactions, revealing insights into performance discrepancies

Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation (arxiv:cs). Proposes Interaction Distillation framework to optimize preference modeling in reward models, mitigating attention hacking in reinforcement learning from human feedback for LLMs

Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models (arxiv:cs). Causal2Vec enhances decoder-only LLMs for effective semantic encoding with lightweight BERT-style pre-encoding, achieving state-of-the-art MTEB performance and reduced computational costs

👋 Before you go

I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching a Patreon page!. Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:

Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
First dibs on merch (details still cooking)
That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing

If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.

About Generative AI

Our Generative AI newsletter covers the latest developments, trends, tools, and insights in AI research, LLMs and agentic applications. Each week, we curate the most important content from over 50,000 blogs and news sites so you don't have to spend hours searching.

Whether you're a beginner or expert in generative AI, our newsletter provides valuable information to keep you informed and ahead of the curve in this rapidly evolving field.

Subscribe now to join thousands of professionals who receive our weekly updates!