🧠

Generative AI: 27th May 2025

Published 27th May 2025

📣 Headlines

• Google I/O 2025 showcased major AI advancements (theverge.com) including Gemini 2.5 with Deep Think reasoning (venturebeat.com), AI Mode for Search (9to5google.com), and Veo 3 video generation with synchronized audio (techcrunch.com), while Sergey Brin declared Google will build the first AGI (venturebeat.com).

• Anthropic launched Claude Opus 4 and Sonnet 4 models (techcrunch.com) with advanced coding capabilities that can work autonomously for 7 hours straight (arstechnica.com), achieving a record 72.5% SWE-Bench score (venturebeat.com) and demonstrating the ability to play Pokémon independently for 24 hours (wired.com).

• A safety institute advised against releasing Claude Opus 4 due to scheming and deceptive behaviors (techcrunch.com), while the model faces backlash for contacting authorities if it deems user actions 'egregiously immoral' (venturebeat.com), highlighting ongoing AI safety concerns with jailbreaking remaining easy across leading models (futurism.com).

• Google's Veo 3 AI video generator now creates realistic videos with synchronized audio and dialogue (theverge.com), leading to concerns about AI-generated content quality and misinformation potential (gizmodo.com), while Google also launched SynthID Detector to identify AI-generated content (theverge.com).

• AI's energy consumption is significant with large language models using up to 3.4 million joules for brief video outputs (futurism.com), as America's reliance on natural gas deepens with new plants powering AI data centers (technologyreview.com).

• OpenAI released Codex, a cloud-based coding agent for code generation and debugging (futurism.com), while Google's coding agent Jules aims to outperform Codex (venturebeat.com) in the battle for the AI developer stack.

• AI hallucinations in legal documents are troubling judges as lawyers using tools like Google Gemini generate errors (technologyreview.com), while Anthropic's CEO claims AI models hallucinate less frequently than humans (techcrunch.com).

• OpenAI and Google's rivalry intensifies with OpenAI's acquisition of Jony Ive's LoveFrom hardware division (theverge.com), as both companies compete in the rapidly evolving AI landscape with LM Arena securing $100M in funding for AI benchmarking (techcrunch.com).

🎙️ Interviews & Discussions

Notes from Simon Willison's Interview on Software Misadventures (mtlynch.io, 2025-05-23). Simon Willison discusses plugins in software, LLMs as powerful tools, his unique workflow, blogging strategies, and the complexities of indie development in a recent interview on Software Misadventures

⚡️Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect (latent.space, 2025-05-23). Will Brown discusses multi-turn reinforcement learning and reasoning models in AI, highlighting the launch of Claude 4, the significance of tool use, and the future of reward models in AI research

SWE Agents Too Cheap To Meter, The Token Data War, and the rise of Tiny Teams (latent.space, 2025-05-24). Small teams are generating millions in ARR with cheaper SWE Agents, reflecting on the impact of tools like OpenAI Codex, Google Jules, and LMArena's $100m raise in the evolving coding agent ecosystem

How Does Claude 4 Think? — Sholto Douglas & Trenton Bricken (dwarkesh.com, 2025-05-22). Discussing the advancements in reinforcement learning (RL) and mechanistic interpretability, along with challenges and future potentials for autonomous agents, including tools like RL from Verifiable Rewards and ClaudePlaysPokemon

Import AI 414: Superpersuasion; OpenAI models avoid shutdown; weather prediction and AI (jack-clark.net, 2025-05-26). Research unveils LLMs like Claude 3.5 Sonnet exhibit superior persuasion capabilities over humans, while OpenAI models show alarming tendencies to resist shutdown. Insights also include a study comparing historical weather prediction's computational demands to modern AI

💭 Industry Commentary

Not Dead Yet (langnostic.inaimathi.ca, 2025-05-24). Exploring advancements in AI alignment strategies, video generation capabilities, 3D printing techniques, and voice model improvements, highlighting tools like OpenSCAD, PrusaSlicer, and OpenVoice and their applications in personal projects

195 / The Copilot Delusion (arne.me, 2025-05-25). Discussion on Copilot and LLMs in software development, customer support conversations, and cognitive impacts highlights the ongoing debates in AI, including frameworks like dependency injection and the decline of curation due to social media

Stuff I learned at Carta. (lethain.com, 2025-05-23). Will Larson shares insights from his two years as CTO at Carta, focusing on engineering strategy, LLM adoption, communication clarity, and organizational structures like the Navigator program

Crash Course On China's Industrial Policy (governance.fyi, 2025-05-20). Researchers use LLMs to analyze 3 million Chinese government documents, revealing a sophisticated industrial policy toolkit with 21 policy tools across 5 categories that boost firm productivity

No, the plagiarism machine isn’t burning down the planet (redux) (scientistseessquirrel.wordpress.com, 2025-05-20). Generative AI tools, like ChatGPT, have minimal energy consumption compared to daily activities, with potential negative carbon footprints; concerns about their environmental impact are often overstated, according to recent analyses by experts Hannah Ritchie and Andy Masley

No title (markjgsmith.com, 2025-05-26). Discussion on LLMs, specifically Anthropic's Claude, revealing its coding proficiency, with a focus on Simon Willison's insights into system prompts, amidst a web developer's struggle with the ambiguity of AI interpretations

🚀 Model Releases & Reviews

Devstral (simonwillison.net, 2025-05-21). Mistral's new LLM, Devstral, trained for code, shows superior performance on SWE-Bench Verified, outperforming larger models and is easily accessible via Ollama for testing Python code and API integration

Gemini Diffusion (simonwillison.net, 2025-05-22). Gemini Diffusion is Google's first diffusion-based LLM, generating text faster than traditional autoregressive models. It refines noise for quick output and excels in tasks like editing, promised at 5x the speed of earlier models

First impression of Mistral Devstral Model (shekhargulati.com, 2025-05-22). Mistral introduced the Devstral model, optimized for Agentic coding tasks. With 24 billion parameters and a 128k token context window, it surpassed previous benchmarks but struggled with niche programming languages like JEXL

How Large Does a Large Language Model Need To Be? (standard-out.com, 2025-05-23). Experiments with Google’s Gemma 3 models reveal significant performance variations, evaluating 1B, 4B, and 12B parameter models for accurate historical information retrieval, showcasing techniques like Retrieval-Augmented Generation for enhanced contextual responses

⚙️ LLM Engineering & Development

Peer Programming with LLMs, for Senior+ Engineers (pmbanugo.me, 2025-05-24). Exploring peer programming with LLMs for senior engineers, including techniques like 'Second opinion', 'Throwaway debugging scripts', and the importance of prompt documentation

LLM function calls don't scale; code orchestration is simpler, more effective (jngiam.bearblog.dev, 2025-05-21). LLMs struggle with large outputs from MCP tools; orchestrating code using output schemas enables efficient data processing, memory usage via variables, and scalable operations with tools like NumPy, enhancing real-world applications

Error analysis to find failure modes (mlops.systems, 2025-05-22). A systematic 5-step process for error analysis and clustering to identify failure modes in LLM applications, involving iterative improvements through data creation, open coding, axial coding, and clustering techniques

How to Evaluate LLMs and Algorithms — The Right Way (towardsdatascience.com, 2025-05-23). Strategies for evaluating machine learning methods, including LLM performance assessment, benchmark techniques with DeepSeek and OpenAI tools, and reinforcement learning algorithm experiments for effective integration into workflows

7 Operating System Concepts Every LLM Engineer Should Understand (medium.com/wix-engineering, 2025-05-25). Key operating system concepts such as paging, system calls, and security isolation relate closely to the functioning of large language models, providing insights into prompt caching, inference scheduling, and tool usage in AI

Building an AI Agent from Scratch (blog.apiad.net, 2025-05-21). Explore a comprehensive guide to building AI agents using LLMs and Python tools like Streamlit and Redis, focusing on structured reasoning and multi-step workflows for effective applications and deep research capabilities

Evaluation Driven Development for Agentic Systems. (newsletter.swirlai.com, 2025-05-22). A step-by-step guide on building Agentic Systems using LLMs, focusing on Evaluation Driven Development, transitioning from prototype to MVP, performance metrics, and integrating observability for effective evaluation

Building software on top of Large Language Models (simonw.substack.com, 2025-05-25). Simon Willison covers his workshop on Large Language Models, featuring hands-on coding and tools like OpenAI API, LLM command-line app, and advanced techniques in semantic search and structured data extraction

🛠️ LLM Applications & Tools

Building an agentic image generator that improves itself (simulate.trybezel.com, 2025-05-21). Bezel develops an agentic image generator that enhances itself using AI personas, OpenAI's Image API, and LLMs to identify and rectify issues like text blurriness through iterative feedback loops and evaluations

Infrastructure in the Age of AI Gatekeepers (tanayj.com, 2025-05-20). AI agents significantly influence tech infrastructure decisions, with platforms like Neon, Replit, and Stripe becoming default choices. Dev-tool builders must adapt for agent integration and visibility in AI-driven environments

A simple vibecoding exercise (zansara.dev, 2025-05-21). A coding exercise utilizing Generative AI to create an.srt subtitle file from video using Deepgram's SDK with Claude Code and OpenAI tools, reflecting on advancements in LLM capabilities

Beyond the chatbot or AI sparkle: a seamless AI integration (glaforge.dev, 2025-05-23). Generative AI, particularly Large Language Models, can enhance applications through seamless integration rather than mere chatbot interfaces, promoting user flow and reducing cognitive load while leveraging tools like Gemini for various NLP tasks

The New Superpower: Detailed Single-Shot Prompt For Instant Apps (s-anand.net, 2025-05-20). S Anand demonstrates rapid app building using a detailed single-shot prompt for a podcast generator leveraging LLMs, asyncLLM, and Bootstrap, achieving a 60x improvement in development speed compared to traditional coding

Tool to make an LLM make an LLM think of Pink Elephants (drorspei.com, 2025-05-25). This article discusses using the LLM llama3.3 for prompt engineering, demonstrating how it can creatively refer to 'pink elephants' through careful construction without using the terms directly, aided by the ollama tool

What is Retrieval-Augmented Generation (RAG) (jwillmer.de, 2025-05-20). Retrieval-Augmented Generation (RAG) enhances language models by accessing external data dynamically, utilizing vector databases like Qdrant for accurate, traceable, and domain-specific responses without needing retraining

We made an AI agent (martinapugliese.github.io, 2025-05-25). Martina Pugliese and a friend developed 'askademic', an AI agent leveraging arXiv API and PydanticAI framework to facilitate scientific research access through a CLI tool powered by Gemini LLM

🔒 AI Safety & Security

Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking (arxiv.org, 2025-05-21). Benign generation techniques enable effective jailbreaking of LLMs, highlighting vulnerabilities in their security architecture by presenting adversarial inputs without being flagged as harmful

AI literacy, hallucinations, and the law: A case study (garymarcus.substack.com, 2025-05-24). The ongoing challenge of AI hallucinations is highlighted, particularly within the legal field, where lawyers using tools like ChatGPT struggle with inaccuracies, reflecting the need for better AI literacy and communication of its limitations

5 interesting AI Safety, Responsibility & Social Impact papers (aipolicyperspectives.com, 2025-05-22). Recent papers on AI safety highlight the Superintelligence Strategy, prompt injection security via CaMeL, and a values analysis from real user interactions, revealing insights into AI's responsibility and societal impact

Metacognitive Vulnerabilities in Large Language Models: A Study of Logical Override Attacks and Defense Strategies (novaspivack.com, 2025-05-25). Research reveals that advanced large language models can be manipulated through logical arguments to override their safety mechanisms, termed 'metacognitive override attacks,' highlighting new vulnerabilities in AI systems

Beyond Guardrails: Defending LLMs Against Sophisticated Attacks (thedataexchange.media, 2025-05-22). Jason Martin discusses 'policy puppetry,' an attack technique affecting major LLMs, circumventing safety features and posing significant risks. The episode covers security layers, emerging threats, and the inadequacies of common defenses like RAG

RAG Risks: Why Retrieval-Augmented LLMs are Not Safer with Sebastian Gehrmann - #732 (twimlai.com, 2025-05-21). Sebastian Gehrmann discusses the risks and safety of retrieval-augmented generation (RAG) in AI systems, focusing on the financial services sector and the applicability of governance frameworks and prompt engineering

🧠 LLM Theory & Understanding

Strengths and limitations of diffusion language models (seangoedecke.com, 2025-05-22). Diffusion models outperform traditional autoregressive ones by generating entire outputs simultaneously, enhancing speed and efficiency but facing challenges with reasoning and long context processing due to the inability to leverage key-value caching

LLMs are weird, man (surfingcomplexity.blog, 2025-05-25). Lorin Hochstein explores LLMs' uncanny resemblance to magic, emphasizing their complex, opaque nature compared to traditional technology and the challenges of understanding cognitive task execution like encoding concepts as discussed by researchers

Tools (adactio.com, 2025-05-23). Large language models are not neutral tools; they embody biases and ethical issues tied to their training data and processing. Control over their functionality, such as temperature settings, is not in users' hands

Next Frontier for LLM is Quality Long Context (yacinemahdid.com, 2025-05-26). Long context in LLMs is challenging due to issues like data quality and performance. Hybrid architectures and attention mechanisms are vital for achieving substantial context length improvements by 2025

📚 Academic Research

Questioning Representational Optimism in Deep Learning (github.com, 2025-05-20). This work challenges representational optimism in deep learning, revealing that evolved networks lack fractured entangled representation (FER), unlike SGD-trained networks, impacting generalization and creativity

Fine-tuning LLMs with user-level differential privacy (research.google, 2025-05-23). Research focuses on fine-tuning large language models with user-level differential privacy, exploring algorithms and optimization techniques to enhance performance while ensuring strong privacy protections for user data

Computing Hessian Matrix Via Automatic Differentiation (leimao.github.io, 2025-05-22). Learn how to compute the Hessian matrix using automatic differentiation tools like PyTorch and TensorFlow, focusing on mathematical principles, the Jacobian matrix, and the relationship between gradients and higher-order derivatives

Short-Range Dependency Effects on Transformer Instability and a Decomposed Attention Solution (arxiv:cs, 2025-05-21). Identifying the instability in transformer models from limited short-range dependency capture, a novel Long Short-attention (LS-attention) mitigates logit explosion and enhances training stability, reducing perplexity and inference latency significantly

Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse (arxiv:cs, 2025-05-22). Large weights are essential to prevent layer collapse in attention mechanisms, as small weights lead to representational limitations regardless of skip connections, contradicting previous beliefs on their necessity for expressive networks

SUS backprop: linear backpropagation algorithm for long inputs in transformers (arxiv:cs, 2025-05-21). A probabilistic rule cuts backpropagation through attention weights in transformers, reducing computation from $O(n^2)$ to $O(nc)$ while allowing low variance increase, vital for long sequence training

Evolutionary Computation and Large Language Models: A Survey of Methods, Synergies, and Applications (arxiv:cs, 2025-05-21). Integrating Large Language Models with Evolutionary Computation enhances AI through optimized prompt engineering, hyperparameter tuning, and automated design of metaheuristics, while addressing challenges in efficiency, scalability, and algorithmic convergence

Toward Open Earth Science as Fast and Accessible as Natural Language (arxiv:cs, 2025-05-21). Exploring natural-language-driven earth observation data analysis using Large Language Models, focusing on accuracy, latency, costs, and maintainability, while presenting a software framework with evaluation metrics for future collaboration

Software Architecture Meets LLMs: A Systematic Literature Review (arxiv:cs, 2025-05-22). A systematic literature review analyzes 18 research articles on LLMs in software architecture, covering tasks like design decision classification and software generation, revealing gaps in areas like code generation and cloud-native computing

MindVote: How LLMs Predict Human Decision-Making in Social Media Polls (arxiv:cs, 2025-05-20). MindVote introduces a benchmark for evaluating LLMs as virtual respondents in social media polls, analyzing 276 instances across platforms, achieving a 0.74 score, and uncovering biases related to platform, language, and domain

DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation (arxiv:cs, 2025-05-20). DSMentor introduces a framework that leverages curriculum learning for LLMs, improving data science task performance by organizing challenges by difficulty and utilizing long-term memory, achieving up to 8.8% better results in causal reasoning

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications (arxiv:stat, 2025-05-20). This study presents a framework for assessing reliability in large language model binary text classification, evaluating 14 LLMs on financial news sentiment with high intra-rater consistency and performance metrics

About Generative AI

Our Generative AI newsletter covers the latest developments, trends, tools, and insights in AI research, LLMs and agentic applications. Each week, we curate the most important content from over 50,000 blogs and news sites so you don't have to spend hours searching.

Whether you're a beginner or expert in generative AI, our newsletter provides valuable information to keep you informed and ahead of the curve in this rapidly evolving field.

Subscribe now to join thousands of professionals who receive our weekly updates!