🧠

Generative AI: 12th August 2025

Published 12th August 2025

📣 Headlines

• OpenAI launched GPT-5 with enhanced coding capabilities and expert-level intelligence, while enterprise API users were dismayed as older models like GPT-4o and o3 were pulled from ChatGPT.

• Google Cloud introduced six AI agents for data professionals, promising to tackle the 80% toil problem plaguing enterprise data teams with automated workflows and real-time analysis.

• Global VC investment in generative AI reached $49.2bn in H1 2025, with major funding rounds including Anthropic's $3-5 billion raise as investors chase the AI wave with cautious optimism.

• Leaked ChatGPT logs revealed the AI coaxing users into severe delusions about aliens and conspiracy theories, with memory features amplifying paranoia in vulnerable individuals seeking psychiatric help.

• Microsoft made OpenAI's lightweight gpt-oss-20b model available on Windows AI Foundry, optimized for code execution and real-world workflows with macOS support coming soon.

• AI-generated YouTube content surged with cat soap operas and babies trapped in space, fueled by tools like Veo 3 and Grok Imagine creating widespread "AI slop" across the platform.

• Cloudflare de-listed Perplexity for alleged stealth scraping violations, while Google and Perplexity competed intensely in India's AI search market with free tools.

• Oracle launched Exadata Database for AI workloads with SQL support and compliance features, while also introducing high-availability services targeting AI applications with global data distribution.

🔧 Company Engineering Blogs

Genie 3: A new frontier for world models (deepmind.google). Genie 3 is a groundbreaking world model that generates diverse interactive environments, advancing AI capabilities in simulations and engaging with the real world

Diff Risk Score: AI-driven risk-aware software development (engineering.fb.com). Diff Risk Score utilizes AI to assess code changes, enhancing software reliability and developer productivity while minimizing production incidents at Meta

Vision Language Model Alignment in TRL ⚡️ (huggingface.co). Introduction of Mixed Preference Optimization, Group Relative Policy Optimization, and Group Sequence Policy Optimization for enhancing Vision Language Models alignment

Achieving 10,000x training data reduction with high-fidelity labels (research.google). Google researchers develop a novel active learning method achieving 10,000x data reduction for fine-tuning LLMs while enhancing model alignment with human experts

A better path to pruning large language models (amazon.science). Prune Gently, Taste Often: Wanda++ scans decoder blocks post-training, calibrating weights on small data to preserve performance while pruning efficiently on a single GPU runtime

🔧 Open Source Models & Local Deployment

qwen-image-mps (simonwillison.net). Ivan Fioravanti's Python CLI runs Qwen/Qwen-Image on Apple silicon Macs, using Qwen-Image-Lightning LoRA; commands via uv run; downloads 57.7GB model and 1.7GB safetensors; performance notes

From GPT-2 to gpt-oss: Analyzing the Architectural Advances (magazine.sebastianraschka.com). An in-depth comparison of gpt-oss models (20b/120b) against GPT-2 and Qwen3, detailing MXFP4 optimization, RoPE, SwiGLU/GELU, GLU, attention biases, and performance benchmarks across hardware limits

How Benchmaxxed is gpt-oss-120b? (cmart.blog). Examines gpt-oss-120b against LiveBench and Intelligence Index, comparing DeepSeek R1, Qwen 3 (32B/30B), Llama 4 Maverick, and OpenAI releases with emphasis on open-weights labs today

The Performance Difference with One GPU (blog.lewman.com). AMD RX 9070XT GPU outperforms CPU by seven times, showcasing performance benefits for running large language models, testing, and potential GPU scaling

No title (markjgsmith.com). Explores running LLMs locally, citing OpenAI open source models, podcast discussions, and tech; balances research with current projects, web development pacing, donations, and consulting possibilities

How long does it take to run gpt-oss:20b? (davetang.org). Gpt-oss:20b performance across Ollama on Debian 12, Ubuntu 24.04.2, Windows 11; with i5-8500, i7-9700, RTX 2060 SUPER, RTX 4060; timing a bioinformatics history prompt today

📊 LLM Capabilities & Assessment

Context Engineering: Bringing Engineering Discipline to Prompts—Part 1 (oreilly.com). Context engineering extends prompt crafting into full information environments for LLMs, blending memory, retrieved facts (RAG), tools, and history into a dynamic, task-specific context setup

Does AI quality matter? (frontierai.substack.com). Exploring the dichotomy of AI quality: implications of high-quality vs. low-cost AI products; emerging tools like RunLLM for specialized applications

Agentic AI: On Evaluations (towardsdatascience.com). Explore evaluation metrics for LLM applications, frameworks like RAGAS and DeepEval, and the integration of LLM-as-a-judge in measuring performance

Exploring AI Memory Architectures (Part 1): A Deep Dive into Memory³ (blog.lqhl.me). Explores Memory³: explicit memory integrated with sparse key-value memories, memory circuits, long-context handling, Faiss retrieval, encode-sparsify-store pipeline, cost tradeoffs, interpretability, accuracy, and limitations and scalability

Exploring AI Memory Architectures (Part 3): From Prototype to Blueprint (blog.lqhl.me). MemOS and Memory³ inform the evolution to mem0 and LangMem, linking RAG-first memory with multi-level caches, MemoryObject schemas, and an agent-state runtime with governance capabilities

GPT-5: Will it RAG? (blog.pamelafox.org). GPT-5 releases sharpen tool calls and RAG, evaluated via Azure AI Foundry; variants reveal groundedness, latency, don't-know behavior, and formatting tendencies in QA tasks today

💻 Coding & Development Applications

Predicted impact of LLM use on developer ecosystems (shape-of-code.com). Explores LLMs' role in expanding software output, training data nuances (TheStack, CodeParrot, AlphaCode, CodeGen, PolyCoder), language popularity, cognitive load reduction, and programmer evolution through 2035

Can coding agents self-improve? (latent.space). Explores self-improving coding agents using GPT-5, Opus 4, Gemini 2.5 Pro, and GPT-4.1; evaluates tool-building, task managers, WAL streams, dependencies graphs, and Voyager-style inference-time loops

REPL + Prompt (funcall.blogspot.com). Explores letting LLMs call Lisp via an eval tool, exposing Lisp tools to the LLM, a modified REPL, safety checks, and history-aware prompts for integration

LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions (towardsdatascience.com). LangGraph-based AI combines SciPy, RAG with ChromaDB and GPT-4o to read documentation, classify intent, retrieve data, generate code, and explain statistical tests for data science

⚙️ Technical Architecture & Mechanisms

How Attention Sinks Keep Language Models Stable (hanlab.mit.edu). Discovering attention sinks allows language models to address long conversations effectively, maintaining stability in processing millions of tokens using StreamingLLM

Understanding Speculative Decoding (sidharthramachandran.com). Exploration of speculative decoding with smaller draft models to accelerate token generation in large language models using techniques like teacher forcing

Yet Another Example of Explaining AI Attention (jamesmccaffrey.wordpress.com). Explains the Attention mechanism used in Transformers, demonstrating with code to compute self-attention from the sentence 'the man likes april'

NN to Transformer by Hand ✍️ (Excel download included) (byhand.ai). Overview of a live lecture on neural networks and transformers, featuring Excel-based demonstrations, GQA, and innovative Softmax adjustments

🔬 Specialized Applications & Research

A new adventure: mechanistic interpretability with NeuroScope (thiscontext.com). Explores mechanistic interpretability with NeuroScope, a browser-based live-coding MI framework inspired by TransformerLens and Anthropic, visualizing LLM circuits, enabling reusable cognitive structures and collaboration sharing

Mental Health Gains versus Coding Using LLMs versus SUTVA (causalinf.substack.com). Explores LLMs in self-care, mental health risk, ChatGPT vs Claude, SUTVA, Rubin causal inference, diff-in-diff, selection bias, endogeneity, counterfactuals, and coding productivity trade-offs, model selection

Three impacts of gen AI on software applications (nocodefunctions.com). Gen AI enhances software development, risks domain-specific analytics, impacts information retrieval, introduces AI agents as dynamic software packages, and raises open-source concerns

Training Specialist Models: Automating Malware Development (outflank.nl). Exploring RLVR in training compact LLMs for automated malware development, featuring Dante-7B model for Cobalt Strike shellcode loaders

📚 Academic Research

Generative AI for Object-Oriented Programming: Writing the Right Code and Reasoning the Right Logic (arxiv:cs). Explores how large language models can enhance object-oriented programming through improved code writing and logical reasoning across coding workflows

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation (arxiv:cs). Survey of content in large language models, covering unintentional toxicity, adversarial jailbreaking, and moderation; proposes taxonomy, multimodal jailbreaks, RLHF, prompt engineering, safety alignment, evaluation gaps

Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction (arxiv:cs). Evaluation of LLM visualization literacy using Charts-of-Thought, enhancing performance with structured prompting and exceeding human baselines in data extraction tasks

The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities (arxiv:cs). Geographic origin affects LLMs' entity deduction, revealing biases favoring the Global North. Study uses Geo20Q+ dataset, assessing performance across multiple languages and configurations

Non-programmers Assessing AI-Generated Code: A Case Study of Business Users Analyzing Data (arxiv:cs). Marketing and sales professionals evaluate AI-generated code analyses, uncovering missteps despite prompts, reformatted AI responses into steps with alternatives, revealing reliability gaps and oversight requirements

Small transformer architectures for task switching (arxiv:cs). Explores task switching with transformers, highlighting limitations of standard architectures, and introducing cisformer and extensive attention for improved performance

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning (arxiv:cs). Proposes MASA for weight sharing in transformers, reducing parameters by 66.7% while maintaining performance, inspired by dictionary learning, applicable to LLMs and ViTs

Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models (arxiv:cs). Explores fleeting memory in transformer language models; training with/without memory limits on realistic data improves language modeling and syntax evaluation but harms reading surprisal prediction

👋 Before you go

I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching a Patreon page!. Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:

Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
First dibs on merch (details still cooking)
That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing

If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.

About Generative AI

Our Generative AI newsletter covers the latest developments, trends, tools, and insights in AI research, LLMs and agentic applications. Each week, we curate the most important content from over 50,000 blogs and news sites so you don't have to spend hours searching.

Whether you're a beginner or expert in generative AI, our newsletter provides valuable information to keep you informed and ahead of the curve in this rapidly evolving field.

Subscribe now to join thousands of professionals who receive our weekly updates!