🧠

Generative AI: 7th October 2025

Published 7th October 2025

📣 Headlines

• OpenAI released GDPval evaluating AI performance across 44 occupations, detailing task-level capabilities from real estate broadsides to clinical lesion assessment.

• Sora 2’s viral, IP‑fraught content, OpenAI’s copyright U‑turn granting rights holders more control and takedowns, and new controls to limit AI doubles, political content, and branding in generated videos put generative video governance in focus.

• Google detailed Gemini for Home rollout, early access, and device support with US pilots and Nest/Aware integrations, while Amazon’s upgraded Alexa showed slow responses and hallucinations, highlighting uneven LLM performance in smart homes.

• AMD will supply six gigawatts of GPUs to OpenAI data centers starting with MI450 in 2026, intensifying competition with Nvidia and signaling massive AI compute buildout.

• Universal and Warner are nearing major AI licensing deals with Google, Spotify, Suno, and others, pushing streaming‑style payments for model training and generation.

• Opera launched the AI‑centric Neon browser with an agentic assistant, Tasks, Neon Do, and reusable Cards, with early access emphasizing on‑device privacy and Make workflows.

• Q3 global venture funding jumped 38% on AI mega‑rounds and accelerating exits; VCs outlined plays for the application phase, from backing AI‑first startups amid $2–3T capex to owning data, usage‑based pricing, and avoiding model lock‑in.

• Sector deals: Axiom Math raised $64M to build an AI mathematician with verifiable proofs; Heidi Health secured $65M for AI scribe and agent tools; and Oneleet closed $33M to automate security compliance with AI.

🔧 Company Engineering Blogs

LLMs Are the Key to Mutation Testing and Better Compliance (engineering.fb.com). Meta's ACH uses LLMs for mutation-guided test generation to improve compliance testing and scalability

Revolutionizing Data Cloud: Unleashing the Power of the New ML Recommendations System (engineering.salesforce.com). Data Cloud-native ML recommendations system; flexible abstract schemas; multi-cluster architecture; CI/CD NDCG evaluation; Cursor AI-assisted development

Spec-driven development: Using Markdown as a programming language when building with AI (github.blog). Spec-driven development: write app logic in Markdown and compile with AI copilots and prompts like compile.prompt.md

SOTA OCR on-device with Core ML and dots.ocr (huggingface.co). On-device OCR with Core ML and dots.ocr: converting a 3B parameter model via CoreML/MLX, debugging, and benchmarking on Apple Neural Engine

Compute-Optimal Quantization-Aware Training (machinelearning.apple.com). Compute-Optimal Quantization-Aware Training improves QAT efficiency by modeling FP and QAT compute trade-offs and deriving a scaling law

🏭 Productionizing AI in software

A practical blueprint for evaluating conversational AI at scale (dropbox.tech). Structured evaluation blueprint for conversational AI at scale: datasets, LLM judges, Braintrust, gated QA pipelines, and production-grade metrics

Restoring Reliability in the AI-Aided Software Development Life Cycle (cacm.acm.org). AI-generated code boosts velocity; SRE-led risk models, testing, and observability drive reliability and resilience

The Java Developer’s Dilemma: Part 1 (oreilly.com). Java enterprise tech meets AI: standards, frameworks (LangChain4j, Model Context Protocol, MCP), testing, performance (FFM, Vector API), and production-readiness for AI in Java

Upstream: the strategic advantage for LLM prompt tracking (christopheryee.org). Prompt tracking for LLMs like ChatGPT, Gemini, and Claude reveals how brand visibility varies with prompts, contexts, and personas

🧭 Agentic workflows and RAG

The RAG Obituary: Killed by agents, buried by context windows (nicolasbustamante.com). Agentic search and large-context navigation threaten RAG; hybrid search, vector embeddings, and rerankers may fade as grep-like tools and full-file navigation rise

Unlocking Complex Networks with GraphML and LLMs (blog.devgenius.io). GraphML and LLM integration for knowledge graphs, RAG, embeddings, encoders, and aligners with LLMs like BERT and GPT-style models

Learning to act in generative settings (danmackinlay.name). Survey of optimizing agents vs. replicating persisters; with curiosity, empowerment, and POET as open-ended generators

How to Build a Powerful Deep Research System (towardsdatascience.com). Deep research system architecture with orchestration, keyword and vector search tools, and multi-agent setup for comprehensive document analysis

Teaching Models to Decide When to Retrieve: Adaptive RAG, Part 4 (blog.reachsumit.com). Adaptive RAG Part 4 surveys gatekeeper, LLM-tuned, and reasoning-based retrieval strategies for selective retrieval in modern RAG systems

Stumbling into AI: Part 5—Agents (rmoff.net). Explores AI Agents fundamentals, definitions, memory, HITL, and examples like coding and travel agents with LangChain, LlamaIndex, MCP, and RAG

Diving into RAG and Customizing Large Language Models (techbychris.com). Explores Retrieval-Augmented Generation (RAG), fine-tuning LLMs, Claude 3 in Bedrock, GPT instruction fine-tuning code, and related Raschka resources

Privacy vs. Accuracy: Setting Realistic Expectations for Your Home LLM (incognitocat.me). Private home LLMs with Ollama reveal accuracy trade-offs and RAG contrasts for niche vs. universal topics

🧪 LLM evaluation and testing

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch) (sebastianraschka.com). Four main LLM evaluation approaches: multiple-choice benchmarks, verifiers, leaderboards, and LLM judges with code examples and PyTorch-based demonstrations

Introducing RTEB: A New Standard for Retrieval Evaluation (huggingface.co). RTEB uses open and private datasets to evaluate retrieval embedding models for real-world generalization

Iterating some sample data (kieranhealy.org). Iterates sample data to illustrate LLM evaluation via confusion matrices, R code, and tibble-based data frames

Are Foundation Models Ready for Your Production Tabular Data? (towardsdatascience.com). Overview of TabPFN, CARTE, TabuLa-8b, and TabDPT as tabular foundation models with ICL, graph-based, and LLM-fine-tuned approaches

Seriously Testing LLMs (satisfice.com). GenAI testing challenges, LARC retrieval consistency, prompts, risk analysis, and Rapid Software Testing methods for AI reliability

⚙️ Transformer systems and performance

Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it (github.com). Gluon attention kernel made persistent; fp8 tips, cutlass naming, and performance/accuracy notes with FP16/FP8 comparisons

Paper Review: LongLive: Real-time Interactive Long Video Generation (andlukyane.com). Real-time interactive long video generation with KV recache, streaming long tuning, and short-window attention with a frame sink

Introduction to KV Cache Optimization Using Grouped Query Attention (pyimagesearch.com). Grouped Query Attention reduces KV cache memory and speeds long-context inference in transformers using shared KV heads

AI Under the Hood: Part I: Understanding the Machine (kennethwolters.com). Explores why decoder-only LLMs are slow, the dual highway of information flow, memory bottlenecks, and generation vs prefill dynamics

DeepSeek v3.2-Exp, Claude Sonnet 4.5, and more (sibellavia.lol). DeepSeek v3.2-Exp, Claude Sonnet 4.5, sparse attention, top-k keys, FP8, 685B model, long-context, and LoRA findings

DiLoCo: Data Parallelism for the Datacenter Poor (hackbot.dad). Data parallelism basics, gradient accumulation, and DiLoCo for training large LLMs across heterogeneous, non-densely connected compute

🔍 Interpretability, theory, and reasoning

How to inject knowledge efficiently? Knowledge infusion scaling law for LLMs (arxiv.org). Knowledge infusion scaling law for pre-training LLMs exploring data-efficiency and parameter scaling for knowledge injection

Why do LLMs freak out over the seahorse emoji? (vgel.me). Investigation of seahorse emoji belief in LLMs using logit lens, lm_head mechanics, and cross-model behaviors

A History of Large Language Models (gregorygundersen.com). Tracing attention, transformers, and distributed representations from Markov models to Bengio’s neural language model in LLM evolution

Generative AI in the Real World: Emmanuel Ameisen on LLM Interpretability (oreilly.com). Anthropic researcher Emmanuel Ameisen discusses LLM interpretability, mechanistic reasoning, hallucinations, grounding, and debugging tools across Claude, reasoning vs non-reasoning models, and post-training practices

Writing an LLM from scratch, part 20 -- starting training, and cross entropy loss (gilesthomas.com). Explains cross entropy loss, logits, softmax, one-hot targets, and gradient descent in training LLMs with Raschka's approach

Attention in LLMs and Extrapolation (data-processing.club). Attention heads in LLMs: syntactic, streaming, retrieval, induction, function vectors, and iteration heads underpin in-context learning and extrapolation

The Illusion of Reasoning: Is Meta’s Code World Models Overrated? [Breakdowns] (artificialintelligencemadesimple.com). Execution grounding via Meta’s Code World Model; analysis of traces, brittleness, RL limits, and a vision of verification-centric infra and vertical specialization

Animals vs Ghosts (karpathy.bearblog.dev). Sutton vs. Dwarkesh on bitter lessons, pretraining vs reinforcement learning, animals vs ghosts, and frontier LLM directions

📚 Academic Research

Growing Visual Generative Capacity for Pre-Trained MLLMs (arxiv:cs). Bridge: a pure autoregressive MLLM combining pre-trained visual understanding with generative ability via Mixture-of-Transformers for image understanding and generation

Go with Your Gut: Scaling Confidence for Autoregressive Image Generation (arxiv:cs). ScalingAR: a test-time scaling framework for autoregressive image generation using token entropy to calibrate confidence and adaptively guide conditioning

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression (arxiv:cs). Local Linear Attention (LLA) blends Linear and Softmax attention via test-time regression, with FlashLLA and memory-efficient kernels

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs (arxiv:cs). PaDT unifies multimodal LLMs with Patch-as-Decodable Tokens and Visual Reference Tokens for detection, segmentation, and grounding

Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention (arxiv:cs). HoloV: adaptive spatial cropping to prune visual tokens in MLLMs, preserving holistic context for near-original performance

👋 Before you go

I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching a Patreon page!. Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:

Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
First dibs on merch (details still cooking)
That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing

If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.

About Generative AI

Our Generative AI newsletter covers the latest developments, trends, tools, and insights in AI research, LLMs and agentic applications. Each week, we curate the most important content from over 50,000 blogs and news sites so you don't have to spend hours searching.

Whether you're a beginner or expert in generative AI, our newsletter provides valuable information to keep you informed and ahead of the curve in this rapidly evolving field.

Subscribe now to join thousands of professionals who receive our weekly updates!