Generative AI: 7th October 2025
📣 Headlines
• OpenAI released GDPval evaluating AI performance across 44 occupations, detailing task-level capabilities from real estate broadsides to clinical lesion assessment.
• Sora 2’s viral, IP‑fraught content, OpenAI’s copyright U‑turn granting rights holders more control and takedowns, and new controls to limit AI doubles, political content, and branding in generated videos put generative video governance in focus.
• Google detailed Gemini for Home rollout, early access, and device support with US pilots and Nest/Aware integrations, while Amazon’s upgraded Alexa showed slow responses and hallucinations, highlighting uneven LLM performance in smart homes.
• AMD will supply six gigawatts of GPUs to OpenAI data centers starting with MI450 in 2026, intensifying competition with Nvidia and signaling massive AI compute buildout.
• Universal and Warner are nearing major AI licensing deals with Google, Spotify, Suno, and others, pushing streaming‑style payments for model training and generation.
• Opera launched the AI‑centric Neon browser with an agentic assistant, Tasks, Neon Do, and reusable Cards, with early access emphasizing on‑device privacy and Make workflows.
• Q3 global venture funding jumped 38% on AI mega‑rounds and accelerating exits; VCs outlined plays for the application phase, from backing AI‑first startups amid $2–3T capex to owning data, usage‑based pricing, and avoiding model lock‑in.
• Sector deals: Axiom Math raised $64M to build an AI mathematician with verifiable proofs; Heidi Health secured $65M for AI scribe and agent tools; and Oneleet closed $33M to automate security compliance with AI.
đź”§ Company Engineering Blogs
LLMs Are the Key to Mutation Testing and Better Compliance (engineering​.fb​.com). Meta's ACH uses LLMs for mutation-guided test generation to improve compliance testing and scalability
Revolutionizing Data Cloud: Unleashing the Power of the New ML Recommendations System (engineering​.salesforce​.com). Data Cloud-native ML recommendations system; flexible abstract schemas; multi-cluster architecture; CI/CD NDCG evaluation; Cursor AI-assisted development
Spec-driven development: Using Markdown as a programming language when building with AI (github​.blog). Spec-driven development: write app logic in Markdown and compile with AI copilots and prompts like compile.prompt.md
SOTA OCR on-device with Core ML and dots.ocr (huggingface​.co). On-device OCR with Core ML and dots.ocr: converting a 3B parameter model via CoreML/MLX, debugging, and benchmarking on Apple Neural Engine
Compute-Optimal Quantization-Aware Training (machinelearning​.apple​.com). Compute-Optimal Quantization-Aware Training improves QAT efficiency by modeling FP and QAT compute trade-offs and deriving a scaling law
🏠Productionizing AI in software
A practical blueprint for evaluating conversational AI at scale (dropbox​.tech). Structured evaluation blueprint for conversational AI at scale: datasets, LLM judges, Braintrust, gated QA pipelines, and production-grade metrics
Restoring Reliability in the AI-Aided Software Development Life Cycle (cacm​.acm​.org). AI-generated code boosts velocity; SRE-led risk models, testing, and observability drive reliability and resilience
The Java Developer’s Dilemma: Part 1 (oreilly​.com). Java enterprise tech meets AI: standards, frameworks (LangChain4j, Model Context Protocol, MCP), testing, performance (FFM, Vector API), and production-readiness for AI in Java
Upstream: the strategic advantage for LLM prompt tracking (christopheryee​.org). Prompt tracking for LLMs like ChatGPT, Gemini, and Claude reveals how brand visibility varies with prompts, contexts, and personas
đź§ Agentic workflows and RAG
The RAG Obituary: Killed by agents, buried by context windows (nicolasbustamante​.com). Agentic search and large-context navigation threaten RAG; hybrid search, vector embeddings, and rerankers may fade as grep-like tools and full-file navigation rise
Unlocking Complex Networks with GraphML and LLMs (blog​.devgenius​.io). GraphML and LLM integration for knowledge graphs, RAG, embeddings, encoders, and aligners with LLMs like BERT and GPT-style models
Learning to act in generative settings (danmackinlay​.name). Survey of optimizing agents vs. replicating persisters; with curiosity, empowerment, and POET as open-ended generators
How to Build a Powerful Deep Research System (towardsdatascience​.com). Deep research system architecture with orchestration, keyword and vector search tools, and multi-agent setup for comprehensive document analysis
Teaching Models to Decide When to Retrieve: Adaptive RAG, Part 4 (blog​.reachsumit​.com). Adaptive RAG Part 4 surveys gatekeeper, LLM-tuned, and reasoning-based retrieval strategies for selective retrieval in modern RAG systems
Stumbling into AI: Part 5—Agents (rmoff​.net). Explores AI Agents fundamentals, definitions, memory, HITL, and examples like coding and travel agents with LangChain, LlamaIndex, MCP, and RAG
Diving into RAG and Customizing Large Language Models (techbychris​.com). Explores Retrieval-Augmented Generation (RAG), fine-tuning LLMs, Claude 3 in Bedrock, GPT instruction fine-tuning code, and related Raschka resources
Privacy vs. Accuracy: Setting Realistic Expectations for Your Home LLM (incognitocat​.me). Private home LLMs with Ollama reveal accuracy trade-offs and RAG contrasts for niche vs. universal topics
đź§Ş LLM evaluation and testing
Understanding the 4 Main Approaches to LLM Evaluation (From Scratch) (sebastianraschka​.com). Four main LLM evaluation approaches: multiple-choice benchmarks, verifiers, leaderboards, and LLM judges with code examples and PyTorch-based demonstrations
Introducing RTEB: A New Standard for Retrieval Evaluation (huggingface​.co). RTEB uses open and private datasets to evaluate retrieval embedding models for real-world generalization
Iterating some sample data (kieranhealy​.org). Iterates sample data to illustrate LLM evaluation via confusion matrices, R code, and tibble-based data frames
Are Foundation Models Ready for Your Production Tabular Data? (towardsdatascience​.com). Overview of TabPFN, CARTE, TabuLa-8b, and TabDPT as tabular foundation models with ICL, graph-based, and LLM-fine-tuned approaches
Seriously Testing LLMs (satisfice​.com). GenAI testing challenges, LARC retrieval consistency, prompts, risk analysis, and Rapid Software Testing methods for AI reliability
⚙️ Transformer systems and performance
Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it (github​.com). Gluon attention kernel made persistent; fp8 tips, cutlass naming, and performance/accuracy notes with FP16/FP8 comparisons
Paper Review: LongLive: Real-time Interactive Long Video Generation (andlukyane​.com). Real-time interactive long video generation with KV recache, streaming long tuning, and short-window attention with a frame sink
Introduction to KV Cache Optimization Using Grouped Query Attention (pyimagesearch​.com). Grouped Query Attention reduces KV cache memory and speeds long-context inference in transformers using shared KV heads
AI Under the Hood: Part I: Understanding the Machine (kennethwolters​.com). Explores why decoder-only LLMs are slow, the dual highway of information flow, memory bottlenecks, and generation vs prefill dynamics
DeepSeek v3.2-Exp, Claude Sonnet 4.5, and more (sibellavia​.lol). DeepSeek v3.2-Exp, Claude Sonnet 4.5, sparse attention, top-k keys, FP8, 685B model, long-context, and LoRA findings
DiLoCo: Data Parallelism for the Datacenter Poor (hackbot​.dad). Data parallelism basics, gradient accumulation, and DiLoCo for training large LLMs across heterogeneous, non-densely connected compute
🔍 Interpretability, theory, and reasoning
How to inject knowledge efficiently? Knowledge infusion scaling law for LLMs (arxiv​.org). Knowledge infusion scaling law for pre-training LLMs exploring data-efficiency and parameter scaling for knowledge injection
Why do LLMs freak out over the seahorse emoji? (vgel​.me). Investigation of seahorse emoji belief in LLMs using logit lens, lm_head mechanics, and cross-model behaviors
A History of Large Language Models (gregorygundersen​.com). Tracing attention, transformers, and distributed representations from Markov models to Bengio’s neural language model in LLM evolution
Generative AI in the Real World: Emmanuel Ameisen on LLM Interpretability (oreilly​.com). Anthropic researcher Emmanuel Ameisen discusses LLM interpretability, mechanistic reasoning, hallucinations, grounding, and debugging tools across Claude, reasoning vs non-reasoning models, and post-training practices
Writing an LLM from scratch, part 20 -- starting training, and cross entropy loss (gilesthomas​.com). Explains cross entropy loss, logits, softmax, one-hot targets, and gradient descent in training LLMs with Raschka's approach
Attention in LLMs and Extrapolation (data-processing​.club). Attention heads in LLMs: syntactic, streaming, retrieval, induction, function vectors, and iteration heads underpin in-context learning and extrapolation
The Illusion of Reasoning: Is Meta’s Code World Models Overrated? [Breakdowns] (artificialintelligencemadesimple​.com). Execution grounding via Meta’s Code World Model; analysis of traces, brittleness, RL limits, and a vision of verification-centric infra and vertical specialization
Animals vs Ghosts (karpathy​.bearblog​.dev). Sutton vs. Dwarkesh on bitter lessons, pretraining vs reinforcement learning, animals vs ghosts, and frontier LLM directions
📚 Academic Research
Growing Visual Generative Capacity for Pre-Trained MLLMs (arxiv:cs). Bridge: a pure autoregressive MLLM combining pre-trained visual understanding with generative ability via Mixture-of-Transformers for image understanding and generation
Go with Your Gut: Scaling Confidence for Autoregressive Image Generation (arxiv:cs). ScalingAR: a test-time scaling framework for autoregressive image generation using token entropy to calibrate confidence and adaptively guide conditioning
Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression (arxiv:cs). Local Linear Attention (LLA) blends Linear and Softmax attention via test-time regression, with FlashLLA and memory-efficient kernels
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs (arxiv:cs). PaDT unifies multimodal LLMs with Patch-as-Decodable Tokens and Visual Reference Tokens for detection, segmentation, and grounding
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention (arxiv:cs). HoloV: adaptive spatial cropping to prune visual tokens in MLLMs, preserving holistic context for near-original performance
đź‘‹ Before you go
I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching a Patreon page!. Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:
- Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
- First dibs on merch (details still cooking)
- That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing
If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.
You may also like
About Generative AI
Our Generative AI newsletter covers the latest developments, trends, tools, and insights in AI research, LLMs and agentic applications. Each week, we curate the most important content from over 50,000 blogs and news sites so you don't have to spend hours searching.
Whether you're a beginner or expert in generative AI, our newsletter provides valuable information to keep you informed and ahead of the curve in this rapidly evolving field.
Subscribe now to join thousands of professionals who receive our weekly updates!