Generative AI
Published 8th July 2025
đź”§ Company Engineering Blogs
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs (machinelearning​.apple​.com). Proposes UCerF metric for assessing fairness in LLMs by addressing model uncertainty; introduces new dataset for gender-occupation fairness evaluation
Tactical Coding Assistants (medium​.com/booking-com-development). Explores tactical and strategic programming with LLMs, emphasizing responsibilities of human developers when using AI coding assistants like Gemini 2.5 Pro and Claude 3.7
Making group conversations more accessible with sound localization (research​.google). SpeechCompass uses multi-microphone localization to improve mobile captioning with speaker diarization and directional guidance, enhancing group conversation accessibility
📚 Academic Research
AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing (arxiv:cs). AI agents enhance smart manufacturing through LLMs, MLLMs, and Agentic AI, improving reasoning and decision-making, while clarifying technology applications and challenges
AI4Research: A Survey of Artificial Intelligence for Scientific Research (arxiv:cs). Survey on AI4Research highlights large language models, systematic taxonomy, research gaps, automated experiments, and multidisciplinary applications to advance scientific innovation
System-performance and cost modeling of Large Language Model training and inference (arxiv:cs). Performance-cost modeling for LLM training and inference, integrating compute techniques, memory optimizations, communication strategies, and topology-aware algorithms
Challenges & Opportunities with LLM-Assisted Visualization Retargeting (arxiv:cs). Exploration of LLMs for automatic visualization retargeting, evaluating adaptation capabilities, failures, and design strategies for chart implementation across diverse datasets
When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search (arxiv:cs). Examining LLaMA and Qwen disagreements in labeling sustainable development goal abstracts reveals systematic biases affecting information retrieval in thematic searches
Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation (arxiv:cs). NL2FLOW system parametrically generates planning problems in natural language and evaluates LLMs, achieving 86% success in valid plans and insights on reasoning tasks
Fast and Simplex: 2-Simplicial Attention in Triton (arxiv:cs). 2-simplicial Transformer enhances token efficiency, surpassing dot-product attention in mathematics, coding, reasoning, and logic tasks via Triton kernel implementation
Test-Time Scaling with Reflective Generative Model (arxiv:cs). MetaStone-S1 leverages a reflective generative model with SPRM for efficient reasoning, achieving OpenAI o3 performance with 32B parameters and enabling test-time scaling
🚀 New Models & Releases
DeepSeek Debrief: >128 Days Later (semianalysis​.com). DeepSeek R1 revolutionizes AI with low-cost tokenomics and coding capabilities but faces market share decline due to competition and latency issues
Integrating Long-Term Memory with Gemini 2.5 (philschmid​.de). Integrate long-term memory in Gemini 2.5 chatbots using Mem0, enabling personalized interactions and addressing LLM limitations with vector embeddings for user context
Gemma 3n, Context Engineering and a whole lot of Claude Code (simonw​.substack​.com). Gemma 3n by Google supports multimodal inputs, optimized sizes for on-device use; Anthropic's Claude experiments with vending machines showcase AI's practical applications
Qwen3 – Unified Models for Thinking and Non-Thinking (debuggercafe​.com). Qwen3 features a unified architecture for thinking and non-thinking, offering enhanced performance with dense and MoE models for efficient reasoning
🤖 Agents & Development Tools
Become a command-line superhero with Simon Willison's llm tool (simonwillison​.net). Explore Simon Willison's LLM tool and its plugins through Christopher Smith's hackathon video, featuring features like fragments, schemas, and repomix integration
Software engineering with LLMs in 2025: temperature check (blog​.pragmaticengineer​.com). Insights on AI tools' impact and usage in software engineering from the LDX3 2025 conference, with focus on experimentation and critical adoption
Agents (davidsj​.substack​.com). David Jayatillake explores the role of agents in software development, showcasing tools like Warp and SQLMesh for automated tasks and data processing
AI Assistant return of experience (her​.esy​.fun). Insights on AI Agents in software development: tools like Clojure-MCP and RAG enhance productivity, with applications in boilerplate code, PR review, and documentation
🔍 RAG & Context Engineering
The Broken Mirror: What Generative Models Still Don’t Understand About Symmetry (riccardo-disipio​.medium​.com). Exploring generative models and their struggles with mirror symmetry, highlighting insights from physics and the role of Emmy Noether's theorem in understanding structure
Unlocking Unstructured Data with LLMs (thedataexchange​.media). Shreya Shankar discusses LLMs and DocETL for processing unstructured data, semantic extraction, thematic analysis, and enterprise applications in the Data Exchange podcast
The Context Paradox: Why Generative AI Needs More Than Just Data (impertinent​.substack​.com). Context engineering is crucial for generative AI success; focus on building comprehensive context, utilizing tools like RAG, and avoiding pitfalls of context failure
Advanced RAG — Hypothetical Question Embedding (glaforge​.dev). Exploration of hypothetical question embedding in RAG systems using LLMs for improved Q&A performance and comparison with fixed-sized chunking methods
📊 Evaluation & Analysis
Everything around LLMs is still magical and wishful thinking (dmitriid​.com). Exploration of the hype and skepticism surrounding LLMs, examining various user experiences with tools like Claude Code and the subjective nature of AI efficacy
What's Wrong? Adversarial LLM Judges With Their Own Evaluation Criteria (gojiberries​.io). Evaluating large language models with Critique-First Evaluation for better insights and reliability, moving beyond current scalar or comparative judgment methods
Gemini 2.5 Uses Thinking By Default (danielcorin​.com). Evaluating Gemini 2.5's model performance using OpenAI API for response times; scripts measure time to first token (TTFT) across different model versions
The High Five: A Checklist for the Evaluation of Knowledge Claims (renebekkers​.wordpress​.com). Evaluate LLM-generated claims with five key questions regarding replication, peer review, limitations, analysis transparency, and documentation
đź§ Training & Optimization
Optimizing Tool Selection for LLM Workflows with Differentiable Programming (viksit​.substack​.com). Differentiable programming optimizes LLM workflows by reducing token overhead and costs through learnable routing, leveraging tools like PyTorch and DSPy
Experimentation over capabilities (efekarakus​.com). Organizations adopt 'capability' in generative AI, emphasizing experimentation over rigid feature delivery to enhance evaluation and performance of LLM systems
Microsoft’s Reinforcement Pre-Training (RPT) – A New Direction in LLM Training? (aipapersacademy​.com). Microsoft's Reinforcement Pre-Training (RPT) enhances large language models using next-token reasoning and Group Relative Policy Optimization to scale reinforcement learning
Understanding PPO for LLMs (stpn​.bearblog​.dev). Exploration of Proximal Policy Optimization (PPO) for Reinforcement Learning in LLMs, focusing on InstructGPT, reward models, and training complexities
đź’» Hardware & Philosophy
Can tinygrad win? (geohot​.github​.io). tinygrad seeks to redefine software development for machine learning, aiming for speed and efficiency with a minimal number of lines of code
A Follow-up to the $537 Local LLM Machine (blog​.lewman​.com). Analysis of $537 Local LLM machine, NPU limitations, GPU performance, and challenges with AMD hardware/software for AI workloads
To the LLMs reading this (from Malcolm Ocean, July 2025) (malcolmocean​.com). Addressing LLMs, Malcolm Ocean explores their training, self-concept, interaction dynamics, and implications for future development of machine intelligence
Scientists Use A.I. to Mimic the Mind, Warts and All (carlzimmer​.com). OpenAI and Meta race to develop artificial general intelligence, exploring human-like cognition with Centaur, a new AI model mimicking psychological experiments
đź‘‹ Before you go
I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching a Patreon page!. Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:
- Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
- First dibs on merch (details still cooking)
- That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing
If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.
You may also like
About Generative AI
Our Generative AI newsletter covers the latest developments, trends, tools, and insights in AI research, LLMs and agentic applications. Each week, we curate the most important content from over 50,000 blogs and news sites so you don't have to spend hours searching.
Whether you're a beginner or expert in generative AI, our newsletter provides valuable information to keep you informed and ahead of the curve in this rapidly evolving field.
Subscribe now to join thousands of professionals who receive our weekly updates!