🤖

Machine Learning Engineer: 5th August 2025

Published 5th August 2025

🔧 Company Engineering Blogs

Covariate Selection in Causal Inference: Good and Bad Controls (booking.ai). Explores covariate selection in causal inference, discussing confounding, mediators, colliders, and biases impacting causal effect estimates using observational data

Solving Dispatch in a Ridesharing Problem Space (eng.lyft.com). Lyft's dispatch team tackles dynamic ridesharing challenges using graph theory and optimization algorithms for efficient rider-driver matching

MLE-STAR: A state-of-the-art machine learning engineering agent (research.google). MLE-STAR automates machine learning engineering tasks by leveraging web search, code refinement, and ensemble strategies, achieving high performance in Kaggle competitions

Multiagent AI for generating chain-of-thought training data (amazon.science). Ensembles of AI agents enhance chain-of-thought data generation, improving LLM performance by 29% across safety benchmarks while ensuring policy adherence

🔬 Scientific Applications & Research

An integrated single-nucleus and spatial transcriptomics atlas reveals the molecular landscape of the human hippocampus (lcolladotor.github.io). Integrated single-nucleus and spatial transcriptomics reveal human hippocampus' molecular landscape, highlighting neuronal cell types and spatial organization through NMF and label transfer techniques

AI Enters the Scientific Loop: Simulation, Integrity, and the Rise of Open Reasoning (firstprinciples.org). AI technologies like SmolLM3 and PhysiX are transforming scientific practices, raising issues around integrity, peer review, and the validity of simulation data

AI-Powered Disease Prevention Without Privacy and Security Trade-offs (transmitter.ieee.org). Federated learning aids dengue outbreak prediction while securing patient data, enabling better public health responses without compromising privacy and ethical standards

Using Generative AI in the Battle Against Invasive Plants (cmu.edu). CMU researchers enhance AI tools to detect invasive species like leafy spurge, improving ecological management with synthetic images and machine learning techniques

Neural population-based approaches have opened new windows into neural computations and behavior (thetransmitter.org). Neural manifolds enhance understanding of neural computations, enabling flexible behavior and revealing geometric patterns across individuals and species, utilizing tools like RATS and MARBLE

Is data advancing science at the cost of deeper insight? (news.stanford.edu). Grace Huckins discusses the balance between AI-driven advancements and deeper scientific understanding, highlighting issues in neuroscience and the role of big data tools like AlphaFold

⚙️ ML Engineering & Infrastructure

Run Python functions on K8s (valatka.dev). Deploy Python functions on Kubernetes using @kuberun for seamless integration with ML workflows, leveraging Docker and Kubernetes Pod specs

The Network Impact on Job Completion Time in AI Model Training (kentik.com). Network performance is crucial in large-scale AI training, impacting Job Completion Time (JCT) through factors like microbursts and GPU synchronization delays

Logging and registering models with MLflow (medium.com/marvelous-mlops). Explore MLflow's capabilities for logging and registering machine learning models, particularly with Databricks and scikit-learn pipelines

Graph Neural Networks at Scale: DGL with ROCm on AMD Hardware (rocm.blogs.amd.com). Explore DGL's role in enabling scalable graph neural networks on AMD's ROCm platform, enhancing performance across diverse AI applications

🔧 Algorithm Optimization & Model Performance

Optimizing training a GPT style Tokenizer with C++ (justinhj.github.io). C++ optimization of a GPT-style tokenizer, leveraging BPE and experiments reducing training time by 23x, guided by Andrej Karpathy's concepts

Evolving Integer Compression Algorithms with LLMs (mathieularose.com). Mathieu Larose optimizes integer compression algorithms using LLMs, producing a Go implementation surpassing several C methods for sorted 32-bit unsigned integers

When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems (towardsdatascience.com). Feature collapse occurs in ML systems, causing reliance on few inputs, leading to brittle models and unnoticed failures. Detecting this collapse is crucial for reliability

The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix (towardsdatascience.com). Explores the pitfalls of regular model retraining in machine learning, emphasizing the need for diagnostics and understanding performance decline over data volume

Optimizing User Narratives for Foundation Models (building.nubank.com). Nubank uses hyperparameter search to optimize user narratives for AI models, improving data representation and model performance through efficient token usage

🤖 Advanced AI Systems & Research

ZKTorch: Open-Sourcing the First Universal ZKML Compiler for Real-World AI (medium.com/@danieldkang). ZKTorch is an open-source ZKML compiler for AI, enabling verifiable AI outputs without exposing proprietary data, supporting a range of machine learning models

Simulating large systems with Regression Language Models (research.google). Text-to-text regression using Regression Language Models (RLM) predicts performance metrics, optimizing resource allocation in Google’s Borg infrastructure

How Kimi RL’ed Qualitative Data to Write Better (dbreunig.com). Kimi K2 enhances qualitative writing using reinforcement learning, addressing challenges in qualitative scoring while demonstrating effective categorization techniques in AI

Postdoc & PhD positions at Queen Mary University of London (appliedtopology.org). Postdoc and PhD positions at Queen Mary University, focusing on Applied Topology and AI research using mathematical foundations

AMI not AGI? (languagelog.ldc.upenn.edu). Yann LeCun advocates for Advanced Machine Intelligence (AMI) over AGI, emphasizing self-supervised learning and world models through the V-JEPA 2 framework

🧮 Mathematical Foundations & Statistical Modeling

Coding Latent Discrete Parameters in Stan (yongfu.name). Marginalization technique in Stan for cognitive modeling of children's number understanding using the Give-N task, integrating Bayesian methods

Hierarchical Revenue & Retention Modeling (juanitorduz.github.io). Hierarchical revenue-retention modeling using Bayesian techniques across markets, leveraging JAX, NumPyro, and statistical data generation

The Box-Cox power exponential distribution (blog.djnavarro.net). Discussion on the Box-Cox power exponential distribution, GAMLSS models, and utilizing NHANES data for growth curve modeling in pharmacometrics

The Kepler Problem (Part 8) (johncarlosbaez.wordpress.com). Equivalence of the hydrogen atom and massless spin-½ particles in Einstein’s universe; implications of Dirac operators and their eigenvalues on the 3-sphere

Eigenvalues of the Laplacian on a square (johndcook.com). Exploring eigenvalues of the Laplacian on a square, including zero boundary conditions and Pólya's bounds on eigenvalue distribution

📚 Academic Research

Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation (arxiv:cs). Investigates cluster-based cross-validation strategies, proposing Mini Batch K-Means with class stratification; analyzes bias, variance, and computational cost across diverse datasets

Your Spending Needs Attention: Modeling Financial Habits with Transformers (arxiv:cs). Transformers enhance financial predictive modeling by leveraging self-supervised learning on transaction data, improving customer behavior understanding at Nubank

MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation (arxiv:stat). MIBoost employs gradient boosting for variable selection after multiple imputation, enhancing prediction performance using a uniform mechanism across imputed datasets

Consistency of Feature Attribution in Deep Learning Architectures for Multi-Omics (arxiv:cs). Investigates SHAP in multi-omics deep learning for feature attribution, exploring robustness and consistency across architectures and initial weight settings

Multi-Task Learning 1997–2024: Part II Regularization and Optimization (hdsr.mitpress.mit.edu). Survey on multitask learning focusing on regularization methods, optimization strategies, and techniques for effective information sharing across tasks

Domain Generalization and Adaptation in Intensive Care with Anchor Regression (arxiv:stat). Causality-inspired domain generalization using anchor regression and boosting improves predictive model performance in diverse ICU settings with 400,000 patient data

Are Recommenders Self-Aware? Label-Free Recommendation Performance Estimation via Model Uncertainty (arxiv:cs). Investigates recommendation model self-awareness through probability-based List Distribution uncertainty (LiDu), linking uncertainty to performance in various datasets

Multi-Task Learning 1997–2024: Part III Applications (hdsr.mitpress.mit.edu). Explores Multi-Task Learning applications, deep learning evolution, pre-trained foundation models, and hybrid architectures for multimodal data handling

Exploration on Demand: From Algorithmic Control to User Empowerment (arxiv:cs). Adaptive clustering framework for movie recommendations enhancing personalization and diversity, utilizing sentence-transformer embeddings and user-controlled exploration, improving content discovery

Cluster-Based Random Forest Visualization and Interpretation (arxiv:cs). Visualization method for interpretability of random forests using clustered decision trees, featuring new distance metrics and visualizations like Feature Plot and Rule Plot

👋 Before you go

I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching a Patreon page!. Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:

Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
First dibs on merch (details still cooking)
That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing

If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.

About Machine Learning Engineer

Our Machine Learning Engineer newsletter covers the latest developments, research papers, tools, and techniques in ML engineering and deployment. Each week, we curate the most important content so you don't have to spend hours searching.

Whether you're a beginner or expert in machine learning engineering, our newsletter provides valuable information to keep you informed and ahead of the curve in this technically challenging field.

Subscribe now to join thousands of professionals who receive our weekly updates!