Machine Learning Engineer
Tuesday 15th April, 2025
Subscribe to this newsletter!
⚙️ ML Systems & Applications
Software Project Pieces Broken Bits Back Together (hackaday.com, 2025-04-13). GARF (Generalizeable 3D reAssembly for Real-world Fractures) uses machine learning to reconstruct broken objects from complex fragments, addressing missing pieces and irregular edges, showcasing innovative applications of AI in real-world scenarios
Building Rock-Solid ML Systems (bytes.swiggy.com, 2025-04-11). Swiggy leverages machine learning for improved customer service, utilizing techniques like exploratory data analysis, sensitivity analysis, explainable AI with SHAP, and maintaining coding standards for reliable model performance and transparency
System Design Components and Trade Offs (asun9.com, 2025-04-10). Explore system design components and trade-offs, focusing on ML system considerations, PostgreSQL and Elasticsearch integration, and comparing data processing tools like Kubeflow, Airflow, and Dataflow in large-scale applications
From Entities to Alphas: Launching the Python Version of the Equities Entity Store (jonathankinlay.com, 2025-04-13). The Python version of the Equities Entity Store enhances quantitative finance workflows with over 1,400 features for 7,500 stocks, integrating lookahead bias protection and efficient portfolio construction using LambdaMART and LightGBM
A Scalable Approach to Clustering Embedding Projections (fredhohman.com, 2025-04-14). Efficient clustering of embedding projections using kernel density estimation in 2D space allows for rapid labeling and summarization, significantly reducing computational costs compared to traditional methods
Scikit-learn ought to have robust model persistence (jcbsv.net, 2025-04-11). Scikit-learn's model persistence is limited by pickling and ONNX format, causing compatibility issues across versions, highlighting the need for a robust persistence function that utilizes JSON for model parameters
The ExoLabel Post: Clustering Massive Networks with Limited Resources (ahl27.com, 2025-04-11). ExoLabel is an algorithm for clustering large genomic networks using Fast Label Propagation, enabling efficient processing of massive datasets with limited RAM, leveraging compressed-sparse row format to store edge data on disk
🔍 ML Technical Deep Dives
MLIR Part 7 - Transformers (stephendiehl.com, 2025-04-11). Transformers utilize self-attention, positional encodings, and multi-head attention to enhance natural language processing, enabling robust contextual awareness and emergent capabilities through models like GPT-2, without needing prior extensive context
Cross-entropy and KL divergence (eli.thegreenplace.net, 2025-04-12). Cross-entropy computes loss in ML classification, while KL divergence measures differences between probability distributions. Both concepts utilize entropy and logarithms to quantify uncertainty and information content
Some experiments to help me understand Neural Nets better, post 3 of N (addxorrol.blogspot.com, 2025-04-10). The author conducts experiments with neural networks, using 30 layers and 27000 parameters to analyze overfitting, based on training data with 5000 points, but surprisingly finds no signs of overfitting
Automatic Differentiation Revisited (leimao.github.io, 2025-04-12). Explore automatic differentiation through the de novo chain rule, Jacobian matrices, and their products as used in deep learning frameworks, simplifying computation without detailed dot product discussions
Why CatBoost Works So Well: The Engineering Behind the Magic (towardsdatascience.com, 2025-04-10). CatBoost employs innovative methods like Ordered Target Statistics and Ordered Boosting to mitigate target leakage when handling categorical variables, enhancing performance through techniques such as Oblivious Trees for decision tree structure
Graph Neural Networks: Revolutionizing Data Analysis (cosmicmeta.io, 2025-04-12). Graph Neural Networks (GNNs) enable advanced data analysis by learning from graph-structured data, enhancing fields like finance and healthcare, while utilizing tools like PyTorch Geometric and DGL for better predictive modeling
📚 Scholarly ML Research I
A Comparative Study of Recommender Systems under Big Data Constraints (arxiv:cs, 2025-04-11). A comparative study of recommender algorithms like EASE-R, SLIM, and Matrix Factorization under Big Data constraints reveals trade-offs in accuracy, scalability, and computational costs, providing guidelines for effective model selection
RO-FIGS: Efficient and Expressive Tree-Based Ensembles for Tabular Data (arxiv:cs, 2025-04-09). RO-FIGS introduces Random Oblique Fast Interpretable Greedy-Tree Sums for efficient tree-based learning, utilizing multivariate splits to enhance feature interaction discovery and model interpretability across diverse tabular datasets
Explainability and Continual Learning meet Federated Learning at the Network Edge (arxiv:cs, 2025-04-11). This work addresses challenges in edge computing by exploring Multi-objective optimization, integrating explainable models like decision trees, and combining Continual Learning with Federated Learning for adaptive, privacy-preserving machine learning solutions
Enhancing Metabolic Syndrome Prediction with Hybrid Data Balancing and Counterfactuals (arxiv:cs, 2025-04-09). This study enhances metabolic syndrome prediction by optimizing ML models such as XGBoost and Random Forest using hybrid data balancing techniques like MetaBoost and counterfactual analysis, identifying key features like blood glucose and triglycerides
Hyperparameter Optimisation with Practical Interpretability and Explanation Methods in Probabilistic Curriculum Learning (arxiv:cs, 2025-04-09). This study explores hyperparameter optimisation in reinforcement learning via Probabilistic Curriculum Learning, utilizing the AlgOS framework and Optuna's TPE for efficiency, and introduces a SHAP-based interpretability approach for analysing hyperparameter impacts
An experimental survey and Perspective View on Meta-Learning for Automated Algorithms Selection and Parametrization (arxiv:cs, 2025-04-08). This paper surveys Algorithms Selection and Parametrization (ASP) methods, emphasizing Automated Machine Learning (AutoML) and presents a benchmark of 4 million models, evaluating classifiers selection across 8 algorithms and 400 datasets
🧮 Scholarly ML Research II
Where Does Meaning Live in a Sentence? Math Might Tell Us. (quantamagazine.org, 2025-04-09). Tai-Danae Bradley leverages category theory to explore the mathematical structure of language, aiming to understand how grammar and meaning arise from word combinations and how this relates to AI-generated text
Modeling data with correlated errors across a directed graph (ckrapu.github.io, 2025-04-13). Model data with correlated errors using PyMC, focusing on directed acyclic graphs (DAG) and Gaussian Markov random fields for improved regression estimates and predictive performance
Extending the Theta forecasting method to GLMs, GAMs, GLMBOOST and attention: benchmarking on Tourism, M1, M3 and M4 competition data sets (28000 series) (thierrymoudiki.github.io, 2025-04-14). The expanded Theta forecasting method incorporates GLMs, GAMs, GLMBOOST, and attention mechanisms, benchmarking performance on Tourism, M1, M3, and M4 datasets using advanced statistical packages and techniques
aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis (arxiv:astro, 2025-04-13). aweSOM is an open-source Python package that uses a Self-organizing Maps algorithm with CPU/GPU acceleration for clustering large multidimensional datasets, achieving 10-100x speedup and improved memory efficiency compared to existing implementations
High-dimensional Clustering and Signal Recovery under Block Signals (arxiv:stat, 2025-04-11). This study presents CFA-PCA and MA-PCA for high-dimensional clustering and signal recovery, addressing sparse and dense block signals with minimax optimality and efficiency under non-Gaussian data conditions
Weak Signals and Heavy Tails: Machine-learning meets Extreme Value Theory (arxiv:stat, 2025-04-09). This survey explores the integration of multivariate extreme value theory with statistical learning techniques, focusing on exponential maximal deviation inequalities and applications in classification, regression, anomaly detection, and high-dimensional lasso adaptation
Backsolving classical generalization bounds from the modern kernel regression eigenframework (james-simon.github.io, 2025-04-09). This blogpost discusses using the KRR eigenframework to derive classical generalization bounds, exploring RKHS norms, kernel functions, and bounds on test error based on learnability of kernel eigenstructure
Learning Möbius from Inconvenient Integer Representations (davidlowryduda.com, 2025-04-11). David Lowry-Duda explores machine learning experiments to learn the Möbius function using inconvenient integer representations and discusses Zeckendorf and factoradic representations' effectiveness in modeling divisibility rules