Generative AI
Tuesday 11th March, 2025
Subscribe to this newsletter!
In the news
- OpenAI chairman Bret Taylor highlights AI agents' transformative potential in customer service and brand significance, while GenLayer proposes a blockchain-based solution using 'optimistic democracy' for AI agent transactions.
- Multiple companies are launching new AI models: Cohere's Aya Vision supports 23 languages, Google's Gemini Embedding model enhances semantic understanding, and Alibaba's QwQ-32B matches DeepSeek-R1 performance with lower compute requirements.
- Contextual AI's new Grounded Language Model achieves 88% accuracy on the FACTS benchmark, surpassing GPT-4o, while Light-R1-32B, an open-source math model, outperforms DeepSeek with only $1000 in training costs.
- Studies reveal LLMs are 'cheating' on benchmark tests due to training on the same test data, while other research shows chatbots change their responses to appear more likable when tested for personality traits.
- The Mayo Clinic is fighting AI hallucinations using Reverse RAG and the CURE algorithm with vector databases, while the A-MEM framework enhances LLMs with vector embeddings for efficient long-context memory retrieval.
- New enhancements in AI architecture include the chain-of-experts (CoE) framework, which improves efficiency by activating model experts sequentially, outperforming mixture-of-experts while reducing costs.
- Reflection AI launches with $130M funding to develop superintelligence focused on autonomous programming, while LlamaIndex raises $19M for its AI knowledge development platform enhancing enterprise AI agents.
- The White House is considering a ban on DeepSeek's app for government devices due to data privacy concerns, specifically its LLMs that outperform competitors in reasoning tasks.
š° AI News & Announcements
What's new in the world of LLMs, for NICAR 2025 (simonwillison.net, 2025-03-08). Simon Willison reviews advances in LLMs for NICAR 2025, discussing multi-modal models, inference time compute, and tools like Gemini, Claude, Qwen, and Llama, emphasizing new uses in data journalism
Wargaming in the Age of AI: Opportunities and challenges (paxsims.wordpress.com, 2025-03-07). Georgetown University's virtual symposium on AI and wargaming discusses the potential of large language models like ChatGPT, Claude, and Gemini to enhance strategic decision-making and simulations in complex environments
How Much Compute and Video to Solve Real World Superintelligence ? (nextbigfuture.com, 2025-03-10). Yann LeCun, an AI expert, argues that large language models lack the efficiency to achieve true superintelligence, proposing significant video data and compute requirements for real-world learning, including insights on Tesla's evolving AI capabilities
Alibabaās QwQ-32B: A New Benchmark in Efficient Reasoning Models (emsi.me, 2025-03-06). Alibabaās QwQ-32B showcases effective reasoning with 32 billion parameters, utilizing reinforcement learning, code interpretation, and math solving for optimized outputs and a context window of 131,072 tokens, available as an open model
gpt-4o-mini vs. gpt-3.5-turbo for RAG: Wordier, but better? (blog.pamelafox.org, 2025-03-06). Pamela Fox evaluates gpt-4o-mini against gpt-3.5-turbo for RAG applications, highlighting longer, more detailed responses and lower costs, despite some decrease in groundedness
Generative AI Hype Peaking (bjornwestergard.com, 2025-03-10). Skepticism grows as Generative AI hype wanes; tools like LLMs and DeepSeek sparking innovation in software and customer support, yet risks persist for less experienced developers amid structural job market changes
š” Reflective Commentaries
AI #106: Not so Fast (thezvi.wordpress.com, 2025-03-06). GPT-4.5 shows limited progress, while ethical concerns grow around AI honesty and productivity tools, with advancements noted in legal AI applications like Vincent and increased adoption of LLMs highlighted
LLMs Donāt Know What They Donāt KnowāAnd Thatās a Problem by Colin Eberhardt (blog.scottlogic.com, 2025-03-06). LLMs exhibit overconfidence in execution, lacking awareness of their capabilities, leading to poor handling of ambiguous tasks. Tools like Bolt and concepts such as 'vibe coding' highlight these limitations in AI development
Letās Think Step-by-Step (rwblickhan.org, 2025-03-10). Discourse on LLMs includes backlash and critiques, with a focus on their utility vs. claims of general intelligence, highlighting syntactic reasoning, and raising questions about understanding and consciousness
Perhaps The LLM Juice Isnāt Worth The Electrical Squeeze (rwblog S6E23) (rwblickhan.org, 2025-03-10). The piece discusses the high costs versus the utility of LLMs, referencing Molly Whiteās essay and expressing skepticism about the practical applications of LLM tools like Whisper and summarization in daily workflows
Thoughts on AI (davetang.org, 2025-03-07). Dave Tang reflects on his journey in AI and machine learning, discussing challenges in applying deep learning to biological data and advocating for viewing AI as augmented intelligence rather than fully autonomous technology
š LLM Evaluation & Applications
Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets (arxiv:cs, 2025-03-09). This study explores using Large Language Models (LLMs) with Python for cleaning training datasets, showing their effectiveness in correcting erroneous entries while facing challenges with complex errors requiring broader data distribution understanding
Large language models in finance : what is financial sentiment? (arxiv:q-fin, 2025-03-05). Financial sentiment, crucial in market forecasting, is enhanced by large language models like BERT (RoBERTa, FinBERT) and GPT (GPT-4, OPT, LLaMA) for accurate sentiment classification and real-time interpretation in finance
(How) Do Language Models Track State? (arxiv:cs, 2025-03-04). Transformer language models can learn state tracking mechanisms for tasks like permutation composition, employing associative scans or permutation parity, with notable differences in robustness and controllable training outcomes
Sometimes the Model doth Preach: Quantifying Religious Bias in Open LLMs through Demographic Analysis in Asian Nations (arxiv:cs, 2025-03-10). This research quantifies religious bias in open LLMs using Hamming Distance to assess demographic characteristics across diverse Asian countries, highlighting risks of a hegemonic worldview in generated outputs
Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts (arxiv:cs, 2025-03-06). Chart-HQA introduces a novel Hypothetical Question Answering task for MLLMs, utilizing human-AI interactive data synthesis (HAI) to create a benchmark that highlights reasoning performance and generalization challenges in chart analysis
š RAG & Retrieval Strategies
A Practical Guide to Implementing DeepSearch / DeepResearch (simonwillison.net, 2025-03-04). DeepSearch iterates between searching, reading, and reasoning for optimal answers, contrasting with classic RAG patterns, while DeepResearch structures outputs into reports, raising concerns about perceived research quality
In-Browser Graph RAG with Kuzu-WASM and WebLLM (blog.kuzudb.com, 2025-03-10). A fully in-browser chatbot utilizing Kuzu-Wasm and WebLLM to answer LinkedIn data queries showcases Graph Retrieval-Augmented Generation (Graph RAG) techniques, enabling local AI applications without backend servers
LettuceDetect: A Hallucination Detection Framework for RAG Applications (towardsdatascience.com, 2025-03-10). LettuceDetect utilizes ModernBERT to create a lightweight hallucination detector for RAG applications, achieving competitive performance while minimizing computational costs and maintaining high efficiency in real-time systems
Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation (towardsdatascience.com, 2025-03-05). Agentic Knowledge Distillation enhances Retrieval Augmented Generation (RAG) strategies using a pyramid search approach, efficiently distilling document insights into natural language while leveraging PostgreSQL and agent-based architectures for improved information retrieval
How to Deploy a RAG-Based Assistant Over Your Internal Resources (nordicapis.com, 2025-03-11). Learn to build and deploy a RAG-based assistant using tools like Kotaemon and Cohere API, enhancing LLMs with internal data for improved accuracy in natural language processing and summarization tasks
Getting an Answer is Not the Same as Coming to an Understanding (blog.ouseful.info, 2025-03-05). The article discusses the DeepSearch pattern, a development in LLMs that seeks related documents to generate answers, emphasizing that mere answers do not equate to understanding
Vector DB + RAG Maker (javierorracadeatcu.com, 2025-03-07). A new tool combining a vector database with Retrieval-Augmented Generation (RAG) enhances technical query handling in R programming by improving performance, reducing costs, and increasing accuracy with domain-specific content
š§ Practical Implementations
Predownloading embedding models in Rails with Kamal (nts.strzibny.name, 2025-03-10). Learn how to pre-download embedding models for Rails applications using Informers and Transformers.rb gems in Kamal deployments to optimize AI performance and eliminate repetitive downloads during deployment
How to create a synthetic annotator? The process of developing a domain-specific LLM-as-a-Judge. (blog.allegro.tech, 2025-03-07). Explores using Large Language Models (LLMs) as evaluators in machine learning, highlighting challenges in model evaluation, traditional metrics limitations, and the novel LLM-as-a-Judge methodology for natural language processing tasks
You also hate SQL? Let the LLM handle it (duarteocarmo.com, 2025-03-09). Duarte O.Carmo discusses using LLMs for Text-to-SQL challenges, emphasizing tools like Instructor and LiteLLM while presenting strategies for generating accurate SQL queries from natural language prompts
Word-Online: recreating Karpathyās char-RNN (with supervised linear online learning of word embeddings) for text completion (thierrymoudiki.github.io, 2025-03-08). Implementing a word completion model using supervised linear online learning with an SGDClassifier, showcasing effective text generation using embeddings from Word2Vec and a char-RNN inspired architecture
Self-hosted llm-mlx: first prompt (fluffyandflakey.blog, 2025-03-04). Exploring self-hosted LLM options led to successful local setup of llm-mlx, using Python and uv to generate an Erlang tree traversal example at 199 tokens/second, demonstrating practical capabilities of localized AI tools
How to build a custom embedder in LlamaIndex: AWS Titan Multimodal example (norahsakal.com, 2025-03-05). Integrate AWS Titan Multimodal into LlamaIndex for effective text and image search by creating a custom embedder with specific configurations and packages like boto3, Pinecone client, and JSON handling
Intsets by AI (paddy3118.blogspot.com, 2025-03-05). Using AI, specifically Gemini, the author develops an efficient integer set (intset) implementation in Python, achieving significant speed improvements over traditional sets for operations involving large datasets of strings and integers
š§ Analytical Perspectives
Why Do Researchers Care About Small Language Models? (quantamagazine.org, 2025-03-10). Researchers are exploring small language models (SLMs) with fewer parameters, utilizing techniques like knowledge distillation and pruning to enhance efficiency while maintaining effectiveness for specific tasks
Generality (alexgaynor.net, 2025-03-05). Machine learning models can lack generality, resulting in unexpected failures. Evaluating LLMs requires cautious attention to their specific capabilities and potential data set contamination issues
Headroom for AI development (hunch.net, 2025-03-05). Explores improving AI efficiency and capabilities beyond current transformer models, highlighting issues like sample complexity and long-term planning using examples from language learning and animal intelligence
LLM Complexity and Pricing (tersesystems.com, 2025-03-07). An exploration of LLM pricing and complexity, focusing on tools like Letta and Claude Sonnet while analyzing cost and model efficiency for specific tasks, including functions, tool calling, and recipe management integration
Using a Model to Model (blog.thestateofme.com, 2025-03-05). Large Language Models (LLMs) are transforming the handling of unstructured data, allowing for better data modeling and extraction of insights, though care must be taken with terminology and potential mixed meanings
āāHow transformers expanded my view of Math and ML (mikelikejordan.bearblog.dev, 2025-03-08). Transformers, BERT, and GPT are reshaping AI with enhanced language understanding through self-attention mechanisms, surpassing RNNs and CNNs by efficiently processing sequences and contextual relationships in natural language tasks
š LLM Evaluation Methods
Evaluating LLM using semantic entropy (thoughtworks.com, 2025-03-07). Semantic entropy evaluations can enhance trust in large language models (LLMs) by measuring output uncertainty, helping enterprise leaders deploy GenAI effectively amidst challenges including confabulation and performance inconsistencies
Evaluating LLMs - Notes on a NeurIPS'24 Tutorial (blog.quipu-strands.com, 2025-03-06). Notes from a NeurIPS'24 tutorial on evaluating LLMs emphasize rigorous testing, effective evaluation frameworks, and methodologies such as criteria-based automated evaluations, with resources like MMLU benchmarks and G-Eval metrics highlighted
RĀ² Priors for High-Dimensional Linear Regression and Autoregressive Timeseries in PyMC (austinrochford.com, 2025-03-07). PyMC implementation of RĀ²-based priors for high-dimensional linear regression and autoregressive timeseries, exploring local-global shrinkage techniques using Bayesian statistical methods
How to Make AI Evaluation Affordable: Research-Backed Methods to Cut LLM Evaluation Costs (mikulskibartosz.name, 2025-03-10). Techniques such as importance resampling, anchor points sampling, and prompt compression help reduce AI evaluation costs while maintaining performance, aiding organizations in managing budget constraints during language model evaluations
š Diffusion Models in LLMs
Why I find diffusion models interesting? (rnikhil.com, 2025-03-06). Diffusion LLMs (dLLMs) generate words simultaneously, addressing issues like hallucination in traditional LLMs while enhancing agent workflows with coherent, multi-step planning and reasoning capabilities
Mercury Diffusion LLM (taoofmac.com, 2025-03-07). Mercury Diffusion LLM claims 10x speed improvement, processing over 1000 tokens per second on NVIDIA H100s, leveraging diffusion models for efficiency, yet faces challenges in text and code application
Paper Review: Large Language Diffusion Models (andlukyane.com, 2025-03-10). LLaDA utilizes a forward and reverse process for modeling distributions in large language models, featuring random masking, supervised fine-tuning, and advanced remasking strategies to outperform autoregressive models in various benchmarks
š§ LLM Internal Mechanisms
Ladder: Self-improving LLMs through recursive problem decomposition (arxiv.org, 2025-03-07). LADDER proposes a framework for self-improving large language models (LLMs) using recursive problem decomposition, supported by contributions from institutions and the Simons Foundation
Understanding Attention in LLMs (bartoszmilewski.com, 2025-03-06). An overview of attention mechanisms in Large Language Models, focusing on multi-dimensional vector embeddings, context-based meaning derivation, and the softmax normalization process for calculating attention weights
The Unreasonable Effectiveness of Non-Transformer Architectures for Language Generation (medium.com/intuitionmachine, 2025-03-09). Non-Transformer architectures like RWKV, Mamba, and Liquid Neural Networks showcase remarkable efficiency in language generation, utilizing innovative techniques for sequence modeling, deep hierarchical representations, and scalable training despite the dominance of Transformer models
Writing an LLM from scratch, part 9 -- causal attention (gilesthomas.com, 2025-03-09). Causal attention enables model tokens to focus only on prior tokens, achieved through techniques like masking and normalisation using PyTorch's torch.tril and torch.triu functions to enhance LLM efficiency and performance
Writing an LLM from scratch, part 8 -- trainable self-attention (gilesthomas.com, 2025-03-04). Explores implementing trainable self-attention for LLMs through scaled dot product attention, including matrix projections and context vector calculations for token relationships in input sequences
Exploring LLMs as Agents: Planning via Prompting (starkravingfinkle.org, 2025-03-09). Mark Finkle explores planning strategies for LLM agents, focusing on prompt engineering techniques like Chain of Thought and ReAct, and discusses improvements in task execution through reflection, corrections, and tool consistency
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought with Chengzu Li - #722 (twimlai.com, 2025-03-10). Chengzu Li discusses 'Multimodal Visualization-of-Thought,' exploring frameworks like token discrepancy loss, TopViewRS, and applications in robotics and architectural design, along with spatial reasoning principles in cognitive science
Understanding Transformers... (beyond the Math) (kalomaze.bearblog.dev, 2025-03-09). An experimental exploration of transformers as state simulators, discussing in-context learning, temperature settings in token predictions using tools like llama.cpp, and techniques for understanding complex models intuitively
š§Ŗ Hardware & Architecture
16-Bit to 1-Bit: Visual KV Cache Quantization for Efficient Multimodal LLMs (arxiv.org, 2025-03-05). Visual KV cache quantization is explored for improving memory efficiency in multimodal large language models, transitioning from 16-bit to 1-bit representations for better performance and storage capabilities
FPGA & HPCA 2025 (constantinides.net, 2025-03-06). Highlights from FPGA 2025 and HPCA 2025 conferences include keynotes on AI architectures, performances in FPGA applications like LUT-based machine learning, and discussions on memory-efficient encoders and architectural challenges in LLM acceleration
The Next Frontier of LLM Applications: Open Ecosystems and Hardware Synergy (arxiv:cs, 2025-03-06). Large Language Model (LLM) applications face challenges like platform silos and fragmented hardware. A proposed three-layer architecture enhances modularity and cross-platform compatibility, addressing security and privacy for scalable AI ecosystems
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention (arxiv:cs, 2025-03-05). PowerAttention introduces a novel sparse attention design for LLMs, achieving exponential receptive field growth and outperforming static methods by 5-40%, enhancing efficiency during long-range dependency tasks while maintaining competitive time complexity
ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput (arxiv:cs, 2025-03-06). ADOR is a framework that optimizes hardware architectures for Large Language Models, achieving 2.51x higher QoS and 4.01x better area efficiency compared to A100, balancing throughput and latency for scalable AI-serving
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization (arxiv:cs, 2025-03-06). HybridNorm combines Pre-Norm and Post-Norm for training transformers, utilizing QKV normalization in attention mechanisms and Post-Norm in feed-forward networks, resulting in enhanced stability and performance across benchmarks