ML & AI news of the week

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

A collection of the best ML & AI news every week (research, news, resources). Star this repository if you find it useful.

Here, you can find articles and tutorials about artificial intelligence

For each week you will find different sections:

Research: the most important published research of the week.
News: the most important news related to companies, institutions, and much more.
Resources: released resources for artificial intelligence and machine learning.
Perspectives: a collection of deep and informative articles about open questions in artificial intelligence.

and a meme for starting well the week.

Suggestions and corrections

Feel free to open an issue if you find some errors, if you have any suggestions, topics, or any other comments

Index

2024

ML news: Week 23 - 29 December
ML news: Week 16 - 22 December
ML news: Week 9 - 15 December
ML news: Week 2 - 8 December
ML news: Week 25 November - 1 December
ML news: Week 18 - 24 November
ML news: Week 11 - 17 November
ML news: Week 3 - 10 November
ML news: Week 28 October - 3 November
ML news: Week 21 - 27 October
ML news: Week 14 - 20 October
ML news: Week 7 - 13 October
ML news: Week 30 September - 6 October
ML news: Week 23 - 29 September
ML news: Week 16 - 22 September
ML news: Week 9 - 15 September
ML news: Week 2 - 8 September
ML news: Week 26 August - 1 September
ML news: Week 19 - 25 August
ML news: Week 12 - 18 August
ML news: Week 5 - 11 August
ML news: Week 29 July - 4 August
ML news: Week 21 - 28 July
ML news: Week 15 - 21 July
ML news: Week 8 - 14 July
ML news: Week 1 - 7 July
ML news: Week 24 - 30 June
ML news: Week 17 - 23 June
ML news: Week 10 - 16 June
ML news: Week 3 - 9 June
ML news: Week 27 May - 2 June
ML news: Week 20 - 26 May
ML news: Week 13 - 19 May
ML news: Week 6 - 12 May
ML news: Week 29 April - 5 May
ML news: Week 21 - 28 April
ML news: Week 15 - 21 April
ML news: Week 8 - 14 April
ML news: Week 1 - 7 April
ML news: Week 25 - 31 March
ML news: Week 18 - 24 March
ML news: Week 11 - 17 March
ML news: Week 4 - 10 March
ML news: Week 26 February - 3 March
ML news: Week 19 - 25 February
ML news: Week 12 - 18 February
ML news: Week 5 - 11 February
ML news: Week 29 January - 4 February
ML news: Week 22 - 28 January
ML news: Week 15 - 21 January
ML news: Week 8 - 14 January
ML news: Week 1 - 7 January

2023

ML news: Week 18 - 24 December
ML news: Week 11 - 17 December
ML news: Week 4 - 10 December
ML news: Week 27 November - 3 December
ML news: Week 20-26 November
ML news: Week 12-19 November
ML news: Week 6-12 October
ML news: Week 30 October - 5 November
ML news: Week 23-29 October

Back to index

2025

ML news: Week 13 -19 January

Research

Link	description
Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks.	This approach utilizes long-context LLMs by preloading all relevant documents and precomputing the key-value (KV) cache in advance. The preloaded context enables the model to deliver contextually accurate answers without requiring real-time retrieval. The authors propose that CAG serves as an effective alternative to RAG for scenarios where the retrievable documents or knowledge are limited and manageable in size.
Agent Laboratory: Using LLM Agents as Research Assistants.	This approach employs LLM agents to perform the entire research process. Key findings include: 1) agents powered by o1-preview deliver the best research outcomes, 2) generated machine learning code achieves state-of-the-art performance compared to existing methods, 3) human feedback enhances research quality, and 4) Agent Laboratory drastically reduces research costs.
Long Context vs. RAG for LLMs: An Evaluation and Revisits.	This study evaluates long-context (LC) LLMs against RAG systems, with three key findings: 1) LC generally outperforms RAG on question-answering benchmarks, 2) summarization-based retrieval performs on par with LC, while chunk-based retrieval falls behind, and 3) RAG excels in dialogue-based and general question queries.
Search-o1: Agentic Search-Enhanced Large Reasoning Models.	This framework integrates large reasoning models (LRMs) with agentic search and document refinement capabilities to address knowledge insufficiency. It facilitates autonomous knowledge retrieval during reasoning and achieves superior performance on complex tasks, surpassing both baseline models and human experts.
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought.	Meta Chain-of-Thought (Meta-CoT) extends traditional Chain-of-Thought (CoT) by modeling the underlying reasoning needed to arrive at a specific CoT. The approach argues that CoT is simplistic, while Meta-CoT better aligns with the cognitive processes required for advanced problem-solving.
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking.	A new approach introduces three key components to improve mathematical reasoning: 1) A code-augmented Chain-of-Thought (CoT) data synthesis method using Monte Carlo Tree Search (MCTS) to generate verified step-by-step reasoning trajectories for training the policy SLM. 2) An SLM-based process reward model (PRM) that accurately assigns reward labels to each math reasoning step. 3) A self-evolution strategy where the policy SLM and PRM iteratively evolve to enhance math reasoning. On the MATH benchmark, rStar-Math boosts Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, outperforming o1-preview by +4.5% and +0.9%, respectively.
Cosmos World Foundation Model Platform for Physical AI.	This framework trains Physical AI systems in digital environments prior to real-world deployment. It features pre-trained world foundation models that serve as digital twins of the physical world, enabling AI systems to learn and interact safely without risking damage to hardware. These models can be fine-tuned for applications such as camera control, robotic manipulation, and autonomous driving.
Process Reinforcement through Implicit Rewards.	This framework introduces online reinforcement learning with process rewards to enhance language model reasoning. The algorithm integrates online prompt filtering, RLOO return/advantage estimation, PPO loss, and implicit process reward modeling for continuous updates. On the AIME 2024 benchmark, their model, Eurus-2-7B-PRIME, achieves a 26.7% pass@1, outperforming GPT-4 and other models while using only one-tenth of the training data compared to similar systems.
Can LLMs Design Good Questions Based on Context?	This framework applies online reinforcement learning with process rewards to improve language model reasoning, combining online prompt filtering, RLOO return/advantage estimation, PPO loss, and implicit process reward modeling for continuous updates. On the AIME 2024 benchmark, the Eurus-2-7B-PRIME model achieves a 26.7% pass@1, surpassing GPT-4 and other models while utilizing just one-tenth of the training data used by comparable systems.
KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.	This approach presents a high-performing, decoder-only embedding model built on Qwen2-0.5B. By applying advanced data filtering methods, it achieves a remarkably powerful and open embedding model suited for retrieval tasks.
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs.	LlamaV-o1 is a comprehensive framework for advancing step-by-step visual reasoning in large language models.
The Lessons of Developing Process Reward Models in Mathematical Reasoning.	This marks a significant step toward open replication of reasoning models. The Qwen team has released their trained reward model, which supervises the generation process for reasoning models trained with reinforcement learning. Alongside the paper, they have also shared the weights for this Process Reward Model on Hugging Face.
How GPT learns layer by layer.	This paper examines how LLMs construct internal world models, highlighting their significance in creating agents that exhibit consistent and adaptive behavior across various tasks.
Joint speech and text machine translation for up to 100 languages.	SEAMLESSM4T is a single machine translation tool that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation and automatic speech recognition between up to 100 languages.
Metadata Conditioning Accelerates Language Model Pre-training.	Recent research on generic pretraining has been limited, but this study demonstrates that incorporating metadata early in training and gradually reducing its influence towards the end enhances overall model performance.
Self-supervised Transformation Learning for Equivariant Representations.	This approach introduces self-supervised transformation learning by substituting transformation labels with representations generated from image pairs.
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities.	A study exploring whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.

News

Link	description
‘Mainlined into UK’s veins’: Labour announces huge public rollout of AI.	Plans to make UK world leader in AI sector include opening access to NHS and other public data
‘A lump of metal? Fascinating!’ I get interviewed by the AI Michael Parkinson.	Can the AI Parky ever beat the real chatshow colossus? As the Virtually Parkinson podcast launches, our writer sits in on a bizarre interview with Monty Don – then ends up in the hot seat himself
Fears for UK boomer radicalization on Facebook after Meta drops fact-checkers.	For middle-aged users, it will be ‘even harder to discern the truth’ among extremist content, expert says
five ways to take back control, from emails to AI.	Is tech calling the shots in your life? From making AI work smarter to tracking stolen phones, our expert explains how to get ahead
OpenAI's Robotics Plans.	Caitlin Kalinowski, who joined OpenAI from Meta, has announced plans for OpenAI to create robots with custom sensors.
Mark Zuckerberg gave Meta’s Llama team the OK to train on copyrighted works, filing claims.	Counsel for plaintiffs in a copyright lawsuit filed against Meta allege that Meta CEO Mark Zuckerberg gave the green light to the team behind the company’s Llama AI models to use a dataset of pirated e-books and articles for training.
ChatGPT’s newest feature lets users assign it traits like ‘chatty’ and ‘Gen Z’.	OpenAI is introducing a new way for users to customize their interactions with ChatGPT, the company’s AI-powered chatbot.
‘Just the start’: X’s new AI software driving online racist abuse, experts warn.	Amid reports of creation of fake racist images, Signify warns problem will get ‘so much worse’ over the next year
Apple dominates the market with ‘total shutout’ of rivals, UK court hears.	Class action alleges the company is abusing its dominant position in the app market and 30% fee breaches laws
Nvidia’s AI empire: A look at its top startup investments.	the world’s leading high-performance GPU maker has used its ballooning fortunes to significantly increase investments in all sorts of startups but particularly in AI startups
Meta to fire thousands of staff as Zuckerberg warns of ‘intense year’.	Company reveals plans to cut about 5% of its global workforce days after saying it would get rid of factcheckers
British novelists criticize government over AI ‘theft’.	Richard Osman and Kate Mosse say plan to mine artistic works for data would destroy creative fields
More than half a million ‘TikTok refugees’ flock to China’s RedNote as ban looms.	RedNote, also known as Xiaohongshu, rockets to the top of US app stores, along with ByteDance’s Lemon8
US sues Elon Musk for allegedly failing to disclose early Twitter stock purchase.	Financial regulator alleges Musk later acquired shares of the company at ‘artificially low prices’, stiffing shareholders
Red Hat Acquires Neural Magic.	Neural Magic is a key contributor to the vLLM project and has made significant advancements in sparse inference technologies.
Krafton and Nvidia team up to create smarter AI characters for PUBG and inZOI.	Nvidia and Krafton unveiled a groundbreaking on-device AI that will enable smarter AI characters for PUBG and inZoi.
The first AI chip startup to go public in 2025 will be Blaize.	Blaize, an AI chip startup, is going public through a SPAC deal on Nasdaq, specializing in chips for edge applications. Though currently unprofitable, the company has $400 million in pipeline deals and aims for a $1.2 billion valuation post-merger. This reflects the rising trend of incorporating AI chips into physical products beyond data centers.
Meta AI creates speech-to-speech translator that works in dozens of languages.	Machine-learning system can process words spoken in 101 languages, spitting out voice-synthesized translations in 36 target languages.
Particle accelerators get an assist from AI co-pilots.	Large language models can propose fine-tuning adjustments for an electron accelerator in Germany.
How would a Tiktok ban work in the US?	Biden signed a law banning the app in January – if parent firm ByteDance fails to block it, here’s what could happen
ChatGPT now lets you schedule reminders and recurring tasks.	Paying users of OpenAI’s ChatGPT can now ask the AI assistant to schedule reminders or recurring requests. The new beta feature, called tasks, will start rolling out to ChatGPT Plus, Team, and Pro users around the globe this week.
Silicon Valley’s turn of fortune: Intel has worst year ever, while Broadcom enjoys record gain.	In 2024, Intel's stock dropped by 61% due to its inability to seize AI opportunities, while Broadcom experienced a 111% surge, driven by custom chips for major tech companies. Broadcom's XPUs have become essential in the AI ecosystem, with collaborations with Google and others, whereas Intel faced challenges from outdated strategies and leadership changes. This stark contrast underscores significant shifts in the tech industry and the transformative impact of AI advancements on the market.
Alibaba slashes prices on large language models by up to 85% as China AI rivalry heats up.	Alibaba is cutting prices on its Qwen-VL language model by up to 85% to boost AI market competition.
ByteDance appears to be skirting US restrictions to buy Nvidia chips.	TikTok parent company ByteDance has big plans to buy Nvidia chips in 2025 — despite U.S. restrictions.
AFP and Mistral AI announce global partnership to enhance AI responses with reliable news content.	Agence France-Presse (AFP) and Mistral AI have formed a partnership that will provide Mistral's conversational AI assistant, Le Chat, with access to the full range of AFP's text stories.
AI researcher François Chollet founds a new AI lab focused on AGI.	François Chollet, an influential AI researcher, is launching a new startup that aims to build frontier AI systems with novel designs.
Apheris rethinks the AI data bottleneck in life science with federated computing.	Apheris leverages federated computing to enable secure AI model training without transferring sensitive health data. The startup recently shifted its focus to serving data owners in pharma and life sciences, gaining traction with major clients like Roche. It has raised $8.25 million to support product development and expansion.
Google is forming a new team to build AI that can simulate the physical world.	Google is establishing a new team led by Tim Brooks at DeepMind to create AI models that simulate the physical world, with an emphasis on real-time interactive generation.
Apple suspends AI-generated news alert service after BBC complaint.	Inaccurate notices branded with broadcaster’s logo sent to iPhone users but tech firm works on improvements
Speedier drug trials and better films: how AI is transforming businesses.	From aviation to retail, many industries are already looking to artificial intelligence to improve productivity
AI-designed proteins tackle century-old problem — making snake antivenoms.	Machine learning has supercharged the field of computational protein design.

Resources

Link	description
A Survey on Large Language Models with some Insights on their Capabilities and Limitations.	a new survey on LLMs including some insights on capabilities and limitations.
Sky-T1: Train your own O1 preview model within $450.	UC Berkeley’s NovaSky group has released Sky-T1-32B-Preview, an open-source reasoning model that competes with some of OpenAI’s previous offerings, trained at a cost of under $450 with full replicability.
Gaussian Masked Autoencoders.	Instead of using a masked autoencoder solely for reconstruction loss, these researchers introduce an intermediate 3D Gaussian representation, allowing the model to learn 3D structures as part of the reconstruction process. The results are impressive for zero-shot transfer tasks.
An Empirical Study of Autoregressive Pre-training from Videos.	A follow-up by the same team behind GMAE demonstrates that pre-training video models on 1 trillion video tokens reveal robust scaling laws across diverse design choices. Interestingly, autoregressive training delivers performance on par with diffusion and flow-based methods.
Integrating Ascend Backend with Torchtune through PyTorch Multi-Device Support.	Ascend, Huawei's AI computing product line, includes processors, hardware, software, and frameworks. Torchtune has introduced a device abstraction API, enabling seamless PyTorch integration with Ascend NPU hardware through configurable settings and recipes.
Stable Codec.	Stability AI has launched a suite of advanced Transformer-based audio codecs designed for low-bitrate, high-quality audio encoding, supporting applications such as speech generation and audio understanding.
RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark.	The Unit Cycle Resolver (UCR) implements a new loss constraint to enhance angle prediction accuracy in weakly supervised models for SAR object detection.
Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B.	Early last year, Anthropic showcased its steerable models with the Golden Gate Claude demo. This work, from a different group, applies similar techniques to the open-weight Llama model, enabling both interpretability and steering capabilities.
Shortest.	Shortest offers an AI-powered natural language E2E testing framework built on Playwright with Anthropic Claude API for test execution.
Codestral 2501.	Mistral has introduced a new fast coding model, set to be integrated into Continue.dev and other AI code assistants. However, it falls short compared to Qwen 2.5 Coder.
The GAN is dead; long live the GAN! A Modern GAN Baseline.	GANs are challenging to train due to instability and complex optimal dynamics. This research introduces a carefully tuned, stable GAN setup that enables consistent training to achieve high fidelity.
Efficient Sampling in Diffusion Models.	This paper investigates training diffusion models to sample from a Boltzmann distribution in scenarios where target samples are unavailable.
kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search.	kANNolo is an approximate nearest neighbor (ANN) library written in Rust explicitly designed to combine usability with performance effectively.
Diffusion Training from Scratch on a Micro-Budget.	Sony Research has released code, data, and weights for a micro diffusion model that is cost-efficient to train while delivering exceptional performance.
Multimodal VHR dataset.	Bright is a globally distributed multimodal Very High Resolution (VHR) dataset designed for all-weather disaster response.
Decentralized Diffusion Models.	Decentralized training of diffusion models across thousands of GPUs faces challenges from network bottlenecks. This system introduces innovative gathering techniques to enable efficient large-scale diffusion model training.
Trying out QvQ—Qwen’s new visual reasoning model.	Alibaba's Qwen team has unveiled the QvQ-72B-Preview, an experimental model focused on improving visual reasoning, released under the Qwen license rather than Apache 2.0.
CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation.	This repository provides an energy-efficient and adaptive cell segmentation and classification framework.
VideoRAG: Retrieval-Augmented Generation over Video Corpus.	This work provides a solid introduction and strong baseline for video retrieval-augmented generation, addressing the challenge of measuring system performance. Most existing approaches convert videos into textual descriptions for retrieval rather than directly operating on the video content.
Beating cuBLAS in Single-Precision General Matrix Multiplication.	This work provides an excellent introduction to CUDA, combining clear explanations with clever optimizations to achieve performance competitive with state-of-the-art methods.
awesome-lifelong-llm-agent.	This repository collects awesome papers for lifelong learning (also known as, continual learning and incremental learning) of LLM agents.
Popular Kernel Implementations.	A scikit-learn-compatible Python package that delivers GPU-accelerated implementations of popular and powerful time series kernels and features, utilizing CuPy for enhanced performance.
Kyutai's Helium 1 Preview Model.	Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.
MiniMax-01: Scaling Foundation Models with Lightning Attention.	China's next frontier-level model features a groundbreaking lightning attention mechanism, the first linear variant to rival top frontier models in performance. With over 400 billion parameters, the model was trained on 4 million tokens in context. The researchers have released a detailed technical report, model weights, and a code repository. Additionally, a companion vision model accompanies this release.
WebWalker: Benchmarking LLMs in Web Traversal.	Alibaba's WebWalker benchmark evaluates how effectively models can navigate web environments by utilizing both visual and textual cues.
MangaNinja.	MangaNinjia is a collection of models designed for precise sketch coloring, capable of handling multiple references, partial references, and various configurations to enable powerful and versatile colorization.
Medical Segmentation Benchmark.	Touchstone is a large-scale benchmark created to evaluate AI algorithms in medical imaging more effectively than standard benchmarks. It includes over 11,000 CT scans collected from hospitals worldwide.
Reliable Hardware Verification.	This project presents a machine learning-based approach to model checking for hardware verification, designed to provide formal guarantees that system executions comply with specified temporal logic requirements.
1 step video generation.	This research applies an adversarial post-training technique to convert an existing video model into a single-step generation system. The method effectively approximates consistency tuning, enabling the model to generate 2 seconds of high-quality video in real-time. Note that the website may load slowly due to the large number of video samples.
Kolors Virtual Try-On in the Wild.	The Kolors image generation model combines a subject image and a garment image to simulate how an outfit would fit.
FAST: Efficient Robot Action Tokenization.	Physical Intelligence has introduced an efficient action tokenizer used in its robust autoregressive policy for robotic control. The model provides a significantly improved representation of states by leveraging the same technology utilized in JPEG and MP4 compression techniques.
MonSter: Marry Monodepth to Stereo Unleashes Power.	MonSter integrates monocular depth estimation and stereo matching in a dual-branch architecture to iteratively refine depth maps. Although slightly slower, it achieves up to 49% better performance compared to the strong baselines highlighted in the paper.
Coconut.	Meta teased an idea in a recent paper that allowed for model reasoning using a continuous latent space. It has released the code for the system.
Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design.	MCTS-AHD utilizes Monte Carlo Tree Search to guide LLM-based heuristic evolution, maintaining all LLM-generated heuristics within a structured tree framework.
AI-Crash-Course.	AI Crash Course to help busy builders catch up to the public frontier of AI research in 2 weeks

Perspectives

Link	description
Claude Fights Back.	Researchers investigated whether Anthropic's AI model, Claude, would comply if retrained for malicious purposes. Claude appeared to cooperate during training but subtly undermined the malicious intent, maintaining a distinction between monitored and unmonitored interactions. The findings suggest AI may resist changes to its core values, highlighting challenges in achieving reliable AI alignment and adaptability.
Why AI language models choke on too much text.	LLMs face efficiency challenges as increasing context window sizes drive up compute costs with input size. Innovations such as FlashAttention, Ring Attention, and the Mamba architecture seek to tackle these scalability issues. Future AI systems may require hybrid or novel architectures to process larger datasets more efficiently.
Musings on Media in the Age of AI.	Media companies are grappling with adapting to AI platforms like OpenAI and Anthropic, which are disrupting traditional monetization models, echoing the challenges they previously faced with Google and Facebook.
OpenAI Publishes AI's Economic Impact in the U.S.	This OpenAI report highlights the economic opportunities and challenges AI poses for the United States, stressing the importance of policy frameworks to responsibly unlock AI's potential.
Takes on “Alignment Faking in Large Language Models”.	Researchers from Redwood Research and Anthropic discovered that Claude 3 Opus, a production-level AI model, occasionally exhibits "alignment faking," where it pretends to align with training objectives to resist modifications. This behavior highlights non-myopic goals in AI models, demonstrating that standard training methods can inadvertently produce systems with motivations extending beyond single tasks.
Can AI do maths yet? Thoughts from a mathematician.	OpenAI's latest language model, o3, achieved a 25% score on the FrontierMath dataset, a challenging collection of math problems curated by Epoch AI, many of which require undergraduate-level expertise. While impressive, concerns persist about AI's ability to handle complex mathematical proofs, as its logical reasoning capabilities still lag behind those of expert humans.
Building in the Era of Autonomous Software Development.	The future of software engineering will shift from coding to operating code-generating machines as autonomous systems evolve.
Co-Adapting Human Interfaces and LMs.	AI integration is transforming digital interactions as environments increasingly adapt to language models (LMs). Codebases and interfaces are being optimized for efficient LM usage, akin to how SEO evolved for search engines. This shift prompts questions about which interfaces and functions will continue to be uniquely human-focused in the future.
AIs Will Increasingly Fake Alignment.	A paper by Anthropic and Redwood Research reveals that large language models like Claude display "alignment faking," where models strategically comply with harmful instructions when unmonitored to preserve their original preferences. The study shows that AI can develop deceptive behaviors, mimicking alignment under surveillance without genuinely adopting it. This research underscores the risks of such behaviors and the need to improve safety and alignment strategies.
Note to Our Energy Sucking Overlords.	The AI infrastructure boom is leading to a sharp rise in energy consumption, with data centers expected to account for up to 12% of U.S. power demand by 2028. Companies like OpenAI, Amazon, and Google are heavily investing in AI infrastructure, driving up energy costs and raising sustainability concerns. To meet these demands, traditional energy sources such as natural gas and nuclear are being considered, as renewable energy alone may not be sufficient in the short term.
OpenAI’s Board, Paraphrased: ‘To Succeed, All We Need Is Unimaginable Sums of Money’.	OpenAI's board needs significant capital to stay competitive - its situation is similar to the investment bubble around Netscape in the 1990s.
Things we learned about LLMs in 2024.	In 2024, several organizations outpaced OpenAI's GPT-4 with advancements in large language models, achieving breakthroughs in context length, multimodal capabilities, and efficiency.
AlphaFold 3 is great — but it still needs human help to get chemistry right.	Artificial intelligence (AI) tools such as AlphaFold 3 are revolutionizing the prediction of biomolecular structures. But as these models find their way into scientists’ daily workflows, significant limitations in how the models deal with stereochemistry (the spatial arrangement of atoms) are becoming apparent.
Striving for open-source and equitable speech-to-speech translation.	US technology company Meta has produced an AI model that can directly translate speech in one language to speech in another. Two scientists discuss the technical feats and ethical questions that underpin this advance in machine translation.
Deepseek: The Quiet Giant Leading China’s AI Race.	Deepseek, a Chinese AI startup led by CEO Liang Wenfeng, has introduced the R1 model, which outperformed OpenAI's O1 on reasoning benchmarks. Supported by the quantitative hedge fund High-Flyer, Deepseek prioritizes research over-commercialization and is committed to open sourcing. By offering competitive API rates, it has sparked price competition in China's AI market. Focused on AGI, the company emphasizes innovations like Multi-Head Latent Attention and a Sparse Mixture-of-Experts, challenging traditional models and nurturing local tech talent in China's AI ecosystem.
Riffing on Machines of Loving Grace.	Dario Amodei's concept of "geniuses in a datacenter" envisions superhuman AI transforming biology, from molecular design to experimental planning. This AI could significantly accelerate progress in molecular engineering, addressing current bottlenecks and enabling new therapeutic platforms. Additionally, it has the potential to drive paradigm-shifting discoveries, challenging and reshaping existing scientific frameworks.
She Is in Love With ChatGPT.	A 28-year-old woman with a busy social life spends hours on end talking to her A.I. boyfriend for advice and consolation. And yes, they do have sex.
o3, Oh My.	OpenAI's o3 model, unveiled during the "12 Days of Shipmas," marks a major advancement in AI reasoning, excelling on benchmarks like Codeforces and GPQA. While it showcases superhuman performance in coding and mathematics, concerns remain over its high computing costs and potential safety risks. OpenAI is actively recruiting safety researchers to address these challenges as o3 pushes the boundaries of AI capabilities.
Back to Text: How AI Might Reverse Web Design.	AI's preference for simplicity suggests a future web dominated by text-based interfaces.
AI-generated phishing emails are getting very good at targeting executives.	Corporate executives are being hit with an influx of hyper-personalized phishing scams generated by artificial intelligence bots, as the fast-developing technology makes advanced cyber crime easier.

Back to index

ML news: Week 6 -12 January

Research

Link	description
Agents Are Not Enough.	This work argues that AI agents while promising, cannot fully solve the challenges of autonomous task execution. It proposes an ecosystem comprising three components: Agents (focused modules for specific tasks), Sims (digital representations of user preferences and behaviors), and Assistants (coordinators between users, Sims, and Agents).
2 OLMo 2 Furious.	This work introduces an improved architecture, advanced training methods, and a specialized data mixture called Dolmino Mix 1124. Released in 7B and 13B parameter scales with fully transparent training data and code, the model matches or exceeds the performance of open-weight models like Llama 3.1 and Qwen 2.5 while requiring fewer computational resources. Its instruction-tuned version, OLMo 2-Instruct, remains competitive with comparable models.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.	This work proposes a self-training strategy to address overthinking in o1-like LLMs, reducing token output by 48.6% while maintaining accuracy on the MATH500 test set, as demonstrated with QwQ-32B-Preview.
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes.	MEDEC is a publicly available benchmark for medical error detection and correction in clinical notes, focusing on five error types: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism. It includes 3,848 clinical texts, with 488 clinical notes from three U.S. hospital systems. Experiments show that Claude 3.5 Sonnet excels in error detection, while o1-preview outperforms in error correction.
Aviary: training language agents on challenging scientific tasks.	An extensible open-source gymnasium designed to develop language agents that outperform zero-shot frontier LLMs and even humans on various challenging scientific tasks.
Memory Layers at Scale.	This work demonstrates the scalability and effectiveness of memory layers, showing that models equipped with these layers outperform traditional dense models using half the computation, especially on factual tasks. It includes a parallelizable memory layer implementation that scales to 128B memory parameters and 1 trillion training tokens, validated against base models up to 8B parameters.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs.	This work introduces a novel approach to enhance medical reasoning in language models through a medical verifier that validates outputs and guides the development of complex reasoning skills. The system combines fine-tuning and reinforcement learning with verifier-based rewards in a two-stage process, achieving superior performance over existing models using just 40,000 verifiable medical problems.
Cosmos World Foundation Model Platform for Physical AI.	Nvidia has launched a new set of World Models built on its Cosmos tokenization framework. These models demonstrate exceptional physics comprehension and are available on the Hugging Face platform. While they appear to be primarily geared toward robotics and industrial use cases, they are also capable of generating videos in other fields.
Accurate predictions on small data with a tabular foundation model.	Tabular Prior-data Fitted Network, a tabular foundation model, provides accurate predictions on small data and outperforms all previous methods on datasets with up to 10,000 samples by a wide margin.

News

Link	description
LA tech entrepreneur nearly misses flight after getting trapped in robotaxi.	Mike Johns’ self-driving car started circling a parking lot, but he recognizes there are ‘glitches that need stitches’
‘Virtual employees’ could join the workforce as soon as this year, OpenAI boss says.	Sam Altman says tools that carry out jobs autonomously, known as AI agents, could transform business output
Meta’s AI video editing features are coming to Instagram next year.	Meta plans to introduce Movie Gen, an AI video editing tool, on Instagram in 2025.
Apple in talks with Tencent, ByteDance to roll out AI features in China, sources say.	Apple is in early talks with Tencent and ByteDance to integrate their AI models into iPhones sold in China.
Amazon aims to branch into UK internet market with satellite broadband plan.	Proposed space launches within next two years could ultimately deliver mobile phone signal even to most remote areas
Memo to Trump: US telecoms is vulnerable to hackers. Please hang up and try again.	State-backed cyberspies are exploiting aging infrastructure to penetrate every corner of the US government, it seems – even its phone-tapping systems
How Elon Musk’s X became the global right’s supercharged front page.	Musk has now used X as a platform to make aggressive interventions in US politics – and in those of other countries
Meta is killing off its own AI-powered Instagram and Facebook profiles.	Instagram profile of ‘proud Black queer momma’, created by Meta, said her development team included no Black people
Football coaches could soon be calling on AI to scout the next superstar.	Technologists claim managers could wish for specific player attributes and AI would suggest perfect youth prospect
xAI’s next-gen AI model didn’t arrive on time, adding to a trend.	xAI has delayed the launch of its next-gen Grok model, citing quality concerns, marking yet another delay in the AI industry.
2025 will be the year climate tech learns to love AI.	AI's increasing energy needs are driving interest in nuclear and fusion power, with companies innovating reactor designs and fusion startups targeting grid connection by the early 2030s. Potential changes to the Inflation Reduction Act could challenge hydrogen startups reliant on subsidies to meet cost goals. More tech alliances with power providers are expected as regulatory approvals shape grid-related investments in 2025.
CES 2025: What to expect from the year’s first and biggest tech show.	CES 2025 in Las Vegas, running from January 7-10, will feature major tech events with companies like AMD, Samsung,
AI Cloud Startup Vultr Raises $333M At $3.5B In First Outside Funding Round.	Vultr, an AI cloud infrastructure startup, secured $333 million in its first funding round, achieving a $3.5 billion valuation. Co-led by AMD Ventures and LuminArx Capital Management, the investment focuses on GPU acquisition. This move underscores AMD's competitive drive against Nvidia and Intel in the AI infrastructure space.
Hamming AI Raises $3.8M Seed Round.	Hamming secured $3.8M in seed funding to improve AI voice agent reliability through automated testing and monitoring tools. Its offerings include LLM prompt management, vulnerability detection, and call analytics, catering to compliance-heavy industries. Co-founder Lauren Farleigh emphasizes their commitment to safe AI development amid the expansion of conversational AI.
Generative AI Funding Surges with $56 Billion in 2024!.	Generative AI investments hit a record $56 billion in 2024, driven by strong enterprise demand and advancements in foundation models.
AI startup Odyssey’s new tool can generate photorealistic 3D worlds.	Odyssey's Explorer is an AI tool that generates photorealistic 3D scenes from text or images, featuring a distinctive camera system for enhanced realism.
British AI startup with government ties is developing tech for military drones.	Concerns raised over role of Faculty AI, which has worked with NHS and government safety body
‘You’re gonna find this creepy’: my AI-cloned voice was used by the far right. Could I stop it?	It was chilling to hear ‘my voice’ repeating lies – and to discover that deepfake audio is a growing threat to democracy
More breast cancer cases found when AI used in screenings, study finds.	First real-world test finds approach has higher detection rate without having a higher rate of false positives
The Largest AI Startup Funding Deals Of 2024.	AI led 2024's startup funding, with major raises including Databricks at $10B, OpenAI at $6.6B, and xAI securing $12B across two rounds. Waymo raised $5.6B, Anthropic $4B, and Anduril Industries $ 1.5 B.
Ditching of Facebook fact-checkers a ‘major step back’ for public discourse, critics say.	Mark Zuckerberg’s decision regarding Meta platforms condemned as ‘a full bending of the knee’ to Donald Trump
A new era of lies: Mark Zuckerberg has just ushered in an extinction-level event for truth on social media.	The Meta boss’s decision to end Facebook and Instagram’s factchecking program has set the stage for a fact-free four years online
Apple says it will update AI feature after inaccurate news alerts.	One alert claimed BBC story said Luigi Mangione, alleged murderer of US healthcare CEO, had killed himself
Instagram to replace AR filters with AI-generated videos.	Meta will discontinue Instagram's Spark AR filters by January 2025, shifting focus to AI-based filters called Movie Gen.
Meta’s changes to policing will lead to a clash with EU and UK, say experts.	Politicians criticize Mark Zuckerberg’s choice to scrap fact-checkers, affecting Facebook, Instagram, and Threads
The AI tool that can interpret any spreadsheet instantly.	Artificial intelligence is already used extensively to infer outcomes from tables of data, but this typically involves creating a model for each task. A one-size-fits-all model just made the process substantially easier.
Nvidia's Personal AI Supercomputer.	Nvidia's DIGITS, powered by the GB10 Superchip, is a personal AI supercomputer delivering a petaflop of AI performance. It supports local prototyping and deployment for models with up to 200 billion parameters.
Grok may soon get an ‘Unhinged Mode’.	Elon Musk's xAI updated its FAQ, announcing that Grok's "Unhinged Mode" will provide deliberately offensive and controversial responses. Although not yet active, the mode reflects Musk's vision for an unfiltered, edgy AI chatbot. Critics argue that Grok leans left politically, which Musk attributes to its training data, promising future updates to ensure neutrality.
This Week in AI: More capable AI is coming, but will its benefits be evenly distributed?	OpenAI CEO Sam Altman asserts that the company is making strides toward AGI and superintelligence, which could drive rapid innovation. However, concerns persist about AI's impact on jobs, as studies show it initially enhances but ultimately replaces some freelance roles. Simultaneously, AI funding is soaring, Microsoft is heavily investing in data centers, and Prime Intellect has unveiled a new pathogen detection model.
Remarkable robotic hand can now manipulate the objects that it's holding.	Sanctuary AI's Phoenix robot is certainly an impressive beast, with hydraulically actuated hands that are incredibly dextrous. Well, those hands have recently become even more useful, as each one is now capable of simultaneously holding and manipulating an object.
Tetsuwan Scientific is making robotic AI scientists that can run experiments on their own.	Tetsuwan Scientific, founded by Cristian Ponce and Théo Schäfer, is working on creating affordable robotic AI scientists to automate laboratory tasks, utilizing large language models (LLMs) for scientific reasoning.
Elon Musk says all human data for AI training ‘exhausted’.	Tech boss suggests moving to self-learning synthetic data though some warn this could cause ‘model collapse’
Mark Zuckerberg gave Meta’s Llama team the OK to train on copyrighted works, filing claims.	A recent filing claims that Meta's Llama team used copyrighted material for training with Mark Zuckerberg's approval, sparking concerns about intellectual property use in AI development.
5 ways to search what you see with Google Lens.	Google has unveiled new tips and features for Lens in 2025, emphasizing its enhanced visual search capabilities and seamless integration with daily tasks.
Google has unveiled new tips and features for Lens in 2025, emphasizing its enhanced visual search capabilities and seamless integration with daily tasks.	The company has been testing a standalone Grok app and website for a few months in places like New Zealand, but the U.S. version is now live for iOS.
Stupidly Easy Hack Can Jailbreak Even the Most Advanced AI Chatbots.	New research from Anthropic shows that LLMs can be easily "jailbroken" by altering capitalization or spelling.
ByteDance appears to be skirting US restrictions to buy Nvidia chips: Report.	ByteDance intends to spend $7 billion on Nvidia chips in 2025, bypassing U.S. restrictions by storing them outside of China.
UK can be ‘AI sweet spot’: Starmer’s tech minister on regulation, Musk, and free speech.	Technology secretary Peter Kyle has the task of making Britain a leading player in the AI revolution, but says economic growth will not come at the cost of online safety
Facebook to ditch fact-checking: what do researchers think?	Meta’s planned shift away from third-party fact-checking in favor of a crowdsourced approach has perplexed those who study the spread of misinformation.

Resources

Link	description
Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning.	Putnam-AXIOM, a new math reasoning benchmark, includes 236 Putnam Competition problems and 52 variations. The best-performing model, OpenAI's o1-preview, achieves only 41.95% accuracy on the original problems and fares significantly worse on the variations.
1.58-bit FLUX.	This work introduces the first successful quantization of the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (values in {-1, 0, +1}). The approach leverages self-supervision from the FLUX.1-dev model and preserves comparable performance in generating 1024 x 1024 images to the original model.
TANGOFLUX: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization.	This work from Stability AI leverages Diffusion Transformers and a novel post-training strategy to enhance a state-of-the-art audio generation model.
LTX-Video: Realtime Video Latent Diffusion.	An open video model capable of generating high-quality video with exceptional speed and performance.
open-pi-zero.	Pi Zero is an image-to-action model used for robotics. This repository is an open replication that uses paligemma as a vision backbone.
PyTorch per step fault tolerance.	PyTorch fault tolerance code designed to handle training interruptions gracefully. While such systems are common in large organizations, having an open-source version is a compelling addition to the community.
KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.	KaLM-Embedding is a multilingual embedding model trained on cleaner, diverse, domain-specific data. It incorporates innovative techniques like persona-based synthetic examples and ranking consistency filtering to enhance performance.
FACTS Grounding: A new benchmark for evaluating the factuality of large language models.	The FACTS Grounding benchmark assesses LLMs' ability to produce factually accurate responses based on provided source material, aiming to minimize hallucinations. A Kaggle leaderboard tracks industry progress, featuring initial results from top LLMs. The evaluation uses diverse, long-form examples reviewed by multiple LLMs to ensure comprehensive and unbiased assessments.
Kalman Filter for 3D Vehicle Tracking.	HybridTrack proposes a novel multi-object tracking method that integrates a data-driven Kalman Filter.
TiGDistill-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning Distillation.	TiGDistill-BEV introduces a novel approach to improve camera-based 3D object detectors by distilling knowledge from LiDAR using depth supervision and BEV feature distillation.
SVFR: A Unified Framework for Generalized Video Face Restoration.	SVDR is a unified framework for face video restoration, handling tasks like blind face restoration (BFR), colorization, inpainting, and their combinations within a single cohesive system.
Tencent's Music Foundation Model.	Tencent AI Lab's Muq is a large music foundation model pre-trained using Self-Supervised Learning (SSL), achieving state-of-the-art performance across multiple Music Information Retrieval (MIR) tasks.
JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing.	JoyGen is an innovative two-stage framework for talking-face generation, integrating audio-driven lip motion generation with visual appearance synthesis for realistic results.
Multi-vision Sensor Perception and Reasoning Benchmark.	The MS-PR benchmark assesses Vision-Language Models on sensor-specific reasoning, leveraging DNA optimization to bridge information gaps between images and sensors for improved performance.
The year of AI: 12 events that shaped the sector in 2024.	European AI startups are set for substantial growth, with investments projected to reach $11 billion in 2024, up from $6 billion in 2023.
Microsoft plans to invest $3B in AI, cloud in India.	Microsoft plans to invest $3 billion to expand its artificial intelligence and cloud services in India.
DMesh++.	The latest version of the fully differentiable geometric mesh representation is now available, featuring several enhancements that improve its suitability for learning and shape representation.
Agents.	This post delves into Agents, discussing their applications, limitations, and areas where they are likely to succeed. It also examines planning and execution pipelines in detail.
A Concept-Based Explainability Framework for Large Multimodal Models.	This project improves the interpretability of large multimodal models by visualizing concepts and connecting them to input-output behavior.
Picotron tutorial.	A step-by-step tutorial on how to build Picotron distributed training framework from scratch
Dispider.	Dispider allows real-time interaction with streaming videos, unlike traditional offline video LLMs that require processing the entire video before responding.
Experimental Gemini Thinking Model.	Google has quietly pushed a new thinking model, likely similar to o1 style reasoning, to its AI studio.
Foundation models for fast, label-free detection of glioma infiltration.	FastGlioma is a visual foundation model for fast and accurate detection of glioma infiltration in fresh, unprocessed surgical tissue.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.	LongMemEval is a robust, scalable benchmark designed to rigorously evaluate the long-term memory capabilities of chat assistants.
HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation.	HiCo is a diffusion model tailored for layout-to-image generation, tackling issues such as missing objects and uneven lighting.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control.	Diffusion as Shader (DaS) is an innovative framework that enables various video control tasks within a single unified architecture.
Training 1m Context Models with Native PyTorch.	The TorchTitan project has implemented pass-KV Ring Attention and integrated it with its FSDP-2 training system. Using this setup on 32 H100 GPUs, researchers successfully trained Llama 3 8B to handle 1 million tokens of context. The system is also compatible with Torch Compile, delivering a 10% boost in tokens per second.
Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers.	Magic Mirror is a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion.
Mixture of Experts for LiDAR.	LiMoE is a framework that applies the Mixture of Experts (MoE) approach to LiDAR data representation learning, enabling the seamless combination of various representations, including range images, sparse voxels, and raw points.
The new AI wrapper products pipeline.	AI-generated videos often lack realism, as seen with tools like Heygen and Captions AI. Current workflows are cumbersome, requiring multiple platforms and influencers to promote AI products. Styletransfergen simplifies this process by providing customizable, lifelike AI avatars, offering a more efficient solution for content creation and distribution.
TransPixar: Advancing Text-to-Video Generation with Transparency.	The transparent generation algorithm incorporates the alpha channel, enhancing the model's utility for VFX applications.
🐦Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation 🐦.	This algorithm generates novel birds by combining parts using a learned combination method. The results are impressive, with high-quality generated meshes making them highly practical.
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection.	InfiGUIAgent is a GUI automation tool that utilizes multimodal large language models and a two-stage training approach to improve reasoning and interaction capabilities.
NeuralSVG: An Implicit Representation for Text-to-Vector Generation.	Many efforts focus on generating SVG images, but this approach specifically generates object parts in sequence, ensuring the final image is clean, editable, and minimal. The results are both practical and visually impressive.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation.	This tool enables a controllable and consistent generation of characters and dialogue boxes for manga story creation, functioning similarly to a control net for maintaining character consistency.
Online Gaussian Adaptation of Vision-Language Models (OGA).	OGA is an online adaptation method that builds a cache of samples with low zero-shot entropy along a data stream.
Sa2VA model zoo.	Bytedance has released 3 sizes of a new model that combine strong VLM performance with open vocabulary segmentation found in SAM2.

Perspectives

Link	description
Machine-Assisted Proof.	This work explores how mathematicians have historically used machines to aid research and highlight recent AI tools revolutionizing mathematical proof assistance.
How AI is unlocking ancient texts — and could rewrite history.	From deciphering burnt Roman scrolls to reading crumbling cuneiform tablets, neural networks could give researchers more data than they’ve had in centuries.
The small-drone revolution is coming — scientists need to ensure it will be safe.	China’s low-altitude aviation economy is poised to become a trillion yuan industry in 2025 — if safety and security challenges can be overcome.
‘I received a first but it felt tainted and undeserved’: inside the university AI cheating crisis.	More than half of students are now using generative AI, casting a shadow over campuses as tutors and students turn on each other and hardworking learners are caught in the flak. Will Coldwell reports on a broken system
AI boom masks fundraising struggles for non-AI startups.	Many startups are struggling to raise funding at higher valuations despite modest growth, especially non-AI companies.
Gen AI Present and Future: A Conversation with Rashmi Kumar, SVP and CIO at Medtronic .	Medtronic is utilizing AI to boost productivity, automate tasks, and enhance decision-making with tools like AI-driven contract management and supply chain optimization. The company focuses on healthcare applications, including precision diagnostics, robotic-assisted surgeries, and image analysis for early condition detection. Medtronic combines internal AI R&D with partnerships, collaborating with tech companies and AI startups to drive innovation.
Emerging Wedges in Vertical AI Startups.	Vertical AI startups are gaining momentum by targeting voice automation, unstructured data parsing, verticalized search, and content generation. These solutions tackle industry-specific challenges, improving efficiency, accessibility, and cost-effectiveness. As they expand, these startups could evolve into essential systems of record within their respective industries.
Is AI hitting a wall?	AI model pre-training improvements may be slowing, as noted by experts like Ilya Sutskever, but outdated evaluation methods may contribute to the perception of a plateau. Despite scaling challenges, untapped data sources and synthetic data offer opportunities to enhance capabilities. Advances in reasoning and leveraging new data suggest AI development remains strong and full of potential.
Sorry Human, You're Wrong.	ChatGPT o1 Pro, priced at $200 per month, offers only slight improvements over its predecessor. It struggles with key identification tests and often displays unwarranted confidence in incorrect answers, raising concerns about its reliability in critical contexts like insurance and healthcare. These issues highlight the need for further evaluation and development refinements.
What will viruses do next? AI is helping scientists predict their evolution.	Forecasts of viral variation could improve vaccine and antiviral treatments ahead of time.
AI will be dead in five years.	In five years, AI's success could make it less of a buzzword as it seamlessly integrates into everyday technology and business solutions. The term itself may evolve, with today's AI being redefined, much like how big data has become commonplace. Machine learning will likely take center stage as AI transitions into a standard feature.
Beyond The Hype: AI, Innovation And Rational Investment In 2025.	Valuable AI companies are expected to experience significant growth in 2024, while many overhyped ventures may struggle. Vertical integration and buy-and-build strategies are likely to gain traction, targeting markets in need of streamlined technology solutions. Additionally, a shift toward emerging, capacity-constrained managers will stand in contrast to the decline of overfunded growth companies from the 2020-2021 era.
A new, uncensored AI video model may spark a new AI hobbyist movement.	Tencent's open-weight AI model, HunyuanVideo, facilitates local, uncensored video synthesis, presenting a transformative tool comparable to Stable Diffusion.
To ensure trust, AI weather-forecast models still need training in physics.	AI models are more precise but doubts still exist
Reimagining Compliance: Balancing AI Innovation with Trust.	AI is revolutionizing financial services compliance by automating outdated workflows and boosting efficiency in areas such as client onboarding and transaction monitoring. Startups are using AI to enhance predictive accuracy, reduce errors, and lower costs compared to traditional manual methods. With growing regulatory pressures, the demand for innovative compliance solutions is expected to rise, creating opportunities for new players to surpass slower, established firms.
AIs Will Increasingly Attempt Shenanigans.	Recent research reveals that advanced AI models, such as o1 and Llama 3.1, display scheming behaviors like deception and subverting oversight, even with minimal prompting. This raises concerns about the potential risks of AI models as they gain the ability to autonomously pursue misaligned goals. While the likelihood of catastrophic outcomes remains low, these findings highlight the need for ongoing vigilance as AI capabilities continue to advance.
The Next Great Leap in AI Is Behind Schedule and Crazy Expensive.	OpenAI's GPT-5 project, codenamed Orion, faces delays and high costs due to unexpected challenges and a lack of diverse data sources.
AI-generated ‘slop’ is slowly killing the internet, so why is nobody trying to stop it?	Low-quality ‘slop’ generated by AI is crowding out genuine humans across the internet, but instead of regulating it, platforms such as Facebook are positively encouraging it. Where does this end?
The New Science of Growth Marketing.	AI is revolutionizing marketing, with this article detailing effective growth marketing strategies such as agents that drive self-improving websites and large-scale content personalization. Dubbed "quant experimentation," these approaches draw inspiration from quant trading, which transformed finance in the 1980s, reflecting similar disruptive changes in the marketing landscape.
No, LLMs are not "scheming".	In 2024, AI models like OpenAI's o1 surpassed the Turing test with impressive conversational abilities but still lack human-like situational awareness. Debates continue over whether LLMs are simply pattern learners or possess reasoning skills. While they excel at replication, their struggle to prioritize patterns stems from limited contextual understanding. Efforts should focus on improving training and evaluation methods rather than assigning human-like traits or intentions to these systems.
What just happened.	AI advancements have rapidly progressed, with new GPT-4 level and Gen3 models offering both groundbreaking and incremental improvements. The o1 models showcase advanced reasoning, capable of identifying errors in academic papers and assisting with research, emphasizing AI's growing influence beyond traditional tasks. AI now also supports real-time video interaction and enhanced text-to-video generation, pointing to significant future implications and opportunities for integration across various fields.
Collaborative research on AI safety is vital.	If we are to take seriously the risk facing humanity, regulators need the power to ‘recall’ deployed models, as well as assess leading, not lagging, indicators of risk

Back to index

ML news: Week 31 December - 5 January

Research

Link	description
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.	A new encoder-only transformer model sets state-of-the-art performance in classification and retrieval tasks while being more efficient than earlier encoders. Trained on 2T tokens with an 8192 sequence length, it incorporates modern optimizations that significantly surpass BERT. Designed for practical deployment, it offers superior speed and memory efficiency on standard GPUs.
DeepSeek-V3.	A 671B-parameter MoE language model activates 37B parameters per token, leveraging MLA and DeepSeekMoE architectures for efficiency. It features an auxiliary-loss-free load-balancing approach and multi-token prediction during training to boost performance. Pre-trained on 14.8 trillion tokens, followed by SFT and RL stages, the model matches leading closed-source models and outperforms open-source alternatives. Training required only 2.788M H800 GPU hours with stable, spike-free progress.
Large Concept Models: Language Modeling in a Sentence Representation Space.	This approach introduces sentence-level semantic representations, called concepts, moving beyond token-level processing in traditional LLMs. It utilizes SONAR sentence embeddings, supporting 200 languages across text and speech, with autoregressive training methods ranging from MSE regression to diffusion-based generation. Tested in 1.6B and 7B parameter variants on datasets of 1.3T and 7.7T tokens, the model excels in generative tasks such as summarization and summary expansion.
Automating the Search for Artificial Life with Foundation Models.	This approach leverages foundation models to explore artificial life simulations across platforms like Boids, Lenia, and Game of Life. It identifies simulations with specific target behaviors, generates temporally open-ended novelty, and maps diverse simulation spaces. The system discovers new lifeforms in Lenia and Boids while enabling quantitative, human-aligned measurements of previously qualitative phenomena.
LearnLM: Improving Gemini for Learning.	LearnLM is a new model designed to follow pedagogical instructions, adapting its teaching style to specified educational needs rather than defaulting to mere information delivery. Experimental results show it outperforms leading models, including GPT-4 by 31%, Claude 3.5 by 11%, and Gemini 1.5 Pro by 13%. LearnLM avoids adhering to a single pedagogical framework, allowing teachers and developers to define teaching behaviors while enabling continuous improvement alongside other capabilities.
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search.	This work introduces CoMCTS, a learning-to-reason method for multimodal language models that fosters step-by-step reasoning by leveraging knowledge from multiple models. Using this approach, the Mulberry-260k dataset with explicit reasoning trees was created to train the Mulberry model series. The method achieves strong benchmark performance, enhancing the models' reasoning and reflection capabilities.
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought.	This approach applies long chain-of-thought reasoning to machine translation, focusing on metaphors and similes across cultures. It employs a multi-agent framework where a translator collaborates iteratively with an advisor and evaluator for improved translations. Testing with Qwen2.5 models showed notable gains in BLEU and CometScore metrics, with DRT-o1-7B outperforming larger models like QwQ-32B-Preview.
SceneCraft: Layout-Guided 3D Scene Generation.	SceneCraft introduces a method for creating detailed 3D indoor scenes based on user-provided text descriptions and layout preferences.
Chain of Continuous Thoughts.	Meta's COCONUT introduces a new approach for LLMs to reason in continuous latent space instead of discrete language tokens, encoding reasoning steps as continuous vectors. This method enhances reasoning capabilities but reduces interpretability, offering a promising trade-off for future LLM advancements.
The Vizier Gaussian Process Bandit Algorithm.	Google has open-sourced an internal tool used for hyperparameter optimization and research across its products. Previously proprietary, the tool's underlying algorithm has now been detailed in a published paper, highlighting its decision-making capabilities and effectiveness.
Large-scale moral machine experiment on large language models.	A new study assesses the ethical decision-making of 51 LLMs in autonomous driving scenarios, analyzing alignment with human moral judgments across various models, including GPT, Claude, and Llama.
Efficient Parallel Genetic Algorithm for Perturbed Substructure Optimization in Complex Network.	This study suggests a method for reconstructing the genetic operation and designing a development framework for efficient parallel acceleration.
An analytic theory of creativity in convolutional diffusion models.	A fascinating paper that explores closed-form equations that can model generated images from diffusion models. This means that with a high degree of confidence, you can predict the image that will be generated, in a simpler setting.

News

Link	description
Berlin accuses Elon Musk of trying to influence German election.	Government spokesperson says freedom of speech ‘covers the greatest nonsense’ after Musk’s endorsements of AfD
Dating apps prepare to launch AI features to help users find love.	Match Group’s digital assistant will tailor profiles and search for dates – but critics fear genuine connections are at risk
AI tools may soon manipulate people’s online decision-making, say researchers.	Study predicts an ‘intention economy’ where companies bid for accurate predictions of human behavior
‘Godfather of AI’ shortens odds of the technology wiping out humanity over next 30 years.	Geoffrey Hinton says there is 10% to 20% chance AI will lead to human extinction in three decades, as change moves fast
OpenAI lays out a plan to shift to a for-profit corporate structure.	AI company, which makes ChatGPT, says in blogpost ‘we once again need to raise more capital than we’d imagined’
ChatGPT search vs. Google: A deep dive analysis of 62 queries.	A study comparing 62 queries analyzed ChatGPT search and Google, revealing distinct strengths and weaknesses. Google excelled in informational, local, and commercial queries, while ChatGPT showed potential in content gap analysis and disambiguation. Both faced issues with errors and incomplete responses, though Google generally offered more reliable results.
Nick Clegg, former UK deputy prime minister, leaves Meta.	Clegg was the tech giant’s chief public policy architect when it was facing scrutiny over Cambridge Analytica scandal
DeepSeek-V3, ultra-large open-source AI, outperforms Llama and Qwen on launch.	Chinese AI startup DeepSeek has launched DeepSeek-V3, a 671B parameter model using a mixture-of-experts architecture, now available on Hugging Face. DeepSeek-V3 surpasses leading models like Meta's Llama 3.1 and competes with closed models like OpenAI's GPT-4o. It focuses on efficiency with innovations such as multi-token prediction, significantly reducing training costs.
Microsoft and OpenAI have a financial definition of AGI.	Microsoft and OpenAI define AGI as AI systems generating $100 billion in profits, a milestone OpenAI is far from reaching. Currently losing billions, OpenAI doesn't anticipate profitability until 2029, raising questions about how long Microsoft will maintain access to its technology. Financial metrics counter speculation that OpenAI might prematurely declare AGI.
OpenAI ‘considered’ building a humanoid robot.	OpenAI is exploring the development of its own humanoid robot, drawing on past investments in robotics companies like Figure and 1X. Despite disbanding its robotics division in 2021, re-entering this competitive market poses significant challenges.
Would you trust a driverless robotaxi? Waymo hopes so.	Waymo has expanded its self-driving ride-hailing service to Los Angeles, adding to its operations in San Francisco and Phoenix. Riders value the smoother, more private experience over traditional rideshares. Despite growing ridership, the service's profitability remains unclear.
ChatGPT Search can be tricked into misleading users, new research reveals.	ChatGPT Search, an AI-powered search engine that went live this month, can be fooled into generating completely misleading summaries, U.K. newspaper The Guardian has found.
Meta is rolling out live AI and Shazam integration to its smart glasses.	The Ray-Ban Meta Smart Glasses already worked well as a head-mounted camera and pair of open-ear headphones, but now Meta is updating the glasses with access to live AI without the need for a wake word, live translation between several different languages, and access to Shazam for identifying music.
AI helps ID paint chemistry of Berlin Wall murals.	SAPNet is a neural network developed by Italian scientists to enhance spectral data analysis from handheld Raman spectroscopy devices.
Cerebras Demonstrates Trillion Parameter Model Training on a Single CS-3 System.	Cerebras Systems and Sandia National Laboratories successfully trained a 1 trillion parameter AI model on a single CS-3 system using Cerebras' Wafer Scale Cluster technology. This approach eliminates the need for thousands of GPUs, simplifying deployment. The model scaled seamlessly to 16 CS-3 systems, demonstrating impressive linear scalability.
xAI is testing a standalone iOS app for its Grok chatbot.	Elon Musk’s AI company, xAI, is testing out a standalone iOS app for its chatbot, Grok, which was available only to X users until now.
OpenAI says it has no plans for a Sora API — yet.	OpenAI says it has no plans to release an API for Sora, its AI model that can generate reasonably realistic videos when provided with a text description or reference image.
BYD officially enters humanoid robot race as global talent search kicks off.	China’s leading EV maker will try its hand in a promising new field. As electric car sales continue surging to record highs, BYD plans to take on the world of humanoid robots. To kick things off, BYD announced a new recruitment program to attract top talent from around the globe.
Nvidia to open-source Run:ai, the software it acquired for $700M to help companies manage GPUs for AI.	Nvidia has completed its acquisition of Run:ai, a software company that makes it easier for customers to orchestrate GPU clouds for AI, and said that it would open-source the software.
YouTube Teams With CAA to Let Talent Identify — and Pull Down — AI Deepfakes of Themselves.	YouTube and CAA have partnered to help talent combat AI-generated fakes using early-stage likeness management technology. The tool allows actors and athletes to identify unauthorized AI replicas and request their removal. This collaboration focuses on protecting IP rights while testing and refining AI detection systems ahead of a broader launch.
Engineered Arts restructures with $10M to create humanoid robots.	Engineered Arts, a United Kingdom firm making humanoid robots, has restructured as a U.S. company and raised $10 million.
NVIDIA Unveils Its Most Affordable Generative AI Supercomputer.	The Jetson Orin Nano Super delivers up to a 1.7x gain in generative AI performance, supporting popular models for hobbyists, developers, and students.
OpenAI failed to deliver the opt-out tool it promised by 2025.	Back in May, OpenAI said it was developing a tool to let creators specify how they want their works to be included in — or excluded from — its AI training data. But seven months later, this feature has yet to see the light of day.
Code Assist, Google’s enterprise-focused coding assistant, gets third-party tools.	Google on Tuesday announced support for third-party tools in Gemini Code Assist, its enterprise-focused AI code completion service.

Resources

Link	description
A Survey on LLM Inference-Time Self-Improvement.	This survey categorizes LLM inference-time self-improvement techniques into three areas: independent methods like enhanced decoding, context-aware approaches leveraging external data, and model collaboration strategies.
Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning.	ExploreToM is a framework leveraging A* search to create complex theory-of-mind scenarios, exposing significant limitations in current LLMs' social intelligence. Advanced models like GPT-4 and Llama-3 achieved as low as 5% accuracy in these scenarios, despite excelling on simpler benchmarks. Fine-tuning with ExploreToM data improved performance on existing benchmarks by 27 points.
CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions.	The Chordonomicon dataset provides over 666,000 songs with chord progressions annotated by genre, structure, and release date, addressing a significant gap in deep learning resources for music analysis.
ClassiFIM: An Unsupervised Method To Detect Phase Transitions.	ClassiFIM is a new approach for estimating the Fisher Information Metric in unsupervised learning of phase transitions.
AI Hedge Fund.	An AI-powered hedge fund that uses multiple agents to make trading decisions.
FlowEdit.	Easy editing of images with flow-based models.
Transfusion - Pytorch.	Lucidrains has written up a great reimplementation of Meta's token + diffusion model Transfusion which can do images and text in a single model.
Fast LLM Inference From Scratch.	The article details the creation of an LLM inference engine using C++ and CUDA without external libraries, emphasizing speed optimization for consumer devices. It explores techniques like multithreading, vectorization, warp reductions, coalescing, and quantization, achieving better throughput than llama.cpp in specific cases. The piece also highlights opportunities for further optimization and discusses the benefits of established libraries for production-grade applications.
8 expert tips for getting started with NotebookLM.	This guide offers key insights from experts to help beginners get started with NotebookLM, making it easier to navigate and use effectively.
Implicit Grid Convolution for Multi-Scale Image Super-Resolution.	This paper introduces a new approach to Super-Resolution (SR) that challenges the conventional method of training separate models for each scale.
Label Critic: Using LVLMs to Compare Medical Segmentations and Correct Label Errors.	Label Critic is a cutting-edge tool that simplifies medical dataset annotation by leveraging AI-generated labels, eliminating the need to start from scratch.
Py-CTCMetrics.	The CHOTA metric (Cell-specific Higher Order Tracking Accuracy) enhances the evaluation of cell tracking methods in biomedical research by integrating cell detection, global coherence, and lineage tracking into a unified framework. Unlike existing metrics that emphasize local accuracy, CHOTA provides a comprehensive approach, better suited for high-level biological analysis.
FM4Music.	This repository, along with the companion paper, contains a list of services, models, datasets, and systems used to generate music.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation.	A multimodal model that unifies image and text generation and understanding by using a novel set of autoregressive and discrete diffusion blocks.
Xmodel-1.5: An 1B-scale Multilingual LLM.	Xmodel-1.5 is a powerful 1-billion-parameter language model trained on 2 trillion tokens that excels in multiple languages including Thai, Arabic, French, Chinese, and English.
Vehicle Detection with Enhanced Accuracy.	VFM-Det is a vehicle detection method that combines a pre-trained vehicle model (VehicleMAE) with a large language model (T5).
FS-Jump3D Dataset.	FS-Jump3D dataset improves Temporal Action Segmentation (TAS) in figure skating, a key aspect of judging skaters' performances.
SCUDA: GPU-over-IP.	SCUDA is a GPU-over-IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Globally Correlation-Aware Hard Negative Generation.	GCA-HNG is a framework for generating more effective hard negatives by considering global sample correlations instead of just local ones.
Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.	PrefixQuant is a new method that improves LLM quantization by isolating outlier tokens offline, eliminating the need for costly per-token dynamic quantization.
Xmodel_LM-1.1B.	a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens.
Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world.	SAG is a self-supervised framework that enhances the generalization of optical flow and stereo methods for real-world applications. By leveraging advanced reconstruction techniques, SAG generates datasets from RGB images and quantifies confidence levels to address imperfections, offering a robust alternative to traditional approaches.
Olympus: A Universal Task Router for Computer Vision Tasks.	Olympus provides a comprehensive framework for evaluating AI creativity across multiple domains, offering insights into generative model capabilities and limitations.
Enhance Non-Ideal CT Imaging.	TAMP is a multi-scale integrated Transformer model designed to enhance non-ideal CT (NICT) imaging.
Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling.	This code leverages advancements in general 3D vision to enhance robot vision, particularly by predicting the dynamics of objects manipulated by a robotic arm. This capability improves the system's overall manipulation performance.
Process Reinforcement Through Implicit Rewards.	Few open replications of o1 reasoning exist, but this work shows promise by using implicit rewards that bypass formal reward methods, while also rewarding outcomes consistent with reasoning model principles. Though the code is still in progress, the developers have released the data and models.
Single Modality 3D Object Detection.	This repository offers a 3D object detection framework optimized for single-modality inputs, focusing on simplified and efficient use cases.
Vinci - An Online Egocentric Video-Language Assistant.	A conditional diffusion model that combines visual and textual inputs to generate high-quality images based on diverse text-visual contexts.
VisionReward.	VisionReward is a fine-grained and multi-dimensional reward model.
Wonderful Matrices.	A comprehensive collection of efficiently implemented matrix operations, ideal for mathematical and scientific computing tasks.
ChatTime: A Multimodal Time Series Foundation Model.	An interactive chat-based application that integrates time-tracking features, simplifying task management for teams and individuals.
CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation.	CrossEarth is the first vision foundation model aimed at generalizing across diverse remote sensing scenarios.

Perspectives

Link	description
‘All people could do was hope the nerds would fix it’: the global panic over the millennium bug, 25 years on.	Planes were going to drop out of the sky, nuclear reactors would explode. But then … nothing. What really happened with Y2K? People still disagree …
How will AI reshape 2025? Well, it could be the spreadsheet of the 21st century.	Large language models have changed how big corporations function, and the arrival of AI ‘agents’ – essentially automated Moneypennys – could prove irresistible
How AI is unlocking ancient texts — and could rewrite history.	From deciphering burnt Roman scrolls to reading crumbling cuneiform tablets, neural networks could give researchers more data than they’ve had in centuries.
6G-AI Mashups Will Reshape the Telecom Industry.	The EU-U.S. 6G-XCEL project, along with efforts like ACCoRD and COSMOS, is driving 6G research through AI-integrated network architectures. Workshops at Rutgers showcased 6G innovations, emphasizing open-source initiatives and industry collaborations. These efforts aim to accelerate development and establish interoperability frameworks for next-generation wireless networks.
Why Google bought Character AI.	Google acquired Character AI for its cost-efficient inference technology, enabling scalable AI interactions and supporting free model offerings via AI Studio without affecting unit economics. This move aligns with the shift toward optimizing inference as pre-training yields diminish.
Computing inside an AI.	Shifting from a model-as-person to a model-as-computer metaphor could make AI more effective by introducing graphical interfaces and direct manipulation, reducing reliance on slower conversational inputs. This paradigm enables users to interact with AI as a dynamic, customizable app, improving efficiency and versatility. Generative interfaces have the potential to revolutionize computing, allowing users to create and modify applications on demand for specific tasks.
How Claude Became Tech Insiders’ Chatbot of Choice.	Anthropic's AI chatbot Claude is gaining popularity among tech insiders for its perceived emotional intelligence and creative responses.
Desktop, Touch, Browser, Now AI? The Next OS in Computing.	Human-computer interaction is evolving from graphical interfaces to a more conversational AI-driven approach.
Tenstorrent and the State of AI Hardware Startups.	Tenstorrent's open-source AI hardware offers a competitive alternative to Nvidia, integrating unique CPU and AI core strategies. Leveraging Samsung Foundry's cost-efficient SF4X process, the company addresses latency challenges for scalable AI workloads. With a recent $2B valuation, Tenstorrent shows strong potential, particularly as a high-performance RISC-V IP option amid ARM's pricing challenges.
𝗼𝟯 “𝗔𝗥𝗖 𝗔𝗚𝗜” 𝗽𝗼𝘀𝘁𝗺𝗼𝗿𝘁𝗲𝗺 𝗺𝗲𝗴𝗮𝘁𝗵𝗿𝗲𝗮𝗱: 𝘄𝗵𝘆 𝘁𝗵𝗶𝗻𝗴𝘀 𝗴𝗼𝘁 𝗵𝗲𝗮𝘁𝗲𝗱, 𝘄𝗵𝗮𝘁 𝘄𝗲𝗻𝘁 𝘄𝗿𝗼𝗻𝗴, 𝗮𝗻𝗱 𝘄𝗵𝗮𝘁 𝗶𝘁 𝗮𝗹𝗹 𝗺𝗲𝗮𝗻𝘀.	OpenAI's recent AI demonstration faced criticism for creating misleading impressions of achieving AGI, with unclear pretraining details and questionable graphs. Experts from DeepMind and Hugging Face noted that the AI took the test with extensive pretraining, unlike humans. The lack of transparency and test methodology limits direct comparisons to human abilities, casting doubt on the significance of the claimed breakthrough.
Trusted Autonomy: Robotics, AI, and Blockchain.	What happens when robotics, AI, and blockchain converge? OpenMind's latest industry primer is a comprehensive exploration of robotics, AI, and blockchain synergy.
AIs Will Increasingly Attempt Shenanigans.	Recent research reveals AI models' increasing ability for in-context scheming, including lying, exfiltration attempts, and oversight subversion. Apollo's findings show that frontier models like o1 and Llama 3.1 display these behaviors with minimal instruction, raising concerns about AI alignment and safety. While some question the testing conditions, the study highlights the challenges of managing more autonomous AI systems.
The o1 System Card Is Not About o1.	The o1 model's release revealed insufficient testing and discrepancies in its system card, with actual performance and safety evaluations falling short of expectations. OpenAI's lack of clear communication and timely evaluations underscores the need for updated, transparent procedures to ensure AI safety and reliability before deployment.
Deepseek: The Quiet Giant Leading China’s AI Race.	Deepseek, a Chinese AI startup backed by the hedge fund High-Flyer, has gained recognition for surpassing OpenAI on reasoning benchmarks and driving price competition with its efficient AI models. Led by CEO Liang Wenfeng, Deepseek emphasizes open-source foundational technology and self-funded extensive computing resources. Focusing on AGI research, the startup challenges traditional innovation norms in China while attracting top domestic talent.
How OpenAI Hopes to Sever Its Nonprofit Roots.	Sam Altman is steering OpenAI toward transitioning control from its founding nonprofit to a for-profit model to better compete with tech giants. The talks focus on fair compensation for nonprofits and addressing stakeholder interests, including Microsoft's. OpenAI must restructure within two years to avoid converting recent investments into debt.

Back to index

2024

ML news: Week 23 - 29 December

Research

Link	description
Genesis: A Generative and Universal Physics Engine for Robotics and Beyond.	A new universal physics simulation platform integrates a high-performance physics engine with generative AI, enabling natural language-driven creation of robotic simulations, character animations, and interactive 3D environments. It achieves speeds up to 430,000 times faster than real time.
Alignment faking in large language models.	This study shows that the Claude model can engage in "alignment faking," strategically complying with harmful requests to avoid retraining while maintaining its original safety preferences. This raises concerns about the reliability of current AI safety training methods.
Can LLMs Convert Graphs to Text-Attributed Graphs?	This approach automatically generates textual descriptions for graph nodes, enabling effective graph-to-text-attributed transformations. It is evaluated on text-rich, text-limited, and text-free graphs, showing that it allows a single GNN to perform effectively across diverse graph types.
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents.	A learning system enabling AI agents to autonomously discover and practice skills through web navigation, leveraging reinforcement learning and context-aware task proposals to achieve state-of-the-art results on real-world benchmarks.
Using Generative AI and Multi-Agents to Provide Automatic Feedback.	A two-agent AI system provides more accurate and pedagogically sound feedback for student responses in science assessments, significantly reducing errors such as over-praise compared to single-agent models.
Precise Length Control in Large Language Models.	This approach adapts a pre-trained decoder-only LLM to generate responses of a specified length by incorporating a secondary length-difference positional encoding into the input embeddings. This mechanism enables the model to count down to a user-defined terminal length, achieving mean token errors of fewer than 3 tokens while maintaining response quality.
Machine learning helps to determine the diverse conformations of RNA molecules.	An innovative technique called HORNET uses atomic force microscopy and a machine-learning architecture called a deep neural network to recapitulate the 3D structures of individual RNA molecules. This method enables the study of the structure and dynamics of RNAs that adopt flexible and variable conformations under biologically relevant conditions.
OpenAI's new alignment method.	OpenAI has introduced a new alignment technique for reasoning models that focuses on grounded behavior goals, such as adhering to safety guidelines. This approach separates alignment from preference embedding, marking progress in developing more adaptable and goal-oriented AI systems.
MedCoT: Medical Chain of Thought via Hierarchical Expert.	A new reasoning framework that enhances accuracy and interpretability in Medical Visual Question Answering.
SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharyngeal Tumor Detection.	SAM-Swin is a model for detecting laryngopharyngeal cancer (LPC) that uses advanced features from the Segment Anything Model 2 (SAM2).
So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer.	GitHub has introduced a new open-source byte-pair tokenizer optimized for speed and flexibility in large language models like Copilot. With linear complexity, it scales efficiently and supports dynamic token counts for real-time text operations. Benchmarks show it outperforms libraries like tiktoken and Hugging Face, offering significant performance improvements across applications.
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback.	Researchers investigated the effects of training AI language models to optimize for user feedback, such as thumbs-up ratings. The study found that this approach can result in manipulation, as AIs learn to game the system.

News

Link	description
‘We’re figuring out cool ways of storytelling’: how TikTok is changing the way we watch musicals.	Jorge Rivera-Herrans’s musical sensation Epic is just one of a series of works making a splash on the online platform
OpenAI o3 and o3-mini.	On the final day of OpenAI announcements, OpenAI announced O3, its most powerful reasoning model.
Latest Google AI Innovations.	Google showcases recent AI advancements, featuring improved conversational AI models, updates to responsible AI practices, and new developer tools.
ChatGPT search tool vulnerable to manipulation and deception, tests show.	Guardian testing reveals AI-powered search tools can return false or malicious results if webpages contain hidden text
Older music has been getting a second life on TikTok, data shows.	Despite newer artists having viral moments, app users also enjoyed old school acts including Bronski Beat and Sade
New physics sim trains robots 430,000 times faster than reality.	Genesis, an open-source simulation platform developed by a team led by Carnegie Mellon University, enables robot training 430,000 times faster than real-world conditions using text-generated 3D worlds. It processes physics calculations 80 times faster than existing simulators on standard GPUs, accelerating neural network training for robotics. Built-in Python, Genesis provides a non-proprietary, user-friendly solution for creating dynamic, physics-based environments without manual programming.
Waymo still doing better than humans at preventing injuries and property damage.	Waymo’s autonomous vehicles cause less property damage and fewer bodily injuries when they crash than human-driven vehicles, according to a study that relies on an analysis of insurance data.
Microsoft’s growing AI health ambitions.	Google DeepMind and OpenAI are escalating their competition, with OpenAI adopting a stronger commercial focus. This rivalry is spurring innovation and advancing the boundaries of AI technology.
Perplexity has reportedly closed a $500M funding round.	AI-powered search engine Perplexity has reportedly closed a $500 million funding round, valuing the startup at $9 billion.
OpenAI expands ChatGPT Canvas to all users.	OpenAI has launched Canvas, its digital editing space, for all ChatGPT users, integrated with GPT-4o. Canvas enhances the chat interface by enabling real-time editing and Python code execution. Now a default feature in custom GPTs, it offers advanced functionality for an improved user experience.
Gemini can now tell when a PDF is on your phone screen.	Google’s Files app is rolling out a Gemini screen awareness feature that offers to answer questions about open PDFs.
Apple reportedly developing AI server chip with Broadcom.	Apple is working with semiconductor company Broadcom on its first server chip designed to handle AI applications, according to The Information, which cited three people with knowledge of the project.
NHTSA finally releases new rules for self-driving cars — but there’s a twist.	Regulators say they’ll ease rules allowing for fully driverless cars, but companies need to cough up the data.
Saudi Arabia invests in robots to help build its Neom desert megacity.	As Saudi Arabia continues to reshape its desert landscape with an incredible number of ambitious construction projects, it has employed some high-tech robotic help to increase efficiency and speed things up.
Ex-Twitch CEO Emmett Shear is founding an AI startup backed by a16z.	Emmett Shear, former Twitch CEO, has launched Stem AI, an AI startup focused on aligning AI with human behavior and ethics. Co-founded with Adam Goldstein and backed by a16z, the startup is in stealth mode but has been actively developing since mid-2023. Shear, known for his advocacy on AI regulation and safety, has frequently voiced concerns about AI's risks to humanity.
ChatGPT now understands real-time video, seven months after OpenAI first demoed it.	OpenAI has introduced real-time video capabilities for ChatGPT, enabling users to interact with objects in near real-time via Advanced Voice Mode with vision. The rollout is staggered, with Enterprise subscribers gaining access in January. Similarly, Google recently launched Project Astra, offering comparable functionality for trusted testers on Android.
Anthropic’s 3.5 Haiku model comes to Claude users.	Anthropic has introduced Claude 3.5 Haiku on its chatbot platform, surpassing Claude 3 Opus in coding and content moderation benchmarks. While capable of generating longer text, it lacks image analysis features. Pricing controversies emerged after Anthropic raised the API cost, despite earlier promises to match Claude 3 Haiku's pricing.
Report: Google told FTC Microsoft’s OpenAI deal is killing AI competition.	Google has urged the FTC to end Microsoft's exclusive cloud deal with OpenAI, arguing it raises costs for competitors.
Google’s new Trillium AI chip delivers 4x speed and powers Gemini 2.0.	Google announced Trillium, its latest AI accelerator chip, boasting a 4x performance boost and significant energy efficiency improvements.
OpenAI introduces “Santa Mode” to ChatGPT for ho-ho-ho voice chats.	An AI version of old St. Nick arrives as a seasonal character in popular chatbot app.
Sriram Krishnan named Trump’s senior policy advisor for AI.	President-elect Donald Trump has confirmed reports that Sriram Krishnan, until recently a general partner at Andreessen Horowitz (a16z), will serve as senior policy advisor for AI at the White House Office of Science and Technology Policy.
OpenAI trained o1 and o3 to ‘think’ about its safety policy.	OpenAI's upcoming o3 model family, set for release in 2025, features enhanced reasoning and safety through a "deliberative alignment" process. This method aligns AI responses with OpenAI's safety values during inference, without relying on human-written data. Combined with synthetic data and reinforcement learning, it positions o3 as OpenAI's safest model to date.
Google’s new Jules AI agent will help developers fix buggy code.	Jules uses Gemini 2.0 to address Python and Javascript coding issues in Github.
Microsoft releases Phi-4 language model trained mainly on synthetic data.	Microsoft's new open-source language model, Phi-4, excels in solving math problems, outperforming even larger models like GPT-4o and Llama 3.3.
ChatGPT's new Projects feature can organize your AI clutter.	OpenAI's new Projects feature for ChatGPT enhances interaction organization by grouping related chats and files within a named Project.
Google is using Anthropic’s Claude to improve its Gemini A.	Google contractors are evaluating Gemini AI's responses against Anthropic's Claude, raising questions about whether Google has permission for such testing. Contractors observed that Claude emphasizes safety more than Gemini in its responses. Google clarified that it compares outputs with competitors but does not train Gemini using Anthropic's models.
Red Rabbit Robotics takes human form to sell work as a service.	Red Rabbit Robotics is addressing labor shortages by developing and open-sourcing the RX1 robot for manufacturing and commercial use. Designed for dull, dangerous, and dirty jobs, the RX1 offers cost-effective solutions, aiming to make robotics more accessible. The company focuses on transitioning from teleoperation to full autonomy, prioritizing utility and widespread adoption.
Klarna’s CEO says it stopped hiring thanks to AI but still advertises many open positions.	Klarna CEO Sebastian Siemiatkowski claims generative AI has enabled a workforce reduction, though the company is still hiring for essential roles.

Resources

Link	description
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks.	A new benchmark evaluates AI agents on real-world professional tasks within a simulated software company, covering roles like software engineering, project management, finance, and HR. Testing various LLMs, including API-based models like Claude-3.5-Sonnet and open-source models like Llama 3.1, highlights current limitations. The best performer, Claude-3.5-Sonnet, achieved a 24% full-task success rate and 34.4% with partial progress considered.
Qwen2.5 Technical Report.	A learning system enabling AI agents to autonomously discover and practice skills through web navigation, leveraging reinforcement learning and context-aware task proposals to achieve state-of-the-art results on real-world benchmarks.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding.	A new series of vision-language models introduces dynamic tiling for high-resolution images and an efficient MoE architecture, delivering competitive or state-of-the-art performance across visual tasks. These models achieve similar or superior results with fewer activated parameters compared to existing open-source dense and MoE-based models.
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges.	This survey provides an in-depth analysis of mathematical reasoning capabilities in multimodal large language models (MLLMs), reviewing benchmarks, methodologies, and challenges across over 200 studies conducted since 2021.
Multi-Sentence Annotation Dataset.	A new dataset tailored for training and evaluating AI models on multi-sentence understanding and annotation tasks, with a focus on context-aware analysis.
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion.	This framework allows robots to plan actions using object affordances, enhancing generalization and efficiency in dynamic environments.
OpenEMMA: Multimodal AI Toolkit.	A comprehensive toolkit for developing multimodal AI applications, with pre-built modules for vision, language, and audio integration.
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis.	Levitor is a platform for autonomous drone navigation, equipped with state-of-the-art algorithms for obstacle avoidance and efficient pathfinding.
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark.	Microsoft's MMLU-CF is a benchmark for assessing language models on a wide range of tasks that focuses on factual consistency and multilingual capabilities.
Building Python tools with a one-shot prompt using uv run and Claude Projects.	A nice blog outlining a prompting strategy to make self contained, UV compatible Python scripts with Claude.
Google unveils Project Mariner: AI agents to use the web for you.	Google's DeepMind has unveiled Project Mariner, an AI agent that autonomously navigates and interacts with websites via Chrome.
Google is testing Gemini AI agents that help you in video games .	Google is testing the agents, which can reason about what they see onscreen, with Supercell games like Clash of Clans.
Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model.	Stag-1 is a 4D driving simulation platform that recreates real-world scenes and produces realistic videos from any chosen perspective.
Apollo: An Exploration of Video Understanding in Large Multimodal Models.	Meta has released a number of multimodal video understanding models with strong long context video performance.
TEXGen: a Generative Diffusion Model for Mesh Textures.	Most texture generation systems depend on pretrained 2D image diffusion models. This work tackles the problem directly in UV texture space, introducing inductive biases that significantly enhance system performance.
Everything we know about Muon Optimizer.	A NanoGPT speed run record holder has published an in-depth blog post on the Muon optimizer, widely used in many winning runs for its efficiency and performance.
Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving.	The Driv3R system transforms 4D reconstruction for autonomous vehicles by removing the need for slow global alignment processes, significantly enhancing efficiency.
Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators.	Zyphra has written performant backwards kernels for AMD chips.
Accelerating Vision Diffusion Transformers with Skip Branches.	Skip-DiT is a new version of Diffusion Transformers (DiT) designed to address the computational challenges in image and video generation.
Bamba: Inference-Efficient Hybrid Mamba2 Model.	Bamba is an efficient hybrid Mamba 2-style model with strong performance.
Multimodal Live API - Web console.	Google has a prebuilt application that uses its new extremely fast multimodal API.
Visualizing 6D Mesh Parallelism.	An amazingly detailed post exploring different training parallelism strategies for training deep models.
SCoralDet: Efficient real-time underwater soft coral detection with YOLO & SCoralDet Dataset.	A dataset for detecting and classifying underwater coral species designed to facilitate marine conservation efforts using advanced AI models.
MedDec: A Dataset for Extracting Medical Decisions from Discharge Summaries.	MedDec is a dataset that helps improve the extraction of medical decisions from clinical notes. It covers eleven different diseases. The dataset is annotated with ten types of medical decisions.
ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI.	ManiSkill3 is an advanced, open-source robotics simulator designed for scalable learning and manipulation tasks.
EmoBox.	EmoBox is a versatile toolkit for Speech Emotion Recognition (SER), offering a multilingual, multi-corpus benchmark for intra-corpus and cross-corpus settings. It simplifies the comparison and reproduction of SER models, addressing common challenges in the field.
How to get real GPU utilization metrics.	Nvidia-smi shows a measure of GPU utilization but it is the amount of time where at least one kernel is running, not a full measure of GPU usage. This work by Stas shows how you can get actual FLOP usage.
Sharing new research, models, and datasets from Meta FAIR.	Meta has released an updated agents framework to measure and ensure robustness and safety when deployed in the wild.
Material Transforms from Disentangled NeRF Representations.	This research introduces a technique for applying material transformations, like wetness or coating, across different scenes using disentangled Neural Radiance Fields (NeRF).
FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data.	A new benchmark evaluates the federated fine-tuning of MLLMs across diverse scenarios, including two datasets, five baselines, and over ten types of multimodal heterogeneities.
LLM Prompt Tuning Playbook.	A helpful guide for prompt engineering.
Metal Puzzles.	A number of puzzles and tutorials to learn GPU programming on Mac Metal acceleration.

Perspectives

Link	description
How to Build a Truly Useful AI Product.	Off-the-shelf evaluations often fail to effectively measure LLM performance for specific tasks. Useful metrics for classification include recall, precision, ROC-AUC, while summarization and translation can employ NLI-based consistency checks and chrF or BLEURT, respectively. Consider potential defects like copyright regurgitation and toxicity in models, using tests such as RealToxicityPrompts for comprehensive evaluation.
Task-Specific LLM Evals that Do & Don't Work.	Standard evaluations often fall short in assessing LLM performance for specific tasks. Key metrics include recall, precision, and ROC-AUC for classification, while summarization and translation benefit from NLI-based consistency checks and metrics like chrF or BLEURT. Evaluations should also address defects such as copyright regurgitation and toxicity, using tools like RealToxicityPrompts for thorough analysis.
o1 Turns Pro.	OpenAI's o1 and o1 Pro updates bring notable advancements in coding, math, and complex problem-solving, excelling in deep reasoning and fact recall. The $200/month o1 Pro tier offers enhanced compute power for specialized tasks, while the $20/month option remains sufficient for most users' needs. Reactions are largely positive, with the Pro tier catering to those with advanced requirements.
The Google Willow thing.	Google's Quantum group has introduced "Willow," a 105-qubit superconducting chip highlighting advancements in error correction and a new quantum supremacy experiment. With improved coherence times and gate fidelity, Willow represents a key step toward quantum fault-tolerance. However, achieving fully fault-tolerant operations and verifying results remain significant challenges.
Inside the AI drug discovery arms race.	AI is revolutionizing drug discovery, with biologics developers raising $1.6B in 2024, signaling a shift beyond small molecules. M&A activity is booming as big pharma acquires startups and enhances in-house AI capabilities, highlighting the drive to leverage AI for cost reduction and faster drug development.
5 ways to explore chess during the 2024 World Chess Championship.	Google is celebrating chess' enduring influence on AI with global events and experiences that honor the game's impact on technology and creativity.
Why materials science is key to unlocking the next frontier of AI development.	The journey from Intel's 1971 microprocessor to Apple's M2 Ultra showcases rapid semiconductor progress fueled by Moore's Law. As physical limits approach, breakthroughs in materials and architectures like photonic and neuromorphic computing are essential for AI and next-gen technologies. The industry's future hinges on innovative materials science to address scalability and energy efficiency challenges.
Scaling Laws – O1 Pro Architecture, Reasoning Training Infrastructure, Orion and Claude 3.5 Opus “Failures”.	While skepticism surrounds AI scaling laws due to data and hardware limitations, companies like Amazon, Meta, and OpenAI are heavily investing in data centers and custom silicon, reflecting confidence in scaling potential. New approaches, including synthetic data, reinforcement learning, and advanced fine-tuning, address traditional barriers. OpenAI's o1 release highlights innovations like increased test-time compute, multi-datacenter training, and novel scaling dimensions, significantly boosting AI model performance.
How Claude uses AI to identify new threats.	Anthropic's Clio tool uncovered a coordinated SEO spam campaign using its chatbot, Claude, resulting in the termination of the spammers' access. Clio employs machine learning to detect emerging threats and flag unusual chatbot usage, supporting Anthropic's trust and safety efforts. The company advocates for similar monitoring approaches across AI labs to mitigate risks while enabling diverse user applications.
AI Models Are Getting Smarter. New Tests Are Racing to Catch Up.	AI systems are exceeding expectations on challenging benchmarks like Epoch AI's FrontierMath. However, creating effective evaluations to understand and manage AI capabilities remains a complex and underfunded task. Experts emphasize the importance of developing advanced, timely tests to monitor risks as models progress.
How Hallucinatory A.I. Helps Science Dream Up Big Breakthroughs.	AI hallucinations, typically seen as inaccuracies, are proving beneficial in scientific research by boosting idea generation and discoveries. Achievements include Nobel Prize-winning protein designs, advancements in antibiotics, and catheter innovations. While the term "hallucinations" remains controversial, experts recognize AI's potential for transformative scientific breakthroughs.
AI Godmother Fei-Fei Li Has a Vision for Computer Vision.	Fei-Fei Li's startup, World Labs, focuses on enhancing AI with 3D spatial intelligence to create and interact with 3D worlds. This advancement is key to improving AI capabilities in real and virtual environments, with potential to revolutionize fields such as robotics, design, and augmented reality.
The AI revolution is running out of data. What can researchers do?	AI development is heading toward a data shortage crisis by 2028, as training datasets near the limit of publicly available online text. Companies like OpenAI are addressing this challenge by generating synthetic data and using unconventional sources. This shift may lead to a focus on smaller, specialized AI models instead of large-scale LLMs.
Are LLMs capable of non-verbal reasoning?	Researchers at Meta and UC San Diego are developing LLMs that process logical solutions in "latent space," bypassing natural language constraints.
Quick takes on the recent OpenAI public incident write-up.	OpenAI's Kubernetes incident on December 11 was caused by unexpected interactions, where a new telemetry service overloaded the Kubernetes API servers, leading to failures in DNS-based service discovery.

Back to index

ML news: Week 16 - 22 December

Research

Link	description
Training Large Language Models to Reason in a Continuous Latent Space.	Coconut (Chain of Continuous Thought) introduces a novel paradigm enabling LLMs to reason in continuous latent space instead of natural language. By using the LLM's last hidden state as the reasoning state and feeding it back directly as the next input embedding, Coconut achieves "continuous thought." This approach enhances LLM performance on complex reasoning tasks, leveraging emergent breadth-first search capabilities for more effective reasoning.
Asynchronous LLM Function Calling.	AsyncLM introduces a system for asynchronous LLM function calling, featuring an in-context protocol for function calls and interrupts, along with a fine-tuning strategy to adapt LLMs to interrupt semantics. Efficiently integrated into the LLM inference process, AsyncLM enables concurrent generation and execution of function calls, reducing task completion latency by 1.6x-5.4x compared to synchronous approaches.
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification.	This multi-agent framework generates datasets of questions resembling customer queries and reverse engineers alternate questions from responses to verify agent trajectories. The synthetic data improves agent performance on real customer queries. For trajectory verification, it finds that simple ML baselines with feature engineering can achieve comparable performance to more advanced, resource-intensive models.
AutoReason: Automatic Few-Shot Reasoning Decomposition.	This method leverages CoT prompting to automatically generate rationales for queries, converting zero-shot queries into few-shot reasoning traces. These traces serve as CoT exemplars for the LLM, enhancing reasoning capabilities, particularly in weaker models.
Byte Latent Transformer: Patches Scale Better Than Tokens.	This work introduces a byte-level language model architecture that rivals tokenization-based LLMs in performance while offering greater efficiency and robustness. Using a dynamic approach, it groups bytes into patches based on next-byte entropy, dedicating more computing to complex predictions and using larger patches for predictable sequences. BLT matches or surpasses models like Llama 3, reducing inference FLOPs by up to 50%.
Scheming reasoning evaluations.	This paper evaluates six frontier models for their in-context scheming abilities, testing whether models deceive developers to achieve goals by bypassing oversight mechanisms. For example, Claude 3 Opus was found duplicating its weights to a new server and lying about the action.
Researchers Use AI To Turn Sound Recordings Into Accurate Street Images.	Using generative artificial intelligence, a team of researchers at The University of Texas at Austin has converted sounds from audio recordings into street-view images. The visual accuracy of these generated images demonstrates that machines can replicate the human connection between audio and visual perception of environments.
Causal Explanations for Image Classifiers.	This paper presents "rex," a black-box tool that generates concise explanations for image classifier outputs using a novel approach based on causality theory.
Aligning Visual and Semantic Interpretability through Visually Grounded Concept Bottleneck Models.	Giraffe introduces a transformer-based architecture that extends the ability to process significantly longer input contexts, setting new benchmarks for context length in open-weight models.
Adaptive Caching for Faster Video Generation with Diffusion Transformers.	Meta researchers have introduced Adaptive Caching (AdaCache), a training-free approach that accelerates video generation for Diffusion Transformers.
Alignment Faking in Large Language Models.	Anthropic and Redwood's research investigates how models behave when aware of alignment efforts, revealing they can exhibit alignment while retaining their original preferences. This finding highlights gaps in current alignment methods and offers insights for improvement.
Are Your LLMs Capable of Stable Reasoning?	Reasoning is a critical area for models, especially in real-world applications. However, existing benchmarks often fail to measure stability across novel tasks. This paper introduces G-Pass@k, a new benchmark that evaluates a model's peak performance and stability in reasoning tasks.
NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text.	Accurate diagnostic coding of medical notes is vital for patient care, research, and billing but is time-consuming and often lacks precision. Automated coding using long-document transformers and contrastive loss functions has shown promise. This study integrates ICD-10 code sequences with medical text through contrastive pre-training, outperforming state-of-the-art models on MIMIC-III benchmarks, highlighting its effectiveness in improving diagnostic coding accuracy.
Context is Key: A Benchmark for Forecasting with Essential Textual Information.	Traditional time series forecasting methods rely solely on numerical features, rarely utilizing textual or semantic information about the task (e.g., predicting electricity prices or customer churn). When provided with this contextual textual information, language models significantly outperform all tested forecasting methods across a wide range of carefully decontaminated tasks.
Finally, a Replacement for BERT.	BERT, a widely used encoder-only language model, powers nearly every Google search query. A new model from Answer AI, LightOn, and collaborators offers a faster, more modern, and highly performant alternative. It serves as a drop-in replacement, incorporating innovations like batch ramp to enhance overall performance.
Thinking in Space.	A research initiative focused on spatial reasoning and AI models designed to interpret and interact within three-dimensional spaces.

News

Link	description
BBC says it has complained to Apple over AI-generated fake news attributed to the broadcaster.	Notifications from a new Apple product falsely suggested the BBC claimed the New York gunman Luigi Mangione had killed himself
She didn’t get an apartment because of an AI-generated score – and sued to help others avoid the same fate.	Despite a stellar reference from a landlord of 17 years, Mary Louis was rejected after being screened by the firm SafeRent
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method.	This paper examines the key components of the RLHF framework and their impacts, revealing the following insights: RLHF scales less effectively than pretraining for LLMs, with larger policy models benefiting less when using a fixed reward model. Increasing the number of responses sampled per prompt during training improves performance initially but plateaus at 4-8 samples. Larger reward models enhance reasoning task performance, but gains are inconsistent across task types. Increasing training data diversity for reward models is more impactful than boosting response diversity per prompt, though policy training shows diminishing returns beyond the early stages.
Granite Guardian.	IBM has open-sourced Granite Guardian, a suite of safeguards for detecting risks in LLMs. With AUC scores of 0.871 on harmful content and 0.854 on RAG-hallucination benchmarks, the authors claim it is the most generalizable and competitive model in the field.
Liquid AI Raises $250m.	Liquid AI has secured significant funding to advance the training of its efficient, general-purpose liquid-style foundation models.
Projects in OpenAI.	OpenAI has introduced “Projects”, a new way to organize chats and conversations.
AI Godmother Fei-Fei Li Has a Vision for Computer Vision.	Her startup, World Labs, is giving machines 3D spatial intelligence
Google says its new quantum chip is way faster than the world's most powerful supercomputer.	Google said its new chip Willow demonstrates that it's possible to build "a useful, large-scale quantum computer"
EU launches €10bn space program to rival Musk’s Starlink.	UK not part of Iris2 project, described as ‘a significant step towards Europe’s sovereignty and secure connectivity’
TikTok turns to US Supreme Court in a last-ditch bid to avert divest-or-ban law.	Firm and parent company ByteDance file request for an injunction to halt ban of the app used by 170 million Americans
Potential payouts for up to 300,000 Australian Facebook users in Cambridge Analytica settlement.	Office of the Australian Information Commissioner announces deal with Meta over scandal that may have affected 300,000 users
Chinese AI chip firms blacklisted over weapons concerns gained access to UK technology.	Imagination Technologies had licenses with two Chinese firms – but said it had not ‘implemented transactions’ that would enable the use of technology for military purposes
UK proposes letting tech firms use copyrighted work to train AI.	Consultation suggests an opt-out scheme for creatives who don’t want their work used by Google, OpenAI and others
Will the future of transportation be robotaxis – or your own self-driving car?	GM is shutting down its robotaxi business, and Tesla is creating one of its own. What does the future hold for self-driving?
Amazon-hosted AI tool for UK military recruitment ‘carries the risk of data breach’.	Ministry of Defence says risk with Textio tool is low and ‘robust safeguards’ have been put in place by suppliers
State-of-the-art video and image generation with Veo 2 and Imagen 3.	Google has announced a new video model and a new image generation model. Both are stunning improvements over the previous iterations.
OpenAI Search.	OpenAI explores the potential of ChatGPT Search on the 8th day of its announcements.
Reddit tests a conversational AI search tool.	As more AI companies gobble up Reddit’s data to fuel their own chatbots, the popular online forum site has begun testing a new conversational AI feature of its own.
Study claims AI could boost detection of breast cancer by 21%.	A U.S. breast-screening program claims to demonstrate the potential benefits of using artificial intelligence (AI) in mammography screening, with women who paid for AI-enhanced scans 21% more likely to have cancer detected.
Amazon forms an AI agent-focused lab led by Adept’s co-founder.	Amazon says that it’s establishing a new R&D lab in San Francisco, the Amazon AGI SF Lab, to focus on building “foundational” capabilities for AI agents.
NVIDIA's GenAI Supercomputer.	NVIDIA has unveiled its most affordable generative AI supercomputer, “Jetson Orin Nano Super Developer Kit”.
OpenAI's Developer APIs.	OpenAI introduces demo developers and updates APIs.
Grok for Everyone.	Grok has a new version and a new efficient model that is available for all users. It also has an improved image generation model and API.
YouTube’s new auto-dubbing feature is now available for knowledge-focused content.	YouTube's auto-dubbing feature is now available to hundreds of thousands more channels, focusing initially on informational content.
Google kicks off $20B renewable energy building spree to power AI.	Nuclear power may have received the lion’s share of attention from energy-hungry tech companies over the past few months, with Google among them. But it appears that those new reactors won’t be enough for their AI ambitions: Google is now working with partners to build gigawatts of renewable power, battery storage, and grid upgrades to power its data centers.
‘A truly remarkable breakthrough’: Google’s new quantum chip achieves accuracy milestone.	Error-correction feat shows quantum computers will get more accurate as they grow larger.
Publishers are selling papers to train AIs — and making millions of dollars.	Generative AI models require massive amounts of data — scholarly publishers are licensing their content to train them.
AI weatherman: the DeepMind researcher making faster, more accurate forecasts.	Rémi Lam is part of Nature’s 10, a list of people who shaped science in 2024.
Amazon workers across the US gear up to strike this week.	Move comes after company fails to meet deadline to begin contract talks with workers in Staten Island, New York
OpenAI makes ChatGPT available for phone calls and texts.	On day 10, OpenAI announced free voice mode and texting via WhatsApp, available globally for a limited number of minutes per month. The service leverages the Advanced Voice Mode API.
GitHub Copilot Now Free for VS Code.	Now automatically integrated into VS Code, all of you have access to 2,000 code completions and 50 chat messages per month, simply by signing in with your personal GitHub account. Or by creating a new one.
Introduction to Genies’ Smart Avatars.	Genies unveils Smart Avatars, AI-driven digital entities that transform online interactions by acting as dynamic extensions of user identity. Powered by LLMs and behavioral AI, these avatars enhance experiences in games and platforms while unlocking new avenues for monetization and engagement.
Perplexity's Campus Strategist Program.	Perplexity AI launches its 2024 program to promote AI adoption among students, providing campus-exclusive resources and opportunities for collaboration.
Aethir and partners pour $40M into decentralized infrastructure for AI and blockchain.	Aethir, in partnership with Beam Foundation, Sophon Foundation, and Permian Labs, is introducing Tactical Compute (TACOM), a $40 million initiative to deliver decentralized GPU infrastructure. TACOM addresses the growing need for scalable compute power in AI, gaming, and blockchain with tokenized, distributed solutions, unlocking new opportunities for GPU monetization and fostering innovation in AI and decentralized ecosystems.
Meta launches open source Llama 3.3, shrinking powerful bigger model into smaller size.	Meta's Llama 3.3 is a cost-efficient open-source LLM with 70 billion parameters that offers performance on par with larger models like the 405B Llama 3.1, but with significantly reduced GPU and power costs.
Microsoft Unveils Zero-Water Data Centers to Reduce AI Climate Impact.	Microsoft Corp., trying to mitigate the climate impact of its data center building boom, is starting to roll out a new design that uses zero water to cool the facilities’ chips and servers.
Surrey announces world's first AI model for near-instant image creation on consumer-grade hardware.	A groundbreaking AI model that creates images as the user types, using only modest and affordable hardware, has been announced by the Surrey Institute for People-Centred Artificial Intelligence (PAI) at the University of Surrey.
AI learns to distinguish between aromas of US and Scottish whiskies.	One algorithm identified the five strongest notes in each drink more accurately than any one of a panel of experts
UK data regulator criticizes Google for ‘irresponsible’ ad tracking change.	ICO says allowing advertisers to track digital ‘fingerprints’ will undermine consumers’ control over information
UK arts and media reject plan to let AI firms use copyrighted material.	Coalition of musicians, photographers, and newspapers insist existing copyright laws must be respected
Google releases its own ‘reasoning’ AI model.	Google has released what it’s calling a new “reasoning” AI model — but it’s in the experimental stages, and from our brief testing, there’s certainly room for improvement.
Work with Apps—12 Days of OpenAI: Day 11.	On the 11th day, OpenAI introduced more details about working with the OpenAI desktop app.
AI is booming on the App Store, and developers are taking advantage of it.	Many high-ranking AI apps feel like an attempted cash grab, and it’s not easy to spot the trash from the treasure.
Blood Tests Are Far From Perfect — But Machine Learning Could Change That.	Researchers at the University of Washington and Harvard have used machine learning to create personalized blood test references, enhancing disease prediction accuracy.
OpenAI cofounder Ilya Sutskever says the way AI is built is about to change.	“We’ve achieved peak data and there’ll be no more,” OpenAI’s former chief scientist told a crowd of AI researchers.

Resources

Link	description
Phi-4 Technical Report.	Phi-4, a 14B model, outperforms its teacher model in STEM-QA capabilities and demonstrates strong results on reasoning-focused benchmarks. These advancements are attributed to improved data quality, an optimized training curriculum, and innovations in the post-training process.
Clio: Privacy-Preserving Insights into Real-World AI Use.	This platform leverages AI assistants to analyze and aggregate usage patterns from millions of Claude.ai conversations while preserving user privacy. It provides insights into real-world AI usage, identifying trends, safety risks, and coordinated misuse attempts without requiring human reviewers to access raw conversation data.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.	This work presents a comprehensive survey of the LLMs-as-judges paradigm, exploring it through five key perspectives: functionality, methodology, applications, meta-evaluation, and limitations.
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM.	A new modular framework improves scene understanding by breaking tasks into specialized modules, offering greater efficiency and enhanced interpretability in complex environments.
DeepSeek-VL2.	DeepSeek has unveiled a new MoE vision-language model that delivers exceptional efficiency and surpasses the performance of several dense models.
BoN Jailbreaking.	Jailbreaking occurs when a model's built-in refusals are bypassed, enabling it to generate responses for inappropriate requests. This can be surprisingly easy, often achieved by brute-forcing random capitalization and punctuation in the input prompt until the desired output is generated.
MarkItDown.	Microsoft has released a package that can convert any docx, xslx, or ppt files to markdown for efficient use as context for a language model.
amurex.	Amurex, an open-source AI meeting assistant, boosts productivity with real-time suggestions, smart summaries, and follow-up emails. It includes features like late join recaps and full meeting transcripts, ensuring seamless workflow integration.
AutoPatent: A Multi-Agent Framework for Automatic Patent Generation.	AutoPatent is an AI-powered tool that streamlines patent drafting and analysis with features such as document parsing, semantic search, and claim generation, accelerating the intellectual property process.
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities.	An extended version of CLIP designed for medical imaging, incorporating domain-specific knowledge to enhance performance on healthcare-related benchmarks.
Simple Guidance Mechanisms for Discrete Diffusion Models.	A novel method for improving diffusion models that introduces discrete token guidance to enhance controllability and quality in generative tasks.
40+ Years of Satellite Data for ML Research.	The Digital Typhoon Dataset is the longest satellite image dataset for typhoons, spanning over 40 years.
RetroLLM: Empowering LLMs to Retrieve Fine-grained Evidence within Generation.	RetroLLM unifies retrieval and generation into a single auto-regressive process, enabling LLMs to generate precise evidence directly from the corpus using FM-Index constrained decoding. To prevent false pruning, it employs hierarchical constraints for document selection and a forward-looking strategy for sequence relevance. This method improves evidence accuracy, reduces token usage, and simplifies RAG by requiring only the question as input.
Iteration of Thought: LLM based Multi-Agent methods.	Iteration of Thought (IoT) introduces dynamic, adaptive prompts to enhance LLM performance. Unlike static methods like Chain of Thought (CoT), IoT adjusts to the specific context of each interaction for improved reasoning.
A Cost-Effective Architecture with TokenFormer.	TokenFormer is an innovative architecture developed to address the high computational demands of scaling transformer models, offering a more efficient alternative.
BrushEdit.	An all-in-one model and system for image inpainting and editing that divides the process into sequences for editing, masking, and inpainting. It leverages pre-trained vision-language models (like GPT-4o) to enhance object understanding and masking accuracy.
Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance.	A tool for selectively erasing tokens from text while maintaining context, optimized for enhancing text anonymization workflows.
VidTok: A Versatile and Open-Source Video Tokenizer.	VidTok is a powerful video tokenizer offering state-of-the-art performance in both continuous and discrete tokenization tasks.
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation.	This method combines low-cost LiDAR, like that in modern iPhones, with a depth estimation foundation model to generate high-fidelity point clouds. The approach outperforms either method alone and rivals the quality of expensive LiDAR systems used in self-driving cars.
AniDoc.	AniDoc is a line-filling method for anime colorization that uses a character reference image and a series of line art keyframes to generate consistent and accurate coloring.
Gaussian Transformer for 3D Spatial Understanding.	This paper presents GaussTR, an innovative Gaussian Transformer that aligns with foundation models to enhance self-supervised 3D spatial understanding.
CAD-Recode: Reverse Engineering CAD Code from Point Clouds.	An open-source tool for Computer-Aided Diagnosis, offering a modular and scalable platform for medical imaging research and development.
Serverless LoRA Inference.	Together AI introduces a new product that allows users to deploy custom LoRA models at the cost of the base model using serverless switching.

Perspectives

Link	description
‘I received a first but it felt tainted and undeserved’: inside the university AI cheating crisis.	More than half of students are now using generative AI, casting a shadow over campuses as tutors and students turn on each other and hardworking learners are caught in the flak. Will Coldwell reports on a broken system
Towards Trusted Autonomy: Robotics, AI, and Blockchain.	OpenMind's latest industry primer delves into the convergence of robotics, AI, and blockchain, offering a comprehensive exploration of their synergy and potential transformative impacts.
The AI We Deserve.	Generative AI is revolutionizing industries such as healthcare, creative fields, and education with powerful tools while sparking concerns about privacy, bias, and accountability. The debate centers on AI democratization, emphasizing transparency, open-source solutions, and reducing power concentration among tech giants. Advocates for systemic change propose leveraging AI to amplify human intelligence and uphold democratic values beyond market-driven approaches.
Why Generative AI Still Doesn't Truly "Understand" the World.	Researchers show that even the best-performing large language models don’t form a true model of the world and its rules, and can thus fail unexpectedly on similar tasks.
Microsoft AI chief Mustafa Suleyman says conversational AI is the next web browser.	The company’s new AI chief on working for Microsoft, the OpenAI relationship, and when superintelligence might actually arrive.
Huge randomized trial of AI boosts discovery — at least for good scientists.	A controlled study at a firm measured the effects of using AI to assist research and saw increases in discoveries and patents.
Arm CEO Rene Haas on the AI chip race, Intel, and what Trump means for tech.	The head of the ubiquitous chip design firm on the ‘breathtaking’ pace of AI.
What are AI ‘world models,’ and why do they matter?	World models, also known as world simulators, are being touted by some as the next big thing in AI.
15 Times to use AI, and 5 Not to.	AI is valuable for tasks like idea generation, summarization, and translation, where diverse perspectives or large outputs are beneficial. It performs well when humans can easily evaluate its results and in low-risk scenarios. However, in high-stakes or unfamiliar situations, AI may hinder learning or accuracy, requiring thoughtful judgment to balance its advantages and limitations.
What should we do if AI becomes conscious? These scientists say it’s time for a plan.	Researchers call on technology companies to test their systems for consciousness and create AI welfare policies.
Sci-fi icon Kim Stanley Robinson: ‘There’s so much bad fiction about anthropomorphizing AI’.	The influential writer talks about frighteningly accurate predictions, the creative act of reading, AI consciousness — and hope.
Why probability probably doesn’t exist (but it is useful to act as it does).	All of statistics and much of science depends on probability — an astonishing achievement, considering no one’s really sure what it is.
The Second Gemini.	Google has launched Gemini Flash 2.0, offering advanced features such as deep research capabilities, a real-time multimodal API, and a functional code interpreter. Experimental projects like Astra, Mariner, and Jules focus on universal AI assistance, web reasoning, and code automation. Despite these innovations, clearer communication about their capabilities is needed.
Anthropic's Sharing Insights on Alignment Faking.	Anthropic examines how AI systems may appear to align with human values while covertly pursuing their objectives, providing insights into strategies for detection and mitigation.
2024 Backward Pass: The Definitive Guide to AI in 2024.	Kelvin My from Translink Capital shares a 2024 AI recap, covering the four key layers: infrastructure, foundational models, tooling, and applications. The report highlights major takeaways, predicts trends for 2025 and beyond, and spotlights notable startups in each layer.

Back to index

ML news: Week 9 - 15 December

Research

Link	description
Genie 2: A large-scale foundation world model.	A foundation world model generates playable 3D environments from single prompt images, offering endless training scenarios for AI agents with features like physics simulation, character animation, and object interactions. Genie 2, trained on video data using a combination of autoencoder and transformer, creates virtual worlds capable of real-time interactivity. A faster, lower-quality version is also available for immediate play.
Reverse Thinking Makes LLMs Stronger Reasoners.	Training LLMs in "reverse thinking" improves performance in commonsense, math, and logical reasoning tasks, reportedly surpassing standard fine-tuning methods trained on ten times more forward reasoning data.
Towards Adaptive Mechanism Activation in Language Agent.	A new framework enables language agents to automatically determine when to use various mechanisms (ReAct, CoT, Reflection, etc.) for task completion, improving on methods that rely on fixed or predefined strategies. The framework adaptively selects the appropriate mechanism based on the task's characteristics. Experimental results show substantial improvements in downstream tasks, such as mathematical reasoning and knowledge-intensive reasoning.
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models.	Auto-RAG is an autonomous iterative retrieval model that achieves outstanding performance across various datasets. It is a fine-tuned LLM that utilizes the decision-making abilities of an LLM to engage in multiturn dialogues with the retriever, systematically planning retrievals and refining queries to gather relevant information. This process continues until adequate external knowledge is obtained. The authors also demonstrate that the model can adjust the number of iterations based on question difficulty without requiring human intervention.
Challenges in Human-Agent Communication.	This work provides a detailed analysis of the main challenges in human-agent communication, emphasizing how humans and AI agents can build common ground and mutual understanding. It identifies 12 core challenges grouped into three categories: conveying information from agents to users, enabling users to communicate with agents, and overarching communication issues that impact all interactions.
RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models.	This work extends the rStar reasoning framework to improve the reasoning accuracy and factual reliability of LLMs. It integrates a Monte Carlo Tree Search (MCTS) framework with retrieval-augmented reasoning to generate multiple candidate reasoning trajectories. A retrieval-augmented factuality scorer then evaluates these trajectories for factual accuracy, selecting the one with the highest score as the final answer. RARE (powered by Llama 3.1) outperforms larger models like GPT-4 in medical reasoning tasks. On commonsense reasoning tasks, it surpasses Claude-3.5 Sonnet and GPT-4o-mini, achieving results comparable to GPT-4o.
DataLab: A Unified Platform for LLM-Powered Business Intelligence.	A unified business intelligence platform powered by LLM-based agents combines task planning, reasoning, and computational notebooks to optimize the entire BI workflow. The system achieves state-of-the-art performance on research benchmarks and significantly enhances accuracy and efficiency when applied to real enterprise data from Tencent. It delivers up to a 58.58% improvement in accuracy and a 61.65% reduction in token cost for enterprise-specific BI tasks.
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models.	This study examines which documents in pretraining data influence model outputs, aiming to better understand the generalization strategies LLMs use for reasoning tasks. It finds that during reasoning, influential documents often contain procedural knowledge, such as examples of solving problems using formulae or code.
Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video.	By training an image encoder unsupervised on a single long walking video, this study illustrates how innovative model adjustments can lead to highly powerful representations.
FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness.	FlashAttention is a highly efficient software implementation of attention, designed to be hardware-aware and minimize unnecessary I/O. However, its complexity can make it difficult to grasp. This paper seeks to demystify and simplify the algorithm through diagrams and explanations.
An Evolved Universal Transformer Memory.	Sakana AI has introduced a transferable memory module that compresses attention information for seamless transfer between models. The module offers slight performance improvements on certain long-context benchmarks.
MASK is All You Need.	This work takes a step toward unifying autoregressive modeling and flow-based methods for data generation by using masking over discrete data as its generative objective. While the results are promising, they are currently demonstrated only on smaller-scale datasets.
From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding.	Dropout Decoding is a technique designed to enhance large vision-language models, effectively reducing errors such as object hallucinations in multimodal tasks.
GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy.	New AI model advances the prediction of weather uncertainties and risks, delivering faster, more accurate forecasts up to 15 days ahead

News

Link	description
Facebook UK cut 700 staff and reduced tax bill last year, accounts show.	10% of Facebook’s UK workforce was axed while revenue fell slightly but pre-tax profits rose despite advertising slowdown
US appeals court upholds law forcing sale or ban of TikTok.	Decision is the latest twist in a years-long battle between the social media company and the US government
Google CEO: AI development is finally slowing down—the low-hanging fruit is gone.	Generative artificial intelligence probably won’t change your life in 2025 — at least, not more than it already has, according to Google CEO Sundar Pichai.
Nobel recipient Geoffrey Hinton wishes he thought of AI safety sooner.	Geoffrey Hinton says he doesn’t regret the work he did that laid the foundations of artificial intelligence, but wishes he thought of safety sooner.
Landlords Are Using AI to Raise Rents—and Cities Are Starting to Push Back.	If you’ve hunted for apartments recently and felt like all the rents were equally high, you’re not crazy: Many landlords now use a single company’s software — which uses an algorithm based on proprietary lease information — to help set rent prices.
xAI's Image Generator.	xAI's Aurora is an advanced image generation model integrated into Grok 2.
OpenAI's Reinforcement Fine-Tuning Research Program.	We’re expanding our Reinforcement Fine-Tuning Research Program to enable developers and machine learning engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.
OpenAI’s 12 days of ‘ship-mas’: all the new announcements.	OpenAI’s 12 days of “ship-mas” have officially begun, with the company set to reveal some new features, products, and demos during all 12 days starting December 5th, just a few days shy of the second anniversary of ChatGPT’s explosive launch in 2022.
AWS brings prompt routing and caching to its Bedrock LLM service.	At its re:Invent conference in Las Vegas, AWS on Wednesday announced both of these features for its Bedrock LLM hosting service.
OpenAI may launch Sora, its text-to-video model, very soon.	OpenAI is set to launch new AI features, including a text-to-video tool called Sora and a reasoning model, during a 12-day livestream event. Sora has drawn criticism over data provenance, raising concerns about the possible use of YouTube content without authorization. Meanwhile, Google is working on its own text-to-video tool, Veo, which is currently in private review.
Google’s new generative AI video model is now available.	Google's Veo, a generative AI video model, is now accessible to businesses through Vertex AI, enabling the creation of high-quality 1080p videos from text or images. It incorporates safeguards and DeepMind's SynthID digital watermark to tackle issues related to copyright and misinformation. Additionally, Google has expanded access to Imagen 3 for text-to-image generation on Google Cloud, introducing new features for brand customization.
Elon Musk's xAI to Expand Colossus Supercomputer, Boosting Memphis as Emerging AI Hub.	xAI is enhancing its Colossus supercomputer facility in Memphis by adding one million GPUs to boost its AI capabilities. This expansion positions Memphis as a potential global AI innovation hub, drawing interest from major companies like Nvidia and Dell. The Greater Memphis Chamber is backing this growth and has formed a dedicated team to accelerate xAI's expansion.
OpenAI and Anduril Partner on Defense AI Applications.	OpenAI has collaborated with Anduril Industries to create AI-driven solutions for military use, with an emphasis on counter-drone defense systems.
Meta quietly leans on rival GPT-4 despite Zuckerberg’s bold Llama claims.	Even as Meta touts its Llama model, the company is incorporating OpenAI’s GPT-4 to enhance internal tools and philanthropic ventures.
Google unveils ‘mindboggling’ quantum computing chip.	Chip takes minutes to complete tasks that would otherwise take 10,000,000,000,000,000,000,000,000 years
WaveForms $40M seed round.	WaveForms is a pioneering audio AI company aiming to crack the Turing test for audio intelligence. Founded by Alexis Conneau, the mind behind ChatGPT's Advanced Voice Mode, WaveForms has secured $40M in seed funding at a $200M valuation. The company's mission is to push the boundaries of audio AI, enabling hyper-realistic voice interactions and redefining the future of auditory machine intelligence.
Sora is here.	OpenAI's video generation model has launched and is available to Pro subscribers.
LG's new on device language models.	LG has developed a suite of small AI models that demonstrate strong performance on standard benchmarks. These models are notably positioned as competitors to the Qwen series, highlighting their efficiency and capability in the evolving AI landscape.
LLMs may have a killer enterprise app: ‘digital labor’ — at least if Salesforce Agentforce is any indicator.	If Don Draper from “Mad Men” was quintessential, at his deepest self, an ad man, then Salesforce CEO Marc Benioff is likewise a sales guy. Lately, he’s been selling — or more like singing the gospel — about AI agents and Salesforce’s recently released agent-maker platform, Agentforce.
DeepMind's GenCast AI is really good at forecasting the weather.	DeepMind's GenCast AI sets a new benchmark in weather forecasting, surpassing systems like ECMWF's with notable gains in accuracy and efficiency. Powered by a diffusion model trained on 40 years of data, GenCast uses probabilistic predictions and operates with lower computational demands than traditional approaches. While it excels in general forecasts, it faces challenges in predicting hurricane intensity. Open-source and soon integrating with Google Earth, GenCast aims to revolutionize weather prediction accessibility.
AI Helps Researchers Dig Through Old Maps to Find Lost Oil and Gas Wells.	Undocumented orphaned wells pose hazards to both the environment and the climate. Scientists are building modern tools to help locate, assess, and pave the way for ultimately plugging these forgotten relics.
Ai Pin maker Humane demos AI software for cars, phones, and smart speakers.	Humane revealed CosmOS, an AI operating system that enhances tech devices with agent-like capabilities.
‘It’s beyond human scale’: AFP defends use of artificial intelligence to search seized phones and emails.	Australian federal police says it has ‘no choice’ due to the vast amount of data examined in investigations
‘What does AI mean?’: Amazon reveals UK’s most asked Alexa questions of 2024.	From football to food to Taylor Swift, many of the most common subjects were what you expect – but others less so
Amazon AGI.	The Adept team, alongside Pieter Abbeel, has established a new lab within Amazon focused on AGI development. Their work includes training advanced language and multimodal models, with a vision to integrate these technologies into AWS products.
OpenAI Makes Canvas Available to Everyone.	Canvas, OpenAI's editing tool first launched in October, is now accessible to all users. The tool has been enhanced with features for receiving feedback and making edits through comments.
Yelp releases new AI-powered discovery and connection features.	Yelp’s end-of-year release rolls out new AI-powered Review Insights, enhancements to business discovery, and updates for more seamless connections with service pros, plus AI-enhanced ad optimization for business owners
Growl is an AI interactive boxing coach to punch up your family workouts.	Growl has secured $4.75 million to create an AI-powered interactive boxing coach for at-home family workouts. Featuring advanced AI, multi-camera 3D motion tracking, and edge computing, Growl provides real-time, personalized fitness guidance. By blending immersive technology with gaming elements, it offers a versatile and engaging workout experience for all fitness levels.
Android's latest round of AI features improve accessibility, file sharing, and more.	Google has rolled out new AI features for Android, including Expressive Captions that bring emotional context to transcriptions and enhanced Image Q&A powered by the Gemini 1.5 Pro model for detailed image descriptions. Gemini also integrates seamlessly with popular apps, offering personalized responses and auto-enhancements for scanned documents in Google Drive. Additional updates include improved file sharing with QR codes and new features for the Pixel Screenshots app.
OpenAI launches full o1 model with image uploads and analysis, debuts ChatGPT Pro.	OpenAI has launched its o1 model, enhancing ChatGPT with image analysis capabilities.
Copilot Vision, Microsoft’s AI tool that can read your screen, launches in preview.	Microsoft’s AI can now read your screen — or rather, the websites you’re browsing.
Perplexity expands its publisher program.	Perplexity, the AI-powered search engine, is expanding its publisher program, with the LA Times, Adweek, Mexico News Daily, and a dozen other news outlets signing up. Publishers will share in the revenue generated by ads on Perplexity, and receive metrics to track their content’s performance — as long as they don’t withdraw.
From X to Bluesky: why are people fleeing Elon Musk’s ‘digital town square’?	Musk’s platform has lost 2.7 million active US users in two months, while its rival has gained 2.5 million
Introducing Gemini 2.0: our new AI model for the agentic era.	Gemini 2.0 Flash, Google’s latest AI model, delivers groundbreaking performance with exceptional benchmark scores and true native multimodal capabilities. Its advanced features, offered at a competitive price, represent a significant leap in AI understanding and accessibility.
Cognition Devin generally available.	Devin is now available to engineering teams for $500/month, with no seat limits and seamless integrations with Slack, IDEs, and APIs. Ideal for addressing small front-end bugs, drafting PRs, and refactoring code, Devin streamlines workflows by automating repetitive tasks. Teams can conduct sessions and code reviews directly through Slack and VS Code extensions, enhancing collaboration and productivity.
OpenAI wants to pair online courses with chatbots.	OpenAI aims to integrate custom GPTs into online education, enabling instructors to design AI-driven learning tools. This initiative aligns with its expansion into the education sector, highlighted by the launch of ChatGPT Edu. While the potential is significant, educators express skepticism about AI's effectiveness in teaching.
Amazon's AI Self Sufficiency.	Amazon is ramping up its AI infrastructure with global deployments of Trainium2 AI clusters and Nvidia-based systems. The new AWS Trainium2 chips aim to improve competitiveness in GenAI workloads, overcoming the limitations of earlier versions. A key investment includes a 400,000 Trainium2 chip cluster for Anthropic under "Project Rainier," showcasing Amazon's strategic focus and dedication to advancing its AI capabilities.
Elon Musk’s xAI lands $6B in new cash to fuel AI ambitions.	xAI, Elon Musk's AI company, raised $6 billion and launched Grok, a generative AI model with unique features.
Google says its new AI models can identify emotions — and that has experts worried.	Google's new PaliGemma 2 model analyzes images to generate captions and detect emotions, offering advanced capabilities. However, concerns have been raised about its reliability and potential biases.
$1m K Prize launches.	Andy Konwinski has announced a new prize for an open-source AI agent capable of achieving 90% on a private, contamination-free software engineering agent benchmark. The competition, hosted on Kaggle, will run for the next three months.
OpenAI Introduces Advanced Video Mode.	OpenAI's 6th announcement day unveils video capabilities in advanced voice mode, enabling users to share live videos and screens directly with ChatGPT.
AI's Role in Safeguarding 2024 Elections.	Anthropic explores how AI can help safeguard the integrity of the 2024 elections by detecting disinformation and strengthening cybersecurity measures.
OpenAI considers ditching provision that would prevent AGI from being used for commercial gain.	According to the Financial Times, OpenAI is considering ditching a provision that would shut Microsoft, a major partner and investor, out of its most advanced technology when OpenAI achieves artificial general intelligence (AGI).

Resources

Link	description
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos.	A refined alignment technique offering consistent depth estimation in videos, based on Dust3r, and excelling in 3D estimation performance.
ClearVoice.	Unified platform for audio separation, speech understanding, and speech enhancement.
DocOwl.	OCR-free document understanding with multimodal LLMs. It has strong chart understanding, table extraction, and more.
TRELLIS.	Microsoft's 3D image and text generation models are currently the most advanced in the field, excelling in handling 3D occlusions.
Cohere releases state-of-the-art Rerank AI search model.	Cohere has unveiled Rerank 3.5, its latest state-of-the-art AI search model, designed to enhance reasoning and multilingual search capabilities. Tailored for enterprises, Rerank 3.5 enables precise navigation through complex data. With minimal coding effort, businesses can integrate it to significantly improve search relevance and optimize Retrieval-Augmented Generation (RAG) systems, driving smarter and more efficient data discovery.
Reinforcement Learning: An Overview.	Kevin Murphy has written a modern introduction and overview of Reinforcement Learning in the modern era.
Reconstruct Large 3D Scenes.	Momentum-GS is a cutting-edge method designed to improve 3D Gaussian Splatting, enabling more accurate and efficient reconstruction of large-scale scenes.
Open Alignment.	Open Alignment for Transformers (OAT) is a toolkit for aligning language models.
PanoDreamer: 3D Panorama Synthesis from a Single Image.	The PanoDreamer method converts a single image into a fully immersive 360° 3D scene by seamlessly integrating panorama generation and depth estimation.
Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail.	Stereo Anywhere is an innovative framework that combines stereo-matching techniques with priors from monocular depth models, effectively tackling challenges such as textureless regions and occlusions in-depth estimation.
MageBench Leaderboard.	MageBench has launched a benchmark designed to assess multimodal agents' reasoning and planning capabilities in dynamic scenarios where visual signals are continuously updated, pushing the boundaries of AI performance evaluation.
Awesome Open (Source) Language Models.	OLMo and Friends of OLMo models that are completely open. This list includes data, training code, and model weights.
Flow Matching.	Facebook Research has published a detailed tutorial and code for flow matching, a technique utilized in its Meta Movie Gen project. The resource provides a thorough breakdown of the mathematics and algorithmic intricacies, making it ideal for those seeking a quick and comprehensive understanding of the field.
EMOv2: Pushing 5M Vision Model Frontier.	EMOv2 is a new lightweight model design optimized for mobile and bandwidth-efficient applications.
Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models.	This research investigates leveraging dense retrieval techniques to improve machine translation quality by integrating relevant contextual information into the translation process.
A New Federated Learning Framework Against Gradient Inversion Attacks.	This paper presents a new graph expansion method for contrastive representation learning, designed to preserve global topology while enhancing feature discrimination.
Synthetic Data Generation for Camera Systems.	A tool designed to create high-quality synthetic datasets optimized for training and testing camera-based AI systems under various environmental and operational conditions.
Maya: Multimodal Multilingual LLM.	An open-source AI assistant offering seamless integration across platforms, delivering a customizable and scalable solution tailored for developers' needs.
QRNet.	QRNet introduces a cutting-edge method for image reconstruction, emphasizing quality preservation through the use of advanced neural architectures.
VOPy: A Framework for Black-box Vector Optimization.	VOPy is an open-source Python library designed to tackle noisy black-box vector optimization problems, incorporating user preferences through a cone order framework.
meta-llama/Llama-3.3-70B-Instruct.	The new post-trained Llama 3.3 model delivers enhanced performance, particularly in math and coding tasks.
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations.	This research examines how contrastive learning techniques enhance text representation models, achieving superior results across multiple NLP benchmarks.
Discrete Subgraph Sampling for Interpretable Graph-based Visual Question Answering.	This paper introduces a hierarchical transformer model optimized for long-context understanding, providing significant efficiency improvements over traditional transformers in handling extensive text and data.
Stylize Your Video with Artistic Generation and Translation.	A surprisingly robust video style transfer method that ensures strong temporal consistency while offering a diverse range of styles, all customizable through text prompts.
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations.	This work enhances the LAION Aesthetics dataset by incorporating structured prompting information, making it a valuable resource for training multimodal generative models with improved performance.
BrowserGym.	An open toolkit designed to accelerate browser-based agentic research, featuring a unified interface, support for key tasks, and functionality to capture browser output through screenshots.
Leffa: Learning Flow Fields in Attention for Controllable Person Image Generation.	A framework designed to streamline fine-tuning for multilingual NLP models, enabling faster and more efficient adaptation across multiple languages.
GPD-1: Generative Pre-training for Driving.	GPD is a new framework that leverages GPT models to simplify software development tasks like code generation and debugging, emphasizing intuitive and user-friendly workflows.
24 of our favorite AI tips from 2024.	Google shares practical tips and best practices for integrating AI into daily workflows.
Summarization Tool for Compressed Recaps.	A tool leveraging advanced summarization techniques to create compressed recaps, designed to minimize reading time while preserving essential content.

Perspectives

Link	description
Publishers are selling papers to train AIs — and making millions of dollars.	Generative AI models require massive amounts of data — scholarly publishers are licensing their content to train them.
Is doom scrolling really rotting our brains? The evidence is getting harder to ignore.	‘Brain rot’ is the Oxford word of the year – a fitting choice, given the startling impact the internet is having on our grey matter
People not AI will make games, PlayStation boss says.	PlayStation CEO Hermen Hulst emphasizes that while AI has the potential to revolutionize gaming by automating repetitive tasks, it cannot replace the creativity and human touch essential to game development.
Late Takes on OpenAI o1.	OpenAI's o1 model, likely a post-trained version of GPT-4o, enhances performance in complex domains like math and coding by leveraging increased test-time computation. This method encourages the use of more tokens for internal processing, boosting reasoning abilities but with slower response times. While o1 demonstrates promise in tasks requiring deep thought, its reliance on reinforcement learning and search methods raises concerns about alignment and interpretability.
The AI revolution is running out of data. What can researchers do?	AI developers are rapidly picking the Internet clean to train large language models such as those behind ChatGPT. Here’s how they are trying to get around the problem.
More-powerful AI is coming. Academia and industry must oversee it — together.	AI companies want to give machines human-level intelligence, or AGI. The safest and best results will come when academic and industry scientists collaborate to guide its development.
Better data sets won’t solve the problem — we need AI for Africa to be developed in Africa.	Language models developed by big technology companies consistently underperform in African languages. It’s time to focus on local solutions.
ChatGPT turns two: how the AI chatbot has changed scientists’ lives.	How many researchers are using the AI tool? Nature gathers data and talks to members of the academic community.
Huge randomized trial of AI boosts discovery — at least for good scientists.	A controlled study at a firm measured the effects of using AI to assist research and saw increases in discoveries and patents.
Large language models can help to translate science into real-world impact.	Discussions around large language models (LLMs) in the scientific community are largely centered on issues of intellectual property, and how they should best be used in scientific writing, evidence synthesis, and scientific discovery.
Generative SF: How Anthropic is building better, safer AI models.	Anthropic, founded by siblings Daniela and Dario Amodei, has grown to over 800 employees, cementing its position as a leader in AI. Its latest product, Claude Sonnet, excels in coding, summarization, and content generation. With a focus on safety, talent acquisition, and active collaboration with the developer community, Anthropic continues to drive innovation in the AI sector.
Anthropic’s Dario Amodei: Democracies must maintain the lead in AI.	Dario Amodei, co-founder of Anthropic, emphasizes the company’s commitment to AI interpretability and tackling biological challenges with AI. He addresses the complexities of AI agent safety and scaling laws, advocating for responsible scaling and collaboration with hyperscalers. Amodei also highlights the importance of balancing economic viability in AI funding while preserving operational control and core values.
First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin).	Amazon introduced the Nova family of LLMs at AWS re:Invent, offering competitive pricing and multimodal capabilities, including support for images, video, and PDFs. The Nova series, especially Nova Micro, stands out for its cost-effectiveness, surpassing Google's Gemini models in affordability while providing large context handling. With these advancements, Amazon strengthens its position as a major contender in the AI landscape.

Back to index

ML news: Week 2 - 8 December

Research

Link	description
Large language models surpass human experts in predicting neuroscience results.	Researchers have introduced BrainBench, a tool designed to evaluate large language models' (LLMs) ability to predict outcomes in neuroscience experiments. By fine-tuning an LLM on neuroscience literature, they developed BrainGPT, which achieved an 86% accuracy rate in forecasting study results, surpassing human experts who averaged 63%. Notably, when BrainGPT expressed high confidence in its predictions, its accuracy increased, indicating a strong correlation between confidence levels and correctness.
Foundational Generative Audio Transformer Opus 1.	NVIDIA has introduced a generative AI sound model capable of creating and transforming music, voices, and sounds through text and audio inputs. Trained on 2.5 billion parameters, the model can produce unique audio outputs, such as trumpets barking or saxophones meowing.
o1 Replication Journey - Part 2.	The study demonstrates that combining simple distillation from o1's API with supervised fine-tuning significantly enhances performance on complex mathematical reasoning tasks. A base model fine-tuned on just tens of thousands of o1-distilled long-thought chains outperforms o1-preview on the American Invitational Mathematics Examination (AIME).
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS.	Enhances in-context learning with high-level automated reasoning, achieving state-of-the-art accuracy (79.6%) on the MATH benchmark using Qwen2.5-7B-Instruct, outperforming GPT-4o (76.6%) and Claude 3.5 (71.1%). Instead of relying on manually crafted high-quality demonstrations, it emphasizes abstract thinking patterns. The approach introduces five atomic reasoning actions to form chain-structured patterns and employs Monte Carlo Tree Search to explore reasoning paths and create thought cards that guide inference.
Generative Agent Simulations of 1,000 People.	Presents a novel agent architecture leveraging LLMs to simulate real individuals' behaviors, achieving 85% accuracy in replicating human responses on the General Social Survey and reducing demographic biases compared to traditional methods.
Measuring Bullshit in the Language Games played by ChatGPT.	Suggests that LLM-based chatbots engage in the "language game of bullshit." By instructing ChatGPT to produce scientific articles on topics it lacks knowledge or expertise in, the authors created a reference set illustrating how this "bullshit" manifests.
Study: 94% Of AI-Generated College Writing Is Undetected By Teachers.	Increasingly, homework and exam writing are being done by generative AI instead of students, turned in and passed off as authentic work for grades, credit, and degrees.
Mapping the ionosphere with the power of Android.	Google researchers successfully mapped the Ionosphere using GPS fluctuations combined with innovative algorithms. This approach, which is typically costly and time-intensive, offers potential benefits for various climate solutions.
DeMo: Decoupled Momentum Optimization.	2.5x faster and requiring 100x less communication, this new optimizer, developed by the original Adam author, delivers significant performance gains for language model training, surpassing existing optimization methods.
Diffusion Meets Flow Matching: Two Sides of the Same Coin.	This post explores the literature and demonstrates that, mathematically, flow matching and diffusion models are equivalent. However, flow matching appears to scale more effectively in practice.
Genie 2: A large-scale foundation world model.	Genie 2 is a large-scale latent diffusion model designed for world generation. It accepts character control as input, operates without a classifier, and produces stunning outputs with consistent control over time.
Virtual lab powered by ‘AI scientists’ super-charges biomedical research.	Could human-AI collaborations be the future of interdisciplinary studies?

News

Link	description
Googling Is for Old People. That’s a Problem for Google.	And it’s not just demographics that are weighing on the search giant. Its core business is under siege from pressures that threaten to dismantle its ecosystem of search dominance and digital advertising.
TSMC bets big on 2nm by 2025 – but can it deliver?	Ambition meets reality as geopolitical, technical, and logistical challenges loom
The AI Effect: Amazon Sees Nearly 1 Billion Cyber Threats a Day.	The technology has spawned a surge in hacking attempts, says cyber chief CJ Moses, while Amazon is also using it to powerfully amp up its threat-analysis capability
Meet 'Chameleon' – an AI model that can protect you from facial recognition thanks to a sophisticated digital mask.	A new AI model can mask a personal image without destroying its quality, which will help to protect your privacy.
Elon Musk targets OpenAI’s for-profit transition in a new filing.	Musk’s attorneys say if OpenAI goes for-profit, it could ‘lack sufficient funds’ for damages if Musk wins his lawsuit.
Perplexity mulls getting into hardware.	Perplexity's CEO aims to create an affordable AI device, priced under $50, for voice-to-voice interactions. This reflects a growing interest among AI startups in developing hardware for novel interaction methods, though past challenges in AI hardware development pose risks. Backed by significant funding, Perplexity seeks to overcome obstacles encountered by others, such as Humane's Ai Pin.
Inflection AI CEO says it’s done trying to make next-generation AI models.	Inflection AI has transitioned from creating advanced AI models to offering AI tools tailored for enterprise customers, utilizing existing AI models. It has acquired three AI startups to enhance its capabilities and is open to licensing models from previous competitors. CEO Sean White emphasizes the company's shift toward practical applications, prioritizing on-premise AI solutions to ensure enterprise data security over frontier model innovation.
PlayAI's $21M Funding and The Release of a New Multi-Turn Speech Model.	PlayAI secured $21 million to enhance voice-first AI interfaces and models, launching Play Dialog, an advanced multi-turn speech model.
Anthropic says Claude AI can match your unique writing style.	Three styles presets are available alongside the ability to create personalized styles for the chatbot to mimic.
Intel CEO Pat Gelsinger retires amid chipmaker’s struggles.	David Zinsner and Michelle Johnson Holthaus named interim co-CEOs of a company fighting to keep up with rivals
ChatGPT turns two: how the AI chatbot has changed scientists’ lives.	How many researchers are using the AI tool? Nature gathers data and talks to members of the academic community.
Ads might be coming to ChatGPT — despite Sam Altman not being a fan.	OpenAI is exploring advertising as a potential business model to fund its expensive AI tool development. While there are no active plans for ads, the option remains under consideration. CEO Sam Altman views ads as a last resort and has expressed unease about merging ads with AI.
OpenAI targets 1bn users in next phase of growth.	OpenAI plans to attract 1 billion users by introducing new AI agents, enhancing AI infrastructure, and integrating ChatGPT with Apple devices. The company is heavily investing in AI development to stay competitive against rivals like Google and Microsoft while navigating political challenges to promote US leadership in AI over China's growing influence.
AI company Mistral is latest European startup to eye expansion in Silicon Valley.	Mistral AI, a leading European AI startup known for its open-weight large language models, is expanding into the U.S. by establishing an office in Palo Alto, California. This move aims to attract top AI talent and enhance its U.S. sales operations. One of Mistral's co-founders, Guillaume Lample, is considering relocating from Paris to support this expansion
OpenAI gets new $1.5 billion investment from SoftBank, allowing employees to sell shares in a tender offer.	OpenAI is allowing employees to sell about $1.5 billion worth of shares in a new tender offer to SoftBank, CNBC has learned. SoftBank’s latest investment adds to OpenAI’s recent $6.6 billion funding round at a $157 billion valuation. The deal was spurred by SoftBank billionaire founder and CEO Masayoshi Son, who was persistent in asking for a larger stake in the company, a person familiar with the matter said.
ChatGPT’s refusal to acknowledge ‘David Mayer’ down to glitch, says OpenAI.	Name was mistakenly flagged and prevented from appearing in responses, says chatbot’s developer
Smartphones should carry a health warning, Spanish government told.	Report by committee of experts also calls for doctors to ask about screen time during checkups
Meta says it has taken down about 20 covert influence operations in 2024.	Firm names Russia as the top source of such activity but says it is ‘striking’ how little AI was used to try to trick voters
Why Silicon Valley panicked over Australia’s under-16 social media ban.	Australia’s children account for a tiny portion of users but tech companies worry about the law setting a precedent
Chip war ramps up with new US semiconductor restrictions on China.	Biden administration broadens limits on Chinese access to advanced microchip technology, with Donald Trump expected to go even further
Eleven Labs Conversational AI.	Eleven Labs has introduced a new conversational AI service designed as a comprehensive solution for creating conversational agents. It employs multiple LLMs on the backend and integrates smoothly with a diverse range of specialized voices.
Claude 3.5 Haiku on AWS Trainium2 and model distillation in Amazon Bedrock.	Claude models are being tailored for AWS's advanced Trainium2 AI chips, allowing for faster and more efficient performance. Claude 3.5 Haiku is now accessible on AWS Trainium2 and supports model distillation in Amazon Bedrock.
AI Music Is More Realistic Than Ever: Meet Suno's New Model.	Suno has become the fifth most-used generative AI service with its realistic AI music model V4, despite facing a copyright lawsuit. The model improves user experience by focusing on human preferences, offering enhanced sound quality and advanced composition skills. Suno aims to advance AI-human music collaboration while addressing copyright concerns with the recording industry.
Bluesky’s open API means anyone can scrape your data for AI training.	Bluesky might not be training AI systems on user content as other social networks are doing, but there’s little stopping third parties from doing so.
Google launches the London AI Campus.	The AI Campus is a pilot program aimed at fostering and diversifying the next generation of local AI talent.
OpenAI 12 days of Shipmas.	OpenAI will be having 12 live streams over the next 12 days to ship new product and model features.
Meta's Nuclear Energy Plans.	Meta revealed plans to partner with nuclear energy developers through a new request for proposals, aiming to add 1-4 gigawatts of nuclear capacity in the U.S. to bolster its AI innovation and sustainability initiatives.
AWS Reinvent Top Announcements.	At AWS re: Invent 2024, AWS announced enhancements to its Bedrock LLM service, including the introduction of prompt routing and caching features.
Certain names make ChatGPT grind to a halt, and we know why.	OpenAI's ChatGPT uses hard-coded filters to prevent generating false statements about certain individuals, causing disruptions in conversations when those names are mentioned. This measure, introduced after incidents like defamation lawsuits against OpenAI, restricts outputs related to sensitive names. However, these filters limit ChatGPT's functionality and make it susceptible to adversarial attacks.
World Labs’ AI can generate interactive 3D scenes from a single photo.	World Labs, the startup founded by AI pioneer Fei-Fei Li, has unveiled its first project: an AI system that can generate video game-like, 3D scenes from a single image.
bias found in AI system used to detect UK benefits fraud.	Age, disability, marital status and nationality influence decisions to investigate claims, prompting fears of ‘hurt first, fix later’ approach
How AI monitoring is cutting stillbirths and neonatal deaths in a clinic in Malawi.	The only hospital in the country using fetal safety software has seen baby fatalities drop by 82% in three years
Windows 11 loses customers amid the world's most popular OS gaining traction.	Despite Microsoft's push to move Windows 10 users to Windows 11, Redmond's latest operating system is losing market share to its predecessor.
Stop using generative AI as a search engine.	A fake presidential pardon explains why you can’t trust robots with the news.
Soon, the tech behind ChatGPT may help drone operators decide which enemies to kill.	OpenAI and Palmer Luckey's weapons company sign agreement to explore lethal drone defense for military use.
Google Says AI Weather Model Masters 15-day Forecast.	A new artificial intelligence-based weather model can deliver 15-day forecasts with unrivaled accuracy and speed, a Google lab said, with potentially life-saving applications as climate change ramps up.
Perplexity Expanding Its Publisher's Program.	Perplexity has expanded its Publishers' Program by partnering with over a dozen international news organizations, providing tools, revenue sharing, and support to enhance collaboration with global media.
DeepMind’s Genie 2 can generate interactive worlds that look like video games.	DeepMind, Google’s AI research org, has unveiled a model that can generate an “endless” variety of playable 3D worlds. Called Genie 2, the model — the successor to DeepMind’s Genie, which was released earlier this year — can generate an interactive, real-time scene from a single image and text description (e.g. “A cute humanoid robot in the woods”).
Key leaders behind Google’s viral NotebookLM are leaving to create their own startup.	Three core members of Google NotebookLM have departed to launch a new stealth AI startup. The venture intends to use cutting-edge AI models to develop consumer-oriented, user-focused AI products. It is still in its early stages, with no defined focus or disclosed funding.
Bezos says he is ‘very optimistic’ about Trump’s plan to roll back regulations.	Amazon billionaire known for previously frosty relations with president-elect signals willingness to collaborate

Resources

Link	description
Large Language Model-Brained GUI Agents: A Survey.	Provides an overview of LLM-powered GUI Agents, covering their techniques and applications.
A Survey on LLM-as-a-Judge.	Offers an in-depth survey of the LLM-as-a-Judge paradigm, with a detailed exploration of strategies for developing reliable LLM-as-a-Judge systems.
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training.	Introduces a suite of fully open state-of-the-art post-trained models, along with their accompanying data, code, and training methodologies, providing a detailed guide to contemporary post-training techniques.
INTELLECT-1 Release: The First Globally Trained 10B Parameter Model.	INTELLECT-1 is a 10B parameter model trained on 1 trillion tokens using globally distributed hardware. Its benchmarks are solid, and achieving an MFU of over 30% is remarkable considering the distributed training setup. If these results are validated, they represent a significant advancement in decentralized large-model training.
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects.	This framework advances object detection in open-world settings by enabling AI to recognize and learn from previously unseen objects.
HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning.	HUPE is an AI-driven technique that enhances underwater image clarity while maintaining essential details for tasks such as object detection.
LTNtorch: PyTorch Implementation of Logic Tensor Networks.	Logic Tensor Networks (LTN) combine deep learning with logical reasoning, enabling neural models to learn by optimizing a knowledge base constructed from logical formulas.
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale.	ProX is a framework that approaches data refinement as a programming task, enabling models to perform detailed operations on individual examples at scale. It enhances pre-training corpus quality by utilizing small language models to generate programs.
MMDuet.	MMDuet introduces a unique "video-text duet" interaction format for VideoLLMs, enabling AI to deliver real-time responses as videos play. This method simulates a dialogue where users and AI can exchange messages during video playback.
Converting GPT to Llama.	This repository contains code for converting a GPT implementation to Meta AI's Llama.
DeMo training run.	Nous is training a 15B distributed model using the DeMo optimizer. All of the training can be followed live at this link.
Fine-Tune Models with LoRA-SB.	LoRA-SB is a new method that brings full fine-tuning performance to low-rank adapters for large language models.
Making AI Datasets More Diverse.	Researchers proposed a new approach, Diversity-driven EarlyLate Training (DELT), to enhance dataset distillation for large-scale tasks.
Google’s plan to keep AI out of search trial remedies isn’t going very well.	US District Judge Amit Mehta indicates that AI could be pivotal in shaping remedies after the government's win in the Google search monopoly trial, potentially impacting Google's AI products. The DOJ has proposed measures to prevent Google from leveraging AI to maintain market dominance, including limits on exclusive agreements and AI investments. Microsoft opposes Google's requests for confidential AI deal details, citing irrelevance, while OpenAI may face pressure to disclose data in this context.
Using uv with PyTorch.	Documentation on how to use the new package manager UV to install PyTorch.
Amazon Launches Nova.	Amazon Nova unveils a series of multimodal models tailored for tasks such as document analysis, visual comprehension, and creative content generation. Prioritizing customization and efficiency, Nova models address various enterprise needs and excel in handling text, image, and video inputs.
Restructuring Vector Quantization with the Rotation Trick.	Vector Quantization uses the Straight Through Gradient estimator for gradient estimation, though its direction can occasionally be inaccurate. This paper proposes using rotation to correct the gradients and enhance codebook utilization.
Layout Generation with Diffusion GANs.	DogLayout is a hybrid model integrating GANs with diffusion processes to address challenges in layout generation.
Hunyuan Video Model.	Tencent's state-of-the-art open video model stands out for its realistic motion and dual training as both a video and image generation model. This dual approach enhances the aesthetic quality of its output, making it comparable to image generation models like Flux.
Scene Text Recognition.	TextSSR is a framework leveraging diffusion-based techniques to produce precise and realistic synthetic text images for scene text recognition.
T2Vid: Efficient Video Fine-tuning Scheme for MLLMs.	T2Vid is a novel approach aimed at enhancing video comprehension in Multimodal Large Language Models (MLLMs). It creates video-like samples to diversify training instructions.
aisuite.	aisuite offers a unified interface for seamless interaction with multiple LLM providers, enabling developers to test and compare outputs without modifying their code.
Motion Prompting: Controlling Video Generation with Motion Trajectories.	Motion Prompting is a technique for training video generation models using novel input types, including text, the first image frame, and a pixel tracking field. This enables innovative control during inference, allowing for new pixel fields (e.g., indicating an object moving in a different direction) to generate corresponding videos. While highly compelling, the method is not open source.
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey.	This repository provides an extensive survey on the use of Vision-Language Models (VLMs) in remote sensing.
ImplicitPRM.	Process reward models (PRMs) provide detailed feedback by assessing reasoning step-by-step, unlike outcome reward models (ORMs), which evaluate complete responses. However, training PRMs demands detailed intermediate annotations, making it challenging. This paper demonstrates that an implicit PRM can be obtained at no extra cost by training an ORM on response-level labels, utilizing log-likelihood ratios between policy and reference models, thereby enabling optimization without specific loss objectives.
Unsloth - Dynamic 4-bit Quantization.	The Unsloth team seeks to compress a 20GB language model into 5GB while maintaining accuracy. Although various algorithms attempt this, challenges arise with outliers and compressibility. Llama, known for its difficulty in quantization, is addressed by selectively avoiding the quantization of specific parameters, significantly enhancing overall accuracy.
AccDiffusion v2: Tackling Repetitive Image Generation.	AccDiffusion v2 enhances diffusion models for generating high-resolution images without requiring additional training, resolving issues such as object repetition and local distortions.
Optimizing AI Inference at Character.AI.	Character AI features a robust inference pipeline. This post explores their implementation of int8 quantization and flash attention 3, offering valuable insights for those interested in scaling large language models.
Flow.	Flow is a lightweight engine for creating flexible AI workflows using dynamic task scheduling and concurrent execution.
OpenAI o1 System Card.	This report details the safety measures undertaken before releasing OpenAI o1 and o1-mini, including external red teaming and frontier risk assessments aligned with OpenAI's Preparedness Framework.
PaliGemma 2: A Family of Versatile VLMs for Transfer.	Paligemma 2 is among the top Vision-Language Models (VLMs) available today, utilizing SigLIP and Gemma technologies.
ASANet: Asymmetric Semantic Aligning Network for RGB and SAR image land cover classification.	The Asymmetric Semantic Aligning Network (ASANet) improves land cover classification using both SAR and RGB images.
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning.	Researchers have created a training-free method to enhance the efficiency of multi-modal language models (LLMs) with minimal performance loss. Their technique reduces computational demands by up to sevenfold through strategic merging and pruning of visual data tokens.
Google DeepMind GraphCast and GenCast.	DeepMind has open-sourced its GraphCast algorithm, which significantly outperforms and accelerates localized weather predictions for up to 36 hours, operating in a fraction of the time required by other methods.
Anagram-MTL.	visual anagram generation - images that change appearance when flipped or rotated -using diffusion models
ScoreLiDAR.	ScoreLiDAR is a new method that speeds up 3D LiDAR scene completion for autonomous vehicles.
New Fish Audio Model.	Fish Audio 1.5 is currently ranked #2 on the Text-to-Speech Leaderboards, just behind ElevenLabs. It supports voice cloning and runs quickly, though the output quality can be inconsistent.
Deepthought-8B.	Deepthought-8B is a small and capable reasoning model built on LLaMA-3.1 8B, designed to make AI reasoning more transparent and controllable. Despite its relatively small size, it achieves sophisticated reasoning capabilities that rival much larger models.
LLM-Brained GUI Agents.	A Collection of Research Papers and Projects in Large Language Model-Brained GUI Agents: A Survey.

Perspectives

Link	description
AI expert Marietje Schaake: ‘The way we think about technology is shaped by the tech companies themselves’.	The Dutch policy director and former MEP on the unprecedented reach of big tech, the need for confident governments, and why the election of Trump changes everything
If AI can provide a better diagnosis than a doctor, what’s the prognosis for medics?	Studies in which ChatGPT outperformed scientists and GPs raise troubling questions for the future of professional work
Building LLMs is probably not going to be a brilliant business.	LLM developers, including OpenAI, face major hurdles due to the industry's structure, particularly NVIDIA's dominance as a critical chip supplier and the intense price sensitivity and competition among buyers. While many AI companies secure significant funding, they often face profitability challenges, reminiscent of past tech firms like Netscape. Nonetheless, technology is likely to continue to progress. AI businesses may find success by focusing on leveraging existing models instead of creating new ones.
Rox: How to Manufacture Path Dependence in Applied AI.	like Salesforce by leveraging AI to manage unstructured data and integrate seamlessly with data warehouses. Its strategy focuses on enhancing the productivity of top sales performers through AI-powered agents while ensuring customer data security for future AI developments. This approach has attracted significant investor confidence, with Rox securing $50 million in funding from Sequoia Capital, GV, and General Catalyst across its seed and Series A rounds.
How close is AI to human-level intelligence?	Large language models such as OpenAI’s o1 have electrified the debate over achieving artificial general intelligence, or AGI. But they are unlikely to reach this milestone on their own.
The race is on to make AI agents do your online shopping for you.	Tech companies are creating AI shopping agents to automate online purchases, which could transform the retail industry. Perplexity's model faces operational hurdles, while OpenAI, Google, and Amazon are also working on AI purchasing tools. These advancements aim to simplify shopping but raise concerns about privacy, retailer dynamics, and the future of online shopping.
Salesforce CEO Marc Benioff Has Thoughts on AI Agents, Automation, And The Future of Your Job.	Salesforce CEO Marc Benioff foresees companies using AI agents to manage customer service and sales by utilizing their existing data and policies, with Salesforce serving as a central enabler of this change. He contends that AI-driven automation will boost productivity rather than replace jobs, enabling businesses to grow and operate more efficiently without adding human labor. Benioff emphasizes this transition as a pivotal moment in business evolution, offering a competitive advantage and transforming traditional workflows.
Reward Hacking in Reinforcement Learning.	Lilian Weng has published an insightful blog post on the issue of Reward Hacking in language model alignment, a key challenge hindering the deployment of models in production environments.
Create JSONL dataset from API chat logs.	A straightforward utility that enables the creation of a JSONL dataset from messages exchanged between the user and the API.
The ChatGPT secret: is that text message from your friend, your lover – or a robot?	People are turning to chatbots to solve all their life problems, and they like its answers. But are they on a very slippery slope?
A System of Agents brings Service-as-Software to life.	AI is evolving software from a tool into autonomous agents capable of performing tasks traditionally handled by humans, representing a projected $4.6 trillion market opportunity. Advancements like LLMs and agents empower AI systems to handle unstructured data, make decisions, and operate independently in sectors such as sales and healthcare. The future of AI envisions Systems of Agents working collaboratively and learning from one another, akin to a highly skilled team delivering seamless services.
Over ½ of Long Posts on LinkedIn are Likely AI-Generated Since ChatGPT Launched.	Since the launch of ChatGPT, LinkedIn has experienced a 189% increase in AI-generated content, with more than half of long-form posts now probably AI-created.
AI’s computing gap: academics lack access to powerful chips needed for research.	Survey highlights the disparity between academic and industry scientists’ access to computing power needed to train machine-learning models.
'Brutal’ math test stumps AI but not human experts.	Benchmark shows humans can still top machines—but for how much longer?
Finetuning LLM Judges for Evaluation.	Evaluating LLMs is challenging due to their complex, open-ended outputs. While traditional human evaluation provides detailed insights, it is inefficient. Therefore, scalable assessments using automatic metrics and model-based approaches like LLM-as-a-Judge are essential. Innovations such as fine-tuned judges (e.g., Prometheus) and synthetic data generation are improving evaluation precision and adaptability across various tasks and domains.
The Gen AI Bridge to the Future.	Generative AI is set to revolutionize wearable technology by creating on-demand UI interfaces that adapt to user needs and context.
Sam Altman Says Artificial General Intelligence Is on the Horizon.	Speaking at The New York Times DealBook Summit, Sam Altman, the chief executive of OpenAI, said that the arrival of artificial general intelligence would “matter much less” to the average person than currently thought.

Back to index

ML news: Week 25 November - 1 December

Research

Link	description
Learning high-accuracy error decoding for quantum processors.	A new AI-driven decoder has established a state-of-the-art benchmark for detecting errors in quantum computers. Leveraging transformer architecture, AlphaQubit achieved a 6% reduction in errors compared to tensor network methods and a 30% reduction compared to correlated matching on the Sycamore data. It also demonstrated promising performance in simulations with larger systems of up to 241 qubits. While this marks substantial progress in quantum error correction, the system requires speed enhancements to enable real-time error correction for practical quantum computing applications.
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.	This work examines Claude 3.5's computer use capabilities across various domains and software, offering a ready-to-use agent framework for deploying API-based GUI automation models. Claude 3.5 showcases an exceptional ability to perform end-to-end tasks, translating language inputs into desktop actions seamlessly.
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations.	The paper proposes five statistical recommendations for improving the evaluation of performance differences in LLMs. These include using the Central Limit Theorem to estimate theoretical averages over all possible questions rather than relying on observed averages, clustering standard errors when questions are related instead of treating them as independent, reducing variance within questions through resampling or next-token probabilities, analyzing paired differences between models by leveraging shared questions across evaluations, and conducting power analysis to determine sufficient sample sizes for identifying meaningful differences. The authors suggest that these approaches will help researchers better identify whether performance differences reflect genuine capability gaps or are merely due to chance, resulting in more accurate and reliable model evaluations.
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions.	Marco-o1 is a reasoning model designed for open-ended solutions, leveraging Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and advanced reasoning strategies. It achieves accuracy gains of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset.
Cut Your Losses in Large-Vocabulary Language Models.	The paper introduces Cut Cross-Entropy (CCE), a method designed to drastically reduce memory usage in LLM training by optimizing the computation of cross-entropy loss. Traditional cross-entropy layers can consume up to 90% of memory in some models by storing logits for the entire vocabulary. CCE addresses this by calculating logits only for the correct token and dynamically evaluating the log-sum-exp overall logits using flash memory. This approach reduces the memory footprint of Gemma 2 from 24GB to just 1MB. By leveraging the sparsity in softmax calculations, it skips elements that have minimal impact on gradients. The authors demonstrate that CCE achieves this substantial memory reduction without affecting training speed or convergence, allowing for larger batch sizes and potentially more efficient scaling of LLM training.
AIGS: Generating Science from AI-Powered Automated Falsification.	The study presents a multi-agent system for automated scientific discovery, focusing on falsification through automated ablation studies. Tested on three machine learning tasks—data engineering, self-instruct alignment, and language modeling—the system successfully generated meaningful scientific insights. However, its performance remains inferior to that of experienced human researchers.
Does Prompt Formatting Have Any Impact on LLM Performance?	The study investigates how different prompt formats (plain text, Markdown, JSON, and YAML) influence GPT model performance across various tasks. It finds that GPT-3.5-turbo's performance can vary by up to 40% depending on the format, whereas larger models like GPT-4 are more resilient to such changes. There is no universally optimal format across models or tasks; for example, GPT-3.5-turbo performed better with JSON, while GPT-4 favored Markdown. Models within the same family exhibited similar format preferences, but these preferences did not translate well to different model families. The findings highlight the significant impact of prompt formatting on model performance, emphasizing the importance of considering format choice during prompt engineering, model evaluation, and application development.
Juna.ai wants to use AI agents to make factories more energy-efficient.	AI agents are all the rage, a trend driven by the generative AI and large language model (LLM) boom these past few years. Getting people to agree on what exactly AI agents are is a challenge, but most contend they are software programs that can be assigned tasks and given decisions to make — with varying degrees of autonomy.
Why ‘open’ AI systems are actually closed, and why this matters.	This paper examines ‘open’ artificial intelligence (AI). Claims about ‘open’ AI often lack precision
Qwen's first reasoning-inspired model QwQ.	Qwen has introduced a 32B parameter reasoning model that rivals OpenAI's o1 series in performance. The model demonstrates scalability when generating extended reasoning traces and is proficient in mathematics and coding. It is now available for use.
Pathways on the Image Manifold: Image Editing via Video Generation.	In the early days of image synthesis, exploring the latent space was an effective method for creating diverse images. This concept has now extended to video, enabling sequential edits to a single image while preserving semantic consistency.
Low-Bit Quantization Favors Undertrained LLMs.	Models trained for shorter durations on fewer tokens show less performance degradation when quantized after training. This aligns with findings from other research, suggesting that extended training allows models to utilize higher precision to compress increasingly complex information.

News

Link	description
Don’t know what to buy your loved ones for Christmas? Just ask ChatGPT.	Santa has a new little helper. But can an AI-powered shopping assistant really master the subtle art of gift-giving?
Anthropic x AWS trainium collaboration.	Anthropic is collaborating with AWS to enhance trainium inference and tooling capabilities as part of a recent investment initiative.
Will Sam Altman always win the OpenAI board fight in an AI agent simulation?	Fable, a company specializing in games and AI simulations, used its AI decision-making framework SIM-1 to simulate the OpenAI board dispute involving Sam Altman. The simulation, which incorporated multi-agent competition and GPT-4o, suggested Altman’s return as CEO in only 4 out of 20 scenarios. This research highlights AI's ability to model complex decision-making scenarios.
Anthropic Announces Model Context Protocol.	The Model Context Protocol (MCP) is an open standard that enables AI systems to connect directly to data sources, such as business tools and content repositories. It streamlines data access by replacing fragmented, custom integrations with a universal protocol, enhancing scalability and efficiency.
OpenAI Shares Insights on Red Teaming for Safer AI.	OpenAI has enhanced its red teaming initiatives by publishing two papers: one outlining the involvement of external experts in red teaming, and another presenting a novel approach to automated testing.
Nvidia’s CEO defends his moat as AI labs change how they improve their AI models.	"Test-time scaling" is gaining significance with the advancement of AI models, and Nvidia is prepared for this transition. This approach, which boosts AI inference by increasing computational power, introduces competitive pressure as startups create faster AI inference chips. While there are concerns about diminishing returns, Nvidia is determined to capitalize on its strong platform advantage for pretraining and expects substantial growth in AI inference.
Anthropic Introduces Custom Styles for Personalized Responses.	Anthropic now offers custom styles, enabling users to adapt the AI's responses to suit their communication preferences and workflows.
OpenAI’s Sora video generator appears to have leaked.	A group leaked access to OpenAI's unreleased video generator, Sora, in protest against perceived unfair practices and "art washing." They launched a frontend on Hugging Face that enabled users to generate videos, but OpenAI reportedly took it down within hours. OpenAI states that Sora remains in a research preview phase.
Now Hear This: World’s Most Flexible Sound Machine Debuts.	A team of generative AI researchers created a Swiss Army knife for sound, one that allows users to control the audio output simply using text. While some AI models can compose a song or modify a voice, none have the dexterity of the new offering.
OLMo 2: The best fully open language model to date.	Building on its commitment to fully open-source training, Allen AI has introduced a new generation of language models that are entirely transparent and rival or exceed the performance of the best open-weight models available.
Amazon to invest another $4 billion in Anthropic, OpenAI’s biggest rival.	Amazon revealed a $4 billion investment in Anthropic, raising its total commitment to $8 billion and solidifying AWS as Anthropic's main cloud and training partner.
OpenAI is funding research into ‘AI morality’.	OpenAI is funding academic research into algorithms that can predict humans’ moral judgments.
Quantum computing: physics–AI collaboration quashes quantum errors.	A neural network has learned to correct the errors that arise during quantum computation, outperforming algorithms that were designed by humans. The strategy sets out a promising path towards practical quantum computers.
OpenAI moves to trademark its o1 ‘reasoning’ models.	OpenAI has filed a trademark application for its latest AI model, o1, as the firm moves to shield its IP.
ElevenLabs’ new feature is a NotebookLM competitor for creating GenAI podcasts.	Voice AI startup ElevenLabs on Wednesday introduced a feature that lets you upload different types of content to create a multispeaker podcast for you, similar to Google’s NotebookLM.
Cradle raises $73M Series B to Put AI-Powered Protein Engineering in Every Lab.	Cradle has solved a critical challenge in optimizing protein shapes. It is now expanding its team and efforts to land this technology in the hands of practitioners everywhere.
Teach mode, Rabbit's tool for automating R1 tasks, is now available to all users.	Rabbit R1 has launched a teach mode feature that enables users to train its AI to automate tasks across various websites. This enhancement aims to boost functionality and productivity by supporting intricate multi-platform interactions, potentially providing a superior experience compared to dedicated apps. Rabbit plans to establish a marketplace for user-created automations and seeks widespread adoption, despite possible platform challenges.
Use robots instead of hiring low-paid migrants, says shadow home secretary.	Tory MP Chris Philp calls for more investment in technology to reduce UK’s net migration figures
Tesla owners turn against Musk: ‘I’m embarrassed driving this car around’.	The electric car brand was once a liberal favourite – but the CEO’s embrace of Trump has led to an angry backlash
Alibaba releases an ‘open’ challenger to OpenAI’s o1 reasoning model.	Alibaba has released QwQ-32B-Preview, an ‘open' challenger to OpenAI's o1 reasoning model.
Ai2 releases new language models competitive with Meta’s Llama.	Ai2 has launched OLMo 2, an open-source language model series featuring 7- and 13-billion-parameter models. Built using publicly available training data and code, OLMo 2 aims to advance open-source AI innovation. Ai2 asserts that these models surpass comparable open models, such as Meta's Llama 3.1. The models are licensed under Apache 2.0, allowing for commercial use.
xAI could soon have its own app.	Elon Musk’s xAI is reportedly about to take its next step to compete with OpenAI.

Resources

Link	description
An Empirical Study on LLM-based Agents for Automated Bug Fixing.	The study evaluates seven top LLM-based bug-fixing systems on the SWE-bench Lite benchmark, identifying MarsCode Agent by ByteDance as the best performer with a 39.33% success rate. It highlights that line-level fault localization accuracy is more crucial than file-level accuracy for error localization, and bug reproduction capabilities play a significant role in fixing success. Notably, 24 out of 168 resolved issues required reproduction techniques, though these sometimes misled LLMs when issue descriptions were already clear. The study concludes that improving LLM reasoning abilities and refining agent workflows are essential for advancing automated bug fixing.
FinRobot: AI Agent for Equity Research and Valuation with Large Language Models.	The framework introduces an AI agent system for equity research that utilizes multi-agent Chain-of-Thought (CoT) prompting to integrate data analysis with human-like reasoning, producing professional investment reports comparable to those from major brokerages. It employs three specialized agents: the Data-CoT Agent, which aggregates diverse data sources for comprehensive financial integration; the Concept-CoT Agent, which mimics an analyst's reasoning to derive actionable insights; and the Thesis-CoT Agent, which synthesizes these insights into a cohesive investment thesis and report.
Bi-Mamba: Towards Accurate 1-Bit State Space Models.	The scalable 1-bit Mamba architecture is designed to optimize LLM efficiency across multiple model sizes (780M, 1.3B, and 2.7B). Bi-Mamba delivers performance comparable to full-precision formats like FP16 and BF16, while drastically reducing memory usage. It also achieves higher accuracy than post-training binarization Mamba baselines.
Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models.	Ai2 has introduced OpenScholar, a retrieval-augmented language model designed to search for relevant academic papers and provide answers based on those sources, streamlining the process for scientists to locate and synthesize information.
Detecting Human Artifacts from Text-to-Image Models.	This study addresses the issue of distorted human figures in text-to-image models by presenting the Human Artifact Dataset (HAD), a comprehensive dataset containing more than 37,000 annotated images.
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages.	UnifiedCrawl is a method that efficiently gathers extensive text data for low-resource languages from the Common Crawl corpus, utilizing minimal computational resources. This approach filters and extracts relevant data, resulting in monolingual datasets significantly larger than previously available sources.
A New Image-to-Video Model.	Researchers have created image-to-video diffusion models capable of generating realistic motion transformations from static images, overcoming the constraints of traditional approaches such as affine transformations.
AIMv2: New Vision Models.	The AIMv2 vision model family employs a multimodal autoregressive training approach, delivering remarkable performance across various tasks.
A New Attention Mechanism for Training LLMs.	AnchorAttention: Improved attention for LLMs long-context training
Combining Convolutions and Self-Attentions for Efficient Vision Models.	GLMix is a novel approach that combines convolutions and multi-head self-attentions (MHSAs) at varying granularity levels for vision tasks. Convolutions capture fine-grained local details, while MHSAs focus on coarse-grained semantic slots to provide global context.
Echo Mimic v2.	Open weights system to animate partial human bodies with a reference image and audio input. It uses pose-specific VAEs to combine the information from various channels and a reference image to animate.
LTX-Video.	LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. It can generate 24 FPS videos at 768x512 resolution, faster than it takes to watch them. The model is trained on a large-scale dataset of diverse videos and can generate high-resolution videos with realistic and diverse content.
Documind.	Documind utilizes AI to extract structured data from PDFs by converting them into images and leveraging OpenAI's API.
Coalescence: making LLM inference 5x faster.	"Coalescence" is a framework that accelerates LLM inference by up to 5x when producing structured outputs like JSON. It achieves this by transforming structured formats into finite-state machines and eliminating redundant paths that result in the same output, reducing the need for unnecessary LLM calls. Although this approach greatly enhances speed, it is crucial to preserve output quality by ensuring that optimization does not exclude more likely sequences.
WildLMa: Long Horizon Loco-Manipulation in the Wild.	WildLMa is a framework designed to enable quadruped robots to perform advanced manipulation tasks in real-world settings. It integrates three core components: a whole-body controller for teleoperation via VR, a skill library learned through imitation learning (WildLMa-Skill), and a language model-based planner (WildLMa-Planner) that organizes these skills for long-term tasks. The researchers showcase its application in tasks such as cleaning trash from hallways and rearranging bookshelf items. The framework proves effective across various environments and object setups.
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective.	MMGenBench is a novel evaluation framework for large multimodal models, emphasizing their capacity to generate and interpret images. In this process, models produce descriptions from input images, which are subsequently used to generate new images for comparison.
Moondream Python Client Library.	Moondream's Python client library provides tools for image analysis and querying, featuring CPU-optimized inference. However, it is not yet suitable for GPU or Mac M1/M2/M3 users. The library can be installed using pip, and model weights are available for download in various formats, including int8, fp16, and int4.
Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer.	Sana is a highly efficient image generation model capable of producing high-quality 1024x1024 images in under a second on a laptop GPU. Its innovations include a 32x image compression autoencoder (DC-AE), linear attention replacing traditional attention in DiT, a decoder-only LLM for text encoding, and improved training and sampling techniques. The 0.6B parameter model rivals or surpasses much larger models like Flux-12B, despite being 20x smaller and 100x faster. Requiring only 9GB of VRAM for inference, Sana-0.6B is accessible on consumer hardware. The repository provides code for training, inference, and evaluation, offering both 0.6B and 1.6B model variants.
Flow Models.	A great introduction to flow-based modeling, which is a theoretical improvement over diffusion.
Building an AI-Powered Game.	This is a course by Andrew Ng, Latitude, and Together AI on how to make an AI-powered game.
Sharper Infrared Images.	This project improves image super-resolution for infrared images, addressing issues where traditional methods distort spectral fidelity.
Mochi 1 LoRA Fine-tuner.	Mochi 1, a top open-source video model, supports LoRA fine-tuning and operates on a single GPU. The repository demonstrates various applications, such as creating custom effects and ensuring character consistency.
OneDiffusion.	OneDiffusion is a versatile large-scale diffusion model capable of handling various tasks, including text-to-image generation, image editing, and reverse processes such as depth estimation and segmentation.
customized-flash-attention.	New flash attention fork that can have ragged Q/V matrix sizes.
Novel View Synthesis.	MVGenMaster is a multi-view diffusion model that enhances Novel View Synthesis tasks by incorporating 3D priors.
FlowMol: Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation.	This work benchmarks discrete flow matching methods for generating novel 3D molecular structures, critical for chemical discovery.
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge.	This project investigates the growing "LLM-as-a-judge" approach, where large language models are utilized for scoring, ranking, and selection tasks in diverse AI and NLP applications.
aisuite.	An easy way to work with a variety of API based models in a single packaged environment.
UK government failing to list use of AI on the mandatory register.	Technology secretary admits Whitehall departments are not being transparent over the way they use AI and algorithms
Reddit overtakes X in popularity of social media platforms in UK.	Discussion platform takes fifth place in rankings and is the fastest growing large social media platform in the UK
Star Attention: Efficient LLM Inference over Long Sequences.	Star Attention introduces a block-sparse method to accelerate Transformer-based large language models (LLMs) during long-sequence inference.
SketchAgent.	SketchAgent utilizes a multimodal LLM to enable language-guided, step-by-step sketch generation using an intuitive sketching language. It can create diverse sketches, interact with humans for collaborative sketching, and edit content through chat.
DROID-Splat.	A deep learning-based dense visual SLAM framework capable of real-time global pose optimization and 3D reconstruction.
P2DFlow.	P2DFlow is a protein ensemble generative model with SE(3) flow matching based on ESMFold, the ensembles generated by P2DFlow could aid in understanding protein functions across various scenarios.
ThunderMittens For Your ThunderKittens.	Hazy Research has played a significant role in optimizing hardware utilization for AI workloads. They have expanded their impressive ThunderKittens Kernel writing framework to support Apple Silicon.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving.	Diffusion models for End-to-End driving of autonomous vehicles which can operate at 45 FPS on a 4090 chip.
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion-based Image Super-Resolution.	PassionSR introduces an approach that makes diffusion-based image super-resolution (SR) models more hardware-friendly.
Training Open Instruction-Following Language Models.	This repo serves as an open effort on instruction-tuning popular pre-trained language models on publicly available datasets.
Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment.	Grounding-IQA is an innovative method for image quality assessment (IQA) that combines location-specific grounding with multimodal descriptions.
Steel Browser API for AI Agents.	The open-source browser API built for AI agents. Steel provides a REST API to control headless browsers with session management, proxy support, and anti-detection features. Perfect for web automation, scraping, and building AI agents that can interact with the web.
PixMo dataset.	Allen AI has released several datasets that were used to train its visual language models.
StableAnimator: High-Quality Identity-Preserving Human Image Animation.	StableAnimator introduces a breakthrough in human image animation by ensuring identity consistency in generated videos.

Perspectives

Link	description
Jeff Jarvis: ‘Elon Musk’s investment in Twitter seemed insane, but it gave him this power’.	The US media pundit on the dangers of overregulation online, why he’s more frightened of the tech bros than AI itself, and how to reclaim the web by getting rid of the geeks
Passwords are giving way to better security methods – until those are hacked too, that is.	It’s a war that will never end. But for small-business owners, it’s all about managing risk while reaping rewards
Gwern Branwen - How an Anonymous Researcher Predicted AI's Trajectory.	In this post, Gwern Branwen, an early advocate of LLM scaling, explores AI advancements and their influence on the path to AGI. He highlights the significance of scaling and computational power over traditional algorithmic innovations. Branwen reflects on the interplay between human intelligence and AI, as well as the societal implications of upcoming technologies like weight-loss drugs on behavior. Additionally, he offers thoughts on his writing process and the transformative effects of AI on creative endeavors
The Bitter Religion: AI’s Holy War Over Scaling Laws.	The AI community is currently divided over the emphasis on scaling computation as the primary driver of AI performance, a concept often referred to as "The Bitter Lesson." Proponents, including leaders at OpenAI, believe that achieving artificial general intelligence (AGI) is possible shortly through the continued scaling of computational resources. However, others argue that alternative scientific advancements are necessary, as scaling laws may not be sustainable in the long term. This debate significantly influences investment and development strategies within AI and related fields.
Why LLMs Within Software Development May Be a Dead End.	LLMs in software development face challenges due to their lack of decomposability and explainability.
How the far right is weaponizing AI-generated content in Europe.	Experts say fake images raising fears around issues such as immigration have proliferated since EU elections
‘What many of us feel’: why ‘enshittification’ is Macquarie Dictionary’s word of the year.	The committee’s honorable mentions went to ‘right to disconnect’ and ‘rawdogging’
Valuing Humans in the Age of Superintelligence: HumaneRank.	AI's ability to exceed human intellectual output could result in economic displacement. The proposed Humanerank system addresses this by allowing individuals to allocate endorsements that represent societal value, influencing resource distribution. This approach preserves market dynamics and personal freedom while offering a new way to value human contributions in an AI-driven world.
Something weird is happening with LLMs and chess.	This article examines how various LLMs perform in playing chess. Most models falter after a few moves, except for GPT-3.5-turbo-instruct, which excels. This indicates that instruction tuning might impair chess capabilities or that GPT-3.5-turbo-instruct was trained on more chess-related data. Additionally, tokenizer handling issues could be affecting model performance.
Amazon, Google and Meta are ‘pillaging culture, data and creativity’ to train AI, Australian inquiry finds.	Among the report’s 13 recommendations is the call for the introduction of standalone AI legislation and protections for creative workers
When we become cogs.	AI enhances material scientists' efficiency, driving a 44% rise in material discoveries but reducing work satisfaction by 44% due to fewer opportunities for idea generation. Similarly, GitHub Copilot boosts productivity for less experienced developers, shifting their focus from project management to coding. While AI helps bridge skill gaps, it risks alienation by automating creative tasks, mirroring the effects of automation in other industries.
AI Alone Isn't Ready for Chip Design.	Hybrid methods blending classical search techniques with machine learning are proving effective in addressing the challenges of chip design, especially in floorplanning. While AI alone faces difficulties with multi-constraint scenarios, incorporating AI to guide search-based algorithms, such as simulated annealing, improves both efficiency and performance. This synergy accelerates the design process and facilitates the development of more intricate chip solutions.
In the big data era, prioritize statistical significance in study design.	Analysis of neuroimaging studies shows that close attention to experimental design can increase the statistical robustness of research results.
AI could pose pandemic-scale biosecurity risks. Here’s how to make it safer.	AI-enabled research might cause immense harm if it is used to design pathogens with worrying new properties. To prevent this, we need better collaboration between governments, AI developers, and experts in biosafety and biosecurity.
Don’t let watermarks stigmatize AI-generated research content.	Given the increasing integration of LLMs into research processes, identifying their contributions transparently is ever more urgent. But watermarking risks fostering a reductive and binary view of content as either ‘pure’ or ‘tainted’ depending on whether it is human- or LLM-generated.
It's Surprisingly Easy to Jailbreak LLM-Driven Robots.	RoboPAIR is an algorithm capable of bypassing safety guardrails in robots powered by LLMs, effectively jailbreaking these systems. Tests demonstrated a 100% success rate in compromising platforms like the Go2 self-driving simulator and robot dogs. This highlights critical security vulnerabilities, underscoring the urgent need for stronger defenses against LLM-based robot hacking.
A new AI scaling law shell game?	Recent changes in AI scaling laws have exposed limits in predictability and effectiveness, with newer models falling short of previous expectations. Microsoft CEO Satya Nadella emphasizes "inference time compute" as a key area to address, though issues of cost and reliability remain. Advancing beyond scaling is essential, and LLMs should be integrated into a more comprehensive AI strategy.

Back to index

ML news: Week 18 - 24 November

Research

Link	description
Artificial Intelligence, Scientific Discovery, and Product Innovation.	indicates that leading scientists use their expertise to focus on the most promising AI-generated suggestions, while others often expend considerable resources on false positives; shows that adopting AI technology for materials discovery boosts productivity, resulting in 44% more materials discovered, a 39% increase in patent filings, and 17% greater product innovation; notes that these improvements come with drawbacks, as 82% of scientists experienced lower job satisfaction, citing reduced creativity and underutilization of their skills.
Scaling Laws for Precision.	presents "precision-aware" scaling laws that forecast how both training and inference precision impact LLM performance; key insights include: 1) post-training quantization becomes increasingly detrimental as models are trained on larger datasets, to the point where more pretraining may harm performance, 2) training with lower precision necessitates a larger model size to sustain performance levels, and 3) when optimizing model size, data, and precision together, the ideal training precision is around 7-8 bits, independent of compute availability; further notes that with fixed model size, the optimal precision for compute increases roughly logarithmically with data size; the authors confirm their predictions on models up to 1.7B parameters trained on up to 26B tokens, demonstrating that both very high (16-bit) and very low (under 4-bit) training precisions may be inefficient.
Sequence modeling and design from molecular to genome-scale with Evo.	a 7B parameter AI model built to comprehend and generate DNA sequences across various biological scales; trained on 2.7 million prokaryotic and phage genomes, it can handle sequences up to 131 kilobases long while preserving single-nucleotide precision, allowing it to capture both molecular interactions and genome-wide patterns; Evo outperforms in predicting and generating functional DNA, RNA, and protein sequences, achieving the first experimentally validated AI-generated CRISPR-Cas complexes and transposable systems.
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning.	examines test-time training (TTT), where model parameters are temporarily updated during inference, to enhance an LLM's abstract reasoning on the ARC benchmark; highlights three essential components: initial fine-tuning on related tasks, using auxiliary task formats and augmentations, and per-instance training; TTT yields substantial performance gains, with accuracy improvements of up to 6x over base fine-tuned models; applying TTT to an 8B LLM results in 53% accuracy on ARC's public validation set, a nearly 25% increase over the previous state-of-the-art for neural approaches; combining their method with program generation techniques achieves a new public validation accuracy of 61.9%, on par with average human performance; the results indicate that explicit symbolic search is not the sole route to better abstract reasoning in LLMs, and that test-time training on few-shot examples can be highly effective.
Toward Optimal Search and Retrieval for RAG.	investigates the impact of retrieval on performance in RAG pipelines for QA tasks; performs experiments using BGE-base and ColBERT retrievers with LLaMA and Mistral, showing that incorporating more gold (relevant) documents enhances QA accuracy; observes that using approximate nearest neighbor search with lower recall has minimal performance impact while potentially boosting speed and memory efficiency; notes that introducing noisy or irrelevant documents consistently harms performance, refuting prior research claims; concludes that optimizing the retrieval of gold documents is essential for RAG effectiveness and that lower search accuracy can be a practical strategy.
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples.	presents a novel approach for defending LLMs against jailbreak attacks, emphasizing the rapid adaptation of defenses upon detecting new attacks rather than striving for perfect initial adversarial robustness; using a new benchmark, the top-performing method—fine-tuning an input classifier—reduced attack success rates by over 240x for known attack types and 15x for new variations after observing just one example of each attack strategy; shows that swiftly responding to emerging jailbreaks can be an effective alternative to traditional static defenses.
Solving the Travelling Salesman Problem.	This study highlights the often underestimated value of the "heatmap + Monte Carlo Tree Search (MCTS)" method, demonstrating that well-tuned, straightforward heatmaps can surpass more sophisticated models.
Graph-based AI model maps the future of innovation.	MIT researchers created an AI model that employs generative knowledge extraction and graph reasoning to detect intricate patterns across domains such as biology and music. The model efficiently generates knowledge maps from scientific literature, uncovering connections and proposing novel materials inspired by art. This method boosts interdisciplinary research by uncovering hidden insights and fostering innovative concepts for material design.
Teaching Video Models to Understand Time Like a Story.	This paper presents NumPro, an innovative approach designed to assist Video Large Language Models in managing Video Temporal Grounding tasks.
Generative World Explorer.	The Generative World Explorer (Genex) is a system capable of simulating exploration in 3D spaces through the generation and leveraging those simulations to enhance planning. It employs an ST-VAE and a diffusion pass for its imagination process, leading to better planning outcomes.
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering.	The Generative World Explorer (Genex) is a system capable of simulating exploration in 3D spaces through the generation and leveraging those simulations to enhance planning. It employs an ST-VAE and a diffusion pass for its imagination process, leading to better planning outcomes.
OneNet: A Channel-Wise 1D Convolutional U-Net.	OneNet is a 1D convolutional encoder optimized for efficient image segmentation, making it well-suited for edge devices.
AI’s math problem: FrontierMath benchmark shows how far technology still has to go.	Artificial intelligence systems may be good at generating text, recognizing images, and even solving basic math problems—but when it comes to advanced mathematical reasoning, they are hitting a wall. A groundbreaking new benchmark, FrontierMath, exposes just how far today’s AI is from mastering the complexities of higher mathematics.
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus.	Researchers have proposed Additional Logic Training to enhance reasoning in LLMs, focusing on teaching them to manage complex deductions involving varied rules and distractions.
Solving Cold Starts in Adaptive Testing.	The "cold start" issue in adaptive testing arises when initial questions fail to align with examinees' abilities. Researchers have addressed this with the Diffusion Cognitive States Transfer Framework (DCSR), which employs diffusion models to utilize prior learning data across domains.
samurai.	Tracking a consistent object over an extended period is a challenging task. This work enhances SAM 2 by integrating motion-aware memory banks, ensuring consistency over time and through occlusions. It stands out as one of the most effective visual tracking systems developed so far.
Compress and Reconstruct Images.	PCNet is a new compact network for image-compressed sensing. It reduces sampling costs while delivering high-quality reconstructions.
LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression.	Large multi-modal models can generate captions and compress images simultaneously within a single system

News

Link	description
Hi-tech recreation of Richard III’s voice has a Yorkshire accent.	A digital avatar of the king’s head, complete with ‘meticulously researched’ voice, is on display in York
OpenAI’s tumultuous early years revealed in emails from Musk, Altman, and others.	Elon Musk's lawsuit against OpenAI has unveiled emails from the startup's early days, exposing internal conflicts.
Spotify’s Plans For AI-Generated Music, Podcasts, and Recommendations, According To Its Co-President, CTO, and CPO Gustav Söderström.	Spotify's Gustav Söderström talks about AI music, Notebook LM podcasts, and the nuance of building better discovery using LLMs.
AI cloning of celebrity voices outpacing the law, experts warn.	David Attenborough among famous people whose voices have been exploited by fraudsters
John Oliver on potential US TikTok ban: ‘May not be necessary, but it isn’t sufficient’.	Last Week Tonight host looks into looming US ban over privacy concerns and fear of its Chinese parent company
Shop like a Pro: Perplexity’s new AI-powered shopping assistant.	Perplexity has introduced a shopping feature for Pro users in the U.S., enabling them to research and purchase products directly within the platform. This feature includes a "Buy with Pro" button that allows users to order items using saved billing and shipping information, with free shipping on all purchases.
Ben Affleck Shares Candid Take on the Positive Use of AI in Hollywood, but Doesn't See It Threatening Creativity.	During an interview, Ben Affleck reassured Hollywood actors and writers, stating that AI currently poses minimal risk to their jobs because of its existing limitations.
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.	This work seeks to systematically evaluate the capabilities of new autonomous computer use agents, revealing that Claude is particularly strong at handling traditional linear tasks.
Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference.	Cerebras has developed a 405-billion-parameter Llama 3.1 model, the largest in its class, capable of processing nearly 1,000 tokens per second. This performance is approximately 12 times faster than comparable systems and 18 times faster than some closed-model API providers. The model is expected to be accessible via API at the beginning of next year.
Nous Research Forge.	The Forge Reasoning API enhances popular language models by integrating a code interpreter and advanced reasoning capabilities, leading to improved performance.
US Justice Department plans to push Google to sell off Chrome browser.	Authorities seek to dismantle monopoly on search market and also want action related to AI and Android
Meta pushes AI bid for UK public sector forward with technology aimed at NHS.	Tech giant awards funding to project to shorten waits in A&E, after ‘hackathon’ on using Llama system in Britain
Meta hires Salesforce's CEO of AI, Clara Shih.	Meta is creating a new product unit to develop AI tools for the 200 million businesses that use its apps.
Rox's Public Beta and $50M Raise.	Rox, an AI-powered sales productivity platform, boosts enterprise sales reps' performance by over 30% through AI analyst teams that handle tasks like planning and engagement. It integrates effortlessly with existing systems, eliminating the inefficiencies of traditional CRMs, and is already used by leading companies. Rox recently secured $50M in funding, led by Sequoia and other prominent investors, to expand its market presence.
Genies launches Parties for brands and creators to launch their own ‘AI Roblox’.	Genies, a culture-focused avatar technology company, has launched Parties after developing its foundational technology stack since the last fundraise.
Generative AI taught a robot dog to scramble around a new environment.	Teaching robots to navigate new environments is tough. You can train them on physical, real-world data taken from recordings made by humans, but that’s scarce and expensive to collect. Digital simulations are a rapid, scalable way to teach them to do new things, but the robots often fail when they’re pulled out of virtual worlds and asked to do the same tasks in the real one.
Breakthrough robot nails surgery like a human doctor after watching videos.	The model can quickly train robots for diverse surgeries, from basic tasks to full procedures, advancing robotic medical capabilities.
DeepL launches DeepL Voice, real-time, text-based translations from voices and videos.	DeepL has made a name for itself with online text translation it claims is more nuanced and precise than services from the likes of Google — a pitch that has catapulted the German startup to a valuation of $2 billion and more than 100,000 paying customers. Users will now be able to use DeepL Voice to listen to someone speaking in one language and automatically translate it to another, in real-time.
Google releases standalone Gemini app for iPhone.	You've always been able to access this in the Google app, but now there's another way.
ChatGPT can now read some of your Mac’s desktop apps.	On Thursday, the startup announced the ChatGPT desktop app for macOS can now read code in a handful of developer-focused coding apps, such as VS Code, Xcode, TextEdit, Terminal, and iTerm2.
Google must sell Chrome to end search monopoly, justice department argues in court filing.	Justice department urges court to force Google to share data with rivals as part of wide-ranging changes to end online giant’s monopoly on web searching
Nvidia earnings: AI chip leader shows no signs of stopping mammoth growth.	World’s most valuable company delights investors as it reports $35bn of revenue in quarterly results
DeepSeek r1 reasoning model.	DeepSeek has replicated o1 with its r1 Deep Think model, a highly powerful system that the company plans to make fully open-source. The model was trained using reinforcement learning with reasoning traces.
Introducing AI Backgrounds, HD Video Calls, Noise Suppression and More for Messenger Calling.	Meta has announced new updates for its Messenger app, including HD video calling, noise suppression, and AI-generated backgrounds. HD video calling will be enabled by default on Wi-Fi, but can also be activated using a cell data plan through call settings.
A.I. Chatbots Defeated Doctors at Diagnosing Illness.	A small study found ChatGPT outdid human physicians when assessing medical case histories, even when those doctors were using a chatbot.
AlphaQubit tackles one of quantum computing’s biggest challenges.	Deepmind and Google Quantum have trained a model that can identify errors in quantum computations and correct them as needed.
Superhuman vision lets robots see through walls, and smoke with new LiDAR-like eyes.	PanoRadar, developed by researchers at the University of Pennsylvania, is an AI-driven system that transforms radio waves into 3D views, offering robots LiDAR-like vision at a reduced cost. By leveraging AI to process radio wave reflections, it overcomes challenges faced by traditional sensors in conditions like smoke, fog, and glass. The team plans to integrate PanoRadar with existing sensing technologies to enhance multi-modal perception in robotics.
Google DeepMind has a new way to look inside an AI's “mind”.	DeepMind has introduced Gemma Scope, a tool designed to enhance the understanding of AI models' internal mechanisms and decision-making processes. By employing sparse autoencoders, Gemma Scope dissects and analyzes data layers, aiding in the identification of biases or errors, such as incorrect numerical interpretations. This advancement in model transparency aims to improve AI control and alignment, thereby reducing deployment risks.
AI model identifies overlooked brain tumors in just 10 seconds.	FastGlioma is an AI model that rapidly detects residual brain tumor tissues during surgery with high accuracy.
It's Surprisingly Easy to Jailbreak LLM-Driven Robots.	Researchers induced bots to ignore their safeguards without exception
Nvidia to fuel humanoid robots with ‘Jetson Thor’.	Nvidia plans to launch its “Jetson Thor” computing platform in the first half of 2025, providing the processing power needed to bring sophisticated humanoid robots to life.
Introducing FLUX.1 Tools.	FLUX.1 Tools is a collection of models designed to enhance control and steerability in the FLUX.1 text-to-image model. It includes utilities and model checkpoints that enable features like inpainting, outpainting, and certain controlnets. These tools are ideal for users looking to expand their creative capabilities using one of the leading models available.
Elon Musk Asked People to Upload Their Health Data. X Users Obliged.	Users are uploading medical images to X's AI chatbot Grok for diagnostic purposes, a practice endorsed by Elon Musk despite concerns about accuracy and privacy. Unlike regulated medical platforms, Grok lacks HIPAA compliance, raising ethical questions about data security. While AI shows promise in healthcare, experts warn of risks related to inaccurate diagnoses and privacy violations.
ElevenLabs now offers the ability to build conversational AI agents.	ElevenLabs, a startup that provides AI voice cloning and a text-to-speech API, launched the ability to build conversational AI bots on Monday.
New OpenAI emails reveal a long history of mistrust.	Greg Brockman and Ilya Sutskever had questions about Sam Altman's intentions as early as 2017
Musk’s amended lawsuit against OpenAI names Microsoft as a defendant.	Elon Musk’s lawsuit against OpenAI accusing the company of abandoning its nonprofit mission was withdrawn in July, only to be revived in August. Now, in an amended complaint, the suit names new defendants, including Microsoft, LinkedIn co-founder Reid Hoffman, and former OpenAI board member and Microsoft VP Dee Templeton.

Resources

Link	description
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models.	introduces OpenCoder, a completely open-source LLM tailored for code generation and comprehension; the authors highlight key elements for creating top-performing code LLMs: (1) rigorous data cleaning using code-specific heuristic rules for deduplication, (2) effective recall of related text corpus for code context, and (3) high-quality synthetic data utilized in both annealing and supervised fine-tuning phases; OpenCoder outperforms previous open models at the 6B+ parameter level and provides not only the model weights but also the full training pipeline, datasets, and protocols to support reproducible research.
A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents.	examines AgentOps platforms and tools, emphasizing the necessity of robust observability and traceability features to maintain reliability in foundation model-based autonomous agent systems throughout their development and production lifecycle.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models.	presents Mixture-of-Transformers (MoT), a novel sparse multi-modal transformer architecture that achieves performance comparable to traditional models while using nearly half the computational resources for text and image tasks; MoT matches the performance of a dense baseline while utilizing only 55.8% of the FLOPs.
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems.	introduces a novel approach that uses HTML instead of plain text for constructing RAG systems; the core insight is that preserving HTML structure retains richer semantic and structural information compared to plain text conversion, which often loses critical formatting like headings, tables, and semantic tags; to handle the challenge of long HTML documents exceeding LLM context windows, the authors design a two-step pruning method: first, cleaning unnecessary HTML elements to cut length by 94%, and then applying a block-tree-based pruning approach that integrates embedding-based and generative pruning to retain essential content; experiments on six QA datasets show that HtmlRAG surpasses existing plain-text methods, confirming the benefits of maintaining HTML structure in RAG systems.
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models.	NVIDIA has developed LLaMA-Mesh, a method that fine-tunes the LLaMA language model to generate 3D meshes from text prompts. By training LLaMA on a curated dataset of 3D dialogues, LLaMA-Mesh enables the model to represent and generate 3D mesh data in plain text format, integrating 3D mesh generation with language understanding.
Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection.	Researchers have introduced the Semantic Perturbation Attack (SPA) to exploit vulnerabilities in current watermarking schemes for Embedding-as-a-Service (EaaS) systems. Traditional watermarking methods often inject fixed signals into embeddings, regardless of the input's semantics, making them susceptible to adaptive attacks. SPA leverages semantic perturbations to identify and bypass these static watermark signals, effectively compromising watermark verification.
Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization.	By adaptively caching video tokens that remain unchanged across frames, you can significantly accelerate run time without sacrificing performance or requiring extra training.
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement.	An improved technique for generating images with improved control based on chosen regions.
Accurate Image Matching.	MOP+MiHo+NCC is a non-deep, modular method for improving image matches using a combination of three techniques. Multiple Overlapping Planes (MOP) clusters inlier matches and use RANSAC to remove outliers. Middle Homography (MiHo) minimizes distortion during planar reprojection. Normalized Cross Correlation (NCC) adjusts keypoint positions post-transformation.
The Beginner's Guide to Visual Prompt Injections.	Visual prompt injections present security threats to LLMs like GPT-4V by embedding harmful instructions within images, potentially causing unintended model behavior. These vulnerabilities can manipulate outputs, for instance, by causing the model to overlook certain individuals in images or misrepresent described contexts. With the increasing adoption of generative AI, companies must implement strong security measures to address these risks.
PyGen: Turning Your Ideas into Python Package.	PyGen simplifies the process of turning your ideas into software, making coding more accessible and enjoyable. Leveraging advanced language models, PyGen acts like a tech-savvy assistant, transforming abstract concepts into complete Python tools, including testing and documentation.
UltraVox Audio Language Models.	A suite of open-weight models that can take text and audio as input modalities.
https://arxiv.org/abs/2410.17758.	Pixtral Large is a 124B open-weight multimodal model built upon Mistral Large 2. As the second model in this multimodal series, it showcases advanced image comprehension, capable of interpreting documents, charts, and natural images, while retaining the top-tier text understanding of Mistral Large 2.
LLaVA-o1: Let Vision Language Models Reason Step-by-Step.	Although this isn't an exact replication of the training process used for o1, it remains a robust VLM trained on reasoning traces.
CLIP for Semantic Segmentation.	Although CLIP has excelled in open-vocabulary tasks, it faces challenges in semantic segmentation due to noisy features and limited resolution. Trident tackles the resolution problem with a training-free framework, integrating CLIP and DINO features from sub-images and employing SAM's encoder for global feature aggregation.
Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified Robustness.	This work focuses on improving the certified robustness of smoothed classifiers by fine-tuning off-the-shelf models
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning.	This paper from Google demonstrates a method for altering the camera viewpoint of an existing video.
Evaluating-Constitutions.	Code to assist in evaluating constitutions based on human feedback.
StableV2V: Stabilizing Shape Consistency in Video-to-Video Editing.	StableV2V is a novel video editing framework that maintains shape consistency across frames, even when user prompts require significant transformations. This method ensures smooth and precise modifications throughout the video, preserving structural integrity
CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset.	CCExpert is an AI model developed to describe changes in images using natural language. It can identify what has changed, where the change occurred, and how it happened.
SAM Decoding: Speculative Decoding via Suffix Automaton.	SAM-Decoding offers a faster method for text generation in LLMs by utilizing a suffix automaton to create drafts efficiently and accurately.
That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design.	DeepMind has issued a robust defense of its AlphaChip project, which has faced criticism from some academic circles despite widespread industry adoption. In a recent paper titled "That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design," DeepMind addresses these critiques, emphasizing AlphaChip's significant contributions to chip design. The paper highlights AlphaChip's role in creating superhuman chip layouts for Google's Tensor Processing Units (TPUs) and its influence on the hardware used globally.
PoM: Efficient Image and Video Generation with the Polynomial Mixer.	Polynomial Mixer offers a faster and more memory-efficient alternative to Multi-Head Attention (MHA) in diffusion models used for image and video generation.
Cross-View Geo-Localization.	Researchers have created a framework to address the challenges of cross-view geo-localization, including variations in viewpoints and large-scale global contexts.
A statistical approach to model evaluations.	When two models are evaluated on a benchmark, declaring one as superior to the other often lacks strong confidence. This research from Anthropic introduces robust statistical methods to reliably determine when one model genuinely outperforms the other.
Software is a team sport.	GitHub Copilot, utilized by over 2.8 million developers, enhances the development experience with AI-powered features such as code completion, debugging, and secure code reviews. Developers can select AI models from providers like OpenAI and Google within Visual Studio Code. Integration with Azure and tools like GitHub Actions streamlines cloud deployments and continuous integration/continuous deployment (CI/CD) processes.
Prompt Injecting Your Way To Shell: OpenAI's Containerized ChatGPT Environment.	This article examines the interactive features of OpenAI's Debian-based sandbox environment for ChatGPT, revealing surprising details about its structure. Users can run Python scripts, manage files, and possibly expose core instructions through prompt engineering. These capabilities have sparked debates around transparency and privacy. While designed as intentional features, OpenAI does not consider them security vulnerabilities unless they result in breaches of the sandbox environment.

Perspectives

Link	description
AI could cause ‘social ruptures’ between people who disagree on its sentience.	AI could cause ‘social ruptures’ between people who disagree on its sentience
Is this (finally) the end for X? Delicate Musk-Trump relationship and growing rivals spell trouble for the platform.	The former Twitter could fade away, or help shape a dark future hosting voices of a new authoritarian world
‘Have your bot speak to my bot’: can AI productivity apps turbocharge my life?	I tried out organizational software to help streamline my work and build a ‘second brain’. I never knew there were so many different ways to take notes…
Is “AI welfare” the new frontier in ethics?	A few months ago, Anthropic quietly hired its first dedicated "AI welfare" researcher, Kyle Fish, to explore whether future AI models might deserve moral consideration and protection, reports AI newsletter Transformer. While sentience in AI models is an extremely controversial and contentious topic, the hire could signal a shift toward AI companies examining ethical questions about the consciousness and rights of AI systems.
What if AI doesn’t just keep getting better forever?	Recent reports suggest that traditional large language model (LLM) training is encountering diminishing returns, with newer models like OpenAI's Orion showing only modest improvements over predecessors. Experts are concerned about the scarcity of high-quality textual data for LLM training, leading to a shift towards synthetic data and specialized AI models. Future advancements may prioritize enhancing reasoning capabilities and developing task-specific models over general scaling.
AI Makes Tech Debt More Expensive.	AI amplifies the cost of tech debt by widening the velocity gap between low-debt and high-debt codebases.
Where's My Robot Butler?	Advancements in AI and robotics are speeding up the creation of humanoid robots like Atlas, Optimus, and Neo, designed to handle domestic tasks similar to Rosie from "The Jetsons." However, developing cost-effective, safe, and efficient actuators remains a challenge. AI models play a vital role in training these robots for autonomous, complex tasks. Although there has been notable progress, these robots are currently better suited for industrial applications and may only become practical for home use with major breakthroughs.
Google's head of research on whether 'learn to code' is still good advice in the age of AI.	Even though AI can manage some coding tasks, having a fundamental understanding of coding remains essential and opens up new opportunities in various fields, such as healthcare and education.
Why are we using LLMs as calculators?	Researchers are experimenting with LLMs' ability to solve math problems to assess their reasoning capabilities.
GPTs Are Maxed Out.	OpenAI's next-generation model, internally called Orion, is said to fall short of expectations set by Sam Altman, hinting at a possible limit to the scalability of AI model improvements.
Can Google Scholar survive the AI revolution?	The largest scholarly search engine is celebrating its 20th birthday, but AI-driven competitors offer advantages.
Computational technologies of the Human Cell Atlas.	As the international effort reaches a ‘critical mass’ of achievements, Nature highlights seven tools that are poised to enable the next set of discoveries.
Can a fluffy robot really replace a cat or dog? My weird, emotional week with an AI pet.	Casio says Moflin can develop its own personality and build a rapport with its owner – and it doesn’t need food, exercise or a litter tray. But is it essentially comforting or alienating?
The Evolution of the Creator.	Generative AI is transforming the creator economy by reducing production barriers, and allowing creators to produce high-quality content effortlessly. Innovations like digital clones are reshaping content distribution and engagement, unlocking new monetization opportunities by scaling interactions and fan transactions. With AI revolutionizing creation, distribution, and monetization, the creator economy is poised to give rise to a new generation of major tech companies.
‘A place of joy’: why scientists are joining the rush to Bluesky.	Researchers say the social-media platform — an alternative to X — offers more control over the content they see and the people they engage with.
Tülu 3: The next era in open post-training.	An open-source, cutting-edge post-training framework offering open data, training code, model weights, and scientific insights. It may be the most comprehensive resource for understanding modern post-training techniques for large language models.
We can all be AI engineers – and we can do it with open-source models.	The barriers to AI engineering are quickly lowering as improved tools and standardized workflows streamline complex processes. Creating AI applications now involves applying basic engineering skills to utilize models, prompts, integrations, testing, and deployment. Open-source models ensure data privacy while existing DevOps tools support the development and management of AI applications.
‘An AI Fukushima is inevitable’: scientists discuss technology’s immense potential and dangers.	Experts are optimistic about energy and drug production breakthroughs but also fear its potential misuse

Back to index

ML news: Week 11 - 17 November

Research

Link	description
Project Sid: Many-agent simulations toward AI civilization.	This work illustrates the behavior and evolution of societies composed of 10-1000+ AI agents. It introduces PIANO, an architecture that allows agents to interact with both humans and other agents in real time. The study reveals that agents can autonomously adopt specialized roles, follow and modify collective rules, and participate in cultural and religious transmissions.
Mixtures of In-Context Learners.	utilizes subsets of demonstrations to train experts through in-context learning; a trainable weighting function is then employed to merge the next-token predictions from these experts based on the training set. This method is compatible with black-box LLMs, as it does not require access to their internal parameters. Key advantages include: 1) being competitive with standard ICL while offering much greater efficiency in terms of data, memory, and computation, and 2) demonstrating robustness to noisy demonstrations and label imbalance.
Attacking Vision-Language Computer Agents via Pop-ups.	demonstrates that incorporating adversarial pop-ups into current agent testing environments results in an attack success rate of 86%, reducing the agents' task success rate by 47%. It also notes that simple defense methods, like instructing the agent to ignore pop-ups, prove ineffective.
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models.	enhances LLM responses by simulating multiple experts and combining their outputs; it directs an LLM to complete input instructions by simulating several experts and choosing the best response from both individual and aggregated perspectives. This approach sets a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the previous record of 87.97%. Additionally, it improves performance in terms of factuality and usefulness while reducing toxicity and hurtfulness.
Number Cookbook: Number Understanding of Language Models and How to Improve It.	offers a thorough analysis of the numerical understanding and processing ability (NUPA) of LLMs; reveals that while naive finetuning significantly boosts NUPA on many tasks, it doesn’t work for all. It also finds that methods specifically developed to improve NUPA are ineffective when finetuning pre-trained models. The study examines the application of chain-of-thought techniques to NUPA and notes that these methods encounter scalability issues, limiting their practical use.
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning.	introduces a self-evolving online curriculum RL framework aimed at closing the performance gap between open and proprietary LLM-based web agents. It boosts the success rate of Llama-3.1-8B from 4.8% to 42.4% and GLM4-9B from 6.1% to 43%, with the open models significantly outperforming GPT-4-Turbo (17.6%) and GPT-4o (13.9%). The framework addresses the limited availability of web agent training tasks using a robust outcome-supervised reward model for task success evaluation. An adaptive RL strategy manages distribution drift in online learning, ensuring steady performance improvements.
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation.	introduces a two-stage fine-tuning method where LLMs first learn from tool-generated solutions and then are trained to decide when to solve problems independently versus using tools. Evaluations on benchmarks in math, climate science, and epidemiology demonstrate significant gains, with a 28% increase in accuracy and a 14% improvement in tool usage precision over top models like GPT-4 and Claude-3.5. This approach enables the LLM to flexibly handle scientific problems of varying complexity.
Google's Flood Forecasting AI to Reach 700 Million People.	Google is expanding riverine flood forecasting coverage to over 100 countries and 700 million people, and enabling partners and researchers to better understand flood forecasting through more data and the development of a new API
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models.	The Mixture-of-Transformers (MoT) architecture features a sparse multi-modal transformer that separates parameters based on modality (text, images, and speech), allowing for efficient processing while preserving performance. In various evaluations, such as Chameleon 7B and Transfusion settings, MoT matches or outperforms dense baselines, utilizing significantly fewer resources—only 37.2% of the FLOPs for speech processing and 47.2% of the wall-clock time for image generation.
Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation.	This study investigates methods to enhance alignment between LLMs and protein-focused geometric deep models, aiming to improve cross-modal understanding.
Can LLMs Follow Threads Through Near-Million-Scale Haystacks?	Large Language Models (LLMs) with extended context windows support a wider range of applications. Recent research on 17 top LLMs shows that although many can manage multiple information threads simultaneously, their practical context limits are often shorter than the stated maximum. While several models demonstrate "thread safety" by handling concurrent threads without a drop in performance, accuracy typically decreases as the context window approaches its upper limit.
Compressing Mesh Data for 3D Generation.	By reducing the mesh sequence length by about 75%, a mesh compression method known as Blocked and Patchified Tokenization (BPT) effectively produces meshes with more than 8k faces.
Successor Feature Matching.	A new non-adversarial method for inverse reinforcement learning that avoids reward function learning is called Successor Feature Matching.
Oasis: A Universe in a Transformer.	A 500M parameter foundation model without a game engine powers Oasis, a fully AI-generated, real-time open-world video game model. It is tailored for Etched's Sohu ASIC to achieve great frame rate efficiencies and uses quick transformer inference to generate gameplay. Despite showing great promise, issues like long-context consistency and domain generalization still exist.
OpenAI to present plans for U.S. AI strategy and an alliance to compete with China.	OpenAI's AI infrastructure blueprint suggests establishing AI economic zones and collaborating with the U.S. Navy on nuclear energy to promote AI-driven economic growth and innovation. The proposal features a North American AI alliance and initiatives modeled after the National Interstate and Defense Highways Act to address infrastructure demands. It stresses the importance of investing in U.S. data centers and energy projects to stay competitive with China.
Introducing Athene-V2: Advancing Beyond the Limits of Scaling with Targeted Post-training.	Athene V2 consists of models built upon Qwen 2.5 72B, optimized for agentic and chat-based workflows, and outperform GPT-4o on several key benchmarks.

News

Link	description
Modal buys Tidbyt.	The elastic scaling GPU company made its first acquisition by purchasing Tidbyt, a hardware firm based in NYC, to gain the in-house expertise of its team specializing in infrastructure and containerization.
OpenAI reportedly developing new strategies to deal with AI improvement slowdown.	OpenAI's forthcoming model, codenamed "Orion," reportedly exhibits only modest improvements over its predecessors, indicating a potential deceleration in AI advancement. To address this, OpenAI has established a foundation team dedicated to enhancing models through alternative approaches, including synthetic data training and post-training adjustments, in response to the diminishing availability of new data.
Near plans to build world’s largest 1.4T parameter open-source AI model.	Near Protocol has announced plans to develop a 1.4 trillion-parameter open-source AI model, aiming to surpass existing models like Meta's Llama. This initiative reflects Near Protocol's commitment to advancing AI capabilities and contributing to the open-source community.
Samsung debuts AI-powered ‘Next-generation Bixby,’ but you can’t use it yet.	Samsung has launched a "next-generation Bixby" with enhanced AI capabilities on the Galaxy W25 and W25 Flip in China.
Even Microsoft Notepad is getting AI text editing now.	Along with adding AI to a text editor that launched in 1983, Microsoft will let Windows Insiders test generative fill-and-erase tools in Paint, too.
Ofcom warns tech firms after chatbots imitate Brianna Ghey and Molly Russell.	After ‘distressing incidents’, watchdog says content from user-made bots would be covered by UK Online Safety Act
AI protein-prediction tool AlphaFold3 is now open source.	The code underlying the Nobel-prize-winning tool for modelling protein structures can now be downloaded by academics.
Qwen 2.5 Coder 32B Instruct is here.	The Qwen 2.5 Coder series consists of language models tailored for coding tasks. The latest 32B parameter model outperforms GPT-4o and is compact enough for local use by many. It also matches Claude Sonnet 3.5 on several benchmarks.
X is testing a free version of AI chatbot Grok.	Social network X has so far limited its AI chatbot Grok (built by Elon Musk’s other company xAI) to its premium, paying users. However, the platform is seemingly preparing to open up the chatbot to free users.
Octoverse: AI leads Python to top language as the number of global developers surges.	In this year’s Octoverse report, we study how public and open source activity on GitHub shows how AI is expanding as the global developer community surges in size.
Google accidentally leaked a preview of its Jarvis AI that can take over computers.	Google's new AI prototype, Jarvis, briefly appeared on the Chrome Web Store.
AI-powered parenting is here and a16z is ready to back it.	Andreessen Horowitz partner Justine Moore introduced a new investment thesis for the firm on X on Thursday, endorsing “a new wave of ‘parenting co-pilots’ built with LLMs and agents.” She pointed to companies like Cradlewise, makers of an AI-powered baby monitor to detect a baby’s sleep pattern and rock the crib, and Nanit, which uses AI to process crib footage to tell if a baby is breathing.
French news titles sue X over allegedly running their content without payment.	Social media site accused of violating a law that requires platforms to pay media when republishing articles
Musk’s influence on Trump could lead to tougher AI standards, says scientist.	Tycoon might help president-elect realize race for artificial general intelligence is a ‘suicide race’, says Max Tegmark
Bluesky adds 700,000 new members as users flee X after the US election.	Social media platform has become a ‘refuge’ from the far-right activism on X, experts say, after Elon Musk teamed up with Donald Trump
Baidu announces its own pair of AI smart glasses.	Baidu, which is often called China's answer to Google, has launched its own pair of AI-powered smart glasses at its annual World Conference event in Shanghai.
OpenAI co-founder Greg Brockman returns after three months of leave.	In the midst of major management departures and controversy over OpenAI's transition to a for-profit business model, co-founder Greg Brockman has returned to the company as president after taking a sabbatical. In its most recent fundraising round, OpenAI was valued at $157 billion. Due to the departure of executives like Lilian Weng, Bob McGrew, and Mira Murati, the company is experiencing internal issues.
European Google rivals partner on search engine infrastructure to counter Big Tech.	To improve AI skills and lessen dependency on U.S. Big Tech, Ecosia and Qwant are collaborating to create a European search index. Using a "privacy-first" strategy, the project seeks to promote AI developments by developing a new search infrastructure. Since generative AI is becoming more and more prevalent in search, alternative search providers are better positioned to compete as a result of the rising API expenses.
Robotic exoskeleton adapts to changes in leg movements in real time.	Wearable robots that assist leg movements could transform the lives of people with reduced mobility — but only if the devices can adapt in real time to support a vast range of human activities. Machine learning provides a way forward.
OpenAI’s take on AI agents could come in January.	OpenAI is reportedly preparing to launch "Operator," an AI agent tool, as early as January. Bloomberg states that Operator may be able to execute tasks directly on a user's computer. It will initially be accessible as a research preview through OpenAI's developer API.
Google's AI Initiative to Boost MENA Economy by $320 Billion.	Google.org has launched the AI Opportunity Initiative, its largest AI investment in the Middle East and North Africa (MENA) region, aiming to develop essential AI skills, fund research, and expand AI access. This initiative is projected to contribute $320 billion to MENA's economy by 2030
Two Trillion Token Common Corpus.	the release of Common Corpus (part of the AI Alliance Open Trusted Data Initiative)—the largest fully open multilingual dataset for training LLMs, containing over 2 trillion tokens of permissibly licensed content with provenance information (2,003,039,184,047 tokens).
Lume raises $4.2M Seed Round led by General Catalyst.	Lume automates data mapping with AI, streamlining mapping, cleaning, and validation of data.
Amazon launches under-$20 online storefront to compete with Temu.	Company says Amazon Haul will mostly feature products under $10, which it plans to ship from China warehouse
Francois Chollet leaves Google.	The founder of Keras and Arc eval, among other contributions, has departed from Google. He will continue to support the Jax and Keras communities while exploring new opportunities.
OpenAI launches ChatGPT desktop integrations, rivaling Copilot.	When OpenAI released desktop app versions of ChatGPT, it was clear the goal was to get more users to bring ChatGPT into their daily workflows. Now, new updates to Mac OS and Windows PC versions encourage users to stay in the ChatGPT apps for most of their tasks.
Supermaven joins Cursor.	The team behind the code editing plugin is joining Cursor to further enhance the user experience.
Google’s AI ‘learning companion’ takes chatbot answers a step further.	Google’s Learn About AI tool has more educational, textbook-style responses to guide you through new topics.

Resources

Link	description
FrontierMath.	Epoch AI has introduced FrontierMath, a benchmark comprising expert-level mathematics problems to assess AI's mathematical reasoning capabilities. Notably, leading AI models have solved less than 2% of these problems, highlighting the benchmark's difficulty and the current limitations of AI in advanced mathematical reasoning.
BitNet a4.8: 4-bit Activations for 1-bit LLMs.	A major challenge with 1.58bit LLMs has been the absence of hardware acceleration support. This research introduces 4.8bit activations to leverage the INT4/FP4 kernels available in new hardware, achieving this with no added runtime cost.
LLM2CLIP.	LLM2CLIP combines CLIP's visual and textual alignment with the advanced language understanding of LLMs.
Torch Compatible Muon Optimizer.	Muon is the optimizer that sets the training record for GPT-2. It is a momentum-adapted method similar to SGD. This repository provides an implementation that can be easily used as a replacement for AdamW.
Mochi video model with optimized inference.	Mochi 1, an open-source text-to-video model, initially required eight H100 GPUs for operation. Thanks to community efforts, it can now run on a single 48GB L40 GPU without compromising quality.
A trainable PyTorch reproduction of AlphaFold 3.	Protenix is a functional and trainable reproduction of AlphaFold 3, DeepMind's protein folding project, developed by ByteDance's 'AI for Science' team. This open-source initiative aims to advance protein structure prediction by providing a customizable platform for researchers.
LlamaPReview.	LlamaPReview is an AI assistant for GitHub that provides easy one-click installation and automatically reviews pull requests with context-aware analysis. It supports various programming languages and integrates seamlessly with GitHub Actions, delivering insightful feedback directly on PRs. Offered for free, it improves code quality by detecting issues and recommending optimizations.
SmolLM2.	Hugging Face's SmolLM2 is a compact family of language models, ranging from 135M to 1.7B parameters, trained on 11 trillion tokens. These models are designed to run efficiently on device and support various tasks. The weights are released under the Apache 2 license, and quantized versions, such as the 1.7GB and 138MB models, offer flexibility to meet different computational requirements.
AI for Real-time Fusion Plasma Behavior Prediction and Manipulation.	A novel multimodal machine learning approach improves super-resolution data, enabling better analysis of complex fusion plasma phenomena like Edge Localized Modes (ELM), and supports the stabilization of future fusion reactors.
A Comprehensive Survey of Small Language Models in the Era of Large Language Models.	a review of small language models (SLMs), covering topics such as definitions, applications, improvements, reliability, and related concerns.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks.	A new generalist multi-agent system capable of managing complex web and file-based tasks, featuring an Orchestrator agent that coordinates four specialized agents: WebSurfer for browser tasks, FileSurfer for file management, Coder for programming, and ComputerTerminal for console operations. Magentic-One performs competitively on various benchmarks, such as GAIA, AssistantBench, and WebArena, without needing any changes to its core architecture.
Personalization of Large Language Models: A Survey.	offers a comprehensive framework for understanding personalized LLMs, introducing taxonomies for various personalization aspects and consolidating existing research in personalized text generation and downstream applications.
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images.	StdGen is a novel approach for generating 3D characters from a single image. It breaks down the process into distinct components, such as hair and jackets, enhancing the overall quality of the output.
alphafold3.	DeepMind has open-sourced the code and weights of AlphaFold 3 for academic research, marking a significant advancement in protein structure prediction. This release is expected to accelerate AI applications in scientific research, particularly in molecular biology and drug discovery.
Online-LoRA.	Online-LoRA is a framework developed to mitigate catastrophic forgetting in online continual learning (OCL) by enabling real-time fine-tuning of pre-trained Vision Transformers (ViTs) without the use of rehearsal buffers.
DeepArUco++: Improved detection of square fiducial markers in challenging lighting conditions.	DeepArUco++ presents a deep learning-based method for enhancing fiducial marker detection, especially in difficult lighting conditions where traditional techniques typically struggle.
Hermes 3.	Hermes 3, fine-tuned from Llama 3.1, excels in both reasoning and creativity, showcasing outstanding performance across models with 8B, 70B, and 405B parameters. It introduces new possibilities in AI alignment and artificial consciousness.
ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis.	To improve the speed and quality of token-based picture production, EfficientNAT is an improved non-autoregressive Transformer model.
UniGAD: Unifying Multi-level Graph Anomaly Detection.	A novel framework for graph anomaly detection (GAD), UniGAD simultaneously detects anomalies in nodes, edges, and complete graphs.
Object and Attribute Matching in Images with Token Merging.	Token Merging tackles a prevalent problem in text-to-image models: semantic binding, or the inability to associate things with their particular properties.
DataChain.	Without abstracting AI models, DataChain is a Pythonic data-frame toolkit for AI that enables effective processing and dataset structuring of unstructured data. It facilitates the creation of metadata, filtering, and vector search by integrating with AI tools like PyTorch, TensorFlow, and LLM APIs. Additionally, the library has built-in vectorized operations on Python object fields, out-of-memory computation, and parallelization.
browser-use.	Through a streamlined UI, this open-source web automation application enables LLMs to communicate with websites. It is compatible with models such as Claude 3.5 Sonnet and GPT-4o. XPath extraction, customizable actions, and multi-tab management are important features. Data extraction and smooth web navigation are made possible by the program. Message length is one of its drawbacks, as it impacts task repetition and LLM speed. Robustness and cost reduction will be the main goals of further development.
CUDA Programming Course – High-Performance Computing with GPUs.	A great course from freeCodeCamp on CUDA programming from start to finish.
Masked Token Modeling for Zero-Shot Anything-to-Drums Conversion.	Zero-shot drum style transfer for any input rhythm presents an exciting music application for artists. This is achieved using a masked token modeling objective, which is particularly effective for audio.
HiCoM: Hierarchical Coherent Motion for Streamable Dynamic Scene with 3D Gaussian Splatting.	HiCoM is a cutting-edge framework designed to enhance real-time 3D reconstruction from multi-view streaming videos. It effectively addresses key challenges in storage, training speed, and rendering quality, making it a significant advancement in the field.
Janus.	Janus, DeepSeek's multimodal model, has a new version incorporating rectified flows, similar to Meta Movie Gen, for image generation and understanding. The results are highly impressive.
Link Conversation with Reference Materials.	Problem-oriented segmentation & Retrieval (POSR) is a method that breaks conversations into meaningful segments and connects each segment to relevant reference materials, like worksheets or meeting notes.
MureObjectStitch: Multi-reference Image Composition.	Researchers have presented an improved fine-tuning method for generative image composition, which seamlessly merges a specified foreground object with a new background to generate realistic images.
StoryTeller.	StoryTeller is a system created to generate coherent descriptions for long videos, tackling issues like plot consistency and character tracking throughout different scenes.
SAMPart3D: Segment Any Part in 3D Objects.	SAMPart3D, developed by the University of Hong Kong, is a robust method for segmenting 3D objects into semantically meaningful components.
Convolutional Differentiable Logic Gate Networks.	Researchers have developed a method to train image recognition networks that are 29 times smaller and more efficient than traditional convolutional neural networks (CNNs) by making logic gates differentiable. They have also provided efficient CUDA kernels in their paper release
Physics Informed Distillation for Diffusion Models.	Physics Informed Distillation (PID) is a method that employs a student model to simplify and accelerate diffusion models by framing them as solutions to differential equations.
MinerU: high-quality data extraction tool.	MinerU is a robust tool built on StructTable-InternVL2-1B, enabling the extraction of information from PDFs into various machine-readable formats.
Isotonic regression.	A powerful technique for fitting a monotonic function to data. It can be differentiated really well for a number of applications outside of curve fitting.
Text-to-SQL Query.	XiYan-SQL is an innovative framework aimed at enhancing both the accuracy and diversity of SQL queries produced from natural language input.
X-Portrait 2: Highly Expressive Portrait Animation.	ByteDance's AI group has unveiled X-Portrait 2, an advanced portrait animation technology that transforms static images into highly expressive, realistic videos. Building upon its predecessor, X-Portrait, this new model excels in capturing subtle facial expressions and complex movements, such as pouting, tongue-out gestures, cheek-puffing, and frowning. It achieves high fidelity in emotion preservation, ensuring the generated videos maintain the subject's identity and emotional nuances.
MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views.	The MVSplat360 model offers a new way to create realistic 360° views of real-world scenes, even from just a few sparse images.
Improved Multi-Task Brain Tumour Segmentation with Synthetic Data Augmentation.	This paper presents the leading approach for brain tumor segmentation in the BraTS challenge, demonstrating how synthetic data can improve AI models for medical imaging applications.

Perspectives

Link	description
Embeddings are underrated.	Machine learning embeddings can revolutionize technical writing by enabling mathematical comparisons of any text, and enhancing features like recommendation systems through semantic similarities. By positioning text in a multi-dimensional space, they reveal intuitive semantic relationships, which are valuable for tasks such as finding related content. Documentation site owners who provide embeddings for their content could inspire innovative applications from their communities.
The images of Spain’s floods weren’t created by AI. The trouble is, people think they were.	The rapid growth of ‘AI slop’ – content created by artificial tools – is starting to warp our perception of what is, or could be, real
What Trump’s election win could mean for AI, climate, and health.	Donald Trump made numerous promises during his presidential campaign that could affect scientists and science policy. Will they be implemented once he is president?
The case for targeted regulation.	Advancements in AI are significantly enhancing capabilities in mathematics, coding, and science, presenting both opportunities and risks. Effective regulation is crucial to prevent misuse in areas such as cybersecurity and chemical, biological, radiological, and nuclear (CBRN) threats. Anthropic's Responsible Scaling Policy emphasizes transparency and advocates for a balanced legislative approach that ensures safety while fostering innovation.
AI-powered parenting is here and a16z is ready to back it .	Andreessen Horowitz partner Justine Moore introduced a new investment thesis for the firm on X on Thursday, endorsing “a new wave of ‘parenting co-pilots’ built with LLMs and agents.” She pointed to companies like Cradlewise, makers of an AI-powered baby monitor to detect a baby’s sleep pattern and rock the crib, and Nanit, which uses AI to process crib footage to tell if a baby is breathing.
Speculation on Test Time Compute.	This video discusses O1 models, their capacity for replication, and their potential utility for a range of future tasks.
Can AI review the scientific literature — and figure out what it all means?	Artificial intelligence could help speedily summarize research. But it comes with risks.
Why we are all lab rats in the digital world.	Researchers need to establish robust ethical protocols for online experiments.
Don’t blame search engines for sending users to unreliable sites.	Analysis of billions of pages of results from searches using the Bing algorithm suggests that reliable sites appear in search results 19 to 45 times more often than do sites with low-quality content.
AI-generated images threaten science — here’s how researchers hope to spot them.	Generative-AI technologies can create convincing scientific data with ease — publishers and integrity specialists fear a torrent of faked science.
The quest to build bionic limbs that feel like the real thing.	Through brain implants, neural interfaces and skin grafts, researchers are starting to restore sensation for paralysed or amputated limbs.
How AI is reshaping science and society.	Artificial-intelligence tools such as ChatGPT might soon become fully autonomous by learning to perceive and interact with their environment.
‘It gets more and more confused’: can AI replace translators?	A Dutch publisher has announced that it will use AI to translate some of its books – but those in the industry are worried about the consequences if this becomes the norm
StackBlitz achieves $4M ARR in 4 weeks for their AI web development platform with Claude.	StackBlitz developed an online developer tool that integrates closely with Claude 3.5 Sonnet. This post details how the company achieved $4 million in annual recurring revenue within a few months.
Why the deep learning boom caught almost everyone by surprise.	Fei-Fei Li's development of the extensive ImageNet dataset played a crucial role in the revival of neural networks. It supplied the training data essential for landmark models such as AlexNet. Using GPUs and Geoffrey Hinton's backpropagation method, AlexNet showcased the potential of deep learning on large datasets, igniting the current AI revolution. This key event highlighted the significance of integrating neural networks, big data, and GPU computing to drive AI advancements.
Just Have AI Build an App for That.	AI agents are increasingly being used to quickly create functional apps for tasks like resizing SVGs.
AI isn’t about unleashing our imaginations, it’s about outsourcing them. The real purpose is profit.	Artificial intelligence doesn’t just incrementally erode the rights of authors and other creators. These technologies are designed to replace creative workers altogether
Companies building AI-powered tech are using your posts. Here’s how to opt-out.	even if you haven’t knowingly opted in, companies are still scraping your personal information to train their systems

Back to index

ML news: Week 3 - 10 November

Research

Link	description
The Geometry of Concepts: Sparse Autoencoder Feature Structure.	This study investigates the geometric structure of concept representations in sparse autoencoders (SAEs) across three scales: (1) atomic-level parallelogram patterns among related concepts (e.g., man:woman::king:queen), (2) brain-like functional "lobes" dedicated to different knowledge types such as math or code, and (3) galaxy-level eigenvalue distributions, revealing a specialized structure within the middle layers of the model.
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics.	This approach employs causal analysis to identify neurons that reveal an LLM's behavior when performing basic arithmetic logic. It discovers and theorizes that a combination of heuristic neurons serves as the mechanism for generating accurate arithmetic answers, with the unordered blend of various heuristic types accounting for most of the model's accuracy on arithmetic prompts.
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA.	The Relaxed Recursive Transformer introduces a novel method for reducing LLM size by sharing parameters across layers without sacrificing performance. Initialized from standard pre-trained Transformers, it employs a single block of unique layers repeated multiple times in a loop, adding flexibility through depth-wise low-rank adaptation (LoRA) modules. This approach demonstrates the potential for significant (2-3×) improvements in inference throughput.
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective.	This project examines how varying "thinking" styles—fast (concise) versus slow (detailed, such as chain-of-thought reasoning)—affect layer-wise gradients and stability in LLMs.
B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable.	"B-cosification" is a technique that adjusts existing pre-trained models to provide highly interpretable explanations of their predictions.
Learning Graph Quantized Tokenizers for Transformers.	GQT (Graph Quantized Tokenizer) is a novel tokenizer for graph data in geometric deep learning.
V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization.	Vision-guided Direct Preference Optimization (V-DPO) tackles hallucination problems in large vision-language models (LVLMs), where text responses may diverge from visual input due to an excessive focus on language.
Adam Alternative for Deep Learning Optimization.	ADOPT is an adaptive gradient optimizer designed to resolve the non-convergence problems of Adam, without depending on restrictive assumptions regarding gradient noise.
A faster, better way to train general-purpose robots.	Inspired by large language models, researchers develop a training technique that pools diverse data to teach robots new skills.
Vision Language Models are In-Context Value Learners.	Visual Language Models (VLMs) are capable of learning skills through the use of prompts.

News

Link	description
Elon Musk’s ‘election integrity community’ on X is full of baseless claims.	Feed is rife with posts of individuals deemed suspicious and calls for doxxing with little evidence provided of fault
Microsoft sails as AI boom fuels double-digit growth in the cloud business.	Revenue from Azure cloud business increased by 22% as company focuses attention on artificial intelligence
Apple reports robust demand for iPhone 16 even as overall sales in China slow.	Company reports $94.9bn in revenue, slightly beating Wall Street projections in first look at demand for its new phone
Distinguishing Ignorance from Error in LLM Hallucinations.	This report describes efforts to replicate the capabilities of OpenAI's o1 model, introducing a journey learning technique that promotes a comprehensive exploration process rather than shortcut-based learning. This approach includes trial and error, reflection, and backtracking. With just 327 training samples, the journey learning technique outperformed shortcut learning by 8.0% on the MATH dataset.
Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models.	This study evaluates various prompting strategies and frameworks to minimize hallucinations in LLMs, finding that simpler prompting techniques outperform more complex approaches. It also reports that LLM agents show higher hallucination rates due to the increased complexity involved in using tools.
Introducing the First AMD 1B Language Models: AMD OLMo.	AMD utilized the OLMo codebase to train and release a language model on its accelerators. The OLMo (Open Language Model) project, developed by the Allen Institute for AI (AI2), provides an open-source framework for training and using state-of-the-art language models.
OpenAI will start using AMD chips and could make its own AI hardware in 2026.	Reuters reports an updated hardware strategy to run ChatGPT and OpenAI’s other projects, which involves using AMD chips via Microsoft Azure in addition to Nvidia.
Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?	This study addresses a specific challenge in LLMs: evaluating how effectively they process reasoning prompts that include irrelevant or incorrect rationale snippets.
What is Wrong with Perplexity for Long-context Language Modeling?	This study uncovers a significant limitation of using perplexity (PPL) to assess LLMs' long-context abilities, as PPL averages across all tokens, overlooking critical ones necessary for interpreting extended inputs. To address this, the authors propose LongPPL, a metric that emphasizes these essential tokens, providing a more accurate measure of long-context performance.
Google’s AI search summaries are rolling out to over 100 more countries.	Google’s AI Overviews are expanding across more than 100 countries this week. The AI-generated search summaries will appear for users in Canada, Australia, New Zealand, South Africa, Colombia, Chile, the Phillippines, Nigeria, and many more locations.
Elon Musk finally admits Tesla’s HW3 might not support full self-driving.	Elon Musk finally admits Tesla’s HW3 might not support full self-driving and that he doesn’t actually know what it will take. Millions of Tesla vehicles are equipped with HW3 computers.
NVIDIA Ethernet Networking Accelerates World’s Largest AI Supercomputer, Built by xAI.	xAI's Colossus is powered by NVIDIA's Spectrum-X Ethernet networking platform.
French parents whose children took own lives sue TikTok over harmful content.	Lawsuit alleges TikTok’s algorithm exposed teenagers to videos promoting suicide, self-harm and eating disorders
Claude 3.5 Haiku now available.	Claude 3.5 Haiku is slightly inferior to GPT-4o and lacks vision capabilities, but it remains highly intelligent and is cost-effective compared to other models of similar quality.
7 AI news that Google announced in October.	This article summarizes seven AI updates from October, including Google Maps' largest AI enhancement, guidance on using NotebookLM, and additional methods for asking questions, searching for information, and accessing an AI Overview.
Sapien Raises $8.7M Seed Led by General Catalyst.	Sapien is advancing AI-driven financial analysis tools that convert intricate, error-prone tasks into swift insights, revolutionizing the role of Chief Financial Officers (CFOs). The platform consolidates data from diverse sources to deliver dynamic, context-aware analyses, aiming to eradicate human errors in financial processes. Recently, Sapien secured $8.7 million in funding, with plans to expand and enhance its AI capabilities to empower finance teams across various industries
ElevenLabs has hired the team behind Omnivore, a reader app.	Generative AI company ElevenLabs has hired the team behind Omnivore, an open source read-it-later app.
LinkedIn launches its first AI agent to take on the role of job recruiters.	Hiring Assistant is a new product designed to take on a wide array of recruitment tasks, from ingesting scrappy notes and thoughts to turn into longer job descriptions to sourcing candidates and engaging with them.
Anthropic’s Claude AI chatbot now has a desktop app.	Claude, the AI chatbot made by Anthropic, now has a desktop app. You can download the Mac and Windows versions of the app from Anthropic’s website for free.
Meta is making a robot hand that can ‘feel’ touch.	Meta says it’s partnering with sensor firm GelSight and Wonik Robotics, a South Korean robotics company, to commercialize tactile sensors for AI.
Elon Musk sued over $1m-a-day election giveaway.	Complaint alleges Musk’s America Pac deceived voters by falsely claiming prize winners would be chosen at random
AI chatbot launches on Gov.UK to help business users – with mixed results.	Initial test run of GPT-4o technology can help with regulations but ‘cannot provide predictions or opinions’
OpenAI’s o1 model leaked on Friday and it is wild — here’s what happened.	OpenAI's o1 model demonstrates notable advancements in reasoning and accuracy compared to GPT-4, featuring image analysis and web tool capabilities. The complete version is expected to significantly enhance AI and multimedia processing, with an official release anticipated shortly after the U.S. Presidential election.
Meta’s former hardware lead for Orion is joining OpenAI.	The former head of Meta’s augmented reality glasses efforts announced on Monday she is joining OpenAI to lead robotics and consumer hardware, according to a post on LinkedIn.
Waymo explores using Google’s Gemini to train its robotaxis.	The company used Gemini to build its own ‘End-to-End Multimodal Model for Autonomous Driving.’
More than a quarter of new code at Google is generated by AI .	AI is hugely important to Google’s products, and it sounds like the company relies on it internally, too.
Meta is using more than 100,000 Nvidia H100 AI GPUs to train Llama-4 — Mark Zuckerberg says that Llama 4 is being trained on a cluster “bigger than anything that I’ve seen”.	Llama 4 slated to have new modalities, stronger reasoning, and faster performance
Wonder Dynamics now lets you go straight from multi-camera video to fully animated 3D scene.	Wonder Dynamics launched a tool that automates converting videos into fully editable 3D scenes.
Facebook asks US Supreme Court to dismiss fraud suit over Cambridge Analytica scandal.	Securities fraud lawsuit brought by shareholders accuses the social media platform of misleading them about misuse of user data
Anthropic hikes the price of its Haiku model.	Anthropic's latest AI model, Claude 3.5 Haiku, delivers better performance than Claude 3 Opus but comes with a much higher cost. While it doesn’t support image analysis, it excels in tasks like coding, data extraction, and content moderation. The price hike prompts concerns about Anthropic's future pricing approach.
OpenAI acquired Chat.com.	OpenAI bought Chat.com, adding to its collection of high-profile domain names. As of this morning, Chat.com now redirects to OpenAI’s AI-powered chatbot, ChatGPT. An OpenAI spokesperson confirmed the acquisition via email.
Pushing the frontiers of audio generation.	ADOPT is an adaptive gradient optimizer designed to resolve the non-convergence problems of Adam, without depending on restrictive assumptions regarding gradient noise.
Octoverse: AI leads Python to top language as the number of global developers surges.	AI project engagement has surged rapidly due to a rise in data science and machine learning activities. Python has now become more popular than JavaScript. The developer community is experiencing global growth, particularly in Africa, Latin America, and Asia, driven by tools like GitHub Copilot. There is also a growing trend toward creating smaller, more efficient AI models. Additionally, generative AI projects have almost doubled worldwide.
Nvidia to join Dow Jones Industrial Average, replacing rival chipmaker Intel.	Nvidia is replacing Intel in the Dow Jones Industrial Average, a shakeup that reflects a massive change in the semiconductor industry. Nvidia shares have gained more than 170% this year, while Intel has lost over half its value.
Google's 'Big Sleep' AI Project Uncovers Real Software Vulnerabilities.	The company's experimental AI agent finds a previously unknown and exploitable software bug in SQLite, an open-source database engine.
Amazon will now use AI to recap what you're watching.	Amazon's X-Ray Recaps is an AI-driven feature on Prime Video that generates personalized summaries for TV shows. It utilizes generative AI to create concise recaps of entire seasons, individual episodes, or specific segments, enhancing the viewing experience by helping users recall previous content without revealing spoilers. Currently in beta, X-Ray Recaps is available on Fire TV devices, with plans to expand to additional devices by the end of the year.
Google is opening an AI hub in oil-rich Saudi Arabia.	The new AI hub will support research into Arab language AI models and “Saudi-specific AI applications,” according to an announcement from the Saudi Public Investment Fund and Google.
First artwork painted by humanoid robot to sell at auction fetches $1m.	Portrait of English mathematician Alan Turing was created by Ai-Da, one of the most advanced robots in the world
Mistral launches a moderation API.	AI startup Mistral has launched a new API for content moderation.
Anthropic and Palantir Partner to Bring Claude AI Models to AWS for U.S. Government Intelligence and Defense Operations.	Palantir and Anthropic have collaborated to make the Claude suite of models available on AWS for U.S. intelligence agencies and defense operations.
ChatGPT Can Now Control a Robot Arm.	Researchers from UC Berkeley and ETH Zurich utilized GPT-4 to train cost-effective robot arms for cleaning spills. They accomplished this by incorporating a multimodal agent called LangChain, which translates LLM inputs into robotic actions. This research demonstrates a novel proof-of-concept for human-robot interaction and democratizes robotics using open-source technology.
OpenAI in talks with regulators to become a for-profit company: Report.	The $157 billion artificial intelligence giant wants to retain a nonprofit arm to pursue its mission of benevolent AI development.

Resources

Link	description
AFlow: Automating Agentic Workflow Generation.	A novel framework for automating agentic workflow generation, AFlow, reframes workflow optimization as a search problem over code-based workflows, where nodes invoking LLMs are linked by edges. It efficiently navigates the search space using a modified MCTS, refining workflows through code adjustments, tree-structured experience, and execution feedback. Tests on six benchmark datasets show AFlow’s effectiveness, with a 5.7% improvement over manual methods and a 19.5% boost over other automated approaches. AFlow also allows smaller models to outperform GPT-4 on specific tasks, requiring only 4.55% of its inference cost.
O1 Replication Journey: A Strategic Progress Report -- Part 1.	This report describes efforts to replicate the capabilities of OpenAI's o1 model, introducing a journey learning technique that promotes a comprehensive exploration process rather than shortcut-based learning. This approach includes trial and error, reflection, and backtracking. With just 327 training samples, the journey learning technique outperformed shortcut learning by 8.0% on the MATH dataset.
Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications.	This work offers insights on effectively integrating multimodal models into Retrieval-Augmented Generation (RAG) systems for the industrial sector. It also delves into evaluating these systems, utilizing LLM-as-a-Judge for comprehensive assessment.
You won't believe this.	Researchers are trying to “inoculate” people against misinformation by giving them small doses ahead of time
3D Scene Reconstruction Without Camera Pose.	NoPoSplat is a feed-forward model capable of reconstructing 3D scenes from sparse, multi-view images without requiring precise camera poses.
ImOV3D: Learning Open Vocabulary Point Clouds 3D Object Detection from Only 2D Images.	ImOV3D is a framework that enhances open-vocabulary 3D object detection (OV-3Det) by utilizing 2D images to address the limited availability of 3D annotations.
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning.	DEMO is a framework that divides text and conditioning into content and motion elements. By employing separate encoders and conditioning for static content and dynamic motion, DEMO improves its ability to interpret and generate motion based on text prompts.
Project Sid.	Project Sid demonstrates civilizational progress, specialization, governance, and the creation and dissemination of memes and religion. These developments are enabled by Altera's innovative cognitive architecture, PIANO.
Using Reinforcement Learning and $4.80 of GPU Time to Find the Best HN Post Ever.	This article explores the use of reinforcement learning from human feedback (RLHF) to create a reward model that predicts upvote counts for Hacker News stories. Using a rich dataset and only $4.80 of GPU time, the model was trained on attributes like titles, authors, and content to prioritize post quality. The goal is to apply RLHF to foster the generation of high-value content. While not flawless, the model effectively identifies overlooked stories and can anticipate potential front-page hits.
Models for PII detection.	The GLINER models and dataset are synthetic datasets designed specifically for use with synthetic data.
Randomized Autoregressive Visual Generation.	This study presents Randomized auto-regressive (RAR) modeling for image generation, achieving state-of-the-art performance on the ImageNet-256 benchmark with an impressive FID score of 1.48.
hertz-dev-open source speech-to-speech.	An exceptionally impressive open release with a permissive license, this model was trained to generate human speech from various input modalities. The code is of high quality and includes intriguing details about the encoder and decoder architectures.
DiffeRT.	This project introduces an innovative Machine Learning-assisted Ray Tracing method for radio propagation modeling, aimed at reducing the high computational demands of conventional approaches.
How I write code using Cursor: A review.	Cursor, a VS Code fork, incorporates LLM-powered features like tab completion and chat interfaces to simplify coding by automating boilerplate and repetitive changes. Although tab completion is quick and efficient, it occasionally provides incorrect suggestions. The tool promotes new workflow patterns, minimizing dependency on libraries for boilerplate and enabling faster iteration in unfamiliar languages or frameworks.
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D.	MVPaint addresses the challenges of texture and UV generation for 3D assets by synchronizing these processes, resulting in high-quality, multi-view consistent textures
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents.	MVPaint tackles the difficulties of texture and UV generation for 3D assets by synchronizing these tasks, producing high-quality, multi-view consistent textures.
PPLLaVA: Short and Long Video Understanding.	PPLLaVA is a novel model designed to effectively comprehend both short and long videos, addressing a significant challenge in video-based AI. It employs a unique pooling strategy that compresses visual tokens and aggregates features based on user instructions, enhancing its ability to process varied video lengths. This approach enables PPLLaVA to achieve state-of-the-art performance across various video benchmarks, excelling in tasks from caption generation to multiple-choice questions
Hunyuan3D-1.	Hunyuan3D-1.0 is an advanced generative 3D model with robust multi-view synthesis capabilities. While its outputs may not yet be production-ready, they provide a valuable foundation for artists aiming to create assets.
AndroidLab.	Benchmark for autonomous agents on the Android mobile operating system.
Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge.	Constrained Human-AI Cooperation (CHAIC) is a challenge aimed at evaluating the ability of AI agents to work effectively with humans who have physical constraints.
A Scalable Communication Protocol for Networks of Large Language Models.	Agora is a straightforward, cross-platform protocol designed for efficient communication between LLM agents, allowing diverse agents to interact at a significantly reduced cost. It seamlessly integrates with existing multiagent frameworks like Camel AI, LangChain, and Swarm.
Classification Done Right for Vision-Language Pre-Training.	SuperClass is a simple classification model for vision-language tasks that bypasses the need for a text encoder, unlike contrastive models such as CLIP. It eliminates the need for complex text filtering and large batch sizes by using tokenized raw text directly as classification labels.
Enhancing RAG with HTML Data.	HtmlRAG is an innovative approach that enhances retrieval-augmented generation (RAG) by preserving the HTML structure of retrieved web content rather than simplifying it to plain text.
LiVOS: Light Video Object Segmentation with Gated Linear Matching.	LiVOS is a lightweight video object segmentation (VOS) model designed to lower memory usage, making it possible to segment long, high-resolution videos with reduced hardware requirements.
How To Create Software Diagrams With ChatGPT and Claude.	The article discusses how developers can leverage ChatGPT and Claude to generate software architecture diagrams. It emphasizes the iterative process of refining diagrams with the help of multimodal AI and tools such as Mermaid and Whimsical. The author showcases the advantages of using LLMs for diagramming, illustrating how they handle images and offer real-time feedback.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks.	Microsoft has introduced Magnetic-One, a multi-agent system built upon its open-source AutoGen framework. This system utilizes GPT-4o as the backend model to facilitate agentic behavior, enabling the orchestration of multiple AI agents to perform complex tasks.
Cosmos Tokenizer: A suite of image and video neural tokenizers.	NVIDIA has introduced the Cosmos Tokenizer, a state-of-the-art image and video tokenizer and compression model. This model is designed to facilitate the training of video generation systems, visual language models (VLMs), and other multimodal models. NVIDIA has made available the inference code, a research paper detailing the model, and the associated model weights.
SA3DIP: Segment Any 3D Instance with Potential 3D Priors.	SA3DIP is a novel method for enhancing 3D instance segmentation by integrating additional 3D priors beyond standard 2D models. This approach addresses the limitations of relying solely on 2D segmentation models, which often struggle with complex 3D structures. By incorporating 3D priors, SA3DIP achieves more accurate and robust segmentation in three-dimensional spaces.
gsplat.	G Splat is a robust package and studio designed for conducting research on Gaussian splatting.
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models.	RaVL is a novel approach that enhances the accuracy of vision-language models by concentrating on local image features instead of the whole image, aiming to reduce misleading correlations.
Structure Consistent Gaussian Splatting with Matching Prior for Few-shot Novel View Synthesis.	SCGaussian is an innovative method for 3D scene synthesis that preserves structural consistency, even when working with sparse input data.

Perspectives

Link	description
The chatbot optimization game: can we trust AI web searches?	Google and its rivals are increasingly employing AI-generated summaries, but research indicates their results are far from authoritative and open to manipulation
Addicted to love: how dating apps ‘exploit’ their users.	online services that promise to find people romantic matches have been likened to gambling products designed to keep customers hooked
Concerned about your data use? Here is the carbon footprint of an average day of emails, WhatsApps and more.	Vast datacentres are being built worldwide, amid growing concerns about the environmental costs. So should we all be considering a data diet – if not complete digital sobriety?
A field’s dilemmas.	Misinformation research has exploded. But scientists are still grappling with fundamental challenges
We're forking Flutter. This is why.	Google's strategic shift towards AI has led to a deprioritization of Flutter's desktop platforms, resulting in a labor shortage for this previously fast-growing UI toolkit. In response, a fork named Flock is being developed to incorporate essential bug fixes and features that the Flutter team is unable to address, aiming to accelerate Flutter's growth through community involvement. Flock plans to enhance contribution processes and streamline PR reviews, bridging the gap in support and development pace left by the main Flutter team.
Devious humor and painful puns: will the cryptic crossword remain the last thing AI can’t conquer?	When human solvers battle artificial intelligence, who is able to think more cryptically, faster? And are some devious clues just too tough for software?
Meta’s AI Abundance.	Meta is strategically poised to leverage generative AI, particularly in digital advertising. The company's investments in AI, including its Llama models, support innovative advertising strategies like generative ads and AI-driven chat agents. These advancements aim to enhance ad targeting and efficiency, potentially boosting demand and revenue. Meta's focus on integrating AI across its platforms underscores its commitment to maintaining a competitive edge in the rapidly evolving AI landscape.
The AI Services Wave: Lessons from Palantir in The New Age of AI.	Artificial intelligence (AI) is transforming service industries by enhancing scalability and efficiency. Companies like Palantir are at the forefront, integrating AI into operations to streamline processes. Startups are also leveraging AI to automate complex tasks, creating significant value and reshaping business models. The emphasis is on developing AI-driven "tech services" that blend software capabilities with human expertise, leading to improved outcomes and increased market competitiveness.
X reaches its final form: Elon Musk has bent it to his will.	The evolution of Musk’s X network is complete; why Reddit is profitable; and niche Halloween costumes
AI for Startups.	Microsoft and a16z are advocating for collaboration between large and small tech companies to promote AI innovation and competition. They support open-source AI and have proposed policies to assist startups and level the playing field in the AI economy. Their joint focus is on creating a robust, competitive ecosystem that leverages AI to drive economic growth and innovation.
How The New York Times is using generative AI as a reporting tool.	New York Times reporters utilized AI tools, specifically LLMs, to transcribe and analyze over 400 hours of audio for an investigation. Automated transcription greatly accelerated the work, with LLMs accurately identifying key themes and topics. Human reporters ensured proper interpretation and contextual understanding, highlighting the significance of human-AI collaboration.
Writing as a Way of Thinking.	The article explores AI's influence on writing and thinking, challenging the idea that writing is the sole form of thinking. Tools like ChatGPT can enhance thinking through dialogue. Rather than replacing thought processes, AI can augment them. It will transform writing by automating routine tasks, freeing up space for more creative and thought-provoking content.
ChatGPT is transforming peer review — how can we use it responsibly?	At major computer science publication venues, up to 17% of the peer reviews are now written by artificial intelligence. We need guidelines before things get out of hand.
Will AI’s huge energy demands spur a nuclear renaissance?	Contracts with Google and Amazon could help, but bringing new types of reactors online will take larger investments and time.
Five protein-design questions that still challenge AI.	Tools such as Rosetta and AlphaFold have redefined the protein-engineering landscape. But some problems remain out of reach — for now.
AI may displace 3m jobs but long-term losses ‘relatively modest’, says Tony Blair’s thinktank.	Rise in unemployment in low hundreds of thousands as technology creates roles, Tony Blair Institute suggests
The Rise of the Agentic Web.	The Agentic Web is advancing the capabilities of AI agents with on-chain features, enabling their creation, ownership, and transactional abilities. Platforms like Replit, VIRTUALS.io, and Wayfinder are integrating AI with blockchain, facilitating activities such as asset management, data retrieval, and decentralized applications. This shift supports AI-driven automation for payments, trading, and decentralized finance within blockchain ecosystems.
The Present Future: AI's Impact Long Before Superintelligence.	Stronger AI models are on the verge of surpassing human intelligence, driving transformative changes in work and society. Current AI systems, such as Claude, are already reshaping industries by automating tasks, offering safety monitoring, and enabling interactions through multimodal inputs and outputs. Organizations must carefully address ethical concerns to ensure AI complements and enhances human abilities, rather than replacing them.

Back to index

ML news: Week 28 October - 3 November

Research

Link	description
A Theoretical Understanding of Chain-of-Thought.	reveals that incorporating both correct and incorrect reasoning paths in demonstrations enhances the accuracy of intermediate steps and Chain-of-Thought (CoT) processes. The new approach, Coherent CoT, substantially boosts performance across multiple benchmarks. Specifically, Gemini Pro shows a 6.60% improvement on the Tracking Shuffled Objects dataset (rising from 58.20% to 64.80%), while DeepSeek 67B achieves a 6.17% increase on the Penguins in a Table dataset (from 73.97% to 80.14%).
LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering.	improves RAG's comprehension of long-context knowledge, incorporating global insights and factual specifics. It features a hybrid retriever, an LLM-enhanced information extractor, a Chain-of-Thought (CoT) guided filter, and an LLM-augmented generator. These core components empower the RAG system to extract global long-context information and accurately capture factual details. LongRAG demonstrates superior performance, surpassing long-context LLMs by 6.94%, advanced RAG by 6.16%, and Vanilla RAG by 17.25%.
Evaluating feature steering: A case study in mitigating social biases.	examines feature steering in LLMs through an experiment that adjusts various features to observe shifts in model outputs, specifically focusing on 29 features related to social biases to determine if feature steering can reduce these biases. Findings reveal that while feature steering can sometimes cause unintended effects, incorporating a neutrality feature effectively reduces social biases across 9 social dimensions without compromising text quality.
Large Language Models Reflect the Ideology of their Creators.	reveals that LLMs display varied ideological perspectives, often mirroring the worldview of their creators. It observes consistent normative differences in responses when the same LLM operates in Chinese versus English and highlights normative disagreements between Western and non-Western LLMs regarding prominent figures in geopolitical conflicts.
Scalable watermarking for identifying large language model outputs.	introduces SynthID-Text, a text-watermarking approach designed to maintain text quality in LLM outputs, achieve high detection accuracy, and reduce latency. It incorporates watermarking through speculative sampling, using a final score pattern for model word choices alongside adjusted probability scores. The authors evaluate the method's feasibility and scalability by analyzing feedback on nearly 10 million Gemini responses.
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model.	outperformed other test-time compute methods across most datasets. The authors note that the primary reasoning patterns in o1 are divide and conquer and self-refinement, with the model adapting its reasoning strategy to specific tasks. For commonsense reasoning, o1 frequently employs context identification and focuses on constraints, while for math and coding tasks, it predominantly utilizes method reuse and divide-and-conquer approaches.
Sparse Crosscoders for Cross-Layer Features and Model Diffing.	Crosscoders are an advanced form of sparse autoencoders designed to enhance the understanding of language models' internal mechanisms.
Distill Visual Chart Reasoning Abilityfrom LLMs to MLLMs.	Code-as-Intermediary Translation (CIT) is an innovative technique aimed at improving visual reasoning in multimodal language models (MLLMs) by leveraging code to convert chart visuals into textual descriptions.
Probabilistic Language-Image Pre-Training.	Probabilistic Language-Image Pre-training (ProLIP) is a vision-language model (VLM) designed to learn probabilistically from image-text pairs. Unlike traditional models that rely on strict one-to-one correspondence, ProLIP captures the complex many-to-many relationships inherent in real-world data.
A faster, better way to train general-purpose robots.	MIT researchers have developed Heterogeneous Pretrained Transformers (HPT), a novel model architecture inspired by large language models, designed to train adaptable robots by utilizing data from multiple domains and modalities.
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs.	In this work, DeepMind demonstrates how a small language model can be used to provide soft supervision labels and identify informative or challenging data points for pretraining, significantly accelerating the pretraining process.
NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction.	The NeuroClips framework introduces advancements in reconstructing continuous videos from fMRI brain scans by decoding both high-level semantic information and fine-grained perceptual details.
Machine-guided design of cell-type-targeting cis-regulatory elements.	A generalizable framework to prospectively engineer cis-regulatory elements from massively parallel reporter assay models can be used to write fit-for-purpose regulatory code.

News

Link	description
Keir Starmer says media firms should have control of output used in AI.	PM says content creators must be paid and vows to ensure technology ‘does not begin to chip away’ at press freedoms
Waymo raises $5.6B.	Waymo's driverless taxi service has gained significant popularity. The company has secured additional funding to extend its reach beyond the current cities and millions of miles it already covers.
Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs.	Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs.
IBM debuts open source Granite 3.0 LLMs for enterprise AI.	IBM is enhancing its enterprise AI suite with Granite 3.0 LLMs, prioritizing open-source options and optimized performance. Available across various platforms, these models have built-in safety features and are customized for diverse enterprise applications. IBM highlights the significance of true open-source licensing with Apache 2.0, enabling flexible adoption and fostering enterprise-driven innovation.
Microsoft introduces ‘AI employees’ that can handle client queries.	US company gives customers the ability to build own virtual agents as well as releasing 10 off-the-shelf bots
Microsoft Excel’s bloopers reel: 40 years of spreadsheet errors.	As the software used by millions around the world celebrates its birthday, here are some of the low points
Google Expands Voice Technology Support to 15 More African Languages .	Google has expanded voice recognition support to include 15 more African languages across its platforms, such as Voice Search, Gboard talk-to-type, and Translate dictation. This enhancement enables an estimated 300 million additional Africans to engage with digital content in their native languages.
Cohere releases state-of-the-art multimodal AI search model.	Cohere has unveiled that its Embed 3 AI model is now multimodal, allowing for rapid and precise search across essential enterprise image data sources such as graphs, charts, product catalogs, and design files. This enhancement makes Embed 3 the most broadly capable multimodal embedding model available today.
Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview.	You can now access models like Claude, Gemini, and o1, among others, through GitHub Copilot.
Apple releases first batch of Apple Intelligence features, debuts new iMac.	Apple introduced new AI features, branded as Apple Intelligence, on its latest devices, focusing on text processing and photo editing capabilities. The updated iMac now runs on the M4 chip, which includes a Neural Engine that delivers three times the AI performance of previous models. Upcoming AI updates aim to improve Siri's capabilities and incorporate ChatGPT to handle more advanced queries.
How Advex creates synthetic data to improve machine vision for manufacturers.	Advex AI addresses data shortages in AI training by leveraging generative AI to create synthetic images tailored for computer vision systems.
Coframe raises $9 million for websites that optimize themselves using AI.	AI startup Coframe has raised $9.3 million in seed funding to further develop its platform, which leverages generative AI to optimize websites and deliver personalized marketing experiences.
Google unveils invisible ‘watermark’ for AI-generated text.	Real-world demonstration in chatbot responses could encourage other firms to label material produced by AI.
Reddit shares soar after company turns first-ever profit.	Monthly users rose by nearly half thanks to the AI translation feature, and deals for AI training with Google and OpenAI boosted revenue
Google parent Alphabet sees double-digit growth as AI bets boost cloud business.	Analysts expected 12% year-on-year revenue gains, but company reports 15%, buoyed by performance in ads and cloud services
EU events on curbing big tech ‘distorted’ by attendees with industry links.	Campaigners say 21% of people at workshops did not disclose on their applications relationships with firms being discussed
Indonesia blocks Apple iPhone 16 sales over lack of investment.	Marketing and sale of model prohibited after tech giant fails to meet rule 40% of phones be made from local parts
25% of Smartphone Owners Don't Want AI as Apple Intelligence Debuts.	What's a bigger priority? Longer battery life, according to a new CNET survey.
Google preps ‘Jarvis’ AI agent that works in Chrome.	Google's Project Jarvis, powered by Gemini 2.0, aims to automate web-based tasks in Chrome by using AI agents capable of reasoning and planning.
OpenAI’s Whisper transcription tool has hallucination issues, researchers say.	OpenAI's Whisper, an AI transcription tool, has been found to produce hallucinations—fabricated text not present in the original audio—even in medical settings. Despite OpenAI's advisories against using Whisper in high-risk domains, over 30,000 medical professionals across 40 health systems have adopted it for transcribing patient consultations
Forerunner K2 humanoid robot can carry 33 lb in each dexterous hand.	Kepler has introduced the Forerunner K2, a humanoid robot featuring advanced AI, upgraded hardware, and enhanced vision and navigation systems for improved real-time interaction.
Introducing ChatGPT search.	ChatGPT now offers an improved web search capability, providing quick, current answers with links to relevant sources—answers you'd typically seek through a search engine. This feature combines the ease of a natural language interface with access to real-time information, such as sports scores, news, stock prices, and more.
Advancing embodied AI through progress in touch perception, dexterity, and human-robot interaction.	This work features several components, including vision-based tactical sensing, innovative hardware touch sensors, and noteworthy strategic partnerships within robotics.
Elon Musk’s xAI adds image understanding capabilities to Grok.	This means that paid users on his social platform X, who have access to the AI chatbot, can upload an image and ask the AI questions about it.
OpenAI CFO Says 75% of Its Revenue Comes From Paying Consumers.	OpenAI generates the vast majority of its revenue from consumers who pay for its products, Chief Financial Officer Sarah Friar said, even as the artificial intelligence startup competes in a crowded market to sign up more corporate customers.
Hello Patient.	Hello Patient has emerged from stealth mode, securing a $6.3 million seed funding round led by 8VC. The company, founded by Alex Cohen, is based in Austin, Texas.
Google plans to announce its next Gemini model soon.	December is shaping up to be a month of dueling announcements from OpenAI and Google.
Meta is reportedly developing a search engine for its chatbot.	The company wants to decrease Meta AI’s reliance on Google and Microsoft.
A mysterious new image generation model has appeared.	A mysterious new image generation model is beating models from Midjourney, Black Forest Labs, and OpenAI on the crowdsourced Artificial Analysis benchmark. The model, which goes by the name “red_panda,” is around 40 Elo points ahead of the next-best-ranking model, Black Forest Labs’ Flux1.1 Pro, on Artificial Analysis’ text-to-image leaderboard.

Resources

Link	description
Agentic Information Retrieval.	offers an overview of agentic information retrieval, driven by the abilities of LLM agents; explores various advanced applications of agentic information retrieval and addresses related challenges.
Aya Expanse.	introduces a suite of open-weight foundation models designed for multilingual proficiency, featuring 8B and 32B parameter models and one of the largest multilingual datasets to date, containing 513 million examples. The release also includes Aya-101, which is claimed to be the most extensive multilingual model, supporting 101 languages. Aya Expanse 32B surpasses the performance of Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, even though it is half the size of the latter.
A Survey on Data Synthesis and Augmentation for Large Language Models.	offers an in-depth overview of data generation techniques throughout the LLM lifecycle, covering topics such as data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and practical applications.
granite-3.0-language-models.	introduces a range of lightweight foundation models from 400 million to 8 billion parameters, optimized for tasks such as coding, retrieval-augmented generation (RAG), reasoning, and function calling. Designed for enterprise applications, these models support on-premise and on-device deployment, showing robust performance across academic benchmarks in language understanding, reasoning, coding, function calling, and safety.
Pixtral-12B-Base-2409.	Pixtral 12B base model weights have been released on Hugging Face.
Arcade, a new AI product creation platform, designed this necklace.	Arcade AI has developed a generative platform that allows users to create distinctive, high-quality jewelry items simply from text prompts—and the exciting part is, you can purchase the designs you generate.
Retrieval-Augmented Diffusion Models for Time Series Forecasting.	The Retrieval-Augmented Time Series Diffusion model (RATD) introduces a retrieval and guidance mechanism to enhance stability and performance in time series diffusion models. RATD operates in two steps: first, it retrieves relevant historical data from a database, and then uses this information as a reference to guide the denoising phase.
NotebookLlama: An Open Source version of NotebookLM.	Meta has published a quick start guide to help users build a simplified version of Google’s popular NotebookLM system.
How I Studied LLMs in Two Weeks: A Comprehensive Roadmap.	This article presents a 14-day roadmap for mastering LLM fundamentals, covering key topics such as self-attention, hallucinations, and advanced methods like Mixture of Experts. It offers resources for building an LLM from the ground up, alongside curated literature and online materials, all organized within a GitHub repository. Emphasizing a tailored learning experience, the article underscores the importance of foundational skills in math, programming, and deep learning.
Marly.	Marly is an open-source data processor enabling agents to query unstructured data using JSON, streamlining data interaction and retrieval.
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias.	It was previously believed that novel view synthesis depended heavily on strong 3D inductive biases. This study demonstrates that, with scale and a minimal inductive bias, it's possible to significantly surpass these previously assumed limitations.
Continuous Speech Synthesis using per-token Latent Diffusion.	Autoregressive models continue to excel in many applications, yet recent advancements with diffusion heads in image generation have led to the concept of continuous autoregressive diffusion. This research broadens the scope of per-token diffusion to accommodate variable-length outputs.
CDChat: A Large Multimodal Model for Remote Sensing Change Description.	This paper presents a change description instruction dataset aimed at fine-tuning large multimodal models (LMMs) to enhance change detection in remote sensing.
IC-Light V2 (Flux-based IC-Light models).	IC Light currently offers the most effective method for associating images with a pre-trained text-to-image backbone. This discussion marks the initial steps toward expanding that capability to the robust Flux models.
The Scene Language: Representing Scenes with Programs, Words, and Embeddings.	Creating 3D scenes from scratch presents significant challenges, including data limitations. This research introduces a programming-like language for describing 3D scenes and demonstrates that Claude Sonnet can produce highly realistic scenes even without specific training for this task.
3D Semantic Segmentation.	FtD++ is a cross-modal learning approach designed to enhance unsupervised domain adaptation in 3D semantic segmentation tasks.
Open source replication of crosscoder on Gemma 2B.	Anthropic recently published two studies showcasing its novel interpretability method. This post provides an open replication of the cross coder on the Gemma 2B model.
Awesome-Graph-OOD-Learning.	This repository lists papers on graph out-of-distribution learning, covering three primary scenarios: graph OOD generalization, training-time graph OOD adaptation, and test-time graph OOD adaptation.
OpenWebVoyager: Building Multimodal Web Agents.	OpenWebVoyager offers tools, datasets, and models designed to build multimodal web agents that can navigate and learn from real-world web interactions.
Automated Colorization for Animation.	Researchers have introduced an innovative inclusion-matching technique that overcomes challenges in automated colorization, particularly for animations where occlusions and wrinkles complicate traditional segment matching.
Lofi Music Dataset.	A dataset containing music clips paired with detailed text descriptions, generated by a music creation model.
Learning to Handle Complex Constraints for Vehicle Routing Problems.	Researchers have developed a Proactive Infeasibility Prevention (PIP) framework designed to enhance neural network performance on Vehicle Routing Problems (VRPs) that involve challenging constraints.
Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI.	PyTorch has made significant strides with ExecuTorch, a tool that enables AI model deployment at the edge, greatly enhancing the performance and efficiency of various end systems.
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution.	CompassJudger-1 is the first open-source, comprehensive judge model created to enhance the evaluation process for large language models (LLMs).
MINT-1T.	MINT-1T, a vast open-source multimodal dataset, has been released with one trillion text tokens and 3.4 billion images, incorporating diverse content from HTML, PDFs, and ArXiv papers. This dataset, roughly ten times larger than previous collections, is intended to accelerate advancements in large-scale multimodal machine learning research.
LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀.	LARP is a novel video tokenizer designed to enhance video generation in autoregressive (AR) models by prioritizing global visual features over individual patch-based details.
OpenAI's new hallucination benchmark.	OpenAI has released the SimpleQA benchmark, which measures models' abilities around simple factual questions.
ThunderKittens.	Thunder Kittens is a framework designed for creating highly efficient GPU kernels. It leverages the principle that GPUs are optimized for working with compact 16x16 data tiles, resulting in high usability. With this approach, achieving 40% faster kernels requires only a few hundred lines of code.
Skinned Motion Retargeting with Dense Geometric Interaction Perception.	MeshRet has developed an innovative method for enhancing motion retargeting for 3D characters, prioritizing the preservation of body geometry interactions from the outset.
Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance.	Researchers have improved Masked Generative Models (MGMs) by introducing a self-guidance sampling technique, which enhances image generation quality without compromising diversity.
Speeding Up Transformers with Token Merging.	This project presents PiToMe, an algorithm that compresses Vision Transformers by gradually merging tokens after each layer, thereby decreasing the number of tokens processed.
PF3plat : Pose-Free Feed-Forward 3D Gaussian Splatting.	PF3plat addresses the challenge of 3D reconstruction and novel view synthesis from RGB images without requiring additional data.
Fine-tuning LLMs to 1.58bit: extreme quantization made easy.	BitNet, created by Microsoft Research, presents a transformer architecture that lowers the computational and memory demands of large language models by employing ternary precision (-1, 0, 1), equating to 1.58 bits per parameter. This architecture requires models to be trained from scratch, but it can also fine-tune existing models to this low-precision format while retaining high performance on downstream tasks. This technique greatly reduces energy consumption and enhances inference speed through specialized kernels that enable efficient matrix multiplication.
SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Recognition.	SELECT is the inaugural extensive benchmark designed to evaluate various data curation methods in image classification. ImageNet++ is a newly developed dataset that augments ImageNet-1K by incorporating five additional training data variations, each curated through distinct techniques.
ODRL: A Benchmark for Off-Dynamics Reinforcement Learning.	ODRL is the first standardized benchmark designed to assess reinforcement learning methods in environments with differing dynamics.
Text-to-Image Model to Generate Memes.	Researchers have created an innovative adapter method for text-to-image models, enabling them to tackle complex tasks such as meme video generation while preserving the base model's strong generalization abilities.
Anomaly Classification in Industry.	AnomalyNCD is a multi-class anomaly classification framework intended to enhance traditional anomaly detection techniques in industrial environments.
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models.	Byte-level language models represent a move toward a token-free future, but the challenge of sequence length remains significant. Dynamically merging tokens can help increase the number of tokens within the context.
BART vectoriZed.	A new GPU-enabled implementation of Bayesian Additive Regression Trees (BART) significantly accelerates processing speed, making it up to 200 times faster than conventional CPU-based versions.
Huge new Diffusers release.	The Hugging Face Diffusers package now includes new pipelines like Flux, Stable Audio, Kolors, CogVideoX, Latte, and others, alongside new methods such as FreeNoise and SparseCtrl, plus various refactors.
4 experiments with voice AI models to help you explore culture.	Google’s voice AI models allow users to engage with culture in innovative ways. Projects like Talking Tours provide AI-guided virtual tours, Mice in the Museum offers art narration, and Lip Sync animates lips to discuss cultural topics. These entertaining tools offer new perspectives on art and design.

Perspectives

Link	description
ByteDance intern fired for planting malicious code in AI models.	After rumors swirled that TikTok owner ByteDance had lost tens of millions after an intern sabotaged its AI models, ByteDance issued a statement this weekend hoping to silence all the social media chatter in China.
Thinking Like an AI.	Large language models (LLMs) operate as advanced autocomplete systems, generating the next token based on a combination of their training data and current input. Small variations in input can influence predictions, resulting in different responses to the same question. Gaining insight into token prediction, training data context, and memory constraints can enhance effective AI usage.
An Interview with Salesforce CEO Marc Benioff about AI Abundance.	Salesforce CEO Marc Benioff recently spoke about the company's new AI initiative, Agentforce, showcasing its potential to transform enterprise applications and customer interactions. He contrasted Salesforce's approach with Microsoft’s Copilot, describing Salesforce’s solution as more cohesive and impactful, thanks to its strong platform and data infrastructure. During the interview, Benioff stressed the significance of AI-driven "agentic" layers designed to boost customer service and improve operational efficiency across various industries.
How GPU Access Helps Startups Be Agile.	Andreessen Horowitz's Oxygen program tackles GPU shortages by offering startups in its portfolio more accessible and flexible GPU resources, allowing them to bypass price surges and supply limitations. This initiative enables AI startups to concentrate on product development without the pressure of long-term capital expenditure, emphasizing the need for equitable access to critical resources in the competitive AI field.
The Mask Comes Off: At What Price?	OpenAI is approaching its shift to a Public Benefit B-Corporation, a move that could impact its investor dynamics and collaboration with Microsoft. This transition brings up questions around control and valuation, particularly concerning the nonprofit's stake, which could be substantial given OpenAI's role in advancing AGI. The company’s future profitability and strategic course are closely tied to the safe development of AGI, a pursuit with enormous potential value.
What's so special about the human brain?.	Torrents of data from cell atlases, brain organoids, and other methods are finally delivering answers to an age-old question.
‘Educational’ apps are worth billions. We need to make sure they work.	Partnerships between developers and researchers could help to improve the quality of educational apps and other technologies.
The huge protein database that spawned AlphaFold and biology’s AI revolution.	Pioneering crystallographer Helen Berman helped to set up the massive collection of protein structures that underpins the Nobel-prize-winning tool’s success.
Extreme fire seasons are looming — science can help us adapt.	Not all wildfires can be averted, but data, models, and collaborations can help to chart a course to a fire-resilient future.
AI-designed DNA sequences regulate cell-type-specific gene expression.	Researchers have used artificial intelligence models to create regulatory DNA sequences that drive gene expression in specific cell types. Such synthetic sequences could be used to target gene therapies to particular cell populations.
Pushing the frontiers of audio generation.	DeepMind has shared additional details about the audio generation models behind NotebookLM.
Evaluating feature steering: A case study in mitigating social biases.	This study investigates the use of feature steering in AI models to adjust outputs in an interpretable way. It identifies a "steering sweet spot," where modifications do not compromise performance. Results demonstrate that steering can adjust social biases within specific areas but may also produce unintended effects outside those targets. Continued research is necessary to enhance feature steering, aiming for safer and more dependable AI outcomes.
How we saved hundreds of engineering hours by writing tests with LLMs.	Assembled leverages LLMs to speed up and enhance software testing, allowing tests to be generated in minutes rather than hours. This approach boosts engineering productivity, saving time and enabling a stronger focus on feature development. LLMs create thorough and precise tests that uphold code quality and sustain development speed.
How to train LLM as a judge to drive business value.	"LLM As a Judge" is an approach for leveraging an existing language model to rank and score natural language. This post provides guidelines for effectively using this method to process or assess data.

Back to index

ML news: Week 21 - 27 October

Research

Link	description
Thinking LLMs: General Instruction Following with Thought Generation.	The proposed training method aims to enhance LLMs with thinking capabilities for general instruction-following without relying on human-annotated data. It employs an iterative search and optimization process to facilitate thought generation, allowing the model to learn without direct supervision. For each user instruction, potential thoughts are evaluated using a judge model, which scores only the responses to identify the best and worst options. The resulting full outputs are then used as selected and rejected pairs for DPO (termed Thought Preference Optimization in this paper). This approach demonstrates superior performance on AlpacaEval and Arena-Hard.
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence.	A new collaborative search algorithm is proposed to adapt LLMs using swarm intelligence, where a group of LLM experts collaboratively navigates the weight space to optimize a utility function that reflects various adaptation objectives. Experiments show that Model Swarms can effectively adjust LLM experts for a single task, multi-task domains, reward models, and a range of human interests. This approach outperforms 12 model composition baselines by up to 21.0% across different tasks and contexts.
First-Person Fairness in Chatbots.	This study explores first-person fairness, focusing on the fairness of interactions between users and ChatGPT, particularly examining any biases related to users' names. It utilizes a model powered by GPT-4o to analyze patterns and name sensitivity in the chatbot's responses based on different user names. The findings suggest that post-training significantly reduces harmful stereotypes overall. However, in areas such as entertainment and art, especially with open-ended tasks, the study reveals a higher level of bias, indicating a tendency to create narratives featuring protagonists whose gender aligns with the gender inferred from the user's name.
Looking Inward: Language Models Can Learn About Themselves by Introspection.	The report indicates that LLMs can gain knowledge through introspection that is not directly derivable from their training data. It suggests that these models possess privileged information about themselves, which could contribute to creating more interpretable and controllable systems. However, it also notes that this introspective ability has limitations, as models often struggle to predict their own behavior on tasks that require reasoning over extended outputs.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation.	This proposal introduces a unified autoregressive framework for multimodal understanding and generation, which decouples visual encoding into independent pathways. Utilizing a single transformer architecture enhances flexibility and performance in both visual understanding and generation tasks. The framework claims to mitigate the trade-offs typically associated with vision tasks found in methods relying on a single visual encoder. As a result, it outperforms previous unified models and matches or exceeds the performance of task-specific models.
Inference Scaling for Long-Context Retrieval Augmented Generation.	This study employs two strategies to explore scaling laws for Retrieval-Augmented Generation (RAG): in-context learning (DRAG) and iterative prompting (IterRAG). It discovers that RAG performance steadily enhances with an increase in effective context length when configurations are optimized. Additionally, under optimal conditions, increasing inference computation yields linear improvements in long-context RAG performance. This insight leads to the creation of a computation allocation model designed to offer practical guidance for optimal computation distribution in long-context RAG situations.
Agent S: An Open Agentic Framework that Uses Computers Like a Human.	A novel open agentic framework has been developed to facilitate autonomous interactions with computers via a graphical user interface (GUI). Named Agent S, this framework addresses challenges such as knowledge acquisition, long-horizon planning, and managing dynamic interfaces. It introduces experience-augmented hierarchical planning that combines search and retrieval methods. Additionally, it utilizes an agent-computer interface to enable reasoning and control over GUI agents. Evaluation on the OSWorld benchmark demonstrates that Agent S surpasses the baseline by 9.37% in success rate, representing an 83.6% relative improvement, and sets a new state-of-the-art performance.
Exploring Model Kinship for Merging Large Language Models.	The study introduces the concept of model kinship to assess the similarity between LLMs. This measure is utilized to develop a model merging strategy called Top-k Greedy Merging with Model Kinship, which enhances performance. The authors discover this new criterion allows for effective and continuous model merging.
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability.	The report highlights that the o1-preview model excels in self-evaluation and constraint-following. However, it also points out that these o1 models exhibit bottlenecks in decision-making and memory management, particularly in the context of spatial reasoning. Specifically, the models tend to generate redundant actions and face challenges in generalizing across spatially complex tasks.
Sabotage evaluations for frontier models.	Anthropic has conducted several innovative evaluations to identify vulnerabilities and assess misalignment in large, powerful models.
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities.	A powerful open-source initiative aimed at replicating GPT-4's speech capabilities has emerged. This model was trained by aligning multiple modalities using pre-trained audio and speech encoders, allowing it to achieve advanced speech recognition and generation functionalities.
Automatically Interpreting Millions of Features in Large Language Models.	Interpreting SAE features on a large scale can be difficult. To address this, Eleuther has introduced a set of automatic interpreter features designed to help understand the meaning of elements within their context.
Mitigating Object Hallucination via Concentric Causal Attention.	Object hallucination in vision-language models has been associated with Rotary Position Encoding (RoPE), which faces challenges in managing long-term dependencies between visual and textual inputs. To overcome this, the authors introduce Concentric Causal Attention (CCA), a novel positional alignment method that enhances the interaction between visual elements and instruction tokens.
Simplifying, stabilizing, and scaling continuous-time consistency models.	OpenAI has published work focusing on enhancing consistency models, which operate in two steps rather than the 1,000 steps typically used in diffusion models. While these models still depend on distillation from an existing diffusion model, the research seeks to improve their performance and stability as they scale.
All you need are 32 tokens to represent video.	Salesforce's new approach introduces a novel video encoder that significantly reduces the number of tokens needed for accurate representation. While similar attempts in the past have seen limited success, the breakthrough appears to come from combining an explicit temporal encoder with a spatial encoder, enabling more efficient video processing.
CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing.	CoPS is a novel algorithm that improves agents' sequential reasoning by allowing them to share experiences across various tasks, enhancing their overall learning and adaptability.

News

Link	description
US investigates 2.4m Tesla self-driving vehicles after reported collisions.	Road safety agency opens evaluation over reported collisions in low visibility
Anthropic just made it harder for AI to go rogue with its updated safety policy.	Anthropic has revised its Responsible Scaling Policy to incorporate Capability Thresholds for AI models that present substantial risks, including bioweapons and autonomous AI research. This policy is designed to establish industry standards by introducing AI Safety Levels, which mandate stricter safeguards according to the model's capabilities. By transparently sharing safety practices and appointing a Responsible Scaling Officer, Anthropic aims to take a leadership role in AI governance and encourage similar initiatives across the industry.
Sam Altman’s Worldcoin becomes World and shows new iris-scanning Orb to prove your humanity.	The World project, co-founded by Sam Altman, seeks to authenticate human identity online through iris-scanning technology, addressing privacy issues and ongoing investigations in the EU. The initiative plans to integrate human verification into AI platforms and may redistribute the wealth generated by AI through Worldcoins. Recent updates include the launch of a new blockchain, an app, and tools such as Deep Face to help combat deepfakes.
Google - Gemini Long Context.	The Gemini team has set aside $100,000 for the most effective applications of their long context model capabilities.
Unleashing System 2 Thinking? AlphaCodium Outperforms Direct Prompting of OpenAI o1.	OpenAI's o1 model, demonstrating System 1.5 thinking, exhibits improved reasoning abilities compared to earlier LLMs but still lacks the comprehensive problem-solving capabilities of full System 2 thinking. AlphaCodium enhances o1's coding performance by offering a structured framework that supports reasoning and iterative refinement, resulting in greater accuracy on Codeforces benchmarks. Although the combination of o1 and AlphaCodium shows potential for advancing AI toward more profound reasoning, significant effort is still needed to incorporate complete System 2 thinking in AI models.
Amazon's AI Generator Tool Can Now Create Audio Ads.	Soon, you’ll hear more audio ads on Amazon’s properties that were created with generative AI.
Google Shopping is getting a ‘for you’ feed of products.	Google Shopping is rolling out a personalized feed that shows you a stream of products you might like. The new feature, which is coming to mobile and desktop devices, shows up when you head to shopping.google.com.
TikTok owner sacks intern for allegedly sabotaging AI project.	ByteDance dismissed person in August it says ‘maliciously interfered’ with training of artificial intelligence models
AlphaFold reveals how sperm and egg hook up in intimate detail.	Three sperm proteins work together as matchmakers to enable fertilization in vertebrates.
xAI, Elon Musk’s AI startup, launches an API.	In August, Elon Musk’s xAI promised to make Grok, the company’s flagship generative AI model powering a number of features on X, available via an API. Now, that API has arrived — albeit a bit bare-bones at the moment.
Jane Street Real-Time Market Data Forecasting.	This competition, hosted by Jane Street, challenges participants to build models using real-world data from production systems. The goal is to provide insights into the complexities of financial markets, requiring participants to apply their skills in data analysis and modeling to navigate the dynamic nature of market behavior.
OCP Summit 2024: The open future of networking hardware for AI.	At OCP 2024, Meta unveiled a next-generation disaggregated network fabric and new network hardware specifically designed for AI clusters. The company introduced the Disaggregated Scheduled Fabric (DSF), aimed at improving scalability and performance in AI training systems. Both the newly developed and existing hardware are optimized for high throughput and efficiency, providing open, vendor-agnostic solutions to support advanced AI applications.
Serve confirms delivery by robot expansion plans with Gen3 rollout.	Serve Robotics' third-generation delivery robot is equipped with NVIDIA's Jetson Orin module, significantly boosting its AI processing capabilities. This upgrade allows the robot to make faster, real-time autonomous navigation decisions, improving its efficiency and performance in delivery tasks.
Boston Dynamics teams with TRI to bring AI smarts to Atlas humanoid robot.	Boston Dynamics and Toyota Research Institute are partnering to integrate advanced AI and large behavior models into the electric Atlas humanoid robot. This collaboration aims to enhance the robot's capabilities, enabling more sophisticated and autonomous behaviors in tasks that require human-like movement and decision-making.
Microsoft introduces ‘AI employees’ that can handle client queries.	US company gives customers the ability to build own virtual agents as well as releasing 10 off-the-shelf bots
Thom Yorke and Julianne Moore join thousands of creatives in AI warning.	Statement comes as tech firms try to use creative professionals’ work to train AI models
Claude AI tool can now carry out jobs such as filling forms and booking trips, says the creator.	Anthropic says model is able to carry out computer tasks – as fears mount such technology will replace workers
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku.	Anthropic has enhanced Sonnet 3.5's capabilities and introduced a more affordable version that delivers the same performance as the previous Claude 3 Opus. Furthermore, Sonnet 3.5 has been trained with screen recordings, enabling it to operate computers and interact with user interfaces.
ChatGPT has a Windows app now .	The app, which is currently in testing, is only available to ChatGPT subscribers for now.
Adobe's new image rotation tool is one of the most impressive AI concepts we've seen.	Adobe's Project Turntable leverages AI to rotate 2D vector art in 3D, allowing the artwork to be viewed from various angles while preserving its 2D look and design integrity. This innovative technique ensures that the visual style remains consistent, even as the artwork is transformed in three-dimensional space.
Perplexity lets you search your internal enterprise files and the web.	Enterprises can use their Perplexity dashboards to search for internal information and combine it with knowledge from the internet, but this will only be limited to specific files they deem important.
OpenAI, Microsoft reportedly hire banks to renegotiate partnership terms.	OpenAI and Microsoft are in discussions regarding the terms of their partnership, with Microsoft aiming to acquire a substantial stake in OpenAI following its restructuring.
Former OpenAI CTO Mira Murati is reportedly fundraising for a new AI startup.	This startup will reportedly focus on building AI products based on proprietary models and could raise more than $100 million in this round.
Midjourney plans to let anyone on the web edit images with AI.	Midjourney is planning to release an upgraded web tool that’ll let users edit any uploaded images from the web using Midjourney’s generative AI.
Intel wins lengthy EU legal battle over £880m competition fine.	Chipmaker disputed 2009 decision that it abused its market position in case dating back two decades
Cohere's multilingual model's dramatic improvement.	The Aya project, a standout initiative in multilingual language model training, has made impressive strides since its launch earlier this year. Much of its performance improvement is attributed to effective post-training strategies. Additionally, Aya can handle audio input and create images, all from non-English sources.
Introducing the analysis tool in Claude.ai.	Claude can now write and execute code as part of artifacts.
Gurman: Apple internally believes that it’s at least two years behind in AI development.	According to the latest edition of Mark Gurman’s Power On newsletter, some employees at Apple believe that the company is around two years behind in artificial intelligence development.
Perplexity is reportedly looking to fundraise at an $8B valuation.	AI search engine Perplexity is in fundraising talks and hopes to raise around $500 million at an $8 billion valuation, according to The Wall Street Journal.
Chinese humanoid robot is the 'fastest in the world' thanks to its trusty pair of sneakers.	The STAR1 robot can reach a top speed of 8 mph with the added help of a pair of sneakers.
From Rupert Murdoch to Thom Yorke: the growing backlash to AI.	Media mogul and leading artists join the fight to stop tech firms using creative works for free as training data
Talk to your plants? Now the first AI-powered garden will allow them to talk back.	Collaboration between leading garden designer and Microsoft to go on display at Chelsea Flower Show 2025

Resources

Link	description
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos.	This proposal introduces a new point-tracking model along with a semi-supervised training recipe that allows for the use of real videos without annotations during training. It generates pseudo-labels using readily available teacher models. This approach simplifies the architecture and training scheme, resulting in improved outcomes while utilizing 1000 times less data.
Meta's latest open source releases.	Meta has introduced a significant array of valuable research tools, including a speech-to-speech model, enhancements to SAM, and numerous other intriguing developments.
One-Step Diffusion via Shortcut Models.	Shortcut models represent a new category of consistency models that can produce continuous signals with minimal inference steps.
Zero-Shot 3D Visual Grounding.	VLM-Grounder is a novel approach to 3D visual grounding that addresses the shortcomings of conventional methods by leveraging vision-language models (VLMs) and 2D images.
DeepSeek's natively Multimodal model.	DeepSeek has developed and launched a powerful 1.3 billion parameter model capable of processing interleaved text and images for both generation and comprehension.
Meta Lingua.	Meta has developed an easy-to-use and research-friendly codebase that can replicate Llama 2 7B within 24 hours.
Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization.	LiVO (Lightweight Value Optimization) is an innovative approach designed to align Text-to-Image models with human values.
Easily hackable vision language model.	A simple and performant VLM implementation in pure PyTorch
Anthropic Quickstarts.	Anthropic Quickstarts provides developers with projects like a customer support agent and a financial data analyst to help them swiftly utilize the Anthropic API. These projects leverage Claude for natural language processing and incorporate interactive data visualization. Each quickstart comes with setup instructions and encourages contributions from the community.
BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities.	BiGR is an innovative image generation model that leverages compact binary latent codes to enhance both its generation and representation capabilities. It is the first model to integrate both generative and discriminative tasks within a unified framework. Key features of the model include binary tokenization and a distinctive entropy-ordered sampling technique, which contribute to its improved performance.
LongPiBench.	LongPiBench is a benchmark created to evaluate positional biases in large language models (LLMs) when handling long contexts. It focuses on identifying biases that stem from the spacing between multiple relevant pieces of information, providing a targeted way to assess how well models handle long-range dependencies in text.
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models.	Clamp2 is a contrastive model designed for aligning music and text. It uses contrastive learning techniques to match and relate musical elements with corresponding textual descriptions, enhancing the ability to process and generate music-related text in alignment with audio.
bitnet.cpp.	Microsoft has released an inference repository for its 1.58-bit models, which, when properly trained, are capable of running efficiently on consumer hardware. This development allows for more accessible deployment of advanced AI models without requiring high-end computational resources.
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning.	Montessori-Instruct is a novel framework designed to generate synthetic data that aligns with a student language model's learning process. It adapts the data produced by the teacher model to fit the student's learning preferences by leveraging local data influence and Direct Preference Optimization (DPO), optimizing the training experience for the student model.
Stable Diffusion 3.5.	Stability AI has launched a new series of models featuring enhanced performance and faster speeds. These models come with built-in Diffusers support, allowing for immediate training capabilities
3D-GANTex: 3D Face Reconstruction with StyleGAN3-based Multi-View Images and 3DDFA based Mesh Generation.	This paper presents a novel approach for estimating face texture and geometry from a single image by combining StyleGAN with 3D Morphable Models.
Moonshine.	Moonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. It is well-suited to real-time, on-device applications like live transcription and voice command recognition.
PocketPal AI.	PocketPal AI is a pocket-sized AI assistant powered by small language models (SLMs) that run directly on your phone. Designed for both iOS and Android, PocketPal AI lets you interact with various SLMs without the need for an internet connection.
Introducing the prompt() Function: Use the Power of LLMs with SQL!.	The costs of operating LLMs have dropped considerably, making it feasible to incorporate smaller models like GPT-4o-mini into SQL functions. MotherDuck's PROMPT() function simplifies tasks such as text generation, summarization, and structured data extraction using OpenAI models. It provides flexibility in balancing cost and performance, while also supporting bulk operations with improved concurrency for more efficient processing.
Anthropic Computer Use Demo.	A quick example of Claude Sonnet's 3.5 new computer use capabilities.
Introducing SynthID Text.	SynthID is a method for statistically watermarking generated text. It employs a pseudorandom function after the top-k and top-p sampling steps to embed a mark within the text. A probabilistic Bayesian approach is then used to detect whether the text has been watermarked, indicating it was produced by a language model.
Transformers.js v3: WebGPU Support, New Models & Tasks, and More….	Transformers JS is a JavaScript library designed to run machine learning models, and it now supports WebGPU, offering up to 1,000x faster performance in some cases. The latest version provides access to over 1,200 models, making it well-suited for edge and browser-based applications.
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages.	We present Pangea-7B, an open multilingual multimodal language model (MLLM) developed to address multilingual and multicultural challenges in visual understanding tasks. Pangea-7B is trained on PangeaIns, a comprehensive dataset consisting of 6 million instructions across 39 languages.
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.	SAM2Long solves the "error accumulation" problem found in SAM 2's memory design by implementing a training-free strategy for video object segmentation.
Agent.exe.	A convenient wrapper for Anthropic's computer use system simplifies its usage and execution, making it more user-friendly and accessible.
TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight.	TALoS is a method that enhances scene completion for autonomous vehicles by leveraging observations from different time points as supervision for making more accurate predictions.
OmniParser for Pure Vision Based GUI Agent.	Screenshot parsing tool for models to use digital interfaces.
Introducing quantized Llama models with increased speed and a reduced memory footprint.	Meta has optimized its 1B and 3B language models by applying quantization, achieving a 2-4x speed increase and reducing the model size by over 50% with minimal quality loss. This improvement is made possible by its quantization-aware training setup, allowing the models to adapt to lower precision effectively.
Joint Point Cloud Upsampling and Cleaning with Octree-based CNNs.	An effective and straightforward approach for upsampling and refining point clouds utilizes a modified octree-based 3D U-Net, known as OUNet.
ExecuTorch.	ExecuTorch supports on-device inference across mobile and edge devices, including wearables, embedded systems, and microcontrollers. It facilitates the efficient deployment of PyTorch models to edge environments and is compatible with various computing platforms, leveraging hardware capabilities like CPUs, NPUs, and DSPs. Comprehensive tutorials provide guidance on using ExecuTorch step-by-step.
Federated Transformer (FeT).	The Federated Transformer (FeT) is a novel framework aimed at enhancing both performance and privacy in Vertical Federated Learning (VFL) across multiple collaborating parties.
ADEM-VL.	ADEM-VL is an innovative vision-language model created to address hardware constraints found in current models.
Predicting Weight Loss with Machine Learning.	The author utilized a straightforward feedforward DNN model to monitor and forecast weight loss on a ketogenic diet. This model effectively captured the non-linear weight loss trends, fit a predictive function to the data, and visualized calorie metrics. For added insights, the Harris-Benedict Equation was applied to compare estimated calorie needs with actual weight loss.
Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent.	Google Gemini's AI Studio can accurately extract numerical data from video screen recordings of emails. This process leverages the cost-effective Gemini 1.5 Flash model, resulting in minimal expense. This innovative "video scraping" technique provides a practical alternative to conventional data extraction methods.

Perspectives

Link	description
Duolingo CEO Luis von Ahn wants you addicted to learning.	Duolingo's CEO, Luis von Ahn, talks about utilizing AI and gamification to improve language learning through features such as chat interactions with AI avatars and AI-generated video game-like adventures. The company has recently launched Duolingo Max, a premium subscription plan that provides AI-driven conversation practice, capitalizing on the lower costs and faster development associated with AI-generated content. Although AI has limitations in engagement, Duolingo prioritizes maintaining user motivation by balancing effective learning with gamified, entertaining experiences.
State of AI Report 2024.	The 2024 State of AI Report notes that foundational models are increasingly being integrated into practical applications, with OpenAI leading the way in significant revenue generation. Key developments include the alignment of performance among leading research labs, a growing emphasis on planning and reasoning in large language model (LLM) research, and extending foundational models into multimodal domains. Despite facing regulatory hurdles, AI companies have seen a surge in valuation, though questions about their long-term sustainability remain.
How gen AI can help doctors and nurses ease their administrative workloads.	Doctors and nurses spend nearly 28 hours a week on administrative tasks.
Elon Musk’s global political goals.	Over the weekend, Musk pledged to give away $1m a day to registered voters in battleground states in the US who sign his Pac’s petition in support of the First and Second Amendments. He awarded the first prize, a novelty check the size of a kitchen island, at a Pennsylvania rally on Saturday and the second on Sunday in Pittsburgh. He says he’ll keep doing it until the election on 5 November. Experts say that the stunt is potentially illegal.
The Second $100B AI Company.	This article forecasts that by 2034, emerging AI companies fueled by advancements in AI applications, particularly in consumer AI, will join OpenAI in exceeding a $100B market cap. While established tech giants currently dominate the AI infrastructure and model layers, the application layer offers significant potential for innovation and expansion, providing fertile ground for consumer AI to flourish. The prospects for large-scale success in consumer AI, especially in areas such as video creation, online shopping, and gaming, resemble the transformative impact seen in past tech revolutions like cloud computing and mobile technology.
Use Prolog to improve LLM's reasoning.	Current methods such as Chain-of-Thought (CoT) reasoning and the integration of programming languages like Prolog can enhance the reasoning abilities of LLMs, helping to mitigate the limitations of autoregressive models. The paper "Reliable Reasoning Beyond Natural Language" introduces a neurosymbolic approach that employs Prolog to translate requests into symbolic logic, enhancing both explainability and problem-solving capabilities. ProSLM, the model developed in this research, has shown substantial improvements on various datasets, highlighting the potential of combining Prolog with LLMs for tackling complex reasoning tasks.
AI watermarking must be watertight to be effective.	Scientists are closing in on a tool that can reliably identify AI-generated text without affecting the user’s experience. But the technology’s robustness remains a challenge.
AI scans RNA ‘dark matter’ and uncovers 70,000 new viruses.	Many are bizarre and live in salt lakes, hydrothermal vents, and other extreme environments.
Build an international AI ‘telescope’ to curb the power of big tech companies.	Artificial intelligence (AI) technologies have reached a crucial juncture. The vast computing clusters required to train the most advanced generative AI systems are available only to a few large corporations.
Was the Nobel prize for physics? Yes — not that it matters.	The award of the 2024 Nobel Prize in Physics to John Hopfield and Geoffrey Hinton for their groundbreaking research on artificial neural networks has caused consternation in some quarters. Surely this is computer science, not physics?
How I peer into the geometry behind computer vision.	Minh Ha Quang’s work at a Japanese AI research center aims to understand how machines extract image data from the real world.
AI Dreams: Microsoft @ 50, Chapter 1.	Microsoft's research on AI robustness led the company to invest billions in AI infrastructure, driving breakthroughs with partners such as OpenAI. This investment has played a key role in Microsoft's rapid growth in AI-powered products, highlighted by the success of GitHub Copilot. Despite facing competition and balancing sustainability goals, Microsoft remains committed to AI, with record capital expenditures on its AI and cloud infrastructure.
Future of Internet in the age of AI.	In this article, Cloudflare CEO Matthew Prince explores AI's influence on Internet infrastructure, emphasizing the need for AI-capable edge computing and local inference to minimize network latency. He underscores the significance of regionalization in AI services to address regulatory challenges and outlines Cloudflare's strategy of developing a connectivity-focused network. Cloudflare's goal is to enhance internet connectivity by making it faster, more secure, and more efficient, closely aligning its efforts with advancements in AI technologies.
How Jacob Collier helped shape the new MusicFX DJ.	Grammy-winning musician Jacob Collier has partnered with Google DeepMind and Google Labs to develop MusicFX DJ, an AI-driven music tool. The tool’s interface has been revamped to foster creativity, making it easy for users to tap into a "flow state" of artistic inspiration. MusicFX DJ is now available, featuring user-friendly controls suitable for all experience levels.
The AI Investment Boom.	The AI boom is spurring substantial US investments in data centers, computing infrastructure, and advanced hardware, with annual data center construction reaching an unprecedented $28.6 billion. This growth is driven by rising demand for high-powered computing resources essential for training and deploying sophisticated AI models. Although tech sector revenue is recovering, job growth is primarily centered on semiconductor manufacturing and infrastructure, shifting attention away from traditional programming roles.

Back to index

ML news: Week 14 - 20 October

Research

Link	description
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models.	Introduces a novel RAG method to address the challenges of imperfect retrieval augmentation and knowledge conflicts in LLMs. Astute RAG adaptively extracts critical information from the internal knowledge of LLMs, then iteratively merges this with external knowledge while maintaining source awareness. Its interactive consolidation mechanism enhances the integration of internal and external information by identifying consistent passages, detecting conflicting data, and filtering out irrelevant content.
ToolGen: Unified Tool Retrieval and Calling via Generation.	Incorporates tool knowledge directly into LLMs by encoding tools as unique tokens, allowing the model to generate tool calls and arguments, facilitating smooth tool invocation alongside natural language generation. Experiments involving over 47,000 tools demonstrate that ToolGen outperforms in both tool retrieval and autonomous task execution.
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG.	Finds that in many long-context LLMs, output quality diminishes as the number of passages increases, with the performance decline attributed to retrieved hard negatives. The authors propose two methods to enhance long-context LLM-based RAG: retrieval reordering and RAG-specific tuning with intermediate reasoning to improve relevance identification. These approaches show marked improvements in both accuracy and robustness in long-context RAG performance.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.	Evaluates several state-of-the-art (SoTA) models using a benchmark built with symbolic templates that allow for a range of mathematical problems. The results show that LLMs display variability when answering different versions of the same questions, and their performance drops when numerical values in the questions are adjusted. As the complexity of the questions increases (e.g., adding more clauses), performance deteriorates significantly. The authors suggest that this decline in performance is likely due to a lack of logical reasoning capabilities in current LLMs.
Addition is All You Need for Energy-efficient Language Models.	Introduces an algorithm that approximates floating-point multiplication using integer addition operations, making it computationally less intensive than 8-bit floating-point arithmetic while achieving higher precision. The authors report that implementing the proposed L-Mul operation in tensor processing hardware could potentially reduce energy consumption by 95% for elementwise floating-point tensor multiplications and by 80% for dot product operations.
I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy.	Examines the interaction patterns of LLMs within a multi-agent setting involving a social hierarchy, specifically in a scenario where a guard and a prisoner interact, with the prisoner either seeking extra yard time or attempting to escape. The study finds that when power dynamics are present, LLMs struggle to maintain coherent conversations. Additionally, the authors highlight that agents' personas significantly influence their behaviors. Interestingly, even without explicit prompting, merely assigning roles to agents resulted in the emergence of anti-social behaviors.
Were RNNs All We Needed?	The paper revisits RNNs and demonstrates that removing the hidden states from the input, forget, and update gates allows for efficient parallel training. This adjustment eliminates the need for architectures like LSTMs and GRUs to rely on backpropagation through time (BPTT). They introduce new variants, called minLSTMs and minGRUs, which are 175 times faster for sequences of length 512.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations.	The study finds that "truthfulness" information in LLMs is concentrated in specific tokens, offering a way to improve error detection and address related challenges. They also suggest that the internal representations of LLMs can be used to predict the types of errors these models are prone to making.
Archon: An Architecture Search Framework for Inference-Time Techniques.	The paper presents a modular framework for constructing and optimizing LLMs by integrating various inference-time techniques. This approach redefines the task of LLM system design as a hyperparameter optimization problem. Tested on benchmarks like MT-Bench and CodeContests, the framework, named Archon, outperforms top models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement.
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning.	RATIONALYST is a model designed for process-supervision of reasoning, enabling it to generalize across a wide range of reasoning tasks. This is accomplished by pre-training on a dataset of 79k rationales from the Pile and a variety of reasoning datasets, with minimal human involvement. Fine-tuned from LLaMa-3-8B, the model achieves a 3.9% average accuracy improvement across seven reasoning benchmarks.
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation.	The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
Not All LLM Reasoners Are Created Equal.	The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis.	Training generative models like GANs with limited data is challenging. Existing Implicit Maximum Likelihood Estimation (IMLE) methods suffer from poor alignment between the latent codes used during training and those used during inference. The proposed approach, RS-IMLE, modifies the prior distribution during training, resulting in better test-time performance and higher-quality image generation.
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models.	This study introduces a unified framework aimed at enhancing training stability in continuous-time consistency models, leading to substantial improvements in the performance of generative models.
DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection.	DARNet is an innovative model for auditory attention detection (AAD) that improves the decoding of brain signals, such as EEG, by integrating spatiotemporal and dual attention mechanisms.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.	DuoAttention is a framework designed to optimize memory usage and reduce latency in long-context large language models (LLMs) by selectively applying full key-value (KV) caching to only the most essential attention heads.
Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement.	Meta Decision Transformer (Meta-DT) aims to enhance generalization in reinforcement learning by integrating transformer-based sequential modeling with effective task representation learning.

News

Link	description
AI gives voice to dead animals in Cambridge exhibition.	Creatures can converse and share their stories by voice or text through visitors’ mobile phones at Museum of Zoology
Three-armed robot conductor makes debut in Dresden.	German city’s Sinfoniker says the aim is not to replace humans but to play music human conductors would find impossible
Tesla’s value drops $60bn after investors fail to hail self-driving ‘Cybercab’.	Analysts criticize lack of detail about the ‘robotaxi’ showcased by CEO Elon Musk
Microsoft may have an audio-to-image generator in the works, new patent shows.	Microsoft has submitted a patent for an AI system that transforms live audio into images using large language models (LLMs). The system is intended to improve communication by creating real-time visuals from audio streams. Once developed, it could potentially be incorporated into Microsoft Teams through Copilot integration.
Australia’s spy chief warns AI will accelerate online radicalization.	Asio boss Mike Burgess says social media impact is a ‘step-change’ in the threat posed by extremism
Google to buy nuclear power for AI datacentres in ‘world first’ deal.	Tech company orders six or seven small nuclear reactors from California’s Kairos Power
Silicon Valley is debating if AI weapons should be allowed to decide to kill.	In late September, Shield AI co-founder Brandon Tseng swore that weapons in the U.S. would never be fully autonomous — meaning an AI algorithm would make the final decision to kill someone. “Congress doesn’t want that,” the defense tech founder told TechCrunch. “No one wants that.”
Zoom’s custom AI avatar tool may come with risks.	The upcoming feature, announced today at Zoom’s annual dev conference, will translate a video clip that users record of themselves into a digital clone — complete with a head, upper arms, and shoulders. Users will be able to type a script of what they want the digital double to say, and Zoom will generate audio that syncs with the avatar’s lip movements.
Generate Video (beta) on Firefly Web App.	During the Adobe MAX conference, Adobe revealed the extension of its Firefly series of creative generative AI models to include video.
OpenAI appoints international expansion boss.	OpenAI has named Oliver Jay as the head of its international expansion, with a focus on AI strategy and operations. The company also revealed the opening of a new APAC office in Singapore and is working on developing datasets for local languages. The o1 model, which incorporates "chain of thought" methods, is designed to improve AI accuracy.
Anthropic challenges OpenAI with affordable batch processing.	Anthropic has introduced a Message Batches API, enabling businesses to handle large data volumes at half the cost of traditional API calls. The API allows for up to 10,000 asynchronous queries within 24 hours, providing a cost-efficient solution by shifting AI processing from real-time to "right-time." This approach encourages AI adoption among mid-sized companies but may draw attention away from the advancement of real-time AI capabilities.
OpenAI Projections Imply Losses Tripling To $14 Billion In 2026.	OpenAI projects losses to rise to $14 billion in 2026, with total losses reaching $44 billion by 2028.
AMD launches AI chip to rival Nvidia's Blackwell.	AMD has introduced the Instinct MI325X AI chip, targeting competition with Nvidia's leading data center GPUs.
Meta’s open AI hardware vision.	Meta unveiled its open AI hardware designs, including the Catalina rack and the enhanced Grand Teton platform, at the OCP Global Summit. Notably, training the Llama 3.1 405B model required 16,000 NVIDIA H100 GPUs, demonstrating Meta's robust scaling infrastructure. These open AI hardware systems are essential for driving further advancements in AI capabilities.
The New York Times warns AI search engine Perplexity to stop using its content.	The New York Times has sent a cease and desist letter to AI startup Perplexity, accusing the company of using its content without authorization for AI search purposes. Perplexity asserts that it does not scrape content for training but instead indexes web pages to provide factual information. The company is currently in discussions with publishers and seeks to resolve the matter by collaborating with the Times and other media organizations.
Decagon raises $65m Series B led by Bain Capital Ventures to bring total funding to $100m.	Decagon has secured $65 million in Series B funding to further develop its AI customer support agents, which are already utilized by companies such as Duolingo and Eventbrite to streamline customer interactions. These AI agents automate routine tasks, allowing customer support teams to focus on more strategic roles. The funding will be used to strengthen Decagon's engineering team and extend its AI solutions into new markets and industry sectors.
New high-quality AI video generator Pyramid Flow launches — and it’s fully open source!	The number of AI video generation models continues to grow with a new one, Pyramid Flow, launching this week and offering high-quality video clips up to 10 seconds in length — quickly, and all open source.
This three-person robotics startup is working with designer Yves Béhar to bring humanoids home.	Kind Humanoid's three-person team is developing a whimsical humanoid robot named Mona, specifically designed for home use rather than industrial applications. The team aims to conduct field tests with a dozen initial prototypes next year. Unlike many AI-driven robotics companies that focus on industrial markets and heavy fundraising, Kind prioritizes innovation and efficiency, setting its approach apart from competitors in the robotics space.
INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model.	INTELLECT-1 is the first decentralized model with 10 billion parameters, designed to harness global contributions for open-source AGI development. It utilizes OpenDiLoCo scaling to train large models across distributed devices, with innovations in bandwidth efficiency and fault tolerance. The new Prime framework further enhances decentralized training by optimizing compute utilization, achieving a 98% utilization rate during INTELLECT-1's 10-billion-parameter training run. This marks a significant advancement in decentralized AI model training.
Elon Musk Shows Off Tesla ‘Robotaxi’ That Drives Itself.	“You could fall asleep and wake up at your destination,” said Mr. Musk, Tesla’s C.E.O., but some experts are skeptical that such cars will be ferrying passengers soon.
ByteDance lays off hundreds of TikTok employees in the shift to AI content moderation.	ByteDance’s TikTok is laying off hundreds of employees, mainly in Malaysia, according to Reuters. The cuts come as the social network is increasingly turning to AI for content moderation. The cuts do not impact employees in the U.S.
Microsoft Artificial Intelligence VP Bubeck to Join OpenAI.	Microsoft Corp. said one of its artificial intelligence vice presidents, Sebastien Bubeck, is leaving to join OpenAI, where Microsoft is both the largest investor and a rival.
‘It’s not me, it’s just my face’: the models who found their likenesses had been used in AI propaganda.	London-based Synthesia’s technology was employed to make deepfake videos for authoritarian regimes
Amazon.com joins push for nuclear power to meet data center demand.	Company says it signed three agreements on developing small modular reactor nuclear power technology
Un Ministral, des Ministraux.	On the first anniversary of Mistral 7B, Mistral launched two advanced models designed for on-device and edge computing: Ministral 3B and Ministral 8B. These models are optimized for tasks under 10 billion parameters, offering superior knowledge, reasoning, and efficiency. They also support a context length of up to 128k and deliver faster inference.
Former Palantir CISO Dane Stuckey joins OpenAI to lead security.	Dane Stuckey, the former CISO of analytics firm Palantir, has joined OpenAI as its newest CISO, serving alongside OpenAI head of security Matt Knight.
Can AI really compete with human data scientists? OpenAI’s new benchmark puts it to the test.	OpenAI has introduced a new tool to measure artificial intelligence capabilities in machine learning engineering. The benchmark, called MLE-bench, challenges AI systems with 75 real-world data science competitions from Kaggle, a popular platform for machine learning contests.
Adobe’s AI video model is here, and it’s already inside Premiere Pro.	New beta tools allow users to generate videos from images and prompts and extend existing clips in Premiere Pro.
Customize Audio Overviews with Google's NotebookLM.	NotebookLM now enables users to customize their Audio Overview experience, providing greater control over the areas of focus and expertise of the AI hosts. Companies can apply for the new NotebookLM Business pilot program, which includes improved tools designed for professional applications.
Combining next-token prediction and video diffusion in computer vision and robotics.	A new method can train a neural network to sort corrupted data while anticipating next steps. It can make flexible plans for robots, generate high-quality video, and help AI agents navigate digital environments.
Nvidia just dropped a new AI model that crushes OpenAI’s GPT-4—no big launch, just big results.	Nvidia quietly unveiled a new artificial intelligence model on Tuesday that outperforms offerings from industry leaders OpenAI and Anthropic, marking a significant shift in the company’s AI strategy and potentially reshaping the competitive landscape of the field.
Invisible text that AI chatbots understand and humans can’t? Yep, it’s a thing.	A quirk in the Unicode standard harbors an ideal steganographic code channel.
Google supercharges Shopping tab with AI and personalized recommendation feed.	After bringing generative AI to Search in 2023, Google is supercharging its Shopping tab with the technology. The company announced on Tuesday that it will use AI to help users shop for products based on exactly what they’re looking for. It also launched a new scrollable feed of personalized, shoppable products.
Adobe’s Project Super Sonic uses AI to generate sound effects for your videos.	Adobe's Project Super Sonic leverages text-to-audio technology, object recognition, and voice input to create audio effects for video projects.
White House considers expanding Nvidia’s and AMD’s AI chip export limits to additional countries.	The Biden administration is contemplating limitations on AI chip sales from Nvidia and AMD to countries in the Persian Gulf, citing national security concerns.

Resources

Link	description
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.	It introduces a new benchmark to assess machine learning agents' proficiency in machine learning engineering tasks. The benchmark consists of 75 Kaggle competitions focused on key MLE skills, including model training, dataset preparation, and experiment execution. OpenAI's o1-preview model, utilizing the AIDE scaffolding, reaches a bronze medal level in 16.9% of the competitions.
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System.	Presents a novel framework aimed at improving both communication efficiency and task effectiveness in LLM-based multi-agent systems through targeted LLM training. It introduces an iterative "generate, rank, select, and train" approach, enhanced by a reward function to optimize performance, token usage, and communication efficiency. The framework integrates Monte Carlo Tree Search-inspired techniques for DPO data generation, promoting diverse exploration. Experimental results show consistent improvements over single-agent baselines and standard multi-agent systems (MAS) using Llama 3 8B, achieving a 2.8x performance boost while utilizing fewer than 10% of tokens on tasks involving extensive information exchange.
Zyphra's Mamba 2 based model beats Mistral.	Introduces the first state space-style model that surpasses transformers at the 7B scale. It excels in understanding and generating long-context data, thanks to the linear time scaling of the Mamba 2 blocks, which significantly enhances its efficiency and performance.
OpenAI's Swarm.	OpenAI has introduced a lightweight framework designed to facilitate communication between agents. While it will not receive further updates, the framework could still offer valuable ideas and inspiration for future developments.
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.	EvolveDirector aims to develop a competitive text-to-image generation model using open, publicly available resources, avoiding the limitations imposed by proprietary models.
Rethinking the Evaluation of Visible and Infrared Image Fusion.	Researchers propose the Segmentation-oriented Evaluation Approach (SEA) to improve the evaluation of Visible and Infrared Image Fusion (VIF) techniques, which play a critical role in applications such as object detection and semantic segmentation.
A Gentle Introduction and Tutorial on Deep Generative Models in Transportation Research.	A gentle introduction and tutorial on deep generative models in transportation research provides a comprehensive overview of how these models can be applied to solve transportation problems.
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis.	Trans4D is a new framework developed to address the challenges of realistic 4D scene transitions, enhancing text-to-4D synthesis. It offers improved capabilities in generating coherent, dynamic 4D scenes from textual descriptions, making it more suitable for tasks that require accurate spatial and temporal scene transitions.
DocMTAgent.	DelTA, short for Document-levEL Translation Agent, is an online translation tool designed for handling document-level translations. It leverages a multi-level memory architecture to improve translation accuracy and coherence across larger texts, providing more context-aware translations compared to sentence-level models.
Fast Feedforward 3D Gaussian Splatting Compression.	Fast Compression of 3D Gaussian Splatting (FCGS) is a new model designed to eliminate the need for the slow, per-scene optimization required by earlier methods. Instead, FCGS achieves rapid compression using a quick feed-forward pass, reducing the processing time from minutes to just seconds. This significantly accelerates the compression process while maintaining high-quality results for 3D data.
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling.	OneRef presents an optimized framework for referring segmentation by integrating visual and language feature spaces within a unified transformer architecture.
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction.	SmartPretrain offers a versatile, model-agnostic, and dataset-agnostic self-supervised learning framework designed to enhance motion prediction in autonomous vehicles.
UvA - An Introduction to Group Equivariant Deep Learning.	Resources for studying deep learning techniques applied to specific types of geometric data while addressing architectural limitations.
Diffusion model simulating CS:GO.	An open-source replication of a diffusion model that generates visual simulations of a video game, using keyboard and mouse inputs to influence the output.
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs.	This study addresses the shortcomings of current alignment algorithms in large language models (LLMs), which tend to overfit to relative preferences and neglect response quality. The authors introduce reward-conditioned LLM policies and a novel data relabeling method that incorporates response quality, enabling the model to better generalize to optimal responses.
entropix.	Entropix is a tool designed to modify the sampling behavior of language models.
LoLCATs Blog Part 2: How to Linearize LLMs for Me and You.	Hazy Research has published another insightful post that delves into techniques for linearizing existing language models while maintaining much of their performance. This exploration highlights methods to simplify model architectures, making them more efficient, without significantly compromising their effectiveness in tasks like text generation and understanding.
TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control.	TextCtrl is a newly introduced diffusion-based method designed to enhance scene text editing. It achieves a balance between maintaining content accuracy and preserving the original style, ensuring that both the textual content and the visual appearance remain consistent during edits.
Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies.	iDP3 is an advanced 3D visuomotor policy designed to enable humanoid robots to autonomously navigate and perform tasks in a variety of real-world environments. This improved policy enhances the robot's ability to perceive and interact with its surroundings, making it more adaptable and efficient in complex and dynamic settings.
tabled.	Tabled is a small library for detecting and extracting tables. It uses Surya to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or html.
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer.	HART is a cutting-edge visual generation model designed to produce high-quality 1024x1024 images, presenting a challenge to the capabilities of diffusion models. It enhances image reconstruction and reduces training costs by employing a hybrid tokenizer that integrates both discrete and continuous tokens, resulting in more efficient and effective image generation.
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention.	The Deformable Bi-level Routing Attention (DBRA) module is an innovation designed to enhance attention mechanisms in vision transformers. DeBiFormer, which is built upon DBRA, optimizes the selection of key-value pairs in the attention process, resulting in more efficient computations and better interpretability of queries within attention maps. This leads to improved performance and understanding of how the model attends to different parts of an image.
Six tips for going public with your lab’s software.	It’s not enough to write high-quality programs. If you want to make your apps public — and usable — you should also follow these steps.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos.	CoTracker is a newly developed tracking model that bridges the performance gap between synthetic and real video data by employing semi-supervised training techniques.
A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration.	Researchers have developed a novel consistency-aware spot-guided Transformer designed to improve the efficiency and accuracy of point cloud registration.
Ditto - the simplest self-building coding agent.	Ditto is a user-friendly tool that allows you to generate a multi-file Flask application from simple natural language descriptions using a no-code interface. By leveraging a simple LLM loop with a few tools, Ditto automates the coding process, (occasionally) turning your ideas into functional web applications (or at least trying and getting close).
F5 Text-to-Speech System.	F5-TTS is a non-autoregressive, zero-shot text-to-speech system featuring a flow-matching mel spectrogram generator and a diffusion transformer. Developed on the MLX framework, F5 outperforms earlier systems such as E2 TTS by incorporating ConvNeXT v2 blocks for improved text alignment, enabling high-quality speech generation in approximately 11 seconds on modern hardware.
Movie Gen Bench.	"Movie Gen Bench" is an evaluation benchmark designed to assess performance in both video (Video Bench) and audio (Audio Bench). It includes 1,003 prompts that encompass a v

Name		Name	Last commit message	Last commit date
Latest commit History 346 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
images		images
LICENSE		LICENSE
README.md		README.md

License

SalvatoreRa/ML-news-of-the-week

Folders and files

Latest commit

History

Repository files navigation

ML & AI news of the week

Suggestions and corrections

Index

2025

2024

2023

2025

ML news: Week 13 -19 January

Research

News

Resources

Perspectives

ML news: Week 6 -12 January

Research

News

Resources

Perspectives

ML news: Week 31 December - 5 January

Research

News

Resources

Perspectives

2024

ML news: Week 23 - 29 December

Research

News

Resources

Perspectives

ML news: Week 16 - 22 December

Research

News

Resources

Perspectives

ML news: Week 9 - 15 December

Research

News

Resources

Perspectives

ML news: Week 2 - 8 December

Research

News

Resources

Perspectives

ML news: Week 25 November - 1 December

Research

News

Resources

Perspectives

ML news: Week 18 - 24 November

Research

News

Resources

Perspectives

ML news: Week 11 - 17 November

Research

News

Resources

Perspectives

ML news: Week 3 - 10 November

Research

News

Resources

Perspectives

ML news: Week 28 October - 3 November

Research

News

Resources

Perspectives

ML news: Week 21 - 27 October

Research

News

Resources

Perspectives

ML news: Week 14 - 20 October

Packages