Reinforcement Learning: a brief overview

25 Mar

This document provides an extensive overview of reinforcement learning (RL), covering various aspects such as maximizing expected utility, minimizing regret, episodic versus continual tasks, and different types of models including partially observable Markov decision processes (POMDPs) and contextual bandits. It also discusses the exploration-exploitation tradeoff, reward functions, and software tools for implementing RL algorithms.

Reinforcement Learning: a brief overview

25 Mar

This document provides an extensive overview of reinforcement learning (RL), covering various aspects such as maximizing expected utility, minimizing regret, episodic versus continual tasks, and different types of models including partially observable Markov decision processes (POMDPs) and contextual bandits. It also discusses the exploration-exploitation tradeoff, reward functions, and software tools for implementing RL algorithms.

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

25 Mar

This paper presents DeepMesh, a novel framework for generating artist-like 3D meshes using reinforcement learning. It addresses challenges in auto-regressive mesh generation by introducing an efficient tokenization algorithm and aligning outputs with human preferences through Direct Preference Optimization (DPO). The proposed method demonstrates significant improvements in both precision and quality compared to existing techniques.

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

25 Mar

This paper presents DeepMesh, a novel framework for generating artist-like 3D meshes using reinforcement learning. It addresses challenges in auto-regressive mesh generation by introducing an efficient tokenization algorithm and aligning outputs with human preferences through Direct Preference Optimization (DPO). The proposed method demonstrates significant improvements in both precision and quality compared to existing techniques.

Transformers without Normalization

14 Mar

This work demonstrates that Transformers can be trained without normalization layers through a method called Dynamic Tanh (DyT), which adjusts input activation ranges and squashes extreme values. The findings challenge the conventional belief that normalization layers are essential for training modern neural networks, showing that DyT can match or exceed the performance of normalized counterparts across various tasks.

Transformers without Normalization

14 Mar

This work demonstrates that Transformers can be trained without normalization layers through a method called Dynamic Tanh (DyT), which adjusts input activation ranges and squashes extreme values. The findings challenge the conventional belief that normalization layers are essential for training modern neural networks, showing that DyT can match or exceed the performance of normalized counterparts across various tasks.

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

05 Mar

This paper explores the effectiveness of reinforcement learning (RL) in fine-tuning foundation models, particularly focusing on the advantages of a two-stage training process involving a reward model followed by an online RL procedure. It discusses various hypotheses regarding the performance gap between online and offline fine-tuning methods, providing theoretical and empirical insights into the complexities of learning policies and reward models.

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

05 Mar

This paper explores the effectiveness of reinforcement learning (RL) in fine-tuning foundation models, particularly focusing on the advantages of a two-stage training process involving a reward model followed by an online RL procedure. It discusses various hypotheses regarding the performance gap between online and offline fine-tuning methods, providing theoretical and empirical insights into the complexities of learning policies and reward models.

When does a predictor know its own loss?

01 Mar

The article discusses the theoretical foundations of loss prediction in machine learning, exploring how well predictors can estimate their own loss on inputs. It establishes connections between loss prediction and multicalibration, providing insights into when a predictor can accurately assess its performance. The work includes empirical results demonstrating the relationship between loss prediction and multicalibration errors, and proposes methods for achieving efficient multicalibration across multiple loss functions.

When does a predictor know its own loss?

01 Mar

The article discusses the theoretical foundations of loss prediction in machine learning, exploring how well predictors can estimate their own loss on inputs. It establishes connections between loss prediction and multicalibration, providing insights into when a predictor can accurately assess its performance. The work includes empirical results demonstrating the relationship between loss prediction and multicalibration errors, and proposes methods for achieving efficient multicalibration across multiple loss functions.

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

27 Feb

This paper introduces Cross-Layer Attention (CLA), a method designed to reduce the memory footprint of key-value (KV) caches in transformer models. By sharing KV activations across layers, CLA achieves significant reductions in memory usage while maintaining comparable accuracy to existing methods like Multi-Query Attention (MQA). The authors present extensive experiments demonstrating the effectiveness of CLA at both 1B and 3B parameter scales, highlighting its potential for improving the efficiency of large language models.

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

27 Feb

This paper introduces Cross-Layer Attention (CLA), a method designed to reduce the memory footprint of key-value (KV) caches in transformer models. By sharing KV activations across layers, CLA achieves significant reductions in memory usage while maintaining comparable accuracy to existing methods like Multi-Query Attention (MQA). The authors present extensive experiments demonstrating the effectiveness of CLA at both 1B and 3B parameter scales, highlighting its potential for improving the efficiency of large language models.

Trading inference-time compute for adversarial robustness.

22 Jan

This paper investigates the relationship between inference-time compute and adversarial robustness in reasoning models, particularly focusing on Large Language Models (LLMs). The authors conduct experiments demonstrating that increasing inference-time compute can significantly enhance the robustness of these models against various adversarial attacks without requiring adversarial training. They also explore new types of attacks and discuss limitations in current approaches to improving adversarial robustness.

Trading inference-time compute for adversarial robustness.

22 Jan

This paper investigates the relationship between inference-time compute and adversarial robustness in reasoning models, particularly focusing on Large Language Models (LLMs). The authors conduct experiments demonstrating that increasing inference-time compute can significantly enhance the robustness of these models against various adversarial attacks without requiring adversarial training. They also explore new types of attacks and discuss limitations in current approaches to improving adversarial robustness.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

21 Jan

This paper introduces DeepSeek-R1, a model designed to enhance reasoning capabilities in large language models (LLMs) through reinforcement learning (RL). It discusses the development of two models, DeepSeek-R1-Zero and DeepSeek-R1, highlighting their performance on various reasoning tasks and the challenges faced during training. The paper emphasizes the importance of cold-start data and iterative RL fine-tuning in improving model performance, as well as the potential for distilling reasoning capabilities into smaller models.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

21 Jan

This paper introduces DeepSeek-R1, a model designed to enhance reasoning capabilities in large language models (LLMs) through reinforcement learning (RL). It discusses the development of two models, DeepSeek-R1-Zero and DeepSeek-R1, highlighting their performance on various reasoning tasks and the challenges faced during training. The paper emphasizes the importance of cold-start data and iterative RL fine-tuning in improving model performance, as well as the potential for distilling reasoning capabilities into smaller models.

DeepSeek LLM Scaling Open-Source Language Models with Longtermism

21 Jan

This paper discusses the development of DeepSeek LLMs, focusing on scaling laws, pre-training methodologies, and evaluation results across various benchmarks in both English and Chinese. It highlights the importance of data quality in model performance and outlines future directions for enhancing the capabilities of open-source language models.

DeepSeek LLM Scaling Open-Source Language Models with Longtermism

21 Jan

This paper discusses the development of DeepSeek LLMs, focusing on scaling laws, pre-training methodologies, and evaluation results across various benchmarks in both English and Chinese. It highlights the importance of data quality in model performance and outlines future directions for enhancing the capabilities of open-source language models.

The GAN is dead; long live the GAN! A Modern Baseline GAN

12 Jan

This paper introduces R3GAN, a new baseline GAN that features increased stability, leverages modern architectures, and does not require ad-hoc tricks that are commonplace in existing GAN models. The authors demonstrate that their approach achieves competitive performance across various datasets while simplifying the training process.

The GAN is dead; long live the GAN! A Modern Baseline GAN

12 Jan

This paper introduces R3GAN, a new baseline GAN that features increased stability, leverages modern architectures, and does not require ad-hoc tricks that are commonplace in existing GAN models. The authors demonstrate that their approach achieves competitive performance across various datasets while simplifying the training process.

The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

12 Jan

This paper introduces the FACTS Grounding leaderboard, a benchmark designed to evaluate the ability of language models to generate factually accurate long-form responses based on provided context. The benchmark includes various metrics and results from multiple judge models, aiming to improve the evaluation of factuality in language generation tasks.

The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

12 Jan

This paper introduces the FACTS Grounding leaderboard, a benchmark designed to evaluate the ability of language models to generate factually accurate long-form responses based on provided context. The benchmark includes various metrics and results from multiple judge models, aiming to improve the evaluation of factuality in language generation tasks.

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

12 Dec

This paper presents a novel training approach for large language models (LLMs) aimed at improving their performance in machine translation tasks. The proposed method, named Advanced Language Model-based Translator (ALMA), involves a two-stage fine-tuning process that utilizes monolingual data followed by high-quality parallel data. The results demonstrate significant improvements in translation performance compared to existing models, highlighting the potential of this new training paradigm.

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

12 Dec

This paper presents a novel training approach for large language models (LLMs) aimed at improving their performance in machine translation tasks. The proposed method, named Advanced Language Model-based Translator (ALMA), involves a two-stage fine-tuning process that utilizes monolingual data followed by high-quality parallel data. The results demonstrate significant improvements in translation performance compared to existing models, highlighting the potential of this new training paradigm.

Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents

11 Dec

This paper presents a pattern catalogue consisting of 18 architectural patterns for designing foundation model-based agents. It addresses challenges in goal-seeking and plan generation, providing guidance for practitioners through a systematic literature review and proposing a decision model for selecting appropriate patterns.

Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents

11 Dec

This paper presents a pattern catalogue consisting of 18 architectural patterns for designing foundation model-based agents. It addresses challenges in goal-seeking and plan generation, providing guidance for practitioners through a systematic literature review and proposing a decision model for selecting appropriate patterns.

PaliGemma 2: A Family of Versatile VLMs for Transfer

07 Dec

PaliGemma 2 is an upgraded vision-language model that enhances transfer performance across various tasks by utilizing a family of models trained at different resolutions and sizes. It achieves state-of-the-art results in multiple OCR-related tasks and demonstrates significant improvements over its predecessor, PaliGemma.

PaliGemma 2: A Family of Versatile VLMs for Transfer

07 Dec

PaliGemma 2 is an upgraded vision-language model that enhances transfer performance across various tasks by utilizing a family of models trained at different resolutions and sizes. It achieves state-of-the-art results in multiple OCR-related tasks and demonstrates significant improvements over its predecessor, PaliGemma.

Bottom-Up and Top-Down Analysis of Values, Agendas, and Observations in Corpora and LLMs

27 Nov

This paper presents a novel approach to analyzing large language models (LLMs) by extracting and assessing socio-cultural values from their outputs. It combines top-down and bottom-up methodologies to characterize value alignment and pluralism in both human-sourced and LLM-generated texts, demonstrating high accuracy in value extraction and resonance assessment.

Bottom-Up and Top-Down Analysis of Values, Agendas, and Observations in Corpora and LLMs

27 Nov

This paper presents a novel approach to analyzing large language models (LLMs) by extracting and assessing socio-cultural values from their outputs. It combines top-down and bottom-up methodologies to characterize value alignment and pluralism in both human-sourced and LLM-generated texts, demonstrating high accuracy in value extraction and resonance assessment.

Stable Flow: Vital Layers for Training-Free Image Editing

26 Nov

This paper presents Stable Flow, a training-free method for image editing that utilizes vital layers within diffusion models to perform various editing tasks. The authors propose an automatic method to identify these vital layers and demonstrate their effectiveness through qualitative and quantitative comparisons against existing methods. The study includes user evaluations and discusses limitations and potential applications of the proposed approach.

Stable Flow: Vital Layers for Training-Free Image Editing

26 Nov

This paper presents Stable Flow, a training-free method for image editing that utilizes vital layers within diffusion models to perform various editing tasks. The authors propose an automatic method to identify these vital layers and demonstrate their effectiveness through qualitative and quantitative comparisons against existing methods. The study includes user evaluations and discusses limitations and potential applications of the proposed approach.

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

26 Nov

This paper presents FocusLLaVA, a novel approach for visual token compression that enhances both efficiency and performance by leveraging visual and textual information. The method employs a coarse-to-fine strategy to remove visual redundancy while maintaining critical information relevant to user instructions. Extensive experiments validate its effectiveness across various multimodal benchmarks.

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

26 Nov

This paper presents FocusLLaVA, a novel approach for visual token compression that enhances both efficiency and performance by leveraging visual and textual information. The method employs a coarse-to-fine strategy to remove visual redundancy while maintaining critical information relevant to user instructions. Extensive experiments validate its effectiveness across various multimodal benchmarks.

Logic Augmented Generation

26 Nov

This paper introduces Logic Augmented Generation (LAG), a novel approach that integrates Semantic Knowledge Graphs (SKGs) with Reactive Continuous Knowledge Graphs (RCKGs) to enhance logical consistency and facilitate the extraction of knowledge using Large Language Models (LLMs). The authors discuss the challenges faced by traditional SKGs in open-ended tasks and propose LAG as a solution that leverages the strengths of both SKGs and LLMs, particularly in domains requiring collective intelligence such as medical diagnostics and climate services.

Logic Augmented Generation

26 Nov

This paper introduces Logic Augmented Generation (LAG), a novel approach that integrates Semantic Knowledge Graphs (SKGs) with Reactive Continuous Knowledge Graphs (RCKGs) to enhance logical consistency and facilitate the extraction of knowledge using Large Language Models (LLMs). The authors discuss the challenges faced by traditional SKGs in open-ended tasks and propose LAG as a solution that leverages the strengths of both SKGs and LLMs, particularly in domains requiring collective intelligence such as medical diagnostics and climate services.

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

26 Nov

This paper explores the mechanisms behind knowledge awareness and hallucinations in language models, utilizing sparse autoencoders to identify directions in the representation space that encode self-knowledge about entities. The findings reveal how these directions influence knowledge refusal behavior and attention regulation, contributing to a better understanding of language model operations and potential improvements in their reliability.

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

26 Nov

This paper explores the mechanisms behind knowledge awareness and hallucinations in language models, utilizing sparse autoencoders to identify directions in the representation space that encode self-knowledge about entities. The findings reveal how these directions influence knowledge refusal behavior and attention regulation, contributing to a better understanding of language model operations and potential improvements in their reliability.

1 Introduction

26 Nov

This paper investigates the robustness of claims regarding zero-shot analogical reasoning in large language models (LLMs) by evaluating their performance on various analogy tasks compared to human performance. The study focuses on three domains: letter-string analogies, digit matrices, and story analogies, assessing both LLMs and humans on original and variant problems that require similar abstract reasoning but differ from training data. Results indicate that while LLMs show some capability for abstract reasoning, they often lack the robustness exhibited by humans, particularly when faced with variations in task structure.

1 Introduction

26 Nov

This paper investigates the robustness of claims regarding zero-shot analogical reasoning in large language models (LLMs) by evaluating their performance on various analogy tasks compared to human performance. The study focuses on three domains: letter-string analogies, digit matrices, and story analogies, assessing both LLMs and humans on original and variant problems that require similar abstract reasoning but differ from training data. Results indicate that while LLMs show some capability for abstract reasoning, they often lack the robustness exhibited by humans, particularly when faced with variations in task structure.

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

26 Nov

The article introduces OpenScholar, a retrieval-augmented language model designed to assist in synthesizing scientific literature. It discusses the challenges faced in literature synthesis and presents ScholarQABench, a benchmark for evaluating models' abilities to synthesize information from multiple papers. The results demonstrate that OpenScholar outperforms existing systems and human experts in various tasks related to scientific literature review.

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

26 Nov

The article introduces OpenScholar, a retrieval-augmented language model designed to assist in synthesizing scientific literature. It discusses the challenges faced in literature synthesis and presents ScholarQABench, a benchmark for evaluating models' abilities to synthesize information from multiple papers. The results demonstrate that OpenScholar outperforms existing systems and human experts in various tasks related to scientific literature review.

Generative Agent Simulations of 1,000 People

22 Nov

This article presents a novel generative agent architecture that simulates the attitudes and behaviors of over 1,000 real individuals through qualitative interviews and large language models. The study evaluates the predictive performance of these agents in replicating human behavior across various social science constructs, demonstrating significant improvements in accuracy and bias reduction compared to traditional demographic-based models.

Generative Agent Simulations of 1,000 People

22 Nov

This article presents a novel generative agent architecture that simulates the attitudes and behaviors of over 1,000 real individuals through qualitative interviews and large language models. The study evaluates the predictive performance of these agents in replicating human behavior across various social science constructs, demonstrating significant improvements in accuracy and bias reduction compared to traditional demographic-based models.

Number it: Temporal Grounding Videos like Flipping Manga

19 Nov

This paper introduces Number-Prompt (NumPro), a method designed to enhance the video temporal grounding capabilities of Video Large Language Models (Vid-LLMs) by overlaying frame numbers onto video content. The study demonstrates that NumPro significantly improves the ability of Vid-LLMs to accurately map events to specific temporal boundaries, achieving state-of-the-art performance in both training-free and fine-tuned settings while maintaining robust general video comprehension.

Number it: Temporal Grounding Videos like Flipping Manga

19 Nov

This paper introduces Number-Prompt (NumPro), a method designed to enhance the video temporal grounding capabilities of Video Large Language Models (Vid-LLMs) by overlaying frame numbers onto video content. The study demonstrates that NumPro significantly improves the ability of Vid-LLMs to accurately map events to specific temporal boundaries, achieving state-of-the-art performance in both training-free and fine-tuned settings while maintaining robust general video comprehension.

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

19 Nov

This paper presents LLaVA-o1, a novel vision language model that performs structured, autonomous reasoning in multiple stages. By introducing four distinct stages—summary, caption, reasoning, and conclusion—LLaVA-o1 achieves a systematic reasoning process. The contributions include the creation of the LLaVA-o1-100k dataset with detailed reasoning annotations and the proposal of a stage-level beam search method for effective inference time scaling, establishing a new standard for multimodal reasoning in VLMs.

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

19 Nov

This paper presents LLaVA-o1, a novel vision language model that performs structured, autonomous reasoning in multiple stages. By introducing four distinct stages—summary, caption, reasoning, and conclusion—LLaVA-o1 achieves a systematic reasoning process. The contributions include the creation of the LLaVA-o1-100k dataset with detailed reasoning annotations and the proposal of a stage-level beam search method for effective inference time scaling, establishing a new standard for multimodal reasoning in VLMs.

Convolutional Differentiable Logic Gate Networks

15 Nov

This paper introduces convolutional differentiable logic gate networks, which integrate various concepts from machine vision into differentiable logic gate networks. The authors propose residual initializations to reduce information loss in deeper networks and prevent vanishing gradients, enabling training of deeper networks than previously possible. They also introduce logical OR pooling, which improves training efficiency. The proposed architecture, LogicTreeNet, significantly reduces model sizes while improving accuracy compared to state-of-the-art models.

Convolutional Differentiable Logic Gate Networks

15 Nov

This paper introduces convolutional differentiable logic gate networks, which integrate various concepts from machine vision into differentiable logic gate networks. The authors propose residual initializations to reduce information loss in deeper networks and prevent vanishing gradients, enabling training of deeper networks than previously possible. They also introduce logical OR pooling, which improves training efficiency. The proposed architecture, LogicTreeNet, significantly reduces model sizes while improving accuracy compared to state-of-the-art models.

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

12 Nov

The paper presents Dualformer, a Transformer model that integrates fast and slow reasoning modes to enhance performance in reasoning tasks while reducing computational costs. It employs structured trace dropping techniques during training to mimic human cognitive processes, achieving improved results in maze navigation and math problem-solving.

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

12 Nov

The paper presents Dualformer, a Transformer model that integrates fast and slow reasoning modes to enhance performance in reasoning tasks while reducing computational costs. It employs structured trace dropping techniques during training to mimic human cognitive processes, achieving improved results in maze navigation and math problem-solving.

OmniGen: Unified Image Generation

09 Nov

The paper introduces OmniGen, a unified diffusion model for image generation that integrates various tasks such as text-to-image generation, image editing, and subject-driven generation into a single framework. It emphasizes simplicity and knowledge transfer across tasks, showcasing competitive performance against existing models while addressing limitations in current image generation methodologies.

OmniGen: Unified Image Generation

09 Nov

The paper introduces OmniGen, a unified diffusion model for image generation that integrates various tasks such as text-to-image generation, image editing, and subject-driven generation into a single framework. It emphasizes simplicity and knowledge transfer across tasks, showcasing competitive performance against existing models while addressing limitations in current image generation methodologies.

On Memorization of Large Language Models in Logical Reasoning

02 Nov

This paper investigates the memorization behaviors of large language models (LLMs) in logical reasoning tasks, proposing a new metric to quantify memorization. The study reveals that while LLMs can memorize training examples, this does not impede their reasoning capabilities. Through various analyses, the authors explore the interplay between memorization and reasoning, providing insights into how LLMs learn to solve logical puzzles.

On Memorization of Large Language Models in Logical Reasoning

02 Nov

This paper investigates the memorization behaviors of large language models (LLMs) in logical reasoning tasks, proposing a new metric to quantify memorization. The study reveals that while LLMs can memorize training examples, this does not impede their reasoning capabilities. Through various analyses, the authors explore the interplay between memorization and reasoning, providing insights into how LLMs learn to solve logical puzzles.

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

02 Nov

This paper investigates the feasibility of pre-training large models using academic resources, highlighting the challenges faced by researchers due to limited compute availability. It presents a survey of academic researchers' access to GPUs and empirically measures the time required to replicate various models on these resources. The findings suggest that with optimizations, it is possible for academic groups to train large models more efficiently than previously thought.

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

02 Nov

This paper investigates the feasibility of pre-training large models using academic resources, highlighting the challenges faced by researchers due to limited compute availability. It presents a survey of academic researchers' access to GPUs and empirically measures the time required to replicate various models on these resources. The findings suggest that with optimizations, it is possible for academic groups to train large models more efficiently than previously thought.

Do Large Language Models Solve Arithmetic with a Bag of Heuristics?

29 Oct

This paper investigates whether large language models (LLMs) solve arithmetic tasks through robust algorithms or memorization. It identifies a mechanism termed 'bag of heuristics' that combines various memorized rules to perform arithmetic reasoning, revealing insights into the internal workings of LLMs and their training processes.

Do Large Language Models Solve Arithmetic with a Bag of Heuristics?

29 Oct

This paper investigates whether large language models (LLMs) solve arithmetic tasks through robust algorithms or memorization. It identifies a mechanism termed 'bag of heuristics' that combines various memorized rules to perform arithmetic reasoning, revealing insights into the internal workings of LLMs and their training processes.

CDChat: A Large Multimodal Model for Remote Sensing Change Description

29 Oct

This article presents CDChat, a large multimodal model designed to describe changes between remote sensing images. It highlights the limitations of existing models in this domain and introduces a new dataset for instruction-tuning that enhances performance in change description tasks. The study evaluates CDChat's effectiveness against other models, demonstrating its superior capabilities in accurately describing changes and counting change regions in bi-temporal images.

CDChat: A Large Multimodal Model for Remote Sensing Change Description

29 Oct

This article presents CDChat, a large multimodal model designed to describe changes between remote sensing images. It highlights the limitations of existing models in this domain and introduces a new dataset for instruction-tuning that enhances performance in change description tasks. The study evaluates CDChat's effectiveness against other models, demonstrating its superior capabilities in accurately describing changes and counting change regions in bi-temporal images.

Continuous Speech Synthesis using per-token Latent Diffusion

29 Oct

This article introduces SALAD, a per-token latent diffusion model for zero-shot text-to-speech synthesis that operates on continuous representations. It builds upon existing techniques and demonstrates superior intelligibility while maintaining high speech quality and speaker similarity compared to traditional methods.

Continuous Speech Synthesis using per-token Latent Diffusion

29 Oct

This article introduces SALAD, a per-token latent diffusion model for zero-shot text-to-speech synthesis that operates on continuous representations. It builds upon existing techniques and demonstrates superior intelligibility while maintaining high speech quality and speaker similarity compared to traditional methods.

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

29 Oct

In this work, we presented the Large View Synthesis Model (LVSM), a transformer-based approach designed to minimize 3D inductive biases for scalable and generalizable novel view synthesis. Our two architectures—encoder-decoder and decoder-only—bypass physical-rendering-based 3D representations like NeRF and 3D Gaussian Splatting, allowing the model to learn priors directly from data, leading to more flexible and scalable novel view synthesis. The decoder-only LVSM, with its minimal inductive biases, excels in scalability, zero-shot generalization, and rendering quality, while the encoder-decoder LVSM achieves faster inference due to its fully learned latent scene representation. Both models demonstrate superior performance across diverse benchmarks and mark an important step towards general and scalable novel view synthesis in complex, real-world scenarios.

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

29 Oct

In this work, we presented the Large View Synthesis Model (LVSM), a transformer-based approach designed to minimize 3D inductive biases for scalable and generalizable novel view synthesis. Our two architectures—encoder-decoder and decoder-only—bypass physical-rendering-based 3D representations like NeRF and 3D Gaussian Splatting, allowing the model to learn priors directly from data, leading to more flexible and scalable novel view synthesis. The decoder-only LVSM, with its minimal inductive biases, excels in scalability, zero-shot generalization, and rendering quality, while the encoder-decoder LVSM achieves faster inference due to its fully learned latent scene representation. Both models demonstrate superior performance across diverse benchmarks and mark an important step towards general and scalable novel view synthesis in complex, real-world scenarios.

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

23 Oct

This paper presents DMDSpeech, a distilled diffusion-based text-to-speech model that achieves high-quality speech synthesis through direct metric optimization. By employing distribution matching distillation, the model generates speech efficiently while improving speaker similarity and intelligibility compared to state-of-the-art models.

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

23 Oct

This paper presents DMDSpeech, a distilled diffusion-based text-to-speech model that achieves high-quality speech synthesis through direct metric optimization. By employing distribution matching distillation, the model generates speech efficiently while improving speaker similarity and intelligibility compared to state-of-the-art models.

AI can help humans find common ground in democratic deliberation

20 Oct

This research project evaluates a new AI-based approach to human collective deliberation, which involves using an AI system as a 'caucus mediator.' The study shows that group statements produced by the AI mediator, the Habermas Machine, won broad-based agreement from participants and were preferred to those written by human mediators. After interacting with the AI, groups were often less divided, converging to a common stance on social and political issues, demonstrating the potential of AI to enhance collective decision-making.

AI can help humans find common ground in democratic deliberation

20 Oct

This research project evaluates a new AI-based approach to human collective deliberation, which involves using an AI system as a 'caucus mediator.' The study shows that group statements produced by the AI mediator, the Habermas Machine, won broad-based agreement from participants and were preferred to those written by human mediators. After interacting with the AI, groups were often less divided, converging to a common stance on social and political issues, demonstrating the potential of AI to enhance collective decision-making.

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

17 Oct

This paper introduces refined variants of the Local Learning Coefficient (LLC) to study the differentiation and specialization of attention heads in transformer language models during training. The findings reveal how attention heads evolve into distinct functional roles, analyze their specialization based on data types, and uncover a novel multigram circuit, contributing to a deeper understanding of model complexity and interpretability.

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

17 Oct

This paper introduces refined variants of the Local Learning Coefficient (LLC) to study the differentiation and specialization of attention heads in transformer language models during training. The findings reveal how attention heads evolve into distinct functional roles, analyze their specialization based on data types, and uncover a novel multigram circuit, contributing to a deeper understanding of model complexity and interpretability.

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

17 Oct

This paper presents Dualformer, a Transformer model that integrates both fast and slow reasoning modes to enhance performance in reasoning tasks while reducing computational costs. The model is trained using randomized reasoning traces, allowing it to adaptively switch between modes during inference. The effectiveness of Dualformer is demonstrated through experiments on maze navigation tasks and its application in fine-tuning large language models for math reasoning.

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

17 Oct

This paper presents Dualformer, a Transformer model that integrates both fast and slow reasoning modes to enhance performance in reasoning tasks while reducing computational costs. The model is trained using randomized reasoning traces, allowing it to adaptively switch between modes during inference. The effectiveness of Dualformer is demonstrated through experiments on maze navigation tasks and its application in fine-tuning large language models for math reasoning.

ToolGen: Unified Tool Retrieval and Calling via Generation

14 Oct

This paper introduces ToolGen, a framework that integrates tool retrieval and execution in large language models (LLMs) by embedding tool-specific virtual tokens into the model’s vocabulary. The framework transforms tool interaction into a generative task, allowing LLMs to efficiently retrieve and execute tools in real-world scenarios. ToolGen's three-stage training process enhances the capabilities of AI agents, setting a new benchmark for scalable and efficient tool usage.

ToolGen: Unified Tool Retrieval and Calling via Generation

14 Oct

This paper introduces ToolGen, a framework that integrates tool retrieval and execution in large language models (LLMs) by embedding tool-specific virtual tokens into the model’s vocabulary. The framework transforms tool interaction into a generative task, allowing LLMs to efficiently retrieve and execute tools in real-world scenarios. ToolGen's three-stage training process enhances the capabilities of AI agents, setting a new benchmark for scalable and efficient tool usage.

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

14 Oct

This paper introduces SFTMix, a novel approach to instruction tuning for large language models (LLMs) that leverages training dynamics to identify data subsets of varying confidence levels and incorporates a Mixup-based regularization technique. The proposed method aims to enhance instruction-following capabilities without relying on well-curated datasets, demonstrating significant improvements over conventional methods across various tasks and LLM families.

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

14 Oct

This paper introduces SFTMix, a novel approach to instruction tuning for large language models (LLMs) that leverages training dynamics to identify data subsets of varying confidence levels and incorporates a Mixup-based regularization technique. The proposed method aims to enhance instruction-following capabilities without relying on well-curated datasets, demonstrating significant improvements over conventional methods across various tasks and LLM families.

MatMamba: A Nested Matryoshka Structure on Mamba2 State Space Models

12 Oct

In this work, we present MatMamba, which is a way to impose a nested Matryoshka structure on a Mamba2 state space model. It brings together the best of both Mamba-style models (faster inference times, especially for longer sequences) and Matryoshka-style learning. A single MatMamba model contains hundreds of nested and accurate submodels that can be flexibly extracted for inference.

MatMamba: A Nested Matryoshka Structure on Mamba2 State Space Models

12 Oct

In this work, we present MatMamba, which is a way to impose a nested Matryoshka structure on a Mamba2 state space model. It brings together the best of both Mamba-style models (faster inference times, especially for longer sequences) and Matryoshka-style learning. A single MatMamba model contains hundreds of nested and accurate submodels that can be flexibly extracted for inference.

EVOLvE: Evaluating and Optimizing LLMs For Exploration

10 Oct

This work explores the in-context exploration capabilities of large language models (LLMs) in bandit environments, introducing BanditBench as a benchmark for evaluating their performance. The study finds that LLMs struggle with exploration when relying solely on raw interaction history, but performance improves significantly with inference-time support and algorithmic guidance. The authors propose methods to integrate optimal algorithms into LLMs through both algorithm-guided support and algorithm distillation, demonstrating that smaller models can outperform larger ones in decision-making tasks.

EVOLvE: Evaluating and Optimizing LLMs For Exploration

10 Oct

This work explores the in-context exploration capabilities of large language models (LLMs) in bandit environments, introducing BanditBench as a benchmark for evaluating their performance. The study finds that LLMs struggle with exploration when relying solely on raw interaction history, but performance improves significantly with inference-time support and algorithmic guidance. The authors propose methods to integrate optimal algorithms into LLMs through both algorithm-guided support and algorithm distillation, demonstrating that smaller models can outperform larger ones in decision-making tasks.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

10 Oct

This article investigates the reasoning capabilities of large language models (LLMs) and introduces GSM-Symbolic, a novel benchmark designed to provide deeper insights into their mathematical reasoning abilities. The study reveals significant performance variability across different instantiations of the same question, challenging the reliability of current evaluations on GSM8K. It highlights that while LLMs show some robustness to changes in proper names, they are more sensitive to variations in numerical values, and their performance deteriorates as question complexity increases. The introduction of GSM-NoOp exposes critical flaws in LLMs' understanding of mathematical concepts, indicating that their reasoning resembles sophisticated pattern matching rather than true logical reasoning.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

10 Oct

This article investigates the reasoning capabilities of large language models (LLMs) and introduces GSM-Symbolic, a novel benchmark designed to provide deeper insights into their mathematical reasoning abilities. The study reveals significant performance variability across different instantiations of the same question, challenging the reliability of current evaluations on GSM8K. It highlights that while LLMs show some robustness to changes in proper names, they are more sensitive to variations in numerical values, and their performance deteriorates as question complexity increases. The introduction of GSM-NoOp exposes critical flaws in LLMs' understanding of mathematical concepts, indicating that their reasoning resembles sophisticated pattern matching rather than true logical reasoning.

The Platonic Representation Hypothesis

09 Oct

This paper argues that representations in AI models, particularly deep networks, are converging towards a shared statistical model of reality. It surveys examples of convergence across different domains and modalities, hypothesizing that this trend is driven by scaling and performance improvements. The authors discuss the implications of this convergence, its limitations, and counterexamples, while also exploring how different models align with biological representations in the brain.

The Platonic Representation Hypothesis

09 Oct

This paper argues that representations in AI models, particularly deep networks, are converging towards a shared statistical model of reality. It surveys examples of convergence across different domains and modalities, hypothesizing that this trend is driven by scaling and performance improvements. The authors discuss the implications of this convergence, its limitations, and counterexamples, while also exploring how different models align with biological representations in the brain.

The table LLM Quantization Algorithm Comparison. VPTQ balances all dimensions and achieves SOTA.

09 Oct

This article discusses the challenges and advancements in quantizing large language models (LLMs) to reduce their size while maintaining performance. It introduces Vector Post-Training Quantization (VPTQ) as a new method that aims to achieve state-of-the-art accuracy with extremely low-bit quantization, addressing issues related to traditional scalar quantization methods and exploring the benefits of vector quantization.

The table LLM Quantization Algorithm Comparison. VPTQ balances all dimensions and achieves SOTA.

09 Oct

This article discusses the challenges and advancements in quantizing large language models (LLMs) to reduce their size while maintaining performance. It introduces Vector Post-Training Quantization (VPTQ) as a new method that aims to achieve state-of-the-art accuracy with extremely low-bit quantization, addressing issues related to traditional scalar quantization methods and exploring the benefits of vector quantization.

Contextual Document Embeddings

09 Oct

This paper presents two improvements to traditional biencoder models for generating embeddings, focusing on contextual training strategies and a new corpus-aware architecture that enhances performance in text retrieval tasks.

Contextual Document Embeddings

09 Oct

This paper presents two improvements to traditional biencoder models for generating embeddings, focusing on contextual training strategies and a new corpus-aware architecture that enhances performance in text retrieval tasks.

Were RNNs All We Needed?

07 Oct

This paper revisits traditional recurrent neural networks (RNNs) such as LSTMs and GRUs, proposing minimal versions that remove hidden state dependencies to enhance training efficiency. The authors demonstrate that these minimal models can be trained in parallel, achieving performance comparable to modern sequence models while maintaining computational efficiency.

Were RNNs All We Needed?

07 Oct

This paper revisits traditional recurrent neural networks (RNNs) such as LSTMs and GRUs, proposing minimal versions that remove hidden state dependencies to enhance training efficiency. The authors demonstrate that these minimal models can be trained in parallel, achieving performance comparable to modern sequence models while maintaining computational efficiency.

Movie Gen: A Cast of Media Foundation Models

04 Oct

This article presents Movie Gen, a cast of foundation models developed by Meta that generates high-quality videos with various capabilities including text-to-video synthesis, video personalization, and precise video editing. The paper discusses the architecture, training methods, and performance benchmarks of these models, aiming to advance the field of media generation.

Movie Gen: A Cast of Media Foundation Models

04 Oct

This article presents Movie Gen, a cast of foundation models developed by Meta that generates high-quality videos with various capabilities including text-to-video synthesis, video personalization, and precise video editing. The paper discusses the architecture, training methods, and performance benchmarks of these models, aiming to advance the field of media generation.

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

01 Oct

This paper presents the UniAudio system, a unified model leveraging large language model techniques to generate various types of audio including speech, sounds, music, and singing. It aims to address the emergent needs in audio generation by providing a foundation model that can seamlessly support multiple audio generation tasks with competitive performance across them.

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

01 Oct

This paper presents the UniAudio system, a unified model leveraging large language model techniques to generate various types of audio including speech, sounds, music, and singing. It aims to address the emergent needs in audio generation by providing a foundation model that can seamlessly support multiple audio generation tasks with competitive performance across them.

Grokking Through Compression: Unveiling Sudden Generalization via Minimal Description

01 Oct

The paper investigates the phenomenon of grokking in neural networks through the lens of Minimal Description Length (MDL), offering an information-theoretic perspective on sudden generalization. The authors propose a method to estimate and track MDL during training using weight pruning techniques. Experiments on modular arithmetic and permutation tasks reveal a strong connection between MDL transitions and grokking points, with varying dynamics across different tasks.

Grokking Through Compression: Unveiling Sudden Generalization via Minimal Description

01 Oct

The paper investigates the phenomenon of grokking in neural networks through the lens of Minimal Description Length (MDL), offering an information-theoretic perspective on sudden generalization. The authors propose a method to estimate and track MDL during training using weight pruning techniques. Experiments on modular arithmetic and permutation tasks reveal a strong connection between MDL transitions and grokking points, with varying dynamics across different tasks.

Emu3: Next-Token Prediction is All You Need

01 Oct

This paper introduces Emu3, a new series of multimodal models that excel at multimodal generation and perception through next-token prediction. By tokenizing images, text, and videos into a discrete space and training a single transformer from scratch, it eliminates reliance on diffusion and compositional methods while surpassing established task-specific models. The results provide compelling evidence that next-token prediction can serve as a powerful paradigm for multimodal models, scaling beyond language models and delivering state-of-the-art performance across diverse tasks, including challenging video generation.

Emu3: Next-Token Prediction is All You Need

01 Oct

This paper introduces Emu3, a new series of multimodal models that excel at multimodal generation and perception through next-token prediction. By tokenizing images, text, and videos into a discrete space and training a single transformer from scratch, it eliminates reliance on diffusion and compositional methods while surpassing established task-specific models. The results provide compelling evidence that next-token prediction can serve as a powerful paradigm for multimodal models, scaling beyond language models and delivering state-of-the-art performance across diverse tasks, including challenging video generation.

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

30 Sep

The article critiques the prevailing 'bigger-is-better' paradigm in AI, arguing that this focus on scale leads to unsustainable practices, environmental concerns, and a concentration of power among a few large players. It emphasizes the need for a shift towards smaller, more efficient models that can address a wider range of applications without the excessive resource demands associated with large-scale systems.

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

30 Sep

The article critiques the prevailing 'bigger-is-better' paradigm in AI, arguing that this focus on scale leads to unsustainable practices, environmental concerns, and a concentration of power among a few large players. It emphasizes the need for a shift towards smaller, more efficient models that can address a wider range of applications without the excessive resource demands associated with large-scale systems.

MaskBit: Embedding-free Image Generation via Bit Tokens

25 Sep

This paper presents MaskBit, a novel embedding-free image generation model that utilizes bit tokens for class-conditional image generation. The authors systematically study and modernize the VQGAN architecture, leading to significant improvements in performance and accessibility. They demonstrate that their approach achieves state-of-the-art results on the ImageNet benchmark while addressing limitations of previous models.

MaskBit: Embedding-free Image Generation via Bit Tokens

25 Sep

This paper presents MaskBit, a novel embedding-free image generation model that utilizes bit tokens for class-conditional image generation. The authors systematically study and modernize the VQGAN architecture, leading to significant improvements in performance and accessibility. They demonstrate that their approach achieves state-of-the-art results on the ImageNet benchmark while addressing limitations of previous models.

ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

25 Sep

The paper proposes a hybrid algorithm called ARES that alternates between reinforcement learning and supervised fine-tuning to enhance multi-modal rationale reasoning for tasks like ScienceQA and A-OKVQA. It leverages advanced AI feedback to improve the quality of generated rationales and inference accuracy, addressing challenges in traditional reinforcement learning methods.

ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

25 Sep

The paper proposes a hybrid algorithm called ARES that alternates between reinforcement learning and supervised fine-tuning to enhance multi-modal rationale reasoning for tasks like ScienceQA and A-OKVQA. It leverages advanced AI feedback to improve the quality of generated rationales and inference accuracy, addressing challenges in traditional reinforcement learning methods.

LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench

25 Sep

This article evaluates the planning capabilities of large language models (LLMs) and introduces OpenAI's new Large Reasoning Model (LRM), o1, using the PlanBench benchmark. It discusses the performance improvements of o1 over previous models while highlighting ongoing challenges in achieving robust planning abilities.

LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench

25 Sep

This article evaluates the planning capabilities of large language models (LLMs) and introduces OpenAI's new Large Reasoning Model (LRM), o1, using the PlanBench benchmark. It discusses the performance improvements of o1 over previous models while highlighting ongoing challenges in achieving robust planning abilities.

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

24 Sep

This paper introduces a new inference pattern called Writing in the Margins (WiM), which enhances the performance of Large Language Models (LLMs) in handling long input sequences for retrieval-oriented tasks. The WiM approach utilizes chunked prefill of the key-value cache to perform segment-wise inference, improving accuracy and efficiency without requiring fine-tuning. The study demonstrates significant performance boosts across various benchmarks, emphasizing the method's compatibility with existing transformer models and its potential for enhancing user experience through increased transparency in context processing.

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

24 Sep

This paper introduces a new inference pattern called Writing in the Margins (WiM), which enhances the performance of Large Language Models (LLMs) in handling long input sequences for retrieval-oriented tasks. The WiM approach utilizes chunked prefill of the key-value cache to perform segment-wise inference, improving accuracy and efficiency without requiring fine-tuning. The study demonstrates significant performance boosts across various benchmarks, emphasizing the method's compatibility with existing transformer models and its potential for enhancing user experience through increased transparency in context processing.

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

24 Sep

This study addresses challenges in text-to-music generation by introducing a quality-aware masked diffusion transformer (QA-MDT) that enhances music generation through improved data quality and alignment between textual descriptions and audio signals. The proposed method demonstrates state-of-the-art performance on benchmark datasets, showcasing its effectiveness in generating high-quality music from textual inputs.

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

24 Sep

This study addresses challenges in text-to-music generation by introducing a quality-aware masked diffusion transformer (QA-MDT) that enhances music generation through improved data quality and alignment between textual descriptions and audio signals. The proposed method demonstrates state-of-the-art performance on benchmark datasets, showcasing its effectiveness in generating high-quality music from textual inputs.

Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks

24 Sep

This position paper argues that while Large Language Models (LLMs) cannot independently perform planning or self-verification, they can serve as valuable cognitive tools within a framework called LLM-Modulo. This framework integrates LLMs with external model-based verifiers to enhance planning tasks, addressing misconceptions about LLM capabilities and proposing a structured approach to leverage their strengths effectively.

Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks

24 Sep

This position paper argues that while Large Language Models (LLMs) cannot independently perform planning or self-verification, they can serve as valuable cognitive tools within a framework called LLM-Modulo. This framework integrates LLMs with external model-based verifiers to enhance planning tasks, addressing misconceptions about LLM capabilities and proposing a structured approach to leverage their strengths effectively.

"CLAIR"_𝐴: Leveraging Large Language Models to Judge Audio Captions

23 Sep

This paper introduces CLAIR_A, a simple and interpretable domain-specific LLM-based measure for audio captioning. The authors demonstrate that this approach aligns well with human judgments and is significantly more interpretable to downstream users. They aim to inspire further research into the alignment of LLMs with human judgment in various audio domains.

"CLAIR"_𝐴: Leveraging Large Language Models to Judge Audio Captions

23 Sep

This paper introduces CLAIR_A, a simple and interpretable domain-specific LLM-based measure for audio captioning. The authors demonstrate that this approach aligns well with human judgments and is significantly more interpretable to downstream users. They aim to inspire further research into the alignment of LLMs with human judgment in various audio domains.

Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models

23 Sep

This paper introduces Score Forgetting Distillation (SFD), a novel approach to machine unlearning in diffusion models that allows for effective forgetting of undesirable information while maintaining generative capabilities. The method is data-free and demonstrates significant improvements in both performance metrics and generation speed compared to traditional methods.

Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models

23 Sep

This paper introduces Score Forgetting Distillation (SFD), a novel approach to machine unlearning in diffusion models that allows for effective forgetting of undesirable information while maintaining generative capabilities. The method is data-free and demonstrates significant improvements in both performance metrics and generation speed compared to traditional methods.

Autoregressive + Chain of Thought ≃ Recurrent: Recurrence’s Role in Language Models’ Computability and a Revisit of Recurrent Transformer

23 Sep

This work analyzes the roles of autoregression and recurrence in neural models, demonstrating that recurrence enhances computational depth. It explains how Chain of Thought (CoT) approximates recurrence in Transformer-based autoregressive models and discusses the implications for model design and performance across various tasks.

Autoregressive + Chain of Thought ≃ Recurrent: Recurrence’s Role in Language Models’ Computability and a Revisit of Recurrent Transformer

23 Sep

This work analyzes the roles of autoregression and recurrence in neural models, demonstrating that recurrence enhances computational depth. It explains how Chain of Thought (CoT) approximates recurrence in Transformer-based autoregressive models and discusses the implications for model design and performance across various tasks.

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

20 Sep

This article analyzes the effectiveness of chain-of-thought (CoT) prompting in large language models, particularly focusing on its performance in mathematical and symbolic reasoning tasks. Through a meta-analysis of existing literature and experimental evaluations, the authors find that while CoT can enhance reasoning capabilities, its benefits are primarily observed in tasks requiring mathematical or logical reasoning, suggesting a need for more sophisticated approaches beyond traditional prompting methods.

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

20 Sep

This article analyzes the effectiveness of chain-of-thought (CoT) prompting in large language models, particularly focusing on its performance in mathematical and symbolic reasoning tasks. Through a meta-analysis of existing literature and experimental evaluations, the authors find that while CoT can enhance reasoning capabilities, its benefits are primarily observed in tasks requiring mathematical or logical reasoning, suggesting a need for more sophisticated approaches beyond traditional prompting methods.

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

17 Sep

This paper explores the effectiveness of the Chain of Thought (CoT) approach in enhancing the performance of transformer models, particularly in solving inherently serial problems. It provides a theoretical framework for understanding how CoT improves the expressiveness of transformers by enabling them to perform serial computations that are typically challenging for standard architectures. The authors present empirical results demonstrating the advantages of CoT in various tasks, alongside discussions on related works and future directions for research.

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

17 Sep

This paper explores the effectiveness of the Chain of Thought (CoT) approach in enhancing the performance of transformer models, particularly in solving inherently serial problems. It provides a theoretical framework for understanding how CoT improves the expressiveness of transformers by enabling them to perform serial computations that are typically challenging for standard architectures. The authors present empirical results demonstrating the advantages of CoT in various tasks, alongside discussions on related works and future directions for research.

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

17 Sep

This paper proposes a novel approach to address domain shifts by separating them into measurable quality and style components. We introduce the Relative Quality loss that integrates quality scores into the training loss function, adding a quality prior to the classification model. Further, we present DiffFAS, a versatile generative model for high-fidelity cross-domain and cross-attack generation, addressing the lack of labeled data for novel attack types.

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

17 Sep

This paper proposes a novel approach to address domain shifts by separating them into measurable quality and style components. We introduce the Relative Quality loss that integrates quality scores into the training loss function, adding a quality prior to the classification model. Further, we present DiffFAS, a versatile generative model for high-fidelity cross-domain and cross-attack generation, addressing the lack of labeled data for novel attack types.

Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

16 Sep

This paper introduces reflective augmentation (RefAug) for enhancing mathematical reasoning in language models. It incorporates reflection into training problems, complementing existing data augmentation techniques. Extensive experiments demonstrate RefAug's efficacy in improving both basic problem-solving skills and complex reflective reasoning tasks. The method is also validated in code generation tasks, showing scalability and robustness through various analyses.

Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

16 Sep

This paper introduces reflective augmentation (RefAug) for enhancing mathematical reasoning in language models. It incorporates reflection into training problems, complementing existing data augmentation techniques. Extensive experiments demonstrate RefAug's efficacy in improving both basic problem-solving skills and complex reflective reasoning tasks. The method is also validated in code generation tasks, showing scalability and robustness through various analyses.

V-STaR: Training Verifiers for Self-Taught Reasoners

16 Sep

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

V-STaR: Training Verifiers for Self-Taught Reasoners

16 Sep

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

Let’s Verify Step by Step

16 Sep

The article discusses the comparison between outcome supervision and process supervision in training large language models, particularly in mathematical reasoning tasks. It highlights the advantages of process supervision, such as providing precise feedback and improving AI alignment. The study finds that process supervision significantly outperforms outcome supervision, especially when scaled up, and introduces active learning to enhance data collection efficiency. The authors release a dataset, PRM800K, to support further research in this area.

Let’s Verify Step by Step

16 Sep

The article discusses the comparison between outcome supervision and process supervision in training large language models, particularly in mathematical reasoning tasks. It highlights the advantages of process supervision, such as providing precise feedback and improving AI alignment. The study finds that process supervision significantly outperforms outcome supervision, especially when scaled up, and introduces active learning to enhance data collection efficiency. The authors release a dataset, PRM800K, to support further research in this area.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

16 Sep

Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this gap through supervised fine-tuning on curated expert demonstrations often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment, a simulated e-commerce platform—where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model’s zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

16 Sep

Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this gap through supervised fine-tuning on curated expert demonstrations often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment, a simulated e-commerce platform—where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model’s zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

16 Sep

The article introduces Quiet-STaR, a method for training language models to generate rationales at each token to improve predictions. It addresses challenges such as computational cost and the need for predicting beyond individual tokens. The approach shows improvements in zero-shot reasoning tasks without fine-tuning on specific datasets.

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

16 Sep

The article introduces Quiet-STaR, a method for training language models to generate rationales at each token to improve predictions. It addresses challenges such as computational cost and the need for predicting beyond individual tokens. The approach shows improvements in zero-shot reasoning tasks without fine-tuning on specific datasets.

AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning

12 Sep

AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning

12 Sep

Learning to Reason with LLMs

12 Sep

Learning to Reason with LLMs

12 Sep

Prompt2Fashion: An automatically generated fashion dataset

12 Sep

Prompt2Fashion: An automatically generated fashion dataset

12 Sep

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

12 Sep

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

12 Sep

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

11 Sep

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

11 Sep

Sapiens: Foundation for Human Vision Models

11 Sep

Sapiens: Foundation for Human Vision Models

11 Sep

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

11 Sep

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

11 Sep

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

11 Sep

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

11 Sep

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

09 Sep

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

09 Sep

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation

05 Sep

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation

05 Sep

LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

05 Sep

LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

05 Sep

Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games

05 Sep

Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games

05 Sep

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

05 Sep

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

05 Sep

wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

05 Sep

wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

05 Sep

Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

05 Sep

Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

05 Sep

[summary] Enhancing Sound Source Localization via False Negative Elimination

02 Sep

[summary] Enhancing Sound Source Localization via False Negative Elimination

02 Sep

Towards Real-world Event-guided Low-light Video Enhancement and Deblurring

02 Sep

Towards Real-world Event-guided Low-light Video Enhancement and Deblurring

02 Sep

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

02 Sep

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

02 Sep

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

02 Sep

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

02 Sep

Diffusion Models Are Real-Time Game Engines

02 Sep

Diffusion Models Are Real-Time Game Engines

02 Sep

Text2SQL is Not Enough: Unifying AI and Databases with TAG

29 Aug

Text2SQL is Not Enough: Unifying AI and Databases with TAG

29 Aug

Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress

29 Aug

Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress

29 Aug

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

29 Aug

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

29 Aug

T3M: Text Guided 3D Human Motion Synthesis from Speech

27 Aug

T3M: Text Guided 3D Human Motion Synthesis from Speech

27 Aug

Open-Endedness is Essential for Artificial Superhuman Intelligence

27 Aug

Open-Endedness is Essential for Artificial Superhuman Intelligence

27 Aug

Aurora: A Foundation Model of the Atmosphere

24 Aug

Aurora: A Foundation Model of the Atmosphere

24 Aug

Pano2Room: Novel View Synthesis from a Single Indoor Panorama

24 Aug

Pano2Room: Novel View Synthesis from a Single Indoor Panorama

24 Aug

[summary] CONVLORA AND ADABN BASED DOMAIN ADAPTATION VIA SELF-TRAINING

23 Aug

[summary] CONVLORA AND ADABN BASED DOMAIN ADAPTATION VIA SELF-TRAINING

23 Aug

[summary] A Survey on Benchmarks of Multimodal Large Language Models

21 Aug

[summary] A Survey on Benchmarks of Multimodal Large Language Models

21 Aug

[summary] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

21 Aug

[summary] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

21 Aug

[summary] SAM-UNet: Enhancing Zero-Shot Segmentation of SAM for Universal Medical Images

21 Aug

[summary] SAM-UNet: Enhancing Zero-Shot Segmentation of SAM for Universal Medical Images

21 Aug

[summary] Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

21 Aug

[summary] Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

21 Aug

[summary] SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation

21 Aug

[summary] SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation

21 Aug

[summary] TurboEdit: Instant text-based image editing

20 Aug

[summary] TurboEdit: Instant text-based image editing

20 Aug

[summary] MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

20 Aug

[summary] MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

20 Aug

[summary] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

19 Aug

[summary] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

19 Aug

[summary] DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model

19 Aug

[summary] DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model

19 Aug

[summary] Comparative Evaluation of 3D Reconstruction Methods for Object Pose Estimation

19 Aug

[summary] Comparative Evaluation of 3D Reconstruction Methods for Object Pose Estimation

19 Aug

[summary] BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning

19 Aug

[summary] BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning

19 Aug

[summary] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

19 Aug

[summary] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

19 Aug

[summary] Prompt Cache: Modular Attention Reuse for Low-Latency Inference

19 Aug

[summary] Prompt Cache: Modular Attention Reuse for Low-Latency Inference

19 Aug

[summary] Prompt-Based Segmentation at Multiple Resolutions and Lighting Conditions using Segment Anything Model 2

18 Aug

[summary] Prompt-Based Segmentation at Multiple Resolutions and Lighting Conditions using Segment Anything Model 2

18 Aug

[summary] LLM4DSR: Leveraing Large Language Model for Denoising Sequential Recommendation

18 Aug

[summary] LLM4DSR: Leveraing Large Language Model for Denoising Sequential Recommendation

18 Aug

[summary] Interpretable Graph Neural Networks for Heterogeneous Tabular Data

18 Aug

[summary] Interpretable Graph Neural Networks for Heterogeneous Tabular Data

18 Aug

[summary] VIRUS-NeRF - Vision, InfraRed and UltraSonic based Neural Radiance Fields

18 Aug

[summary] VIRUS-NeRF - Vision, InfraRed and UltraSonic based Neural Radiance Fields

18 Aug

[summary] SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training

17 Aug

[summary] SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training

17 Aug

[summary] The Clever Hans Effect in Unsupervised Learning

17 Aug

[summary] The Clever Hans Effect in Unsupervised Learning

17 Aug

[summary] Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

17 Aug

[summary] Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

17 Aug

[summary] Towards flexible perception with visual memory

17 Aug

[summary] Towards flexible perception with visual memory

17 Aug

[summary] Towards flexible perception with visual memory

16 Aug

[summary] Towards flexible perception with visual memory

16 Aug

[summary] VITA: Towards Open-Source Interactive Omni Multimodal LLM

15 Aug

[summary] VITA: Towards Open-Source Interactive Omni Multimodal LLM

15 Aug

[summary] UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

15 Aug

[summary] UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

15 Aug

[summary] BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

15 Aug

[summary] BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

15 Aug

[summary] ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

15 Aug

[summary] ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

15 Aug

[summary] How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

15 Aug

[summary] How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

15 Aug

[summary] HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

14 Aug

[summary] HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

14 Aug

[summary] Transformer Explainer: Interactive Learning of Text-Generative Models

14 Aug

[summary] Transformer Explainer: Interactive Learning of Text-Generative Models

14 Aug

[summary] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

14 Aug

[summary] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

14 Aug

[summary] UGrid: An Efficient-And-Rigorous Neural Multigrid Solver for Linear PDEs

14 Aug

[summary] UGrid: An Efficient-And-Rigorous Neural Multigrid Solver for Linear PDEs

14 Aug

[summary] In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

14 Aug

[summary] In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

14 Aug

[summary] Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

13 Aug

[summary] Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

13 Aug

[summary] Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

13 Aug

[summary] Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

13 Aug

[summary] 1.5-Pints Technical Report: Pretraining in Days, Not Months – Your Language Model Thrives on Quality Data

12 Aug

[summary] 1.5-Pints Technical Report: Pretraining in Days, Not Months – Your Language Model Thrives on Quality Data

12 Aug

[summary] VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

12 Aug

[summary] VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

12 Aug

[summary] Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

12 Aug

[summary] Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

12 Aug

[summary] Medical SAM 2: Segment medical images as video via Segment Anything Model 2

12 Aug

[summary] Medical SAM 2: Segment medical images as video via Segment Anything Model 2

12 Aug

[summary] AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning

12 Aug

[summary] AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning

12 Aug

[summary] Sampling for View Synthesis: From Local Light Field Fusion to Neural Radiance Fields and Beyond

12 Aug

[summary] Sampling for View Synthesis: From Local Light Field Fusion to Neural Radiance Fields and Beyond

12 Aug

[summary] L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection

12 Aug

[summary] L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection

12 Aug

[summary] Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models

12 Aug

[summary] Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models

12 Aug

[summary] FORGE: Force-Guided Exploration for Robust Contact-Rich Manipulation under Uncertainty

12 Aug

[summary] FORGE: Force-Guided Exploration for Robust Contact-Rich Manipulation under Uncertainty

12 Aug

[summary] The Ungrounded Alignment Problem

12 Aug

[summary] The Ungrounded Alignment Problem

12 Aug

[summary] MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

12 Aug

[summary] MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

12 Aug

[summary] Tree Attention: Topology-Aware Decoding for Long-Context Attention on GPU Clusters

12 Aug

[summary] Tree Attention: Topology-Aware Decoding for Long-Context Attention on GPU Clusters

12 Aug

[summary] Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models

12 Aug

[summary] Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models

12 Aug

Trending AI Papers

Casts

Reinforcement Learning: a brief overview

Reinforcement Learning: a brief overview

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Transformers without Normalization

Transformers without Normalization

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

When does a predictor know its own loss?

When does a predictor know its own loss?

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Trading inference-time compute for adversarial robustness.

Trading inference-time compute for adversarial robustness.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek LLM Scaling Open-Source Language Models with Longtermism

DeepSeek LLM Scaling Open-Source Language Models with Longtermism

The GAN is dead; long live the GAN! A Modern Baseline GAN

The GAN is dead; long live the GAN! A Modern Baseline GAN

The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents

Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents

PaliGemma 2: A Family of Versatile VLMs for Transfer

PaliGemma 2: A Family of Versatile VLMs for Transfer

Bottom-Up and Top-Down Analysis of Values, Agendas, and Observations in Corpora and LLMs

Bottom-Up and Top-Down Analysis of Values, Agendas, and Observations in Corpora and LLMs

Stable Flow: Vital Layers for Training-Free Image Editing

Stable Flow: Vital Layers for Training-Free Image Editing

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Logic Augmented Generation

Logic Augmented Generation

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

1 Introduction

1 Introduction

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Generative Agent Simulations of 1,000 People

Generative Agent Simulations of 1,000 People

Number it: Temporal Grounding Videos like Flipping Manga

Number it: Temporal Grounding Videos like Flipping Manga

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Convolutional Differentiable Logic Gate Networks

Convolutional Differentiable Logic Gate Networks

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

OmniGen: Unified Image Generation

OmniGen: Unified Image Generation

On Memorization of Large Language Models in Logical Reasoning

On Memorization of Large Language Models in Logical Reasoning

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

Do Large Language Models Solve Arithmetic with a Bag of Heuristics?

Do Large Language Models Solve Arithmetic with a Bag of Heuristics?

CDChat: A Large Multimodal Model for Remote Sensing Change Description

CDChat: A Large Multimodal Model for Remote Sensing Change Description

Continuous Speech Synthesis using per-token Latent Diffusion

Continuous Speech Synthesis using per-token Latent Diffusion

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

AI can help humans find common ground in democratic deliberation

AI can help humans find common ground in democratic deliberation

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

ToolGen: Unified Tool Retrieval and Calling via Generation

ToolGen: Unified Tool Retrieval and Calling via Generation

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe