menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

ML News

source image

Arxiv

1d

read

125

img
dot

Image Credit: Arxiv

Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers

  • The relationship between memorization and generalization in large language models (LLMs) is being investigated in this study.
  • Pre-training capacity-limited Transformer models from scratch on synthetic character-level tasks showed a trade-off between memorization and generalization.
  • Small models excel in extrapolating unseen arithmetic cases but fail at memorization, whereas larger models are better at memorization but struggle with extrapolation.
  • An intermediate-capacity model also shows a shift toward memorization rather than generalization.
  • When trained on both tasks together, no size of model succeeds at extrapolation.
  • The study indicates that pre-training may inherently prioritize one learning mode over the other.
  • By examining these dynamics in a controlled setting, the research provides insights into how model capacity influences learning behavior and its implications for small language model design and deployment.

Read Full Article

like

7 Likes

source image

Arxiv

1d

read

224

img
dot

Image Credit: Arxiv

Feature Shift Localization Network

  • Feature shifts between data sources are common in various applications, causing issues like erroneous features.
  • Localizing shifted features is crucial to correct or filter data and maintain downstream analysis integrity.
  • Detecting distribution shifts is feasible, but localizing the originating features remains a challenge.
  • Existing solutions for localizing feature shifts are either inaccurate or not scalable for large datasets.
  • A new approach, the Feature Shift Localization Network (FSL-Net), is introduced in this work.
  • FSL-Net is a neural network designed to quickly and accurately localize feature shifts in large and high-dimensional datasets.
  • The network is trained with diverse datasets to learn statistical properties and can identify shifts in new datasets without re-training.
  • The FSL-Net model and code are available for public use on GitHub at https://github.com/AI-sandbox/FSL-Net.

Read Full Article

like

13 Likes

source image

Arxiv

1d

read

213

img
dot

Image Credit: Arxiv

Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs

  • Large language models (LLMs) face challenges when deployed on resource-constrained devices due to their rapid scaling.
  • There is a growing interest in extremely low-bit quantization, like 2-bit quantization, to address these challenges.
  • Prior works have shown that 2-bit LLMs are pareto-optimal over 4-bit models in accuracy and latency, particularly for pre-trained LLMs.
  • Existing advancements in 2-bit quantization have not been extended to instruction-tuned models.
  • To bridge this gap, Unified Progressive Quantization (UPQ) is proposed as a framework that combines block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for 2-bit instruction-tuned LLM quantization.
  • UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to reduce quantization error before further quantizing to INT2.
  • Distill-QAT is then applied in UPQ to ensure that INT2 instruction-tuned LLMs produce responses consistent with their original FP16 versions.
  • UPQ demonstrates the ability to quantize open-source instruction-tuned LLMs to 2-bit without proprietary post-training data.
  • UPQ achieves state-of-the-art performance on benchmarks like MMLU and IFEval that are used to evaluate instruction-tuned LLMs.

Read Full Article

like

12 Likes

source image

Arxiv

1d

read

209

img
dot

Image Credit: Arxiv

MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

  • MetaTT is a Tensor Train (TT) adapter framework designed for global low-rank fine-tuning of pre-trained transformers.
  • Unlike LoRA, MetaTT utilizes a single shared TT to factorize all transformer sub-modules such as query, key, value, projection, and feed-forward layers by indexing structural axes.
  • MetaTT adds parameters proportional to the sum across modes for a given rank, resulting in a much more compressed final adapter compared to LoRA.
  • Benchmarks comparing MetaTT with LoRA and other tensor-based methods on standard language modeling tasks show that MetaTT achieves a significant reduction in parameters while maintaining similar accuracy to LoRA and even outperforming other methods.
  • The TT ansatz benefit from mature optimization routines like DMRG-style rank adaptive minimization and Adam, making training simpler compared to other tensor factorization methods.
  • MetaTT allows for cheap appending of new modes, enabling shared adapters across multiple tasks without the need to redesign the core tensor.

Read Full Article

like

12 Likes

source image

Arxiv

1d

read

198

img
dot

Image Credit: Arxiv

SensorLM: Learning the Language of Wearable Sensors

  • SensorLM is introduced as a family of sensor-language foundation models for wearable sensor data understanding with natural language.
  • The lack of paired, richly annotated sensor-text descriptions in real-world wearable data makes aligning and interpreting sensor data with language challenging.
  • SensorLM uses a hierarchical caption generation pipeline to extract statistical, structural, and semantic information from sensor data, creating the largest sensor-language dataset with over 59.7 million hours of data from 103,000 individuals.
  • It extends multimodal pretraining architectures like CLIP and CoCa, outperforming state-of-the-art methods in zero-shot recognition, few-shot learning, and cross-modal retrieval in human activity analysis and healthcare tasks.
  • SensorLM showcases capabilities such as scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to new tasks.

Read Full Article

like

11 Likes

source image

Arxiv

1d

read

110

img
dot

Image Credit: Arxiv

CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model

  • Researchers introduce CodeBrain, an efficient EEG foundation model for capturing multi-scale brain dependencies.
  • CodeBrain aims to address challenges in traditional EEG models related to channel configurations and task objectives.
  • CodeBrain is trained in two stages: TFDual-Tokenizer for heterogeneous temporal and frequency tokenization and EEGSSM for modeling dependencies.
  • TFDual-Tokenizer enables a quadratic expansion of discrete representation space and offers interpretability through cross-domain token analysis.
  • EEGSSM combines global convolution architecture and sliding window attention to capture long-range and local dependencies efficiently.
  • EEGSSM better reflects the brain's small-world topology compared to fully connected Transformer models.
  • CodeBrain's training includes a masked self-supervised learning objective to predict token indices.
  • Experiments on 10 public EEG datasets show CodeBrain's generalizability via linear probing.
  • CodeBrain offers biologically informed and interpretable EEG modeling, laying the foundation for future neuroscience research.
  • Both code and pretraining weights for CodeBrain will be released in a future version.

Read Full Article

like

6 Likes

source image

Arxiv

1d

read

95

img
dot

Image Credit: Arxiv

TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval

  • TRACE is a new multimodal retriever proposed for grounding time-series data in textual context.
  • Dynamic data in areas like weather, healthcare, and energy require effective interpretation and retrieval.
  • TRACE addresses the lack of semantic grounding in time-series retrieval methods.
  • It aligns time-series embeddings with textual context and supports various cross-modal retrieval modes.
  • The retriever facilitates linking linguistic descriptions with complex temporal patterns.
  • TRACE enriches downstream models with context to improve predictive accuracy and interpretability.
  • It also functions as a standalone encoder and achieves state-of-the-art performance on forecasting and classification tasks.
  • The retriever offers dual utility as an encoder for downstream applications and a general-purpose tool to enhance time-series models.
  • TRACE employs hard negative mining for semantically meaningful retrieval.
  • It enables fine-grained channel-level alignment and effectively handles multi-channel signals.
  • The retriever can be task-specifically tuned for context-aware representations.
  • TRACE's performance has been validated through extensive experiments across various domains.
  • The proposal aims to improve time-series retrieval and enhance interpretability in downstream tasks.
  • The method supports both Text-to-Timeseries and Timeseries-to-Text retrieval modes.
  • TRACE bridges the gap in effective interpretation and retrieval of domain-specific time-series data.

Read Full Article

like

5 Likes

source image

Arxiv

1d

read

270

img
dot

Image Credit: Arxiv

Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

  • Neural Processes (NPs) are models that predict stochastic processes' posterior predictive distribution.
  • Modern NPs handle complex applications such as geology, epidemiology, climate, and robotics.
  • The scalability of NPs has become crucial due to data-hungry applications.
  • A new architecture, Biased Scan Attention Transformer Neural Process (BSA-TNP), is proposed.
  • BSA-TNP introduces Kernel Regression Blocks (KRBlocks) and group-invariant attention biases.
  • BSA-TNP uses memory-efficient Biased Scan Attention (BSA) for scalability.
  • BSA-TNP matches or surpasses the accuracy of top models while training faster.
  • It exhibits translation invariance and can learn at multiple resolutions simultaneously.
  • BSA-TNP can model processes evolving in space and time and support high dimensional fixed effects.
  • The model can perform inference with over 1M test points and 100K context points in under a minute on a single 24GB GPU.

Read Full Article

like

16 Likes

source image

Arxiv

1d

read

159

img
dot

Image Credit: Arxiv

Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search

  • Large Language Models (LLMs) often require guidance to perform well in complex environments.
  • A new framework has been introduced to enhance LLM agent planning through in-context learning.
  • The framework uses atomic fact augmentation and lookahead search to improve planning capabilities.
  • The agent extracts task-critical 'atomic facts' from interaction trajectories.
  • These facts augment prompts for LLM-based components for better decision-making.
  • Planning involves a depth-limited lookahead search guided by accumulated facts and history.
  • The approach helps the agent improve understanding and decision-making without weight updates.
  • Theoretical motivation links performance to fact-based abstraction and LLM simulation accuracy.
  • Empirically, the agent shows improved performance and adaptability on interactive tasks like TextFrozenLake and ALFWorld.

Read Full Article

like

9 Likes

source image

Arxiv

1d

read

152

img
dot

Image Credit: Arxiv

The Curious Language Model: Strategic Test-Time Information Acquisition

  • Decision-makers often lack sufficient information to make confident decisions and can undertake actions to acquire necessary information.
  • Different ways of acquiring information have varying costs, making it challenging to select informative and cost-effective actions.
  • A heuristic-based policy called CuriosiTree is proposed for zero-shot information acquisition in large language models (LLMs).
  • CuriosiTree uses greedy tree search to estimate expected information gain of actions and strategically selects actions balancing information gain and cost.
  • Empirical validation in a clinical diagnosis simulation demonstrates that CuriosiTree enables cost-effective integration of heterogeneous information sources.
  • CuriosiTree outperforms baseline strategies in selecting action sequences for accurate diagnosis.

Read Full Article

like

9 Likes

source image

Arxiv

1d

read

148

img
dot

Image Credit: Arxiv

Multivariate Long-term Time Series Forecasting with Fourier Neural Filter

  • Multivariate long-term time series forecasting faces challenges in capturing temporal dependencies and spatial correlations simultaneously.
  • Current approaches like Transformers do not address time series properties like periodicity effectively.
  • FNF is introduced as a dedicated backbone and DBD as the architecture for spatio-temporal modeling.
  • FNF unifies local time-domain and global frequency-domain information processing within a single backbone, extending to spatial modeling.
  • DBD offers superior gradient flow and representation capacity.
  • Empirical evaluation across 11 public benchmark datasets in various domains demonstrates state-of-the-art performance.
  • The approach achieves results without auxiliary techniques, indicating the potential for improved time series modeling in scientific and industrial applications.

Read Full Article

like

8 Likes

source image

Arxiv

1d

read

34

img
dot

Image Credit: Arxiv

Multi-Task Reward Learning from Human Ratings

  • Reinforcement learning from human feedback (RLHF) is crucial for aligning model behavior with user goals.
  • Current RLHF methods oversimplify human decision-making by focusing on isolated tasks like classification or regression.
  • A new reinforcement learning method presented in an arXiv paper considers multiple tasks to mimic human decision-making.
  • The proposed method leverages human ratings in reward-free settings to learn a reward function, striking a balance between classification and regression models.
  • This approach accounts for the uncertainty in human decision-making and allows for adaptive strategy emphasis.
  • Experiments with synthetic human ratings demonstrate the superior performance of the new method over existing rating-based RL techniques.
  • The novel method even outperforms traditional RL approaches in certain scenarios.

Read Full Article

like

2 Likes

source image

Arxiv

1d

read

297

img
dot

Image Credit: Arxiv

LaDCast: A Latent Diffusion Model for Medium-Range Ensemble Weather Forecasting

  • LaDCast is a new global latent-diffusion framework introduced for medium-range ensemble weather forecasting.
  • It generates hourly ensemble forecasts in a learned latent space by compressing high-dimensional ERA5 reanalysis fields into a compact representation using an autoencoder.
  • A transformer-based diffusion model is employed to produce sequential latent updates with arbitrary hour initialization.
  • LaDCast incorporates Geometric Rotary Position Embedding (GeoRoPE) to consider the Earth's spherical geometry, a dual-stream attention mechanism for efficient conditioning, and sinusoidal temporal embeddings for capturing seasonal patterns.
  • The model achieves deterministic and probabilistic forecasting skill comparable to the European Centre for Medium-Range Forecast IFS-ENS, without explicit perturbations.
  • LaDCast excels in tracking rare extreme events like cyclones, providing more accurate trajectory predictions compared to established models.
  • By operating in latent space, LaDCast significantly reduces storage and computational requirements, offering a practical approach to real-time kilometer-scale resolution forecasting.
  • The code and models for LaDCast are open-source, and training and evaluation pipelines are available at https://github.com/tonyzyl/ladcast.

Read Full Article

like

17 Likes

source image

Arxiv

1d

read

274

img
dot

Image Credit: Arxiv

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models

  • Integrating Low-Rank Adaptation (LoRA) into federated learning offers a solution for fine-tuning Large Language Models (LLMs) efficiently without sharing local data.
  • Challenges in balancing communication efficiency, model accuracy, and computational cost exist in methods designed for federated LoRA, especially among heterogeneous clients.
  • Existing methods for federated LoRA either rely on simplistic local adapter averaging introducing noise, require transmitting large local adapters leading to poor communication efficiency, or need computationally expensive decomposition for client-specific low-rank adapters.
  • The proposed FLoRIST framework achieves accurate aggregation without high communication or computational overhead by performing singular value decomposition on stacked local adapters separately.
  • FLoRIST operates within a compact intermediate space to represent information from local LoRAs and uses tunable singular value thresholding for server-side optimal rank selection to construct global low-rank adapters shared by all clients.
  • Empirical evaluations across datasets and LLMs show that FLoRIST balances superior communication efficiency and competitive performance in homogeneous and heterogeneous setups.

Read Full Article

like

16 Likes

source image

Arxiv

1d

read

369

img
dot

Image Credit: Arxiv

Policy-Based Trajectory Clustering in Offline Reinforcement Learning

  • Researchers have introduced a novel task of clustering trajectories from offline reinforcement learning datasets, where each cluster center represents the policy that generated the trajectories.
  • The clustering objective is formulated based on the KL-divergence of offline trajectory distributions and a mixture of policy-induced distributions.
  • To address this task, Policy-Guided K-means (PG-Kmeans) and Centroid-Attracted Autoencoder (CAAE) are proposed.
  • PG-Kmeans trains behavior cloning policies and assigns trajectories based on policy generation probabilities, while CAAE guides the latent representations of trajectories toward specific codebook entries for clustering.
  • The finite-step convergence of PG-Kmeans is theoretically proven, highlighting a challenge in offline trajectory clustering due to policy-induced conflicts.
  • Experimental validation on the D4RL dataset and custom GridWorld environments demonstrates the effectiveness of PG-Kmeans and CAAE in partitioning trajectories into meaningful clusters.
  • The research suggests that these methods offer a promising framework for policy-based trajectory clustering, applicable in offline RL and beyond.

Read Full Article

like

22 Likes

For uninterrupted reading, download the app