menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

ML News

source image

Arxiv

2d

read

75

img
dot

Image Credit: Arxiv

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

  • Researchers have developed a self-supervised approach named V-JEPA 2 to understand, predict, and plan in the physical world.
  • V-JEPA 2 was pre-trained on over 1 million hours of internet video data and achieves top performance in motion understanding and human action anticipation tasks.
  • By integrating V-JEPA 2 with a large language model, it excels in video question-answering tasks at a large scale.
  • The researchers further demonstrate the application of self-supervised learning in robotic planning by training V-JEPA 2-AC on unlabeled robot videos and achieving object manipulation tasks.
  • V-JEPA 2-AC allows picking and placing objects using planning with image goals on Franka arms in different lab environments.
  • This achievement is obtained without task-specific training, reward, or robot data collection in the target environments.

Read Full Article

like

4 Likes

source image

Arxiv

2d

read

67

img
dot

Image Credit: Arxiv

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

  • A new benchmark called the Minimal Video Pairs (MVP) is introduced to assess video language models' physical understanding abilities.
  • Existing benchmarks may inflate scores due to shortcut solutions using superficial cues, which MVP aims to address.
  • MVP comprises 55K multiple-choice video QA examples related to physical world understanding from various video data sources.
  • The examples cover first-person egocentric and exocentric videos, robotic interaction data, and intuitive physics benchmarks.
  • Each sample in MVP includes a minimal-change pair to counter shortcut solutions, consisting of visually similar videos with opposing answers.
  • To answer correctly, a model must provide accurate answers for both examples in the minimal-change pair.
  • Human performance on MVP is 92.9%, while the best video-language model achieves 40.2% compared to random performance at 25%.

Read Full Article

like

4 Likes

source image

Arxiv

2d

read

63

img
dot

Image Credit: Arxiv

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

  • EditInspector is introduced as a benchmark to evaluate text-guided image edits by leveraging human annotations and a comprehensive framework.
  • The benchmark assesses text-guided edits based on various factors like accuracy, artifact detection, visual quality, and more.
  • Current state-of-the-art vision and language models face challenges in evaluating edits thoroughly and often provide inaccurate descriptions.
  • Two novel methods proposed within EditInspector outperform existing models in artifact detection and difference caption generation.

Read Full Article

like

3 Likes

source image

Arxiv

2d

read

367

img
dot

Image Credit: Arxiv

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

  • Chain-of-Action (CoA) is a new visuo-motor policy paradigm based on Trajectory Autoregressive Modeling.
  • CoA differs from traditional methods by generating an entire trajectory through backward reasoning with task-specific goals.
  • The process involves an action-level Chain-of-Thought (CoT) within a single autoregressive structure.
  • The first token in CoA represents a stable keyframe action encoding the task goals, with subsequent actions generated based on the initial keyframe and previously predicted actions.
  • This backward action reasoning enforces a global-to-local structure, where each local action is tightly constrained by the final goal.
  • To enhance action reasoning, CoA includes continuous action token representation, dynamic stopping for variable-length trajectory generation, reverse temporal ensemble, and multi-token prediction.
  • CoA demonstrates strong spatial generalization capabilities while maintaining a flexible and simple visuo-motor policy.
  • Empirical results show that CoA achieves state-of-the-art performance on 60 RLBench tasks and 8 real-world manipulation tasks.

Read Full Article

like

22 Likes

source image

Arxiv

2d

read

355

img
dot

Image Credit: Arxiv

Text-Aware Image Restoration with Diffusion Models

  • Image restoration methods often struggle to reconstruct textual regions accurately, resulting in text-image hallucination.
  • Text-Aware Image Restoration (TAIR) is introduced to recover visual contents and textual fidelity simultaneously.
  • SA-Text, a large-scale benchmark of scene images annotated with text instances, is presented.
  • A multi-task diffusion framework called TeReDiff integrates features from diffusion models into a text-spotting module.
  • The joint training of components allows for rich text representations used in denoising.
  • Experiments show that the approach outperforms existing methods, improving text recognition accuracy.
  • Project page: https://cvlab-kaist.github.io/TAIR/

Read Full Article

like

21 Likes

source image

Arxiv

2d

read

339

img
dot

Image Credit: Arxiv

DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

  • Researchers introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM) for dynamic 3D scene reconstruction from monocular posed videos.
  • DGS-LRM is a feed-forward method capable of predicting deformable 3D Gaussian splats for any dynamic scene, addressing the limitations of existing models for static scenes.
  • Challenges in developing a feed-forward model for dynamic scene reconstruction include a lack of training data and requirements for suitable 3D representations and training paradigms.
  • Key technical contributions of DGS-LRM include an enhanced synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision.
  • Additionally, it utilizes a per-pixel deformable 3D Gaussian representation that supports dynamic view synthesis and long-range 3D tracking.
  • DGS-LRM incorporates a large transformer network for real-time, generalizable dynamic scene reconstruction.
  • Extensive qualitative and quantitative experiments show that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods and surpasses state-of-the-art predictive dynamic reconstruction on real-world examples.
  • The model's predicted 3D deformation is accurate, enabling efficient long-range 3D tracking comparable to leading monocular video 3D tracking methods.

Read Full Article

like

20 Likes

source image

Arxiv

2d

read

363

img
dot

Image Credit: Arxiv

Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems

  • Advances in artificial intelligence (AI) are driving new discoveries in natural sciences by enhancing our understanding of natural phenomena across various scales.
  • AI for science (AI4Science) is an emerging interdisciplinary research paradigm focused on applying AI to natural sciences.
  • This work provides a detailed account of AI for quantum, atomistic, and continuum systems, aiming to understand phenomena across different scales.
  • The subareas of quantum, atomistic, and continuum systems share common challenges, such as capturing physics first principles using deep learning methods.
  • Techniques to achieve equivariance to symmetry transformations and address technical challenges like explainability and uncertainty quantification are discussed.
  • Resources for learning and education in AI for science are categorized to facilitate further understanding and community interest in the field.

Read Full Article

like

21 Likes

source image

Arxiv

2d

read

276

img
dot

Image Credit: Arxiv

Feature Normalization Prevents Collapse of Non-contrastive Learning Dynamics

  • Contrastive learning is a framework where positive views are made similar and negative views are kept far apart in a data representation space.
  • Non-contrastive learning methods like BYOL and SimSiam eliminate negative examples to improve computational efficiency.
  • A study outlined by Tian et al. showed that collapse of learned representations can be prevented by stronger data augmentation compared to regularization.
  • However, this analysis did not consider the impact of feature normalization, a key step before measuring similarity of representations.
  • Excessively strong regularization combined with feature normalization may lead to undesired collapse of dynamics.
  • The study introduces a new theory based on cosine loss with feature normalization, showcasing sixth-order dynamics that prevent collapse.
  • This approach leads to stable equilibrium even when initial parameters could lead to collapsed solutions.
  • The research emphasizes the pivotal role of feature normalization in robustly preventing collapses in learning dynamics.

Read Full Article

like

16 Likes

source image

Arxiv

2d

read

327

img
dot

Image Credit: Arxiv

Byzantine-Resilient Decentralized Multi-Armed Bandits

  • Decentralized cooperative multi-armed bandits involve agents aiming to minimize regret by exchanging information to select arms.
  • Cooperative agents outperform single agents in selecting arms independently.
  • The study focuses on recovering behavior in the presence of Byzantine agents who can provide incorrect information.
  • The framework can model attackers in networks, offensive content instigators, or financial manipulators.
  • A decentralized resilient upper confidence bound (UCB) algorithm is developed to handle Byzantine agents.
  • The algorithm mixes information among agents and trims inconsistent extreme values.
  • The normal agent's performance matches UCB1 algorithm for regret, surpassing non-cooperative cases.
  • Each agent needs at least 3f+1 neighbors, where f is the maximum Byzantine agents in each agent's neighborhood.
  • Extensions to time-varying graphs and minimax lower bounds for achievable regret are established.
  • Experiments support the framework's effectiveness in practical applications.

Read Full Article

like

19 Likes

source image

Arxiv

2d

read

355

img
dot

Image Credit: Arxiv

Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations

  • Modeling spatial-temporal interactions among neighboring agents is crucial for multi-agent problems like motion forecasting and crowd navigation.
  • Recent representations may not fully capture the causal relationships in agent interactions.
  • A metric learning approach is introduced to enhance causal awareness in representations by regularizing latent features with causal annotations.
  • Experiments demonstrate that the proposed approach improves causal awareness and enhances out-of-distribution robustness.
  • A sim-to-real causal transfer method through cross-domain multi-task learning is proposed to apply these concepts in real-world scenarios.
  • Experiments on pedestrian datasets show significant performance improvements even without real-world causal annotations.
  • The research offers insights into challenges and solutions for developing causally-aware representations of multi-agent interactions.
  • The code for the approach is available at https://github.com/vita-epfl/CausalSim2Real.

Read Full Article

like

21 Likes

source image

Arxiv

2d

read

233

img
dot

Image Credit: Arxiv

Using Shapley interactions to understand how models use structure

  • Language models are being analyzed using Shapley Taylor interaction indices (STII) to understand how they represent internal structure.
  • Shapley interactions help measure how inputs in language and speech models work together to impact outputs beyond their independent influences.
  • The study looks into the relationship between models and underlying linguistic structures like syntactic structure, non-compositional semantics, and phonetic coarticulation.
  • Results indicate that autoregressive text models show interactions correlating with the syntactic proximity of inputs.
  • Both autoregressive and masked models encode nonlinear interactions in idiomatic phrases with non-compositional semantics.
  • In terms of speech results, models show the phonetic interaction necessary for extracting discrete phonemic representations.

Read Full Article

like

14 Likes

source image

Arxiv

2d

read

387

img
dot

Image Credit: Arxiv

Share Secrets for Privacy: Confidential Forecasting with Vertical Federated Learning

  • Vertical federated learning (VFL) is utilized for time series forecasting in areas like healthcare and manufacturing.
  • Challenges include data privacy and overfitting on small, noisy datasets.
  • A novel framework called "Secret-shared Time Series Forecasting with VFL" (STV) is proposed to address these challenges.
  • STV features privacy-preserving algorithms for forecasting with SARIMAX and autoregressive trees on vertically-partitioned data.
  • Decentralized forecasting is achieved through secret sharing and multi-party computation in STV.
  • N-party algorithms for matrix multiplication and inverse operations aid in parameter optimization, ensuring strong convergence with minimal tuning complexity.
  • STV's performance is evaluated on six datasets from various contexts, showing comparable forecasting accuracy to centralized approaches.
  • STV's exact optimization surpasses centralized methods by 23.81% in forecasting accuracy, including state-of-the-art models like long short-term memory.
  • The scalability of STV is assessed by comparing communication costs of exact and iterative optimization.
  • The code and supplementary material for STV are accessible online at https://github.com/adis98/STV.

Read Full Article

like

23 Likes

source image

Arxiv

2d

read

39

img
dot

Image Credit: Arxiv

PARAFAC2-based Coupled Matrix and Tensor Factorizations with Constraints

  • Data fusion models based on Coupled Matrix and Tensor Factorizations (CMTF) have been effective tools for joint analysis of data from multiple sources.
  • Recent advancements have integrated the more flexible PARAFAC2 model into CMTF models, enabling handling of irregular/ragged tensors and dynamic data with unaligned time profiles.
  • Existing PARAFAC2-based CMTF models have limitations in terms of possible regularizations and types of data coupling.
  • To address these limitations, a new algorithmic framework has been introduced in this paper for fitting PARAFAC2-based CMTF models using Alternating Optimization (AO) and the Alternating Direction Method of Multipliers (ADMM).
  • The proposed framework allows for imposing various constraints on all modes and linear couplings to other matrix-, CP- or PARAFAC2-models.
  • Experiments on simulated and real datasets have shown the utility and versatility of the proposed framework, highlighting its accuracy and efficiency compared to state-of-the-art methods.

Read Full Article

like

2 Likes

source image

Arxiv

2d

read

387

img
dot

Image Credit: Arxiv

The Remarkable Robustness of LLMs: Stages of Inference?

  • Large Language Models (LLMs) are surprisingly robust to structural interventions like deleting and swapping adjacent layers during inference.
  • LLMs retain 72-95% of their original top-1 prediction accuracy without any fine-tuning after interventions.
  • Performance degradation varies across layers: early and final layers see more degradation, while dropping middle layers has minimal impact.
  • Observation of four stages of inference in LLMs: detokenization, feature engineering, prediction ensembling, and residual sharpening.
  • These stages show depth-dependent computations in LLMs and are seen across different model families and sizes.

Read Full Article

like

23 Likes

source image

Arxiv

2d

read

31

img
dot

Image Credit: Arxiv

Electroencephalogram Emotion Recognition via AUC Maximization

  • This study focuses on addressing imbalanced datasets, particularly in the context of 'Liking' label detection in the DEAP dataset used in neuroscience, cognitive science, and medical diagnostics.
  • Class imbalance issues in data analysis, where minority classes are vital, have been typically overlooked by previous research focusing on more balanced labels like arousal and valence.
  • The study utilizes numerical optimization techniques to maximize the area under the curve (AUC) as a metric for enhancing minority class detection.
  • Comparisons are made between the proposed linear classifier approach and traditional models like logistic regression and support vector machines (SVM).
  • The new method significantly outperforms traditional models, boosting recall from 41.6% to 79.7% and increasing the F1-score from 0.506 to 0.632.
  • These findings demonstrate the effectiveness of AUC maximization through numerical optimization in dealing with imbalanced datasets to improve predictive accuracy for critical minority classes in unseen data.

Read Full Article

like

1 Like

For uninterrupted reading, download the app