Yandex has released Yambda, the world’s largest publicly available dataset for recommender system research, containing nearly 5 billion anonymized user interaction events from Yandex Music.
This dataset aims to bridge the gap between academia and industry, offering valuable behavioral data for developing recommender systems.
Recommender systems rely on massive behavioral data to provide personalized experiences, but access to such large, anonymized datasets has been limited.
Yambda addresses this challenge by offering 4.79 billion anonymized user interactions over a 10-month period, with features like audio embeddings and organic interaction flags.
The dataset is provided in Apache Parquet format, making it accessible for researchers and developers using big data processing frameworks.
Yandex introduces a Global Temporal Split evaluation strategy in Yambda, preserving temporal order for realistic testing of recommender models.
Baseline models like MostPop, DecayPop, ItemKNN, and others are included for benchmarking and assessing the performance of new algorithms.
Yambda's applicability extends beyond music streaming, serving as a benchmark for recommender systems in various domains like e-commerce and social networks.
My Wave, Yandex Music's personalized recommender system, utilizes deep neural networks to offer tailored music suggestions based on user preferences.
Yandex emphasizes privacy in the dataset by anonymizing user data and omitting sensitive attributes, ensuring ethical use of the data for research purposes.