SensorLM is introduced as a family of sensor-language foundation models for wearable sensor data understanding with natural language.
The lack of paired, richly annotated sensor-text descriptions in real-world wearable data makes aligning and interpreting sensor data with language challenging.
SensorLM uses a hierarchical caption generation pipeline to extract statistical, structural, and semantic information from sensor data, creating the largest sensor-language dataset with over 59.7 million hours of data from 103,000 individuals.
It extends multimodal pretraining architectures like CLIP and CoCa, outperforming state-of-the-art methods in zero-shot recognition, few-shot learning, and cross-modal retrieval in human activity analysis and healthcare tasks.
SensorLM showcases capabilities such as scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to new tasks.