Time series classification is a crucial task in healthcare and industry, hindered by limited time series foundation models (TSFMs) due to lack of datasets.
A new framework called Time Vision Transformer (TiViT) is introduced, converting time series data into images to utilize pretrained Vision Transformers (ViTs) from image datasets.
Theoretical analysis shows that patching ViTs for time series can enhance label-relevant tokens and decrease sample complexity.
TiViT achieves top performance on time series benchmarks by leveraging hidden representations from large OpenCLIP models, emphasizing the effectiveness of intermediate layers for classification.