Multimodal deep learning models have been successful in fusing text and imagery data.
However, there are fewer models that attempt to fuse time series data with text, imagery, and audio.
The fusion of time series data with other modalities has practical applications in various industries.
Examples include forecasting river flow using historical data and satellite imagery, predicting patient mortality using vitals, imaging data, and doctor's notes, and forecasting product sales.