Using over 5,000 New Year’s resolution tweets from 2015, the article explores semantic structures behind human expression by finding the underlying stories in the data.
OpenAI embedding model creates a 1,536-dimensional vector for each tweet that can be used for exploring vast, multidimensional space.
PCA is a dimensionality reduction technique that provides the most generic view of the data, whereas T-SNE and UMAP take a different approach by emphasizing local relationships.
Supervised projections like Linear Discriminant Analysis explicitly align the projections with specific categories and reveal stunningly clear patterns. Cosmic graph visualization is used for the same.
Validation is necessary to ensure that projections and patterns uncovered aren't arbitrary. Semantic embeddings aren't arbitrary and are usually structured by design.
Visualization is about inviting curiosity, sparking insights, and uncovering truths hidden in the multidimensional echo of the data.
Projecting 1,536 dimensions onto two is not just a technical challenge but a storytelling exercise, using machine learning to illuminate the human experience.
Therefore, projecting the high-dimensional spaces can help people explore data from all possible angles.