The article showcases a NLP-BERT based questions recognition system that categorizes un-labeled question data into specific groups or clusters without the need for labeled data.
The system involves loading a dataset containing questions, cleaning the text using regular expressions, and preprocessing it with the BERT natural language processing model to create embeddings.
The embeddings are then clustered using the K-means algorithm, following which they are manually assigned a category for easy interpretation.
This is followed by plotting the reduced features of the questions using PCA to visualize clusters.
The final category results are exported to CSV, and metrics are used to evaluate clustering quality.
The article also provides insight on how this system can help evaluate product/customer success through feedback and work on improving existing issues.
Libraries like 're', 'pandas', and 'sklearn' are used for cleaning, data manipulation, and clustering.
The project also leverages BERT natural language processing library along with GPUs for fast processing.
A mapping of cluster labels to descriptive categories is used and sample verification is done for more accurate clustering.
The goal is to extract the semantics of the text and simplify the mapping process for downstream applications.