The Problem¶
There are many categorization models available for labeling data. However, most supervised models require large training datasets. This limits their applicability to scenarios where an existing labeled dataset is available.
A straightforward alternative is implementing the popular K-means clustering algorithm. The algorithm categorizes data into groups, which can then be manually labeled and used to train a supervised model.
While effective, this approach still necessitates the manual process of labeling the clusters. This step can be circumvented by using a clustering model trained on a small labeled dataset to label the rest of the data. However, this would still be a semi-supervised approach, requiring some labeled data.
The Solution¶
The solution I propose involves using Large Language Models(LLMs). You may initially think that LLMs would be unsuitable for categorizing large datasets due to their context limitations. That is why we will not use the language models to categorize the data directly.
Instead, we'll employ the K-means clustering algorithm to group the data initially. Following that, we'll use the language model to categorize these groups. This hybrid approach allows us to categorize large datasets without needing a vast labeled dataset.
Note: While language models are not unsupervised, they don't require us to provide our labeled dataset, effectively solving our problem. Additionally, since the K-means clustering algorithm performs the majority of the work, it's reasonable to consider this solution as primarily unsupervised.