Data Preparation¶

In this section, we will prepare our data for the clustering algorithm. We'll start by loading our data and taking a quick look to understand its structure.

sample data¶

for this article, I will be using an example data set from kaggle. This dataset only contains 642 items, though this same concept was used in production on a 5 million item dataset for an off road vehicle parts company I worked with during my time at Publicis Sapient.

In [ ]:

# Import required libraries
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd
import openai


# Load the car parts data from a JSON file
with open('./parts_data.json', 'r') as f:
    car_parts_data = json.load(f)

here is a small subset of the data¶

["Performance Battery", "Airbag sensors", "Brake roll", "Steering column", "Bell housing", "Liquefied petroleum gas ", "Knock sensor", "Shoe return spring", "Water pipe", "Center console ", "MAP sensor", "management system", "Type 2 connector", "Fan ditch", "Radiator shroud", "Air blower", "Outer door handle", "Steering arm", "Clutch pedal", "Fuel tank cover", "Cylinder head", "Valve housing", "Control arm", "rear main seal", "boot", "Hinge", "Tachometer ", "Speaker", "Camshaft follower", "A/C Compressor", "fuel filler cap", "air dam", "Leaf", "Transmission gear", "Brake booster hose", "Air intake manifold", "Fuel line", "Flywheel ring gear", "Distributor", "Gear stick ", "hatch", "Tyre", "Antenna cable", "recirculating ball", "Tire", "Hinges and springs", "Brake cooling duct", "gear shifter", "Oil pipe", "Radio and media player", "Front seat", "LPG ", "Steering rack ", "Washer", "Engine control", "Heat sleeving ", "Valve cover", "Battery Cable", "Battery tray", "Mounting", "Radiator pressure cap", "Hood release cable", "Transmission yoke", "Steering box", "Charger", "Bench seat", "A/C INNER PLATE", "Trailing arm", "Seat cover", "Water neck o-ring", "Exhaust flange gasket", "Camshaft fastener", "sprocket", "speed of meter sensor", "Brake pad", "rear side", "Cowl screen", "Starter solenoid", "Caliper", "Air spring", "locks", "Switch cover", ]

Feature Extraction¶

Before we can use any machine learning algorithms, we need to convert our text data into a numerical format. We'll use the TF-IDF (Term Frequency-Inverse Document Frequency) method for this purpose.

In our production dataset, we had many more text features such as descriptions, reviews, intallation guides, etc. which made the TF-IDF really shine due to its term discrimination and contextual importance in large bodies of text. Without access to that same dataset, I will be unable to show the true effect, though this should help give a good understanding.

In [ ]:

# Step 1: Text Preprocessing
# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

# Step 2: Feature Extraction
# Compute TF-IDF features for the car parts
tfidf_matrix = tfidf_vectorizer.fit_transform(car_parts_data)

Hyper Parameter Tuning¶

We can then use the Silhouette score to help us optimize the number of clusters,k, for our dataset.

In [ ]:

K_range = range(2, 60)  
    sil_scores = []
    
    for K in K_range:
        kmeans = KMeans(n_clusters=K, random_state=42).fit(tfidf_matrix)
        cluster_assignments = kmeans.labels_
        sil_score = silhouette_score(tfidf_matrix, cluster_assignments)
        sil_scores.append(sil_score)
    
    # Plotting
    plt.figure(figsize=(12, 6))
    plt.plot(K_range, sil_scores, marker='o')
    plt.xlabel('Number of clusters (K)')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Score vs Number of Clusters')
    plt.grid(True)
    plt.show()

No description has been provided for this image

The graph displays Silhouette scores for varying numbers of clusters, peaking at around 37 clusters, suggesting this as the optimal cluster count for the dataset. Scores decline after this peak, letting us know the quality may diminish with further cluster increases.

Clustering¶

With our features ready, we can now move on to clustering the data. We'll use the K-means clustering algorithm. K-means is an iterative algorithm that divides a group of ( n ) data points into ( K ) non-overlapping subgroups.

In [ ]:

kmeans = KMeans(n_clusters=37, random_state=42)
                        
kmeans.fit(tfidf_matrix)

# Get the cluster assignments for each car part
cluster_assignments = kmeans.labels_

In [ ]:

# Combine the original car parts data with their corresponding cluster assignments
clustered_data = pd.DataFrame({
    'Car Part': car_parts_data,
    'Cluster': cluster_assignments
})

# Group the data by the cluster assignments and display the car parts in each cluster
cluster_groups = clustered_data.groupby('Cluster')['Car Part'].apply(list)

# Display the car parts in each cluster
count = 0
for cluster, parts in cluster_groups.items():
    if count == 3:
        break
    count += 1

    print(f"Cluster {cluster}:")
    print(", ".join(parts[:10]))  # Display first 10 parts in each cluster for brevity
    print("...")
    print()

Cluster 0:
Front seat, Bench seat, Seat cover, Seat track, Back seat, seat , Seat belt, Seat bracket, Bucket seat
...

Cluster 1:
Fuel tank cover, Hydrogen tank, Overflow tank, Fuel tank, Water tank
...

Cluster 2:
A/C INNER PLATE, registration plate lamp, Name plate, License plate bracket, number plate lamp , Brake backing plate, plate lamp, License plate lamp
...

Label Generation Using Language Models¶

As you can see the, the clustering worked great. Though now each cluster needs to be labeled to understand what kind of data points are grouped together. We'll use Language Models to automatically generate these labels.

In [ ]:

# Initialize OpenAI API 
openai.api_key = "omitted"

def generate_cluster_label(cluster_items):
    # Prepare text prompt for the language model
    prompt = f"These are some items in a cluster: {', '.join(cluster_items[:5])}. What would be a suitable label for this cluster?"

    # Query the language model to generate a label
    response = openai.Completion.create(engine="text-davinci-002", prompt=prompt, max_tokens=10)
    
    # Extract and return the label from the language model's response
    label = response.choices[0].text.strip()


    
    return label

# Generate labels for each cluster
generated_labels = {}
for cluster, items in cluster_groups.items():
    label = generate_cluster_label(items)
    generated_labels[cluster] = label

# Update the DataFrame with generated labels
clustered_data['Generated Cluster Label'] = clustered_data['Cluster'].map(generated_labels)

# Display the DataFrame with generated labels
print(clustered_data.sample(10))

                  Car Part  Cluster  \
141          Radiator hose       25   
101             Wheel stud        6   
609       Rubber extruded         6   
294  Transmission computer       30   
268                Valance        6   
199       head cover parts       38   
307      Windshield washer       35   
215          driving wheel        6   
56             Valve cover       38   
499         RR Side Sensor       32   

                              Generated Cluster Label  
141                                  automotive hoses  
101                         Automotive Safety Systems  
609                         Automotive Safety Systems  
294                              Computer controllers
268                         Automotive Safety Systems  
199                           Valve and Switch Covers  
307                                  Auto maintenance  
215                         Automotive Safety Systems  
56                            Valve and Switch Covers  
499                                    engine sensors

Analyzing the Label Generation ¶

The successful integration of OpenAI's GPT for cluster labeling showcases the synergy between unsupervised learning and natural language processing. GPT's ability to generate coherent and contextually relevant labels significantly enhances the utility of our clustering model. The automated, yet meaningful, label generation paves the way for more complex applications where human-like understanding of clustered data is beneficial.

Summary¶

In this article, we tackled the problem of categorizing large datasets using an unsupervised approach. We started by transforming our text data into a numerical format using TF-IDF. Then, we used the K-means clustering algorithm to create clusters from our data. Finally, we employed a Language Model to automatically label these clusters.

This approach is highly scalable and can be used for categorizing large datasets without the need for a labeled training set. Future work could involve experimenting with different clustering algorithms and feature extraction methods.

Additionally, here is how I coded the animation shown at the start!¶

In [ ]:

import matplotlib.pyplot as plt
    import matplotlib.animation as animation
    import random
    
    # Initialize plot
    fig, ax = plt.subplots()
    plt.axis('off')
    
    # Sample data: car parts and their respective categories
    data = {
        "sensors": ["Airbag sensors", "Transmission", "speed sensor"],
        "engine components": ["engine", "Diesel engine", "petrol engine", "gasoline engine"],
        "starting system": ["Starter", "Starter drive", "starter pinion", "Starter motor"]
    }
    
    # Flatten the data for the initial state and shuffle it
    flattened_data = [item for sublist in data.values() for item in sublist]
    random.shuffle(flattened_data)
    
    # Initial coordinates
    initial_coords = [(0.2 + (i % 4) * 0.2, 0.8 - (i // 4) * 0.2) for i in range(len(flattened_data))]
    
    # Final coordinates based on cluster and cluster colors
    final_coords = []
    final_colors = []
    cluster_centers = [(0.2, 0.7), (0.5, 0.2), (0.8, 0.7)]
    cluster_colors = ['red', 'green', 'blue']
    for center, parts, color in zip(cluster_centers, data.values(), cluster_colors):
        x, y = center
        y_start = y + len(parts) * 0.05 - 0.1
        for i, _ in enumerate(parts):
            final_coords.append((x, y_start - i * 0.1))
            final_colors.append(color)
    
    def update(frame):
        ax.clear()
        plt.axis('off')
        
        if frame <= 10:  # First 5 frames are delay frames for the initial state
            for (x, y), part in zip(initial_coords, flattened_data):
                plt.text(x, y, part, ha='center', fontsize=8, color='black')
        elif frame <= 59:
            t = min(frame / 59, 1)  # Normalize for a 60-frame transition
            # Animate the movement
            for (x1, y1), (x2, y2), part in zip(initial_coords, final_coords, flattened_data):
                x = x1 + t * (x2 - x1)
                y = y1 + t * (y2 - y1)
                plt.text(x, y, part, ha='center', fontsize=8, color='black' if frame < 59 else final_colors[initial_coords.index((x1, y1))])
        elif 59 < frame <= 75:
            # Brief delay with items in their respective colors but no labels
            for (x, y), part, color in zip(final_coords, flattened_data, final_colors):
                plt.text(x, y, part, ha='center', fontsize=8, color=color)
        else:
            # Add labels to each cluster with added styling
            for (x, y), part, color in zip(final_coords, flattened_data, final_colors):
                plt.text(x, y, part, ha='center', fontsize=8, color=color)
            for center, category, color in zip(cluster_centers, data.keys(), cluster_colors):
                x, y = center
                plt.text(x, y + 0.25, category, ha='center', fontsize=12, fontweight='bold', color=color)
    
    # Create animation with 60 frames for the transition (1 initial frame + 59 transition frames + 5 delay frames + 5 label frames)
    ani = animation.FuncAnimation(fig, update, frames=100, repeat=False)
    
    ani.save('final_animated_car_parts_clustering.gif', writer='pillow', fps=30)

How to Categorize Large Datasets Unsupervised

Written by Eric Detjen

The Problem¶

The Solution¶

Data Preparation¶

sample data¶

here is a small subset of the data¶

Feature Extraction¶

Hyper Parameter Tuning¶

Clustering¶

Label Generation Using Language Models¶

Analyzing the Label Generation ¶

Summary¶

Additionally, here is how I coded the animation shown at the start!¶