Clustering

K-means Clustering in Python using OpenAI

We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the Get_embeddings_from_dataset Notebook.

[2]
(1000, 1536)

1. Find the clusters using K-means

We show the simplest use of K-means. You can pick the number of clusters that fits your use case best.

[3]
/opt/homebrew/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
Cluster
,0    4.105691
,1    4.191176
,2    4.215613
,3    4.306590
,Name: Score, dtype: float64
[4]
Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')
Output

Visualization of clusters in a 2d projection. In this run, the green cluster (#1) seems quite different from the others. Let's see a few samples from each cluster.

2. Text samples in the clusters & naming the clusters

Let's show random samples from each cluster. We'll use gpt-4 to name the clusters, based on a random sample of 5 reviews from that cluster.

[6]
Cluster 0 Theme: The theme of these customer reviews is food products purchased on Amazon.
5, Loved these gluten free healthy bars, saved $$ ordering on Amazon:   These Kind Bars are so good and healthy & gluten free.  My daughter ca
1, Should advertise coconut as an ingredient more prominently:   First, these should be called Mac - Coconut bars, as Coconut is the #2
5, very good!!:   just like the runts<br />great flavor, def worth getting<br />I even o
5, Excellent product:   After scouring every store in town for orange peels and not finding an
5, delicious:   Gummi Frogs have been my favourite candy that I have ever tried. of co
----------------------------------------------------------------------------------------------------
Cluster 1 Theme: Pet food reviews
2, Messy and apparently undelicious:   My cat is not a huge fan. Sure, she'll lap up the gravy, but leaves th
4, The cats like it:   My 7 cats like this food but it is a little yucky for the human. Piece
5, cant get enough of it!!!:   Our lil shih tzu puppy cannot get enough of it. Everytime she sees the
1, Food Caused Illness:   I switched my cats over from the Blue Buffalo Wildnerness Food to this
5, My furbabies LOVE these!:   Shake the container and they come running. Even my boy cat, who isn't 
----------------------------------------------------------------------------------------------------
Cluster 2 Theme: All the reviews are about different types of coffee.
5, Fog Chaser Coffee:   This coffee has a full body and a rich taste. The price is far below t
5, Excellent taste:   This is to me a great coffee, once you try it you will enjoy it, this 
4, Good, but not Wolfgang Puck good:   Honestly, I have to admit that I expected a little better. That's not 
5, Just My Kind of Coffee:   Coffee Masters Hazelnut coffee used to be carried in a local coffee/pa
5, Rodeo Drive is Crazy Good Coffee!:   Rodeo Drive is my absolute favorite and I'm ready to order more!  That
----------------------------------------------------------------------------------------------------
Cluster 3 Theme: The theme of these customer reviews is food and drink products.
5, Wonderful alternative to soda pop:   This is a wonderful alternative to soda pop.  It's carbonated for thos
5, So convenient, for so little!:   I needed two vanilla beans for the Love Goddess cake that my husbands 
2, bot very cheesy:   Got this about a month ago.first of all it smells horrible...it tastes
5, Delicious!:   I am not a huge beer lover.  I do enjoy an occasional Blue Moon (all o
3, Just ok:   I bought this brand because it was all they had at Ranch 99 near us. I
----------------------------------------------------------------------------------------------------

It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data.