Notebooks
P
Pinecone
04 Hdbscan World

04 Hdbscan World

experimentalvector-databasealgos-and-librariessemantic-searchlearnAILLMbertopicPythonjupyter-notebookpinecone-examples

Open In Colab Open nbviewer

Clustering with HDBSCAN

Now that we've mapped our data to a lower dimensional space with UMAP, we can begin clustering with HDBSCAN.

We install the library via pip install hdbscan.

[1]

Load the 2D Earth cities data.

[2]

Fit to the data with HDBSCAN.

[3]
[4]
(9083, 2)
[5]
array([[18.177887 ,  5.1304684],
,       [18.70455  ,  5.9466434]])
[6]

Results are now stored in the label_ attribute of the clusterer.

[7]
array([ -1, 225, 227, ...,  -1,  76,  -1])
[8]

We can view the condensed tree plot which shows the point drop off and also splitting of clusters as clustering parameters are shifted. This visualizes the process that has been used by HDBSCAN to identify clusters. At a high-level this is by identifying the cluster segments/roots with the largest area (those circled).

[9]
<AxesSubplot:ylabel='$\\lambda$ value'>
Output

In this case the algorithm has chosen tiny clusters, so small that we cannot even see the circled clusters in the condensed tree plot (those red lines are circles...). By default the minimum number of points to "create" a cluster is just 5, given a dataset of ~10K points where we are aiming to produce ~6 continent clusters, this is very small.

We can therefore increase this min_cluster_size to return better results.

[12]

Already this looks much better, there is a clear distinction between most continents with the exception of Europe and Asia, which belong to the same landmass and are therefore more difficult to seperate. Interestingly, the outliers (-1) identified by HDBSCAN are all island nation, again this makes sense as they are more geographically isolated.

If we take a look at the condensed tree plot we can also see a clearer tree.

[13]
<AxesSubplot:ylabel='$\\lambda$ value'>
Output
[14]

When compared against no clustered data, it's cool and interesting to see how well the algorithm works.

[15]
[ ]