# Unsupervised Clustering

## About

Clustering is a standard technique in customer segmentation, anomaly detection, document organization, and preprocessing pipelines. This tutorial runs seven clustering algorithms on the Wine dataset, evaluates them with four metrics, and generates a full set of comparison visualizations — all in a single containerized job on Ocean Network.

No local environment setup. No dependency conflicts. Run it once, get the results.

***

### Algorithms Covered

| Algorithm                | Type            | Key characteristic                                                            |
| ------------------------ | --------------- | ----------------------------------------------------------------------------- |
| k-Means++                | Centroid-based  | Minimizes intra-cluster variance; $D^2$ initialization for better seeds       |
| Agglomerative (Ward)     | Hierarchical    | Builds a dendrogram; Ward linkage minimizes variance increase at each merge   |
| Agglomerative (Complete) | Hierarchical    | Complete linkage produces compact, roughly equal-sized clusters               |
| BIRCH                    | Summarization   | Designed for large datasets; compresses data into a CF-tree before clustering |
| DBSCAN                   | Density-based   | Finds clusters of arbitrary shape; labels outliers explicitly as noise        |
| HDBSCAN                  | Density-based   | Hierarchy-aware extension of DBSCAN; no global density threshold required     |
| Affinity Propagation     | Message-passing | Points vote for exemplars; number of clusters emerges from the data           |

***

### Evaluation Metrics

The script evaluates every algorithm on four metrics:

* **Silhouette Score** — how well-separated clusters are (higher is better, range −1 to +1)
* **Davies-Bouldin Index** — ratio of within-cluster scatter to between-cluster separation (lower is better)
* **Calinski-Harabasz Index** — between-cluster vs. within-cluster dispersion ratio (higher is better)
* **Adjusted Rand Index (ARI)** — agreement with the true Wine cultivar labels, corrected for chance (higher is better)

***

### Outputs

All plots are saved to `outputs/clustering/` and downloaded automatically when the job completes:

* Scatter plots of each algorithm's clustering in PCA-reduced 2D space
* Bar charts comparing all four metrics side by side
* Silhouette coefficient plots per algorithm
* Truncated dendrogram (Ward linkage)

***

### Hardware Requirements

| Resource | Requirement     |
| -------- | --------------- |
| GPU      | Not required    |
| Runtime  | Under 5 minutes |

This is a good workload for testing a CPU node on a free compute job before moving to GPU workloads.

***

### Run It on Ocean Network

1. **Clone the repo**

   ```bash
   git clone https://github.com/oceanprotocol/oncompute-tutorials
   ```
2. **Open the `Machine Learning Foundations and Introduction to LLMs/Clustering/` folder** in Ocean Orchestrator.
3. **Select a node** at [dashboard.oncompute.ai](https://dashboard.oncompute.ai/). A CPU node works —  use **Start Compute Job** to run the job.
4. **Start the job** — the container installs scikit-learn, numpy, matplotlib, seaborn, and scipy, then runs `clustering.py`.
5. **Download results** — all plots and metric tables download to your `results/` folder when the job completes.

**Tutorial source:** [github.com/oceanprotocol/oncompute-tutorials/tree/main/Machine%20Learning%20Foundations%20and%20Introduction%20to%20LLMs/Clustering](https://github.com/oceanprotocol/oncompute-tutorials/tree/main/Machine%20Learning%20Foundations%20and%20Introduction%20to%20LLMs/Clustering)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.oncompute.ai/use-cases/unsupervised-clustering.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
