# Data Preprocessing & EDA

## About

Before any ML model can learn, data needs to be understood and cleaned. These two tutorials cover the full preprocessing pipeline, from first-look analysis through cleaning, feature engineering, and dimensionality reduction, applied to realistic, messy datasets.

Both run on CPU nodes and are good entry points for testing Ocean Orchestrator before moving to GPU workloads.

***

### Exploratory Data Analysis

Understanding a dataset before modelling is not optional. This tutorial walks through the four foundational EDA steps — type inspection, descriptive statistics, distribution analysis, and correlation — applied to a corporate financial dataset of 500 firms across six industries and eight countries.

The dataset is intentionally imperfect: \~40% of R\&D values are missing, revenue figures are heavily skewed, and there are extreme outliers in multiple columns. These are the conditions you encounter in real financial data, not clean benchmark datasets.

**What the tutorial covers:**

* Identifying variable types (nominal, ordinal, interval, ratio) and why the distinction matters for choosing the right statistics
* Descriptive statistics: mean, median, standard deviation, IQR — when each is the right summary
* Distribution analysis: histograms, box plots, and recognizing skewness vs. symmetry
* Correlation: Pearson vs. Spearman, interpreting heatmaps, avoiding the correlation-causation trap

**Dataset columns:** company name, sector, country, founded year, employees, revenue (USD millions), profit margin, R\&D spending, debt ratio, listed status.

**Tutorial source:** [github.com/oceanprotocol/oncompute-tutorials/tree/main/Data%20Preprocessing%20Exploration%20and%20Statistical%20Inference/Data%20Types%20and%20Exploratory%20Analysis](https://github.com/oceanprotocol/oncompute-tutorials/tree/main/Data%20Preprocessing%20Exploration%20and%20Statistical%20Inference/Data%20Types%20and%20Exploratory%20Analysis)

***

### Data Cleaning, Feature Engineering & Dimensionality Reduction

Raw data is almost never model-ready. This tutorial covers the full preparation pipeline in the order a practitioner encounters it, applied to a multi-source employee dataset with messy string fields, duplicate rows, and misaligned joins across three separate files.

**What the tutorial covers:**

* **Cleaning**: lowercase normalization, whitespace and symbol stripping, duplicate detection and removal, cross-file join alignment
* **Missing data**: identifying missing value mechanisms (MCAR, MAR, MNAR), visualizing missingness patterns with `missingno`, imputation strategies (mean/median/mode, model-based)
* **Feature engineering**: encoding categorical variables (one-hot, ordinal, target encoding), scaling numerical features (standard, min-max, robust), interaction features
* **Dimensionality reduction**: PCA mechanics and variance explained, t-SNE for visualization, when to reduce and when not to

**Tutorial source:** [github.com/oceanprotocol/oncompute-tutorials/tree/main/Data%20Preprocessing%20Exploration%20and%20Statistical%20Inference/Data%20Cleaning%20%26%20Transformation](https://github.com/oceanprotocol/oncompute-tutorials/tree/main/Data%20Preprocessing%20Exploration%20and%20Statistical%20Inference/Data%20Cleaning%20%26%20Transformation)

***

### Hardware Requirements

| Resource | Requirement          |
| -------- | -------------------- |
| GPU      | Not required         |
| Runtime  | Under 5 minutes each |

***

### Run It on Ocean Network

1. **Clone the repo**

   ```bash
   git clone https://github.com/oceanprotocol/oncompute-tutorials
   ```
2. **Open the tutorial folder** (`Data Types and Exploratory Analysis/` or `Data Cleaning & Transformation/`) in Ocean Orchestrator.
3. **Select a node** at [dashboard.oncompute.ai](https://dashboard.oncompute.ai/). A CPU node works for both — use **Start Compute Job** to run the job.
4. **Start the job** — the container runs the analysis script against the included dataset files.
5. **Download results** — output charts, cleaned datasets, and summary statistics download to your `results/` folder when the job completes.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.oncompute.ai/use-cases/data-preprocessing-and-eda.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
