Regression Trees.qmd

---
title: "Regression Trees"
subtitle: "Advanced Numerical Data Analysis"
author: "Dr Muhammad Saufi"
date: last-modified
format: 
  html:
    toc: true
    toc-title: Contents
    toc-location: left
    toc-depth: 5
    toc-expand: 1
    number-sections: true
    code-fold: true
    code-summary: "Code"
    code-link: true
    theme:
      light: united
      dark: cyborg
    css: styles.css
    self-contained: true
editor: visual
include-after-body: "footer.html"
---

# Sum of Squared Errors (SSE) Formula

SSE is a measure of the discrepancy between the data and an estimation model. In the context of decision tree regression, it helps in determining the "goodness" of a split.

$$
\text{SSE} = \sum_{j} \sum_{i \in c_j} (y_i - \mu_{c_j})^2
$$

-   $j$: Index of the class (or group) after a split.
-   $i \in c_j$: Index of data points within the $j$-th class.
-   $y_i$: Actual value of the $i$-th data point.
-   $\mu_{c_j}$: Mean of the predicted values in class $c_j$.

# Example Calculation

![](Inputs/Decision%20Tree%20Table.png){fig-align="center"}

The tree splits the data based on the feature "Year \> 2010".

## Splitting Data

![](Inputs/Decision%20Tree%20Nodes.png){fig-align="center"}

-   **Left Node (Year \> 2010)**: 2015, 2012, 2018, 2014, 2011
-   **Right Node (Year ≤ 2010)**: 2010, 2000, 2008

### Left Node Data

Mean of Left Node:

$$
\mu_{left} = \frac{0.35 + 0.25 + 0.40 + 0.27 + 0.26}{5} = \frac{1.53}{5} = 0.306
$$

SSE for Left Node:

$$
\text{SSE}_{left} = (0.35 - 0.306)^2 + (0.25 - 0.306)^2 + (0.40 - 0.306)^2 + (0.27 - 0.306)^2 + (0.26 - 0.306)^2
$$ $$
= (0.044)^2 + (-0.056)^2 + (0.094)^2 + (-0.036)^2 + (-0.046)^2
$$ $$
= 0.001936 + 0.003136 + 0.008836 + 0.001296 + 0.002116
$$ $$
= 0.01732
$$

### Right Node Data

Mean of Right Node:

$$
\mu_{right} = \frac{0.20 + 0.15 + 0.45}{3} = \frac{0.80}{3} = 0.267
$$

SSE for Right Node:

$$
\text{SSE}_{right} = (0.20 - 0.267)^2 + (0.15 - 0.267)^2 + (0.45 - 0.267)^2
$$ $$
= (-0.067)^2 + (-0.117)^2 + (0.183)^2
$$ $$
= 0.004489 + 0.013689 + 0.033489
$$ $$
= 0.05167
$$

## Total SSE

$$
\text{SSE}_{total} = \text{SSE}_{left} + \text{SSE}_{right} = 0.01732 + 0.05167 = 0.06899
$$

## Summary

The total Sum of Squared Errors (SSE) after the split based on the feature "Year \> 2010" is **0.06899**. This value indicates the combined variance within each node after the split, helping to assess the goodness of the split. Lower SSE values suggest a better split, meaning the data points within each node are more homogeneous.