Classification And Clustering

Classification

The Perceptron Algorithm

Click on the Canvas to add Points.
Change pen color:

Binary Classification Problem

Given $n$ points $\mathbf{x}_1,\cdots,\mathbf{x}_n$, each is assigned a label $l_i\in \{\pm 1\}$, we need to find a vector $\mathbf{w} $ and $t\in \mathbb{R}$, such that $$\begin{aligned} &\mathbf{w} \cdot \mathbf{x}_{\mathbf{i}}>t \quad \forall i\quad : \quad l_i=+1 \\ &\mathbf{w} \cdot \mathbf{x}_{\mathbf{i}} < t\quad \forall i\quad : \quad l_i=-1 \end{aligned} $$ If we write $$\widehat{\mathbf{x}}_{\mathbf{i}}=\left(\mathbf{x}_{\mathbf{i}}, 1\right) $$ Then the problems is equivalent to find a $\widehat{\mathbf{w}}$ such that $$\left(\widehat{\mathbf{w}} \cdot \widehat{\mathbf{x}}_{\mathbf{i}}\right) l_{i}>0 \quad 1 \leq i \leq n$$

Perceptron Algorithm

One of the earliest algorithm for classification problems is the Perceptron Algorithm.

Initialize : $$\mathbf{w} = 0$$ while $\exists\ \mathbf{x}_{\mathbf{i}}$ with $\sgn(w'\mathbf{x}_{\mathbf{i}})\ne l_{i} $, update $$\widehat{\mathbf{w}} \leftarrow {\mathbf{w}}+{\mathbf{x}}_{\mathbf{i}} l_{i}$$

Initialize : $$\widehat{\mathbf{w}} = 0$$ while $\exists\ \widehat{\mathbf{x}}_{\mathbf{i}}$ with $\widehat{\mathbf{x}}_{\mathbf{i}} l_{i} \cdot \widehat{\mathbf{w}} \leq 0$, update $$\widehat{\mathbf{w}} \leftarrow \widehat{\mathbf{w}}+\widehat{\mathbf{x}}_{\mathbf{i}} l_{i}$$

As long as the points are linear separable, i.e. a solution exist, the Perceptron Algorithm is guaranteed to stop and find a solution.

Kernel Method

Kernel Function

When points in the $p$ dimensional space cannot be separate by a $p-1$ hyperplane, the Perceptron Algorithm is not going to find a solution (as solution does not exist). One technique to solve this issue is to move to a higher dimensional space of dimension $p+k$, where the points are separable by a hyperplane of dimension $p+k-1$, via some function $$f:\mathbf{R}^p \to \mathbf{R}^{p+k} $$ then apply the Perceptron Algorithm in the higher dimensional space, at the end translate the solution back to the original space $\mathbb{R}^p$.

What really matters is the inner product computation in the higher dimensional space. Define the Kernel function induced by $f$ as $$K(\mathbf{u},\mathbf{v}):=f(\mathbf{u})\cdot f(\mathbf{v})$$ A kernel function $K$ is of the form $$K(\mathbf{u},\mathbf{v}):=f(\mathbf{u})\cdot f(\mathbf{v})$$ for some $f$ iff for any $g\in \ms{L}^2$ $$\int \int K(\mathbf{u},\mathbf{v})g(\mathbf{u})g(\mathbf{v})d\mathbf{u}d\mathbf{v} \ge 0$$

The polynomial kernel, $$(\mathbf{u} \cdot \mathbf{v}+c)^{q}$$ for $q \in \mathbb{N}$, $c>0$.
The sigmoid kernel, $$\tanh (a(\mathbf{u} \cdot \mathbf{v})+b)$$ for $a, b \geq 0$.
The Gaussian kernel, $$e^{-\|\mathbf{u}-\mathbf{v}\|^{2} / 2 \sigma^{2}}$$ for $\sigma \neq 0$.

Perceptron Algorithm with Kernel Functions

The avoid the explicit transformation of the points into the higher dimensional space, we note that, in our ealier version of the algorithm, the weight vector $\mathbf{w}$ can be written as a linear combination of the points $$ \mathbf{w} = \sum_{i=1}^n \alpha_i l_i \mathbf{x}_i$$ where $\alpha_i$ is the number of times $\mathbf{x}_i$ is misclassified. This allows us to generalized the algorithm, by replace the dot product with the kernel function. $$ \sgn(\mathbf{w}\cdot \mathbf{x}_i) \Longrightarrow \sgn(\sum_{j=1}^n \alpha_j l_j K(\mathbf{x}_i,\mathbf{x}_j)) $$ And we can write the algorithm as
Initialize : $$\alpha_i = 0$$ while $\exists\ \mathbf{x}_{\mathbf{i}}$ with $\sgn(\sum_{j=1}^n \alpha_j l_j K(\mathbf{x}_i,\mathbf{x}_j))\ne l_{i} $, update $$\alpha_i \leftarrow \alpha_i + 1$$

Support Vector Machines

Support Vector Machines (SVM)

The primary goal in SVM classification is to find the optimal hyperplane $$\mathbf{w}^T \mathbf{x} + b = 0$$ that maximizes the margin between two classes. For linearly separable data, this reduces to solving the quadratic optimization problem:

$$\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \ge 1 \quad \forall i$$

where $y_i \in \{-1, 1\}$ are class labels. The support vectors (critical training samples closest to the hyperplane) define the margin boundaries $$\mathbf{w}^T \mathbf{x} + b = \pm 1$$.

For non-separable data, soft margin SVM introduces slack variables $\xi_i$ and a regularization parameter $C$:

$$\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i \quad \text{subject to} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0$$

This balances margin width and classification error.

Kernel Trick for Nonlinear Separation

When data isn't linearly separable, SVMs use kernel functions $K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$ to map features into higher dimensions without explicit computation.

Forest and Trees

Bagging (Bootstrap Aggregating)

At the core of Random Forest lies bagging, a resampling technique that generates multiple training datasets from the original data. For a dataset $D = \{(x_1, y_1), \ldots, (x_n, y_n)\}$, bagging creates $B$ bootstrap samples $D_1, D_2, \ldots, D_B$ by sampling $n$ instances with replacement from $D$. Each bootstrap sample has an expected overlap of $\frac{1}{e}$ with the original dataset.

For each bootstrap sample $D_b$, a decision tree $f_b(x)$ is trained. The final prediction is an aggregate of these individual trees:

Regression:
$$ f_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} f_b(x) $$
Classification:
$$ f_{\text{bag}}(x) = \text{mode} (f_1(x), f_2(x), \ldots, f_B(x)) $$

This aggregation reduces variance by averaging out the noise inherent in individual trees.

Feature Randomization

Random Forest introduces feature randomness by selecting a random subset of features at each split. For a dataset with $d$ features, each tree is trained on a subset of size $k = \sqrt{d}$ (classification) or $k = \frac{d}{3}$ (regression). This randomness decorrelates trees, further reducing variance while preserving bias.

The probability $p_{nj}$ of selecting the $j$-th feature at a node determines the level of randomness. If $p_{nj} \in (0, 1)$, the correlation $\rho(x)$ between trees decreases, leading to lower variance in the ensemble.

Feature Importance

Feature importance in Random Forest is often measured by the mean decrease in impurity across all trees. For feature $j$, the importance is:

$$ \text{Importance}(j) = \frac{1}{B} \sum_{b=1}^{B} \sum_{n \in T_b} \Delta \text{Impurity}_{n}(j) $$

where $\Delta \text{Impurity}_{n}(j)$ is the reduction in impurity at node $n$ due to splitting on feature $j$.

XGBoost

Mathematical Foundations of XGBoost

XGBoost minimizes the loss function $L(y_i, F(x_i))$, where $F(x) = \sum_{m=1}^{M} f_m(x)$ is the ensemble of decision trees. At each iteration $m$, the algorithm approximates the loss using a second-order Taylor expansion:

$$ L(y_i, F_{m-1}(x_i) + f_m(x_i)) \approx L(y_i, F_{m-1}(x_i)) + g_i f_m(x_i) + \frac{1}{2} h_i f_m(x_i)^2 $$

$g_i = \frac{\partial L}{\partial F_{m-1}(x_i)}$ (gradient of the loss at $F_{m-1}$)
$h_i = \frac{\partial^2 L}{\partial F_{m-1}(x_i)^2}$ (Hessian of the loss at $F_{m-1}$)

The approximated objective function becomes:

$$ \text{Obj}(m) = \sum_{i=1}^{n} \left[ g_i f_m(x_i) + \frac{1}{2} h_i f_m(x_i)^2 \right] + \Omega(f_m) $$

where $\Omega(f_m)$ is the regularization term to control overfitting:

$$ \Omega(f_m) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2 $$

$T$: Number of leaves in the tree
$w_j$: Weight assigned to leaf $j$
$\gamma$: L1 regularization coefficient
$\lambda$: L2 regularization coefficient

This term discourages deep trees (via $\gamma T$) and large leaf weights (via $\lambda \sum w_j^2$).

Feature Importance

Feature importance is measured by the gain in loss reduction when splitting on a feature. For XGBoost, the gain for a split at node $j$ is:

$$ \text{Gain}_j = \frac{1}{2} \left( \sum_{i \in I_j} \frac{g_i^2}{h_i + \lambda} \right) - \gamma $$

This gain is accumulated across all trees to rank feature contributions.

Logistic Regression

Logistic regression is often used as a baseline model for classification problems, due to its simplicity. For discussion on the Logistic Regression, check out the page Linear Model and Friends .

Classification Metrics

Precision, Recall, and F1 Score

In classification tasks, it is important to evaluate the performance of the model using various metrics. Three commonly used metrics are precision, recall, and F1 score.

Precision: The ratio of true positive predictions to the total predicted positives. $$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$
Recall: The ratio of true positive predictions to the total actual positives. $$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $$
F1 Score: The harmonic mean of precision and recall, providing a balance between the two. $$ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

These metrics help in understanding the performance of a classification model, especially in cases where the class distribution is imbalanced.

AUC-ROC Curve

ROC curve is the collection of points $(FPR, TPR)$ at various threshold values. $$ \text{AUC-ROC} = \int_0^1 \text{TPR}(t) d \text{FPR}(t) = \int_0^1 \text{TPR}(FPR^{-1}(t)) dt = \Pr(S_p > S_n) $$ Where $S_p$ and $S_n$ are the scores of the positive and negative instances, respectively.

Clustering

K-Means Clustering

To regenerate the clusters, click
To try a different value for the hyperparameter k, input a different integer value below
k =
To try a different distance function
The k-Means algorithm is very sensitive to the initial centroids. To try different starting points, click
To Run the algorithm, click

k-Means Algorithm is one of the earliest method developed for clustering problems, and is still the most popular partitioning algorithm today.

Choose the number of centroids $k$, and select a distance function d.
Initialize $k$ starting centroids (randomized or with some deterministic criteria)
Assign each point to the closest cluster
Update centroids by reevaluate the mean for each cluster
Repeat the previous two steps until convergence

Applications

Image compression An image may contain thousands of colors, but many of them are similar. By clustering the colors, we can represent the image with fewer colors, say 16, and still maintain the quality of the image.
Document clustering Document clustering involves grouping documents with similar themes or topics together. This can be useful in information retrieval systems, where you want to find documents that are similar to a given document.
Customer segmentation Customer segmentation is the practice of dividing a company's customers into groups that reflect similarity among customers in each group. The goal is to identify high yield segments – that is, those segments that are likely to be profitable or that have growth potential.

DBSCAN

window size =

Parameter Selection: Choose $\epsilon$ (epsilon) and MinPts.
Select a Starting Point: Begin with an arbitrary unvisited point.
Examine the Neighborhood:
- If fewer than MinPts neighbors, label as noise (temporarily).
- If at least MinPts neighbors, mark as a core point and form a new cluster.
Expand the Cluster:
- Add all neighbors to the cluster.
- Recursively expand for core point neighbors.
- Mark non-core neighbors as border points.
Repeat: Move to the next unvisited point and repeat steps 3-4.
Finalize Clusters: Identify all clusters and reclassify some noise points as border points if applicable.