Click on the Canvas to add Points.
Change pen color:
Binary Classification Problem
Given $n$ points $\mathbf{x}_1,\cdots,\mathbf{x}_n$, each is assigned a label $l_i\in \{\pm 1\}$, we need to find
a vector $\mathbf{w} $ and $t\in \mathbb{R}$, such that
$$\begin{aligned}
&\mathbf{w} \cdot \mathbf{x}_{\mathbf{i}}>t \quad \forall i\quad : \quad l_i=+1 \\
&\mathbf{w} \cdot \mathbf{x}_{\mathbf{i}} < t\quad \forall i\quad : \quad l_i=-1 \end{aligned} $$ If we write
$$\widehat{\mathbf{x}}_{\mathbf{i}}=\left(\mathbf{x}_{\mathbf{i}}, 1\right) $$ Then the problems is equivalent to
find a $\widehat{\mathbf{w}}$ such that $$\left(\widehat{\mathbf{w}} \cdot \widehat{\mathbf{x}}_{\mathbf{i}}\right)
l_{i}>0 \quad 1
\leq i \leq n$$
Perceptron Algorithm
One of the earliest algorithm for classification problems is the Perceptron
Algorithm.
Initialize : $$\mathbf{w} = 0$$
while $\exists\ \mathbf{x}_{\mathbf{i}}$ with $\sgn(w'\mathbf{x}_{\mathbf{i}})\ne l_{i} $,
update $$\widehat{\mathbf{w}} \leftarrow {\mathbf{w}}+{\mathbf{x}}_{\mathbf{i}} l_{i}$$
Initialize : $$\widehat{\mathbf{w}} = 0$$
while $\exists\ \widehat{\mathbf{x}}_{\mathbf{i}}$ with $\widehat{\mathbf{x}}_{\mathbf{i}} l_{i} \cdot
\widehat{\mathbf{w}}
\leq 0$,
update $$\widehat{\mathbf{w}} \leftarrow \widehat{\mathbf{w}}+\widehat{\mathbf{x}}_{\mathbf{i}} l_{i}$$
As long as the points are linear separable, i.e. a solution exist, the Perceptron Algorithm is
guaranteed to stop and find a solution.
Kernel Method
Kernel Function
When points in the $p$ dimensional space cannot be separate by a $p-1$ hyperplane, the Perceptron Algorithm is not
going to find a solution (as solution does not exist). One technique to solve this issue is to move to a higher
dimensional space of dimension $p+k$,
where the points are separable by a hyperplane of dimension $p+k-1$, via
some function $$f:\mathbf{R}^p \to \mathbf{R}^{p+k} $$
then apply the Perceptron Algorithm in the higher dimensional space, at the end translate the solution back to the
original space $\mathbb{R}^p$.
What really matters is the inner product computation in the higher dimensional space. Define the
Kernel functioninduced by $f$
as $$K(\mathbf{u},\mathbf{v}):=f(\mathbf{u})\cdot f(\mathbf{v})$$
A kernel function $K$ is of the form
$$K(\mathbf{u},\mathbf{v}):=f(\mathbf{u})\cdot f(\mathbf{v})$$
for some $f$ iff for any $g\in \ms{L}^2$
$$\int \int K(\mathbf{u},\mathbf{v})g(\mathbf{u})g(\mathbf{v})d\mathbf{u}d\mathbf{v} \ge 0$$
The polynomial kernel, $$(\mathbf{u} \cdot \mathbf{v}+c)^{q}$$ for $q \in \mathbb{N}$,
$c>0$.
The sigmoid kernel, $$\tanh (a(\mathbf{u} \cdot \mathbf{v})+b)$$ for $a, b \geq 0$.
The Gaussian kernel, $$e^{-\|\mathbf{u}-\mathbf{v}\|^{2} / 2 \sigma^{2}}$$ for $\sigma \neq
0$.
Perceptron Algorithm with Kernel Functions
The avoid the explicit transformation of the points into the higher dimensional space, we note that, in our ealier
version of the algorithm, the
weight vector $\mathbf{w}$ can be written as a linear combination of the points
$$ \mathbf{w} = \sum_{i=1}^n \alpha_i l_i \mathbf{x}_i$$
where $\alpha_i$ is the number of times $\mathbf{x}_i$ is misclassified. This allows us to generalized the
algorithm, by replace the dot product with the kernel function.
$$
\sgn(\mathbf{w}\cdot \mathbf{x}_i) \Longrightarrow \sgn(\sum_{j=1}^n \alpha_j l_j K(\mathbf{x}_i,\mathbf{x}_j))
$$
And we can write the algorithm as
Initialize : $$\alpha_i = 0$$
while $\exists\ \mathbf{x}_{\mathbf{i}}$ with $\sgn(\sum_{j=1}^n \alpha_j l_j K(\mathbf{x}_i,\mathbf{x}_j))\ne
l_{i} $,
update $$\alpha_i \leftarrow \alpha_i + 1$$
Support Vector Machines
Support Vector Machines (SVM)
The primary goal in SVM classification is to find the optimal hyperplane
$$\mathbf{w}^T \mathbf{x} + b = 0$$
that maximizes the margin between two classes. For linearly separable data, this reduces to solving the quadratic
optimization problem:
where $y_i \in \{-1, 1\}$ are class labels. The support vectors (critical training samples
closest to the hyperplane) define the margin boundaries
$$\mathbf{w}^T \mathbf{x} + b = \pm 1$$.
For non-separable data, soft margin SVM introduces slack variables $\xi_i$ and a regularization
parameter $C$:
This balances margin width and classification error.
Kernel Trick for Nonlinear Separation
When data isn't linearly separable, SVMs use kernel functions $K(\mathbf{x}_i, \mathbf{x}_j) =
\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$ to map features into higher dimensions without explicit computation.
Forest and Trees
Bagging (Bootstrap Aggregating)
At the core of Random Forest lies bagging, a resampling technique that generates multiple
training datasets from the original data. For a dataset $D = \{(x_1, y_1), \ldots, (x_n, y_n)\}$, bagging creates
$B$ bootstrap samples $D_1, D_2, \ldots, D_B$ by sampling $n$ instances with replacement from
$D$. Each bootstrap sample has an expected overlap of $\frac{1}{e}$ with the original dataset.
For each bootstrap sample $D_b$, a decision tree $f_b(x)$ is trained. The final prediction is an aggregate of
these individual trees:
This aggregation reduces variance by averaging out the noise inherent in individual trees.
Feature Randomization
Random Forest introduces feature randomness by selecting a random subset of features at each
split. For a dataset with $d$ features, each tree is trained on a subset of size $k = \sqrt{d}$ (classification)
or $k = \frac{d}{3}$ (regression). This randomness decorrelates trees, further reducing variance while preserving
bias.
The probability $p_{nj}$ of selecting the $j$-th feature at a node determines the level of randomness. If $p_{nj}
\in (0, 1)$, the correlation $\rho(x)$ between trees decreases, leading to lower variance in the ensemble.
Feature Importance
Feature importance in Random Forest is often measured by the mean decrease in impurity across all
trees. For feature $j$, the importance is:
where $\Delta \text{Impurity}_{n}(j)$ is the reduction in impurity at node $n$ due to splitting on feature $j$.
XGBoost
Mathematical Foundations of XGBoost
XGBoost minimizes the loss function $L(y_i, F(x_i))$, where $F(x) = \sum_{m=1}^{M} f_m(x)$ is the ensemble of
decision trees. At each iteration $m$, the algorithm approximates the loss using a second-order Taylor expansion:
This gain is accumulated across all trees to rank feature contributions.
Logistic Regression
Logistic regression is often used as a baseline model for classification problems, due to its simplicity.
For discussion on the Logistic Regression, check out the page
Linear Model and Friends
.
Classification Metrics
Precision, Recall, and F1 Score
In classification tasks, it is important to evaluate the performance of the model using various metrics. Three commonly used metrics are precision, recall, and F1 score.
Precision: The ratio of true positive predictions to the total predicted positives.
$$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$
Recall: The ratio of true positive predictions to the total actual positives.
$$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $$
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
$$ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$
These metrics help in understanding the performance of a classification model, especially in cases where the class distribution is imbalanced.
AUC-ROC Curve
ROC curve is the collection of points $(FPR, TPR)$ at various threshold values.
$$
\text{AUC-ROC} = \int_0^1 \text{TPR}(t) d \text{FPR}(t) = \int_0^1 \text{TPR}(FPR^{-1}(t)) dt = \Pr(S_p > S_n)
$$
Where $S_p$ and $S_n$ are the scores of the positive and negative instances, respectively.
Clustering
K-Means Clustering
To regenerate the clusters, click
To try a different value for the hyperparameter k, input a different integer value below
To try a different distance function
The k-Means algorithm is very sensitive to the initial centroids.
To try different starting points, click
To Run the algorithm, click
k-Means Algorithm is one of the earliest method developed for clustering problems, and is still the most
popular partitioning algorithm today.
Choose the number of centroids $k$, and select a distance function d.
Initialize $k$ starting centroids (randomized or with some deterministic criteria)
Assign each point to the closest cluster
Update centroids by reevaluate the mean for each cluster
Repeat the previous two steps until convergence
Applications
Image compression
An image may contain thousands of colors, but many of them are similar. By clustering the colors, we can
represent the image with fewer colors, say 16, and still maintain the quality of the image.
Document clustering
Document clustering involves grouping documents with similar themes or topics together. This can be useful in
information retrieval systems, where you want to find documents that are similar to a given document.
Customer segmentation
Customer segmentation is the practice of dividing a company's customers into groups that reflect similarity
among
customers in each group. The goal is to identify high yield segments – that is, those segments that are likely
to
be profitable or that have growth potential.
DBSCAN
window size =
Parameter Selection: Choose $\epsilon$ (epsilon) and MinPts.
Select a Starting Point: Begin with an arbitrary unvisited point.
Examine the Neighborhood:
If fewer than MinPts neighbors, label as noise (temporarily).
If at least MinPts neighbors, mark as a core point and form a new cluster.
Expand the Cluster:
Add all neighbors to the cluster.
Recursively expand for core point neighbors.
Mark non-core neighbors as border points.
Repeat: Move to the next unvisited point and repeat steps 3-4.
Finalize Clusters: Identify all clusters and reclassify some noise points as border points if applicable.