About Course
This course is a continuation of Machine Learning I, where you learnt the fundamentals of machine learning with a significant focus on supervised machine learning. The focus of this course will be unsupervised machine learning and introducing deep learning in the final week to get you up to speed with the fundamental workings of nascent and matured AI models currently taking the tech space like ChatGPT, Stable Diffusion and the rest.
Here is a breakdown of what you will cover in this course:
- Week 6:Dimensionality Reduction Methods
- Week 7:Clustering
- Week 8:Time Series Analysis
- Week 9:Neural Networks
- Week 10:Capstone and Conclusion
Acknowledgements and Attribution
This course is attributed to Jake VanderPlas’ Python Data Science Handbook, Data Ranger’s playlist on Time Series Analysis, Andrew NG’s tutorials on Deep Learning Concepts and MIT Open Course on Introduction to Deep Learning. We have added videos to the course to help make harder concepts simpler to understand. Finally, you have notes by Chris Aloo and Zindua technical team shared on Slack or on the resources
Course Content
6.1 Principal Component Analysis
-
Introducing Principal Component Analysis
00:00 -
PCA as dimensionality reduction
00:00 -
PCA for visualization: Hand-written digits
00:00 -
PCA as Noise Filtering
00:00 -
Principal Component Analysis Summary
00:00
6.2 Advanced PCA
-
KernelPCA
00:00 -
IncrementalPCA
00:00
6.3 Manifold Learning
-
Manifold Learning: “HELLO”
00:00 -
Multidimensional Scaling (MDS)
00:00 -
MDS as Manifold Learning
00:00 -
Nonlinear Manifolds: Locally Linear Embedding
00:00
6.4 More Manifolds
-
Isomap
00:00 -
TSNE
00:00
6.5 Dimensionality Reduction Weekly Project
In this project, you are to perform a dimensionality reduction for second hand car sales. The goal is to prepare the dataset for ML models and feature space while retaining most essential information.
Access the assignment here
-
Feature Selection Project
00:00 -
Submit Your Project
00:00
7.1 k-Means Clustering
-
Introducing k-Means
00:00 -
k-Means Algorithm: Expectation–Maximization
00:00 -
Example 1: k-means on digits
00:00 -
Example 2: k-means for color compression
00:00 -
K – Means Clustering
7.2 Gaussian Mixture Models
-
Motivating GMM: Weaknesses of k-Means
00:00 -
Generalizing E–M: Gaussian Mixture Models
00:00 -
GMM as Density Estimation
00:00
7.3 Kernel Density Estimation
In-Depth: Kernel Density Estimation
In the previous section, we covered Gaussian mixture models (GMM), which are a kind of hybrid between a clustering estimator and a density estimator. Recall that a density estimator is an algorithm which takes a D
-dimensional dataset and produces an estimate of the D
-dimensional probability distribution which that data is drawn from. The GMM algorithm accomplishes this by representing the density as a weighted sum of Gaussian distributions. Kernel density estimation (KDE) is in some senses an algorithm which takes the mixture-of-Gaussians idea to its logical extreme: it uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parametric estimator of density. In this section, we will explore the motivation and uses of KDE.
We begin with the standard imports:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
Motivating KDE: Histograms
As already discussed, a density estimator is an algorithm which seeks to model the probability distribution that generated a dataset. For one dimensional data, you are probably already familiar with one simple density estimator: the histogram. A histogram divides the data into discrete bins, counts the number of points that fall in each bin, and then intuitively visualizes the results.
For example, let's create some data that is drawn from two normal distributions:
`def make_data(N, f=0.3, rseed=1):
rand = np.random.RandomState(rseed)
x = rand.randn(N)
x[int(f * N):] += 5
return x
x = make_data(1000)`
We have previously seen that the standard count-based histogram can be created with the plt.hist() function. By specifying the normed parameter of the histogram, we end up with a normalized histogram where the height of the bins does not reflect counts, but instead reflects probability density:
hist = plt.hist(x, bins=30, normed=True)
Notice that for equal binning, this normalization simply changes the scale on the y-axis, leaving the relative heights essentially the same as in a histogram built from counts. This normalization is chosen so that the total area under the histogram is equal to 1, as we can confirm by looking at the output of the histogram function:
density, bins, patches = hist
widths = bins[1:] - bins[:-1]
(density * widths).sum()
1.0
One of the issues with using a histogram as a density estimator is that the choice of bin size and location can lead to representations that have qualitatively different features. For example, if we look at a version of this data with only 20 points, the choice of how to draw the bins can lead to an entirely different interpretation of the data! Consider this example:
x = make_data(20)
bins = np.linspace(-5, 10, 10)
fig, ax = plt.subplots(1, 2, figsize=(12, 4),
sharex=True, sharey=True,
subplot_kw={'xlim':(-4, 9),
'ylim':(-0.02, 0.3)})
fig.subplots_adjust(wspace=0.05)
for i, offset in enumerate([0.0, 0.6]):
ax[i].hist(x, bins=bins + offset, normed=True)
ax[i].plot(x, np.full_like(x, -0.01), '|k',
markeredgewidth=1)
On the left, the histogram makes clear that this is a bimodal distribution. On the right, we see a unimodal distribution with a long tail. Without seeing the preceding code, you would probably not guess that these two histograms were built from the same data: with that in mind, how can you trust the intuition that histograms confer? And how might we improve on this?
Stepping back, we can think of a histogram as a stack of blocks, where we stack one block within each bin on top of each point in the dataset. Let's view this directly:
fig, ax = plt.subplots()
bins = np.arange(-3, 8)
ax.plot(x, np.full_like(x, -0.1), '|k',
markeredgewidth=1)
for count, edge in zip(*np.histogram(x, bins)):
for i in range(count):
ax.add_patch(plt.Rectangle((edge, i), 1, 1,
alpha=0.5))
ax.set_xlim(-4, 8)
ax.set_ylim(-0.2, 8)
(-0.2, 8)
The problem with our two binnings stems from the fact that the height of the block stack often reflects not on the actual density of points nearby, but on coincidences of how the bins align with the data points. This mis-alignment between points and their blocks is a potential cause of the poor histogram results seen here. But what if, instead of stacking the blocks aligned with the bins, we were to stack the blocks aligned with the points they represent? If we do this, the blocks won't be aligned, but we can add their contributions at each location along the x-axis to find the result. Let's try this:
x_d = np.linspace(-4, 8, 2000)
density = sum((abs(xi - x_d) -9999).ravel()
xy = np.vstack([Y.ravel(), X.ravel()]).T
xy = np.radians(xy[land_mask])
# Create two side-by-side plots
fig, ax = plt.subplots(1, 2)
fig.subplots_adjust(left=0.05, right=0.95, wspace=0.05)
species_names = ['Bradypus Variegatus', 'Microryzomys Minutus']
cmaps = ['Purples', 'Reds']
for i, axi in enumerate(ax):
axi.set_title(species_names[i])
# plot coastlines with basemap
m = Basemap(projection='cyl', llcrnrlat=Y.min(),
urcrnrlat=Y.max(), llcrnrlon=X.min(),
urcrnrlon=X.max(), resolution='c', ax=axi)
m.drawmapboundary(fill_color='#DDEEFF')
m.drawcoastlines()
m.drawcountries()
# construct a spherical kernel density estimate of the distribution
kde = KernelDensity(bandwidth=0.03, metric='haversine')
kde.fit(np.radians(latlon[species == i]))
# evaluate only on the land: -9999 indicates ocean
Z = np.full(land_mask.shape[0], -9999.0)
Z[land_mask] = np.exp(kde.score_samples(xy))
Z = Z.reshape(X.shape)
# plot contours of the density
levels = np.linspace(0, Z.max(), 25)
axi.contourf(X, Y, Z, levels=levels, cmap=cmaps[i])
Compared to the simple scatter plot we initially used, this visualization paints a much clearer picture of the geographical distribution of observations of these two species.
Example: Not-So-Naive Bayes
This example looks at Bayesian generative classification with KDE, and demonstrates how to use the Scikit-Learn architecture to create a custom estimator.
In In Depth: Naive Bayes Classification, we took a look at naive Bayesian classification, in which we created a simple generative model for each class, and used these models to build a fast classifier. For Gaussian naive Bayes, the generative model is a simple axis-aligned Gaussian. With a density estimation algorithm like KDE, we can remove the "naive" element and perform the same classification with a more sophisticated generative model for each class. It's still Bayesian classification, but it's no longer naive.
The general approach for generative classification is this:
Split the training data by label.
For each set, fit a KDE to obtain a generative model of the data. This allows you for any observation x
and label y
to compute a likelihood P(x | y)
.
From the number of examples of each class in the training set, compute the class prior, P(y)
.
For an unknown point x
, the posterior probability for each class is P(y | x)∝P(x | y)P(y)
. The class which maximizes this posterior is the label assigned to the point.
The algorithm is straightforward and intuitive to understand; the more difficult piece is couching it within the Scikit-Learn framework in order to make use of the grid search and cross-validation architecture.
This is the code that implements the algorithm within the Scikit-Learn framework; we will step through it following the code block:
from sklearn.base import BaseEstimator, ClassifierMixin
class KDEClassifier(BaseEstimator, ClassifierMixin):
"""Bayesian generative classification based on KDE
Parameters
----------
bandwidth : float
the kernel bandwidth within each class
kernel : str
the kernel name, passed to KernelDensity
"""
def __init__(self, bandwidth=1.0, kernel='gaussian'):
self.bandwidth = bandwidth
self.kernel = kernel
def fit(self, X, y):
self.classes_ = np.sort(np.unique(y))
training_sets = [X[y == yi] for yi in self.classes_]
self.models_ = [KernelDensity(bandwidth=self.bandwidth,
kernel=self.kernel).fit(Xi)
for Xi in training_sets]
self.logpriors_ = [np.log(Xi.shape[0] / X.shape[0])
for Xi in training_sets]
return self
def predict_proba(self, X):
logprobs = np.array([model.score_samples(X)
for model in self.models_]).T
result = np.exp(logprobs + self.logpriors_)
return result / result.sum(1, keepdims=True)
def predict(self, X):
return self.classes_[np.argmax(self.predict_proba(X), 1)]
The anatomy of a custom estimator
Let's step through this code and discuss the essential features:
from sklearn.base import BaseEstimator, ClassifierMixin
class KDEClassifier(BaseEstimator, ClassifierMixin):
"""Bayesian generative classification based on KDE
Parameters
----------
bandwidth : float
the kernel bandwidth within each class
kernel : str
the kernel name, passed to KernelDensity
"""
Each estimator in Scikit-Learn is a class, and it is most convenient for this class to inherit from the BaseEstimator class as well as the appropriate mixin, which provides standard functionality. For example, among other things, here the BaseEstimator contains the logic necessary to clone/copy an estimator for use in a cross-validation procedure, and ClassifierMixin defines a default score() method used by such routines. We also provide a doc string, which will be captured by IPython's help functionality (see Help and Documentation in IPython).
Next comes the class initialization method:
def __init__(self, bandwidth=1.0, kernel='gaussian'):
self.bandwidth = bandwidth
self.kernel = kernel
This is the actual code that is executed when the object is instantiated with KDEClassifier(). In Scikit-Learn, it is important that initialization contains no operations other than assigning the passed values by name to self. This is due to the logic contained in BaseEstimator required for cloning and modifying estimators for cross-validation, grid search, and other functions. Similarly, all arguments to __init__ should be explicit: i.e. *args or **kwargs should be avoided, as they will not be correctly handled within cross-validation routines.
Next comes the fit() method, where we handle training data:
def fit(self, X, y):
self.classes_ = np.sort(np.unique(y))
training_sets = [X[y == yi] for yi in self.classes_]
self.models_ = [KernelDensity(bandwidth=self.bandwidth,
kernel=self.kernel).fit(Xi)
for Xi in training_sets]
self.logpriors_ = [np.log(Xi.shape[0] / X.shape[0])
for Xi in training_sets]
return self
Here we find the unique classes in the training data, train a KernelDensity model for each class, and compute the class priors based on the number of input samples. Finally, fit() should always return self so that we can chain commands. For example:
label = model.fit(X, y).predict(X)
Notice that each persistent result of the fit is stored with a trailing underscore (e.g., self.logpriors_). This is a convention used in Scikit-Learn so that you can quickly scan the members of an estimator (using IPython's tab completion) and see exactly which members are fit to training data.
Finally, we have the logic for predicting labels on new data:
def predict_proba(self, X):
logprobs = np.vstack([model.score_samples(X)
for model in self.models_]).T
result = np.exp(logprobs + self.logpriors_)
return result / result.sum(1, keepdims=True)
def predict(self, X):
return self.classes_[np.argmax(self.predict_proba(X), 1)]
Because this is a probabilistic classifier, we first implement predict_proba() which returns an array of class probabilities of shape [n_samples, n_classes]. Entry [i, j] of this array is the posterior probability that sample i is a member of class j, computed by multiplying the likelihood by the class prior and normalizing.
Finally, the predict() method uses these probabilities and simply returns the class with the largest probability.
Using our custom estimator
Let's try this custom estimator on a problem we have seen before: the classification of hand-written digits. Here we will load the digits, and compute the cross-validation score for a range of candidate bandwidths using the GridSearchCV meta-estimator (refer back to Hyperparameters and Model Validation):
from sklearn.datasets import load_digits
from sklearn.grid_search import GridSearchCV
digits = load_digits()
bandwidths = 10 ** np.linspace(0, 2, 100)
grid = GridSearchCV(KDEClassifier(), {'bandwidth': bandwidths})
grid.fit(digits.data, digits.target)
scores = [val.mean_validation_score for val in grid.grid_scores_]
Next we can plot the cross-validation score as a function of bandwidth:
plt.semilogx(bandwidths, scores)
plt.xlabel('bandwidth')
plt.ylabel('accuracy')
plt.title('KDE Model Performance')
print(grid.best_params_)
print('accuracy =', grid.best_score_)
{'bandwidth': 7.0548023107186433}
accuracy = 0.966611018364
We see that this not-so-naive Bayesian classifier reaches a cross-validation accuracy of just over 96%; this is compared to around 80% for the naive Bayesian classification:
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import cross_val_score
cross_val_score(GaussianNB(), digits.data, digits.target).mean()
0.81860038035501381
One benefit of such a generative classifier is interpretability of results: for each unknown sample, we not only get a probabilistic classification, but a full model of the distribution of points we are comparing it to! If desired, this offers an intuitive window into the reasons for a particular classification that algorithms like SVMs and random forests tend to obscure.
If you would like to take this further, there are some improvements that could be made to our KDE classifier model:
we could allow the bandwidth in each class to vary independently
we could optimize these bandwidths not based on their prediction score, but on the likelihood of the training data under the generative model within each class (i.e. use the scores from KernelDensity itself rather than the global prediction accuracy)
Finally, if you want some practice building your own estimator, you might tackle building a similar Bayesian classifier using Gaussian Mixture Models instead of KDE.
-
Motivating KDE: Histograms
00:00 -
Kernel Density Estimation in Practice
00:00 -
Example: KDE on a Sphere
00:00 -
Example: Not-So-Naive Bayes
00:00
7.4 Hierarchical Clustering
Hierarchical Clustering Overview
Hierarchical Clustering is an unsupervised learning algorithm used to group similar objects into clusters. It builds a hierarchy of clusters in a tree-like structure called a dendrogram.
Types of Hierarchical Clustering
Agglomerative (Bottom-Up):
Start with each data point as a single cluster.
Iteratively merge the closest pairs of clusters until all points are in a single cluster or a stopping criterion is met.
Divisive (Top-Down):
Start with all data points in a single cluster.
Recursively split the cluster into smaller clusters until each cluster contains a single point or a stopping criterion is met.
Steps in Agglomerative Hierarchical Clustering
Calculate Distance Matrix:
Compute the distance between every pair of data points (e.g., using Euclidean distance).
Merge Closest Clusters:
Find the pair of clusters with the smallest distance and merge them.
Update Distance Matrix:
Update the distance matrix to reflect the merge.
Repeat:
Repeat the merging process until only one cluster remains or a desired number of clusters is achieved.
Distance Metrics
Common distance metrics used in hierarchical clustering:
Euclidean distance
Manhattan distance
Cosine similarity
Linkage Criteria
Different linkage criteria determine how the distance between clusters is calculated:
Single Linkage: Minimum distance between points in the two clusters.
Complete Linkage: Maximum distance between points in the two clusters.
Average Linkage: Average distance between points in the two clusters.
Ward's Method: Minimize the variance within each cluster.
-
Introduction to Hierachichal Clustering
00:00
7.5 Clustering Weekly Project
-
Customer Segmentation using Clustering
00:00 -
Submit Your Project
00:00
8.1 Introduction to Time Series Analysis
-
Intro to Time Series
23:40 -
Creating Time Series Objects
25:01 -
Working with Time Series Data
38:54 -
Intro to Timeseries
8.2 Time Series – Basic Time Series Models
-
Picking the Correct Model
02:41 -
AR Model
54:10 -
MA Models
34:17 -
ARMA Model
42:04 -
ARIMA Model
45:44
8.3 ARCH and GARCH Models
-
ARCH Model
33:54 -
GARCH Model
13:49
8.4 Time Series Forecasting in Python
-
Auto ARIMA
27:44 -
Forecasting
45:47 -
Automobile Business Case
27:56
8.5 Time Series Weekly Project
-
Air Quality Forecasting
00:00 -
Submit Your Project
00:00
9.1 Introduction to Neural Networks and Deep Learning
-
Introduction to deep learning
58:12
9.2 Convolutional Neural Networks
-
Foundations of CNNs
55:15
9.3 Sequential Models
-
Recurrent Neural Networks
01:02:50
9.4 Deep Generative Modelling
-
Introduction to Deep Generative Modelling
59:52
9.5 Deep Learning Weekly Project
-
Global Wheat Detection
00:00 -
Submit Your Project
00:00
10.0 Conclusion and Next Steps
-
Capstone Project
00:00 -
Submit your Capstone
00:00