Big Data has been the most hyped buzz word in the past few years. Big Data is the data that is huge in size and grows exponentially with time. Dealing with big data isn’t problematic just because of the size. The complexity of the data is another factor that contributes to the obscurities of Big Data. Performing Data Analysis, Feature Extraction etc on such data is challenging. This is one place you’ll need to use the art of Data Science. To be successful in this task requires us to have good knowledge and understanding of various mathematical and statistical techniques. In this post let’s explore the use of Topology in Data Analysis and Machine Learning.
Topology for Data
Topology is the study of shapes and their properties. Topology doesn’t care if you twist, bend, shear etc. the geometric objects. It simply deals with these shapes’ properties, such as the number of loops in them, no components, etc. You might have seen this weird equivalence between a cup and donut.
Figure 1
They are considered equivalent just because both of them have the same number of holes. How is this useful for Data Analysis or Machine Learning? Let’s look at a trivial example of Classification.
Figure 2
Using Linear Algorithms to classify this kind of data won’t succeed. The shape of the data guides us to use nonlinear algorithms or extract features that will help us.
Data has Shape, and the shape matters.
But drawing insights from data with a huge number of dimensions is non-trivial. Topology is a natural choice for studying shapes of higher dimensions. It can be a powerful tool in your arsenal to tackle the feature engineering problems of complex data.
Feature Extraction using Topology
The goal of the Topological Data analysis is to express the information contained in the data in a lesser number of highly insightful parameters. Some of the parameters that are often used are the number of holes in the data. A hole in zero dimensions is a connected component.
A hole in one dimension is a loop, a hole in 2 dimensions is a void e.t.c
Figure 3
Persistent Homology
But how do we get these parameters for discrete data samples in higher dimensional space? This is where the concepts of simplicial complex and persistent homology come into play.
Data points in space are considered to be hyperspheres to set radius instead of points. An edge is drawn between the data points that are touching each other. Then the number of holes are measured on the graph obtained by these connections. This graph is called a simplicial complex.
Figure 4
Different radius values can reveal different structures in data. Instead of using a single radius, we grow these hyperspheres and measure the parameters at intervals. Then we consider the persistence of the features and use this persistence information as the representation of data. This is called Persistent Homology. As the size of the spheres grow the graph becomes fully connected.
Figure 5
Holes are created and destroyed at various resolutions during the growing process. The persistence information, i.e. birth and death of holes, is measured and represented as barcodes or persistence diagrams shown below.
Figure 6
Now several features can be extracted from these representations and used for ML tasks. One good feature is persistence entropy.It is calculated using the following formula:
sum of lengths of all bars
Classifying 3D Shapes
Let’s see an example of this process to gain a better understanding. We use giotto_tda: a high performing topological machine learning toolkit in python. It integrates with sklearn really well and is very intuitive to use.
Setup
!python -m pip install -U giotto-tda !pip install openml !pip install delayed
Data
We use the same data used in tutorials of giotto_data.Data is loaded from Princeton’s Computer Vision Course.
from openml.datasets.functions import get_dataset df = get_dataset('shapes').get_data(dataset_format='dataframe')[0] df.head()
There are 4 classes of 3D objects in data with 10 samples for each class. 400 points in 3D space represent each object.
We have to transform the data into point clouds to work with the library
import numpy as np point_clouds = np.asarray( [ df.query("target == @shape")[["x", "y", "z"]].values for shape in df["target"].unique() ] ) point_clouds.shape
Calculating Persistence Diagrams
from gtda.homology import VietorisRipsPersistence # Track connected components, loops, and voids homology_dimensions = [0, 1, 2] persistence = VietorisRipsPersistence( metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=6, collapse_edges=True, ) persistence_diagrams = persistence.fit_transform(point_clouds) #Example Persistence Diagram plot_diagram(persistence_diagrams[10])
Figure 7
Persistence Entropy and Other Features
We can get persistence entropies of each homology dimension using
from gtda.diagrams import PersistenceEntropy persistence_entropy = PersistenceEntropy(normalize=True) # Calculate topological feature matrix X = persistence_entropy.fit_transform(persistence_diagrams)
Since we used only 3 dimensions, we get only three numbers for each data point. To increase the number of features, we can calculate other types of features. Following are some examples.
from gtda.diagrams import NumberOfPoints,Amplitude from sklearn.pipeline import make_union # Select a variety of metrics to calculate amplitudes metrics = [ {"metric": metric} for metric in ["bottleneck", "wasserstein", "landscape", "persistence_image"] ] # Concatenate to generate 3 + 3 + (4 x 3) = 18 topological features feature_union = make_union( PersistenceEntropy(normalize=True), NumberOfPoints(n_jobs=-1), *[Amplitude(**metric, n_jobs=-1) for metric in metrics]
Classification Pipeline
Finally, we can put all these things together and build a classification model.
from gtda.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier steps = [ ("persistence", VietorisRipsPersistence(metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=6)), ("features", feature_union), ("model", RandomForestClassifier(oob_score=True)), ] pipeline = Pipeline(steps) pipeline.fit(point_clouds,df['target'].unique())