Machine learning algorithms may take a lot of time working with large datasets. To overcome this a new dimensional reduction technique was introduced. If the input dimension is high Principal Component Algorithm can be used to speed up our machines. It is a projection method while retaining the features of the original data.
In this article, we will discuss the basic understanding of Principal Component(PCA) on matrices with implementation in python. Further, we implement this technique by applying one of the classification techniques.
Dataset
The dataset can be downloaded from the following link. The dataset gives the details of breast cancer patients. It has 32 features with 569 rows.
Let’s get started.Import all the libraries required for this project.
import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns %matplotlib inline
Loading the dataset
dataset = pd.read_csv('cancerdataset.csv') dataset["diagnosis"]=dataset["diagnosis"].map({'M': 1, 'B': 0}) data=dataset.iloc[:,0:-1] data.head()
We need to store the independent and dependent variables by using the iloc method.
X = data.iloc[:, 2:].values y = data.iloc[:, 1].values
Split the training and testing data in the 80:20 ratio.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
PCA Standardization
PCA can only be applied to numerical data. So,it is important to convert all the data into numerical format. We need to standardize data for converting features of different units to the same unit.
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
Covariance Matrix
Based on standardized data we will build the covariance matrix. It gives the variance between each feature in our original dataset. The negative value in the result below represents are inversely dependent on each other.
mean_vec=np.mean(X_train,axis=0) cov_mat=(X_train-mean_vec).T.dot((X_train-mean_vec))/(X_train.shape[0]-1) mean_vect=np.mean(X_test,axis=0) cov_matt=(X_test-mean_vec).T.dot((X_test-mean_vec))/(X_test.shape[0]-1) print(cov_mat)
Eigen Decomposition on Covariance Matrix
Each eigenvector will have an eigenvalue and sum of the eigenvalues represent the variance in the dataset. We can get the location of maximum variance by calculating eigenvalue. The eigenvector with lowest eigenvalue will give the lowest amount of variation in the dataset. These values need to be dropped off.
cov_mat=np.cov(X_train.T) eig_vals,eig_vecs=np.linalg.eig(cov_mat) cov_matt=np.cov(X_test.T) eig_vals,eig_vecs=np.linalg.eig(cov_mat) print(eig_vals) print(eig_vecs)
We need to specify how many components we want to keep. The result gives a reduction of dimension from 32 to 2 features. The first and second PCA will capture the most variance in the original dataset.