Have you ever worked on a dataset with more than a thousand features? How about over 50,000 features? I have, and let me tell you, it’s a very challenging task, especially if you don’t know where to start! Having a high number of variables is both a boon and a curse. It’s great that we have loads of data for analysis, but it is challenging due to size.
It’s not feasible to analyze each variable at a microscopic level. It might take us days or months to perform any meaningful analysis, and we’ll lose much time and money for our business! not to mention the amount of computational power this will take. We need a better way to deal with high-dimensional data to extract patterns and insights from it quickly. So, how do we approach such a dataset?
Of course, you can use dimensionality reduction techniques. This concept allows you to reduce the number of features in your dataset without losing much information and keep (or improve) the model’s performance. As you’ll see in this article, it’s a powerful way to deal with huge datasets.
This is a comprehensive guide to various dimensionality reduction techniques that can be used in practical scenarios. We will first understand this concept and why we should use it before diving into the 12 different techniques I have covered. Each technique has its implementation in Python to familiarize you with it.
We are generating a tremendous amount of data daily. 90% of the world’s data has been generated in the last 3-4 years! The numbers are truly mind-boggling. Below are just some of the examples of the kind of data being collected:
As data generation and collection keeps increasing, visualizing it and drawing inferences becomes more and more challenging. One of the most common ways of doing visualization is through charts. Suppose we have 2 variables: Age and Height. We can use a scatter or line plot between Age and Height and visualize their relationship easily:
Now consider a case where we have, say, 100 variables (p=100). We can have 100(100-1)/2 = 5000 different plots in this case. It does not make much sense to visualize each of them separately, right? In such cases where we have many variables, it is better to select a subset of these variables (p<<100) which captures as much information as the original set of variables.
Let us understand this with a simple example. Consider the below image:
Here we have weights of similar objects in Kg (X1) and Pound (X2). If we use both of these variables, they will convey similar information. So, it would make sense to use only one variable. We can convert the data from 2D (X1 and X2) to 1D (Y1) as shown below:
Similarly, we can reduce the p dimensions of the data into a subset of k dimensions (k<<n). This is called dimensionality reduction.
Here are some of the benefits of applying dimensionality reduction to a dataset:
It’s time to dive into the crux of this article—the various dimensionality reduction techniques! We will use the dataset from AV’s Practice Problem: Big Mart Sales III (register on this link and download the dataset from the data section).
Dimensionality reduction can be done in two different ways:
We will now look at various dimensionality reduction techniques and how to implement each in Python.
Suppose you’re given a dataset. What would be your first step? You would naturally want to explore the data first before building a model. While exploring the data, your dataset has some missing values. Now what? You will try to find out why these are missing values and then impute them or drop the variables entirely that have missing values (using appropriate methods).
What if we have too many missing values (more than 50%)? Should we impute the missing values or drop the variable? I would prefer to drop the variable since it will not have much information. However, this isn’t set in stone. We can set a threshold value; if the percentage of missing values in any variable exceeds that threshold, we will drop the variable.
Let’s implement this approach in Python.
Python Code:
<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>
First, let’s load the data:
Python Code:
First, let’s load the data:
# read the data
train=pd.read_csv("Train_UWu5bXk.csv")
Note: The path of the file should be added while reading the data.
Now, we will check the percentage of missing values in each variable. We can use .isnull().sum() to calculate this.
# checking the percentage of missing values in each variable
train.isnull().sum()/len(train)*100
As you can see in the above table, there aren’t too many missing values (just 2 variables have them, actually). We can impute the values using appropriate methods or set a threshold of, say, 20% and remove the variable having more than 20% missing values. Let’s look at how this can be done in Python:
# saving missing values in a variable
a = train.isnull().sum()/len(train)*100
# saving column names in a variable
variables = train.columns
variable = [ ]
for i in range(0,12):
if a[i]<=20: #setting the threshold as 20%
variable.append(variables[i])
So, the variables to be used are stored in “variable,” which contains only those features with missing values of less than 20%.
Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, can it improve the model we will build? The answer is no because this variable will have zero variance.
So, we need to calculate the variance of each variable we are given. Then, we drop the variables with low variance compared to other variables in our dataset. As I mentioned above, this is because variables with a low variance will not affect the target variable.
Let’s first impute the missing values in the Item_Weight column using the median value of the known Item_Weight observations. For the Outlet_Size column, we will use the mode of the known Outlet_Size values to impute the missing values:
train['Item_Weight'].fillna(train['Item_Weight'].median(), inplace=True)
train['Outlet_Size'].fillna(train['Outlet_Size'].mode()[0], inplace=True)
Let’s check whether all the missing values have been filled:
train.isnull().sum()/len(train)*100
Voila! We are all set. Now let’s calculate the variance of all the numerical variables.
train.var()
As the above output shows, the variance of Item_Visibility is very low compared to the other variables. We can safely drop this column. This is how we apply a low-variance filter. Let’s implement this in Python:
numeric = train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']]
var = numeric.var()
numeric = numeric.columns
variable = [ ]
for i in range(0,len(var)):
if var[i]>=10: #setting the threshold as 10%
variable.append(numeric[i+1])
The above code gives us a list of variables with a variance greater than 10.
The high correlation between two variables means they have similar trends and are likely to carry similar information. This can drastically reduce the performance of some models (linear and logistic regression models, for instance). We can calculate the correlation between independent numerical variables that are numerical in nature. Suppose the correlation coefficient crosses a certain threshold value. In that case, we can drop one of the variables (dropping a variable is highly subjective and should always be done keeping the domain in mind).
As a general guideline, we should keep those variables which show a decent or high correlation with the target variable.
Let’s perform the correlation calculation in Python. First, we will drop the dependent variable (Item_Outlet_Sales) and save the remaining variables in a new dataframe (df).
df=train.drop('Item_Outlet_Sales', 1)
df.corr()
Wonderful, we don’t have any variables with a high correlation in our dataset. Generally, if the correlation between a pair of variables is greater than 0.5-0.6, we should seriously consider dropping one of those variables.
Random Forest is one of the most widely used algorithms for feature selection. It comes packaged with in-built feature importance so you don’t need to program that separately. This helps us select a smaller subset of features.
We must convert the data into numeric form by applying one hot encoding, as Random Forest (Scikit-Learn Implementation) takes only numeric inputs. Let’s also drop the ID variables (Item_Identifier and Outlet_Identifier), as these are just unique numbers and hold no significant importance for us currently.
from sklearn.ensemble import RandomForestRegressor
df=df.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1)
model = RandomForestRegressor(random_state=1, max_depth=10)
df=pd.get_dummies(df)
model.fit(df,train.Item_Outlet_Sales)
After fitting the model, plot the feature importance graph:
features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:] # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Based on the above graph, we can hand-pick the top-most features to reduce the dimensionality in our dataset. Alternatively, we can use the SelectFromModel of sklearn to do so. It selects the features based on the importance of their weights.
from sklearn.feature_selection import SelectFromModel
feature = SelectFromModel(model)
Fit = feature.fit_transform(df, train.Item_Outlet_Sales)
Follow the below steps to understand and use the ‘Backward Feature Elimination’ technique:
This method can be used when building Linear Regression or Logistic Regression models. Let’s look at its Python implementation:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn import datasets
lreg = LinearRegression()
rfe = RFE(lreg, 10)
rfe = rfe.fit_transform(df, train.Item_Outlet_Sales)
We need to specify the algorithm and number of features to select, and we get back the variables obtained from backward feature elimination. We can also check the ranking of the variables using the “rfe.ranking_” command.
This is the opposite process of the Backward Feature Elimination we saw above. Instead of eliminating features, we focus on finding the best features that enhance the model’s performance. This technique, known as feature extraction, operates as follows:
Let’s implement it in Python:
from sklearn.feature_selection import f_regression
ffs = f_regression(df,train.Item_Outlet_Sales )
This returns an array containing the F-values of the variables and the p-values corresponding to each F value. Refer to this link to learn more about F-values. For our purpose, we will select the variables having F-value greater than 10:
variable = [ ]
for i in range(0,len(df.columns)-1):
if ffs[0][i] >=10:
variable.append(df.columns[i])
This gives us the topmost variables based on the forward feature selection algorithm.
NOTE: Both Backward Feature Elimination and Forward Feature Selection are time-consuming and computationally expensive.They are practically only used on datasets with a small number of input variables.
The techniques we have seen so far are generally used when we do not have many variables in our dataset. These are more or less feature selection techniques. In the upcoming sections, we will work with the Fashion MNIST dataset, which consists of images of different types of apparel, e.g., T-shirts, trousers, bags, etc. The dataset can be downloaded from the “IDENTIFY THE APPAREL” practice problem.
The dataset has 70,000 images, 60,000 of which are in the training set and 10,000 test images. For the scope of this article, we will be working only on the training images. The train file is in a zip format. Once you extract the zip file, you will get a .csv file and a train folder that includes these 60,000 images. The corresponding label of each image can be found in the ‘train.csv’ file.
Suppose we have two variables: income and Education. These variables could potentially have a high correlation, as people with a higher education level tend to have significantly higher incomes, and vice versa.
In the Factor Analysis technique, variables are grouped based on their correlations. This means that variables within a group have high correlations with each other but low correlations with variables from other groups. Each group, called a factor, represents a set of related variables. These factors are fewer compared to the data’s original dimensions. However, they can be challenging to observe directly. Techniques like feature projection and embedded methods, such as Linear Discriminant Analysis (LDA), can help simplify these factors and integrate them into the linear modeling process.
Let’s first read in all the images contained in the train folder:
import pandas as pd
import numpy as np
from glob import glob
import cv2
images = [cv2.imread(file) for file in glob('train/*.png')]
NOTE: You must replace the path inside the glob function with the path of your train folder.
Now, we will convert these images into a numpy array format so that we can perform mathematical operations on them and plot them.
images = np.array(images)
images.shape
(60000, 28, 28, 3)
As you can see above, it’s a 3-dimensional array. We must convert it to 1-dimension as all the upcoming techniques only take 1-dimensional input. To do this, we need to flatten the images:
image = []
for i in range(0,60000):
img = images[i].flatten()
image.append(img)
image = np.array(image)
Let us now create a dataframe containing the pixel values of every individual pixel in each image and their corresponding labels (we will use the train.csv file for labels).
train = pd.read_csv("train.csv") # Give the complete path of your train.csv file
feat_cols = [ 'pixel'+str(i) for i in range(image.shape[1]) ]
df = pd.DataFrame(image,columns=feat_cols)
df['label'] = train['label']
Now, we will decompose the dataset using Factor Analysis:
from sklearn.decomposition import FactorAnalysis
FA = FactorAnalysis(n_components = 3).fit_transform(df[feat_cols].values)
Here, n_components will decide the number of factors in the transformed data. After transforming the data, it’s time to visualize the results:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(12,8))
plt.title('Factor Analysis Components')
plt.scatter(FA[:,0], FA[:,1])
plt.scatter(FA[:,1], FA[:,2])
plt.scatter(FA[:,2],FA[:,0])
Looks amazing, doesn’t it? We can see all the different factors in the above graph. Here, the x-axis and y-axis represent the values of decomposed factors. As I mentioned earlier, it is hard to observe these factors individually but we have successfully reduced the dimensions of our data.
PCA is a technique that helps us extract a new set of variables from an existing large set of variables. These newly extracted variables are called Principal Components Analysis. You can refer to this article to learn more about PCA. For your quick reference, below are some of the key points you should know about PCA before proceeding further:
Before moving further, we’ll randomly plot some of the images from our dataset:
plot some of the images from our dataset:
rndperm = np.random.permutation(df.shape[0])
plt.gray()
fig = plt.figure(figsize=(20,10))
for i in range(0,15):
ax = fig.add_subplot(3,5,i+1)
ax.matshow(df.loc[rndperm[i],feat_cols].values.reshape((28,28*3)).astype(float))
Let’s implement PCA using Python and transform the dataset:
from sklearn.decomposition import PCA
pca = PCA(n_components=4)
pca_result = pca.fit_transform(df[feat_cols].values)
In this case, n_components will decide the number of principal components in the transformed data. Let’s visualize how much variance has been explained using these 4 components. We will use explained_variance_ratio_ to calculate the same.
plt.plot(range(4), pca.explained_variance_ratio_)
plt.plot(range(4), np.cumsum(pca.explained_variance_ratio_))
plt.title("Component-wise and Cumulative Explained Variance")
In the above graph, the blue line represents the component-wise explained variance, while the orange line represents the cumulative explained variance. We can explain around 60% variance in the dataset using just four components. Let us now try to visualize each of these decomposed components:
import seaborn as sns
plt.style.use('fivethirtyeight')
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))
sns.heatmap(pca.components_[0, :].reshape(28, 84), ax=axarr[0][0], cmap='gray_r')
sns.heatmap(pca.components_[1, :].reshape(28, 84), ax=axarr[0][1], cmap='gray_r')
sns.heatmap(pca.components_[2, :].reshape(28, 84), ax=axarr[1][0], cmap='gray_r')
sns.heatmap(pca.components_[3, :].reshape(28, 84), ax=axarr[1][1], cmap='gray_r')
axarr[0][0].set_title(
"{0:.2f}% Explained Variance".format(pca.explained_variance_ratio_[0]*100),
fontsize=12
)
axarr[0][1].set_title(
"{0:.2f}% Explained Variance".format(pca.explained_variance_ratio_[1]*100),
fontsize=12
)
axarr[1][0].set_title(
"{0:.2f}% Explained Variance".format(pca.explained_variance_ratio_[2]*100),
fontsize=12
)
axarr[1][1].set_title(
"{0:.2f}% Explained Variance".format(pca.explained_variance_ratio_[3]*100),
fontsize=12
)
axarr[0][0].set_aspect('equal')
axarr[0][1].set_aspect('equal')
axarr[1][0].set_aspect('equal')
axarr[1][1].set_aspect('equal')
plt.suptitle('4-Component PCA')
Each additional dimension we add to the Principal Component Analysis (PCA) technique captures less and less of the variance in the model. The first component is the most important one, followed by the second, then the third, and so on.
We can also use Singular Value Decomposition (SVD) to decompose our original dataset into its constituents, reducing dimensionality. To learn the mathematics behind SVD, refer to this article.
SVD decomposes the original variables into three constituent matrices. It is essentially used to remove redundant features from the dataset. It uses the concept of Eigenvalues and Eigenvectors to determine those three matrices. We will not go into the mathematics of it due to the scope of this article, but let’s stick to our plan, i.e., reducing the dimensions in our dataset.
Let’s implement SVD and decompose our original variables:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=3, random_state=42).fit_transform(df[feat_cols].values)
Let us visualize the transformed variables by plotting the first two principal components:
plt.figure(figsize=(12,8))
plt.title('SVD Components')
plt.scatter(svd[:,0], svd[:,1])
plt.scatter(svd[:,1], svd[:,2])
plt.scatter(svd[:,2],svd[:,0])
The above scatter plot shows us the decomposed components very neatly. As described earlier, there is not much correlation between these components.
Independent Component Analysis (ICA) is based on information theory and is one of the most widely used dimensionality reduction techniques. The major difference between PCA and ICA is that PCA looks for uncorrelated factors while ICA looks for independent factors.
If two variables are uncorrelated, there is no linear relation between them. If they are independent, they are not dependent on other variables. For example, a person’s age is independent of what that person eats or how much television he/she watches.
This algorithm assumes that the given variables are linear mixtures of some unknown latent variables. It also assumes that these latent variables are mutually independent, i.e., they are not dependent on other variables, and hence, they are called the independent components of the observed data.
Let’s compare PCA and ICA visually to get a better understanding of how they are different:
Here, image (a) represents the PCA results, while image (b) represents the ICA results on the same dataset.
The equation of PCA is x = Wχ.
Here,
Now, we must find an un-mixing matrix so that the components become as independent as possible. The most common method to measure the independence of components is Non-Gaussianity:
The above distribution is non-gaussian, which, in turn, makes the components independent. Let’s try to implement ICA in Python:
from sklearn.decomposition import FastICA
ICA = FastICA(n_components=3, random_state=12)
X=ICA.fit_transform(df[feat_cols].values)
Here, n_components will decide the number of components in the transformed data. We have transformed the data into 3 components using ICA. Let’s visualize how well it has transformed the data:
plt.figure(figsize=(12,8))
plt.title('ICA Components')
plt.scatter(X[:,0], X[:,1])
plt.scatter(X[:,1], X[:,2])
plt.scatter(X[:,2], X[:,0])
The data has been separated into different independent components, which can be seen very clearly in the above image. The X-axis and Y-axis represent the values of decomposed independent components.
Now, we shall look at some methods that reduce the data dimensions using projection techniques.
Linear Discriminant Analysis (LDA) is a way to determine the differences between groups of things. Imagine you have a bunch of fruits, some apples, and some oranges. LDA helps you determine what makes apples different from oranges.
Here’s how it works:
To start off, we need to understand what projection is. Suppose we have two vectors, vector a and vector b, as shown below:
We want to find the projection of a on b. Let the angle between a and b be ∅. The projection (a1) will look like:
a1 is the vector parallel to b. So, we can get the projection of a vector an on vector b using the below equation:
Here,
By projecting one vector onto the other, dimensionality can be reduced.
In projection techniques, multi-dimensional data analysis is represented by projecting its points onto a lower-dimensional space. Now, we will discuss different methods of projections:
Once upon a time, it was assumed that the Earth was flat. No matter where you go on Earth, it keeps looking flat (let’s ignore the mountains for a while). But if you keep walking in one direction, you will end up where you started. That wouldn’t happen if the Earth was flat. The Earth only looks flat because we are minuscule compared to its size.
These small portions, where the Earth appears flat, are manifolds. Combining these manifolds gives us a larger-scale view of the Earth, representing the original data. Similarly, in the case of an n-dimensional curve, these small flat pieces also represent manifolds. Combining these manifolds allows us to reconstruct the original n-dimensional curve. Now, let’s delve into the steps for projection onto manifolds, incorporating techniques such as backward feature elimination for optimization.
Let us understand the manifold projection technique with an example.
If a manifold is continuously differentiable to any order, it is known as a smooth or differentiable manifold. ISOMAP is an algorithm that aims to recover the full low-dimensional representation of a non-linear manifold. It assumes that the manifold is smooth.
It also assumes that for any pair of points on the manifold, the geodesic distance (shortest distance between two points on a curved surface) between the two points equals the Euclidean distance (shortest distance between two points on a straight line). Let’s first visualize the geodesic and Euclidean distance between a pair of points:
Here,
ISOMAP assumes both of these distances to be equal. Let’s now look at a more detailed explanation of this technique. As mentioned earlier, all these techniques work on a three-step approach. We will look at each of these steps in detail:
Let’s implement it in Python and get a clearer picture of what I’m talking about. We will perform a non-linear dimensionality reduction technique through Isometric Mapping. For visualization, we will only take a subset of our dataset as running it on the entire dataset will require a lot of time.
from sklearn import manifold
trans_data = manifold.Isomap(n_neighbors=5, n_components=3, n_jobs=-1).fit_transform(df[feat_cols][:6000].values)
Parameters used for non-linear dimensionality reduction technique:
Visualizing the transformed data:
plt.figure(figsize=(12,8))
plt.title('Decomposition using ISOMAP')
plt.scatter(trans_data[:,0], trans_data[:,1])
plt.scatter(trans_data[:,1], trans_data[:,2])
plt.scatter(trans_data[:,2], trans_data[:,0])
You can see above that the correlation between these components is very low. They are even less correlated than the components we obtained using SVD earlier!
So far, we have learned that PCA is a good choice for dimensionality reduction and visualization of datasets with many variables. But what if we could use something more advanced? What if we can easily search for patterns in a non-linear way? t-SNE is one such technique. There are mainly two types of approaches we can use to map the data points:
You can refer to this article to learn about t-SNE in more detail.
We will now implement it in Python and visualize the outcomes:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=3, n_iter=300).fit_transform(df[feat_cols][:6000].values)
n_components will decide the number of components in the transformed data. Time to visualize the transformed data:
plt.figure(figsize=(12,8))
plt.title('t-SNE components')
plt.scatter(tsne[:,0], tsne[:,1])
plt.scatter(tsne[:,1], tsne[:,2])
plt.scatter(tsne[:,2], tsne[:,0])
Here, you can see the different components that have been transformed using the powerful t-SNE technique.
t-SNE works very well on large datasets, but it also has its limitations, such as loss of large-scale information, slow computation time, and inability to represent very large datasets meaningfully. Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can preserve as much of the local and more of the global data structure as t-SNE, with a shorter runtime. Sounds intriguing, right?
Some of the key advantages of UMAP are:
This method uses the concept of k-nearest neighbor and optimizes the results using stochastic gradient descent. It first calculates the distance between the points in high-dimensional space, projects them onto low-dimensional space, and calculates the distance between points in this low-dimensional space. It then uses Stochastic Gradient Descent to minimize the difference between these distances. To get a more in-depth understanding of how UMAP works, check out this paper.
Refer here to see the documentation and installation guide of UMAP. We will now implement it in Python:
import umap
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.3, n_components=3).fit_transform(df[feat_cols][:6000].values)
Here,
Let us visualize the transformation:
plt.figure(figsize=(12,8))
plt.title('Decomposition using UMAP')
plt.scatter(umap_data[:,0], umap_data[:,1])
plt.scatter(umap_data[:,1], umap_data[:,2])
plt.scatter(umap_data[:,2], umap_data[:,0])
The dimensions have been reduced, and now we can visualize the different transformed components. There is very little correlation between the transformed variables. Let’s compare the results from UMAP and t-SNE, especially focusing on how they affect machine learning models and independent variables.
In data mining, when we compare the components obtained from UMAP with those from t-SNE, we notice that the correlation between UMAP components is significantly lower than between t-SNE components. This suggests that UMAP generally yields better results in separating and understanding independent components.
As mentioned in UMAP’s GitHub repository, it often performs better than t-SNE at preserving aspects of the global structure of the data. This means that it can often provide a better “big picture” view of the data and preserve local neighbor relations.
Take a deep breath. We have covered quite a few of the dimensionality reduction techniques available. Let’s briefly summarize where each can be used.
This section will briefly summarize the use cases of each dimensionality reduction techniques we covered, including linear discriminant analysis. Understanding where and when to apply a particular technique can be beneficial, as it helps save time, effort, and computational power.
High dimensional data refers to datasets with a large number of features or variables, which means there are a lot of different pieces of information to consider. These datasets can be complicated because they contain so many data points or individual pieces of information. For example, imagine a dataset detailing thousands of products in a store. Each product has its own set of attributes like price, size, color, and more. Sorting through all this information can be like trying to untangle a big knot. Factor analysis can help make sense of high-dimensional spaces by finding patterns and relationships among the variables, simplifying the data, and making it easier to understand.
Lower-dimensional space, on the other hand, refers to a simplified version of higher-dimensional space. Instead of considering all the features or data points, we focus on a smaller number of dimensions or attributes. This simplification helps us understand the data better and can make it easier to analyze and interpret. Factor analysis, a method commonly used in data science, is often employed to reduce the dimensions of the data while retaining its essential information.
This is as comprehensive an article on dimensionality reduction techniques as you’ll find anywhere! I had a lot of fun writing it and found a few new ways of dealing with many variables I hadn’t used before (like UMAP).
Dealing with thousands and millions of features is a must-have skill for any data scientist or deep learning enthusiast. The amount of high-dimensional data we generate daily is unprecedented, and we must find different ways to use it. Dimensionality reduction techniques involves subspace and unsupervised learning methods, is a very useful way to do this and has worked wonders for me, both in a professional setting and in machine learning hackathons involving supervised learning and other learning techniques.
I look forward to your feedback and ideas in the comments section below.
A. Data reduction techniques simplify complex datasets by selecting important features, transforming data, handling missing data, and reducing dimensions. These techniques enhance efficiency and improve machine learning performance.
A. Data reduction techniques simplify complex datasets by selecting important features, transforming data, handling missing data, and reducing dimensions. These methods, including embedded techniques like Linear Discriminant Analysis (LDA), improve efficiency and enhance machine learning performance.
A. Yes, PCA is indeed a dimensionality reduction method. It simplifies data by transforming it into a lower-dimensional space while keeping crucial information intact. This dimension reduction saves storage space and makes handling and processing data easier, especially for machine learning algorithms.
A. Dimensionality reduction techniques is important because it helps in:
a. Reducing overfitting.
b. Improving model performance.
c. Decreasing computational time and resources.
d. Making data visualization easier.
A. Dimensionality reduction is crucial in machine learning because it simplifies models, improves computational efficiency, reduces storage space, and helps eliminate noise from irrelevant features. It also aids in visualizing data, leading to better insights and more effective models.
A. Dimensionality reduction is important in machine learning and predictive modeling because it enhances model performance by preventing overfitting, improves training speed, and reduces the complexity of models, making them easier to interpret and deploy.
A. To lift the curse of dimensionality, we use techniques like feature selection (choosing the most relevant features) and feature extraction (transforming data into lower dimensions), such as Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE).
A. Python is highly effective for machine learning and dimensionality reduction due to its robust ecosystem of libraries such as scikit-learn, TensorFlow, and Keras, which offer extensive data preprocessing, modeling, and evaluation tools.
A. The types of feature selection for dimensionality reduction include:
a.Filter Methods: Use statistical measures to score and select features (e.g., correlation, chi-square).
b.Wrapper Methods: Use model performance to evaluate feature subsets (e.g., recursive feature elimination).
c.Embedded Methods: Perform feature selection during the model training (e.g., Lasso regression).
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Its really a wonderful article . Appreciate for sharing.
Hi, Thanks for the feedback. Glad you liked the article.
Wonder-full,In one word "awesome"..can you please publish this with R code please
Hi Sayam, Glad you liked the article. I am not very familiar with R however will let you know if i find any resource containing R codes.
Excellent article, have gone through it once and clarified my long standing doubts. Keeping it under favorites for another detailed read.
Thank you Om!
Hi Regarding removing features with low variance The magnitude of variance is dependent on the scale of the values, right? If that's the case, how can one compare variance between two features having a different order of magnitude (when choosing which one to drop) ?
Hi Clyton, Great point. We can normalize the variables and bring them down to same scale. Then there will not be any bias based on the scale of the values.
muy bueno el posting! graicas
Hi Jimmy, gracias por apreciar!
Thanks for you article, really wonderfull ! I would like to know in part 3.3, why we need to delete a variable if the correlation is superior to 0.6 ?
Hi Laurent, If the independent variables are correlated to each other, they will have almost same effect on the target variable. So we can only keep one variable out of all correlated variables which will help us to reduce the dimensions of our dataset and will also not affect the performance of our model.
Thanks a lot for writing such a comprehensive review!
Thank you Sahar
On random forest technique, you do: indices = np.argsort(importances[0:9]) # top 10 features but that takes only the first 9 elements of the model.feature_importances_ and sort those 9 elements. Shouldn't you take all elements, sort them and then take the 9 more important ? This way: indices = np.argsort(importance)[-9:]
Hi Walter, Great Point. Thanks for pointing it out. I have updated the same in the article.
Great article Pulkit!! An easy read :)
Thank you Joy
Congratulations and thanks for your generosity.
Thank you Renato
For dropping variables based on variance - Should it also not depend on the variable's scale? for eg a variable with mean 10 having sd 0.5 is not as much information as another variable having mean 0.1 and sd 0.5.
Hi Rohini, We can normalize the variables and bring them to same scale and then calculate their variance. Great Point!!!
Excellent article on DR. I've been a data analyst for 2 years and now I'm diving into Machine Learning and Data Science and this website has been a tremendous resource for me - Thank you!
Thanks for your great article As far as I know, PCA and feature selection both reduce dimension and their aims is to make machine learning algorithm run more smoothly. From my deduction, we should apply PCA after we do feature selection to get the best input for our algorithm(because we reduce the dimension of the data twice). However, I also google about this problem, but i got many different opinions. Some of them think the same as me, "first feature selection, then PCA", but some of them think it will not efficient if we try to apply both of these technique simultaneously. Think in the same way, should we try to combine all the technique you describe in the graph in the "sumary" section to achieve the best dimension reduction ? If not, when we apply each of the technique ? Thanks and best regards.
Hi Tony, PCA is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible. PCA is a dimensional reduction technique and it performs well on the original data as well. So, there is no need to do feature selection before applying PCA. I have also explained in the summary section as to where you can use which dimensionality reduction technique. Please go through it and let me know if you need any clarifications.
Neatly written with superb content. Enjoyed a lot & really appreciate.
Hi, Where can I find the train data ( specially the dataset for factor analysis which seemed to be some png files) Thanks for your concise post. Thanks,
Hi Ray, The link to download the data is given in article at the end of section 3.6 You can download the data from this link as well. You first have to register for this problem and then download the data from Data section.
Hi, I am facing a problem when doing the missing value imputation. After replacing the NAs with median using fillna function the datatype of the Item_weight and Outlet_size changes to Object and there after when I do the variance, those variables are excluded because of being objec type. I tried to convert it back using astype function but its not working. Can someone help me? I am new to Python.
Hi Manjusha, The data type of Outlet_size variable is Object as it contains categories. So, you don't have to consider it while calculating the variance. Item_weight is float and after imputing the missing values with median, its data type will not change. Please share the code that you are using so that I can help you in a better way.
hello, can you please tell me what is the source for all this information?
Hi Emre, I have collected information for this article from different sources and research papers.
Hi Pulkit, Thanks for the response. I used the same code thats there in the article train['Item_Weight'].fillna(train["Item_Weight"].median,inplace = True) Will it be issue with the version of Python I am using? I am using Python 3.6
Hi Manjusha, Use the below given code to impute the missing values in item_weight column: train['Item_Weight'].fillna(train['Item_Weight'].median(),inplace = True)
Hi Pulkit, good work on the comprehensive article on dimensionality reduction techniques. as per my understanding, the techniques mentioned in the article seem to be only good for regression type of problems. i am wondering if they work for clustering and classification problems. In unsupervised clustering, there is no target variable to treat as a reference; in those cases what techniques work better? And how about classification. how about linear discriminant analysis (LDA) technique which is not mentioned in your article?
Nice article Pulkit. Had a question: Somewhere else, "dimreductions" is categorized as ML learning method. But reading your article, it seems more like 'preparation' of data for other supervised/unsupervised algorithms to actually learn. Can you please throw some light on this understanding of mine? Thanks again!
Brilliant article Really loved the in-depth analysis of each Dimensionality reduction technique along with graphs for each technique as well. Helped me a lot to understand what to use when. Thanks a lot!
Glad you liked it!
[…] The Ultimate Guide to 12 Dimensionality Reduction Techniques (with Python codes) – from Pulkit Sharma published in 2018 […]
Hi. Thanks for this amazing article! I have some question about line "train = pd.read_csv("train.csv")". Where I can find this 'train.csv'? I can't understand what is 'train.csv' in this case
Please i can see most of you are well informed on the topic, I am a final year in ghana and i have been asked to research on Dimensional Data reduction and bring up my proposal i literally dont know what my lecturer meant or whats me to do , Can some one help me
Thanks, great article, great job!
Wow, this is an amazing post! I've been struggling to wrap my head around dimensionality reduction techniques and this comprehensive list really helped me understand the differences between each method. The explanations and examples were clear and concise, making it easy to follow along. One question though - can you recommend a good dimensionality reduction technique for text data? I'm working on a project that involves natural language processing and I'm not sure which technique to use. Thanks again for this helpful post!