We have all been too engrosses with the deep learning and machine learning algorithms, choosing between linear regression, logistic regression, or some other algorithm, that we have forgotten a basic tenet of feature selection.
Feature Selection is the process of selecting the features which are relevant to a machine learning model. It means that you select only those attributes that have a significant effect on the model’s output.
Consider the case when you go to the departmental store to buy grocery items. A product has a lot of information on it, i.e., product, category, expiry date, MRP, ingredients, and manufacturing details. All this information is the features of the product. Normally, you check the brand, MRP, and expiry date before buying a product. However, the ingredient and manufacturing section is not your concern. Therefore, brand, MRP, expiry date are relevant features, and the ingredient, manufacturing details are irrelevant. This is how feature selection is done.
In the real world, a dataset can have thousand of features and there may be chances some features may be redundant, some may be correlated and some may be irrelevant for the model. In this scenario, if you use all the features, it will take a lot of time to train the model, and model accuracy will be reduced. Therefore, feature selection becomes important in model building. There are many other ways of feature selection such as recursive feature elimination, genetic algorithms, decision trees. However, I will tell you the most basic and manual method of filtering with statistical tests used by data scientists.
Now, you have a basic understanding of feature selection, we will see how to implement various statistical tests on the data to select important features.
This article was published as a part of the Data Science Blogathon
Before going into the types of statistical tests and their implementation, it is necessary to understand the meanings of some terminologies.
You may refer to statisticswho.com for more information regarding these terminologies.
A statistical test is a way to determine whether the random variable is following the null hypothesis or alternate hypothesis. It basically tells whether the sample and population or two/ more samples have significant differences. You can use various descriptive stats such as mean, median, mode, range, or standard deviation for this purpose. However, we generally use the mean. These various statistical methods give you a number which is then compared with the p-value. If its value is more than the p-value you accept the null hypothesis, else you reject it.
The procedure for implementing each statistical test will be as follows:
Now you have an understanding of feature selection and statistical tests, we can move towards the implementation of various statistical tests along with their meaning. Before that, I will show you the dataset and this dataset will be used to perform all tests.
The dataset which I will be using is a loan prediction dataset which is taken from the Analytics Vidhya contest. You can also participate in the contest and download the dataset here.
First I imported all necessary python modules and you can check out the data points here.
There are many features in the dataset such as Gender, Dependents, Education, Applicant Income, Loan Amount, Credit history. We will be using these features and check whether one feature effect affects other features using several tests i.e Z-Test, Correlation test, ANOVA test, and Chi-square test.
A Z-test is used to compare the mean of two given samples and infer whether they are from the same distribution or not. We do not implement Z-test when the sample size is less than 30. You would prefer to T-test in such cases.
A Z-Test may be a one-sample Z test or a two-sample Z test.
The One-Sample Z-Test determines whether the sample mean is statistically different from a known or hypothesized population mean. The two-sample Z-test compares 2 independent variables.
We will implement a two-sample Z test.
Z statistic is denoted by
Please note that we will implement 2 sample z-test where one variable will be categorical with two categories and the other variable will be continuous to apply the z-test.
Here we will be using the Gender categorial variable and ApplicantIncome continuous variable. Gender has 2 groups: male and female. Therefore the hypothesis will be:
Null Hypothesis: There is no significant difference between the mean Income of males and females.
Alternate Hypothesis: There is a significant difference between the mean Income of males and females.
M_mean=df.loc[df['Gender']=='Male','ApplicantIncome'].mean()
F_mean=df.loc[df['Gender']=='Female','ApplicantIncome'].mean()
M_std=df.loc[df['Gender']=='Male','ApplicantIncome'].std()
F_std=df.loc[df['Gender']=='Female','ApplicantIncome'].std()
no_of_M=df.loc[df['Gender']=='Male','ApplicantIncome'].count()
no_of_F=df.loc[df['Gender']=='Female','ApplicantIncome'].count()
The above code is calculating the mean of males applicant income, mean of females applicant income, their standard deviation, and number of samples of males and females
twoSampZ function will calculate the z statistic and p-value bypassing the input parameters calculated above.
def twoSampZ(X1, X2, mudiff, sd1, sd2, n1, n2):
pooledSE = sqrt(sd1**2/n1 + sd2**2/n2)
z = ((X1 - X2) - mudiff)/pooledSE
pval = 2*(1 - norm.cdf(abs(z)))
return round(z,3), pval
z,p= twoSampZ(M_mean,F_mean,0,M_std,F_std,no_of_M,no_of_F)
print('Z'= z,'p'= p)
Z = 1.828
p = 0.06759726635832197
if p<0.05:
print("we reject null hypothesis")
else:
print("we accept null hypothesis")
Since value p is greater than 0.5 we accept the null hypothesis. Therefore, we conclude that there is no significant difference between the income of males and females.
A t-test is also used to compare the mean of two given samples like the Z-test. However, It is implemented when the sample data size is less than 30. It assumes a normal distribution of the sample. It can also be one-sample or two-sample. The degree of freedom is calculated by n-1 where n is the number of samples. In linear regression, the T-test is commonly used to determine the significance of individual coefficients (i.e., slopes) in the regression model.
It is denoted by
Besides the simple T-test, there is also a paired T-test which is used when the observations in one group are paired or matched with the observations in the other group.
It will be implemented the same as Z-test. The only condition is sample size should be less than 30. I have shown you Z- Test implementation. Now, you can try your hands on the sample T-Test.
A correlation test is a metric to evaluate the extent to which variables are associated with one another.
Please note that the variables must be continuous to apply the correlation test.
There are several methods for correlation tests i.e. Covariance, Pearson correlation coefficient, Spearman rank correlation coefficient, etc.
We will use the person correlation coefficient since it is independent of the values of variables.
It is used to measure the linear correlation between 2 variables. It is denoted by:
Its values lie between -1 and 1.
If the value of r is 0, it means there is no relationship between variables X and Y.
If the value of r is between 0 and 1, it means there is a positive relation between X and Y, and their strength increases from 0 to 1. Positive relation means if the value of X increases, the value of Y also increases.
If the value of r is between -1 and 0, it means there is a negative relation between X and Y, and their strength decreases from -1 to 0. Negative relation means if the value of X increases, the value of Y decreases.
Here we will be using two continuous variables or features – Loan Amount and Applicant Income. We will conclude whether there is a linear relation between Loan Amount and Applicant Income with the Pearson correlation Coefficient value and also draw the chart between them.
There are some missing values in the LoanAmount column, first, we filled it with the mean value. Then calculated correlation coefficient value.
df[‘LoanAmount’]=df[‘LoanAmount’].fillna(df[‘LoanAmount’].mean())
pcc = np.corrcoef(df.ApplicantIncome, df.LoanAmount)
print(pcc)
[[1. 0.56562046]
[0.56562046 1. ]]
The values on the diagonals indicate the correlation of features with themselves. 0.56 represent that there is some correlation between the two features.
We can also draw the chart as follows:
sns.lineplot(data=df,x='LoanAmount',y='ApplicantIncome')
Also Read: K-Fold Cross Validation Technique and its Essentials
ANOVA stands for Analysis of variance. As the name, suggests it uses variance as its parameter to compare multiple independent groups. ANOVA can be one-way ANOVA or two-way ANOVA. One-way ANOVA is applied when there are three or more independent groups of a variable. We will implement the same in python.
F-Test can be calculated by:
Here we will be using the Dependents categorial variable and ApplicantIncome continuous variable. Dependents has 4 groups: 0,1,2,3+. Therefore the hypothesis will be:
Null Hypothesis: There is no significant difference between the mean Income among different groups of dependents.
Alternate Hypothesis: There is a significant difference between the mean Income among different groups of dependents.
First, we handled the missing values in the Dependents feature.
df['Dependents'].isnull().sum()
df['Dependents']=df['Dependents'].fillna('0')
After this, we created a data frame with the features Dependents and ApplicantIncome. Then with the help of scipy.stats library we calculated the F statistic and p-value.
df_anova = df[['total_bill','day']]
grps = pd.unique(df.day.values)
d_data = {grp:df_anova['total_bill'][df_anova.day == grp] for grp in grps}
F, p = stats.f_oneway(d_data['Sun'], d_data['Sat'], d_data['Thur'],d_data['Fri'])
print('F ={},p={}'.format(F,p))
F =5.955112389949444,p=0.0005260114222572804
if p<0.05:
print(“reject null hypothesis”)
else:
print(“accept null hypothesis”)
Since value p is less than 0.5 we reject the null hypothesis. Therefore, we conclude that there is a significant difference between the income of several groups of Dependents.
This test is applied when you have two categorical variables from a population. It is used to determine whether there is a significant association or relationship between the two variables.
There are 2 types of chi-square tests: chi-square goodness of fit and chi-square test for independence, we will implement the latter one.
The degree of freedom in the chi-square test is calculated by (n-1)*(m-1) where n and m are numbers of rows and columns respectively.
It is denoted by:
We will be using two categorical features Gender and Loan Status and find whether there is an association between them using the chi-square test.
Null Hypothesis: There is no significant association between Gender and Loan Status features.
Alternate Hypothesis: There is a significant association between Gender and Loan Status features.
First, we retrieve the Gender and LoanStatus column and form a matrix which is also called a contingency table.
dataset_table=pd.crosstab(dataset['sex'],dataset['smoker'])
dataset_table
Loan_Status N Y
Gender
Female 37 75
Male 33 339
Then, we calculate observed and expected values using the above table.
observed=dataset_table.values
val2=stats.chi2_contingency(dataset_table)
expected=val2[3]
Then we calculate the chi-square statistic and p-value using the following code:
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(observed,expected)])
chi_square_statistic=chi_square[0]+chi_square[1]
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print("chi-square statistic:-",chi_square_statistic)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('p-value:',p_value)
chi-square statistic:- 0.23697508750826923
Significance level: 0.05
Degree of Freedom: 1
p-value: 0.6263994534115932
if p_value<=alpha:
print("Reject Null Hypothesis")
else:
print("Accept Null Hypthesis")
Since the p-value is greater than 0.05, we accept the null hypothesis. We conclude that there is no significant association between the two features.
Also Read: Amazon launches Bedrock: AI Model Evaluation with Human Benchmarking
So in this tutorial, we have discussed various statistical tests and their importance in data analysis and feature selection. We have seen the application of statistical tests i.e, Z-test, T-test, correlation test, ANOVA test, and Chi-square along with their implementation in python. Besides these, there are various other statistical tests used by data scientists and statisticians. I encourage you to share some in the comments below!
Q1. When to use T-test over Z-test?
Use a z-test for large samples (n > 30) with known population standard deviation, and a t-test for small samples (n < 30) or unknown population standard deviation. The t-test is also suitable for large samples with unknown population standard deviation.
Q2. What is the difference between parametric and non-parametric tests?
Parametric tests make assumptions about the distribution of the data, such as whether there is gaussian distribution or not, while non-parametric tests do not rely on specific distributional assumptions. Parametric tests typically require continuous data and are more powerful when assumptions are met, while non-parametric tests are more robust but less powerful, suitable for ordinal or non-normally distributed data.
Q3. What is a classifier in data analysis?
A. A classifier in data analysis is a model or algorithm used to categorize data points into predefined classes or categories based on their features or attributes. It’s commonly used in machine learning for tasks like text categorization or image recognition.
Q4. What do you mean by statistical hypothesis testing?
A. Statistical hypothesis testing is a method used to make inferences about a population based on sample data. It involves formulating null and alternative hypotheses, selecting a significance level, and using statistical tests to determine the likelihood of observing the sample data if the null hypothesis were true.
Q5. What is the test statistic used in McNemar’s test?
A. The test statistic used in McNemar’s test is typically denoted as χ² (chi-square). It assesses the difference between the discordant pairs in a matched-pairs design, comparing the frequencies of disagreement between two dependent categorical variables.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,