This article explores violin plots, a powerful visualization tool that combines box plots with density plots. It explains how these plots can reveal patterns in data, making them useful for data scientists and machine learning practitioners. The guide provides insights and practical techniques to use violin plots, enabling informed decision-making and confident communication of complex data stories. It also includes hands-on Python examples and comparisons.
This article was published as a part of the Data Science Blogathon.
As mentioned above, violin plots are a cool way to show data. They mix two other types of plots: box plots and density plots. The key concept behind violin plot is kernel density estimation (KDE) which is a non-parametric way to estimate the probability density function (PDF) of a random variable. In violin plots, KDE smooths out the data points to provide a continuous representation of the data distribution.
KDE calculation involves the following key concepts:
A kernel function smooths out the data points by assigning weights to the datapoints based on their distance from a target point. The farther the point, the lower the weights. Usually, Gaussian kernels are used; however, other kernels, such as linear and Epanechnikov, can be used as needed.
Bandwith determines the width of the kernel function. The bandwidth is responsible for controlling the smoothness of the KDE. Larger bandwidth smooths out the data too much, leading to underfitting, while on the other hand, small bandwidth overfits the data with more peaks and valleys.
To compute the KDE, place a kernel on each data point and sum them to produce the overall density estimate.
Mathematically,
In violin plots, the KDE is mirrored and placed on both sides of the box plot, creating a violin-like shape. The three key components of violin plots are:
Placing these components altogether provides insights into the data distribution’s underlying shape, including multi-modality and outliers. Violin Plots are very helpful, especially when you have complex data distributions, whether due to many groups or categories. They help identify patterns, anomalies, and potential areas of interest within the data. However, due to their complexity, they might be less intuitive for those unfamiliar with data visualization.
Violin plots are applicable in many cases, of which major ones are listed below:
Seaborn is standard library in Python which has built-in function for making violin plots. It is simple to use and allows for adjusting plot aesthetics, colors, and styles. To understand the strengths of violin plots, let us compare them with box and density plots using the same dataset.
First, we need to install the necessary Python libraries for creating these plots. By setting up libraries like Seaborn and Matplotlib, you’ll have the tools required to generate and customize your visualizations.
The command for this will be:
!pip install seaborn matplotlib pandas numpy
print('Importing Libraries...',end='')
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
print('Done')
# Create a sample dataset
np.random.seed(11)
data = pd.DataFrame({
'Category': np.random.choice(['A', 'B', 'C'], size=100),
'Value': np.random.randn(100)
})
We will generate a synthetic dataset with 100 samples to compare the plots. The code generates a dataframe named data using Pandas Python library. The dataframe has two columns, viz., Category and Value. Category contains random choices from ‘A’, ‘B’, and ‘C’; while Value contains random numbers drawn from a standard normal distribution (mean = 0, standard deviation = 1). The above code uses a seed for reproducibility. This means that the code will generate the same random numbers with every successive run.
Before diving into the visualizations, we’ll summarize the dataset. This step provides an overview of the data, including basic statistics and distributions, setting the stage for effective visualization.
# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(data.head())
# Get a summary of the dataset
print("\nDataset Summary:")
print(data.describe(include='all'))
# Display the count of each category
print("\nCount of each category in 'Category' column:")
print(data['Category'].value_counts())
# Check for missing values in the dataset
print("\nMissing values in the dataset:")
print(data.isnull().sum())
It is always a good practice to see the contents of the dataset. The above code displays the first five rows of the dataset to preview the data. Next, the code displays the basic data statistics such as count, mean, standard deviation, minimum and maximum values, and quartiles. We also check for missing values in the dataset, if any.
This code snippet generates a visualization comprising violin, box, and density plots for the synthetic dataset we have generated. The plots denote the distribution of values across different categories in a dataset: Category A, B, and C. In violin and box plots, the category and corresponding values are
plotted on the x-axis and y-axis, respectively. In the case of the density plot, the Value is plotted on the x-axis, and the corresponding density is plotted on the y-axis. These plots are available in the figure below, providing a comprehensive view of the data distribution permitting easy comparison between the three types of plots.
# Create plots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Violin plot
sns.violinplot(x='Category', y='Value', data=data, ax=axes[0])
axes[0].set_title('Violin Plot')
# Box plot
sns.boxplot(x='Category', y='Value', data=data, ax=axes[1])
axes[1].set_title('Box Plot')
# Density plot
for category in data['Category'].unique():
sns.kdeplot(data[data['Category'] == category]['Value'], label=category, ax=axes[2])
axes[2].set_title('Density Plot')
axes[2].legend(title='Category')
plt.tight_layout()
plt.show()
Output:
Machine learning is all about data visualization and analysis; that is, at the core of machine learning is a data processing and visualization task. This is where violin plots come in handy, as they better understand how the features are distributed, improving feature engineering and selection. These plots combine the best of both, box and density plots with exceptional simplicity, delivering incredible insights into a dataset’s patterns, shapes, or outliers. These plots are so versatile that they can be used to analyze different data types, such as numerical, categorical, or time series data. In short, by revealing hidden structures and anomalies, violin plots allow data scientists to communicate complex information, make decisions, and generate hypotheses effectively.
A. Violin plots help with feature understanding by unraveling the underlying form of the data distribution and highlighting trends and outliers. They efficiently compare various feature distributions, which makes feature selection easier.
A. Violin plots can handle large datasets, but you need to carefully adjust the KDE bandwidth and ensure plot clarity for very large datasets.
A. The data clusters and modes are represented using multiple peaks in a violin plot. This suggests the presence of distinct subgroups within the data.
A. Parameters such as color, width, and KDE bandwidth customization are available in Seaborn and Matplotlib libraries.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,