The Descriptive statistics can showcase you great insights into the data and it’s nature. With a single line of code, you can create multiple insights over the input data. The describe() function is used for this purpose which lists over 8 statistical properties of each data attribute –
- Standard deviation
- Min value
- 25th percentile
- 50th percentile
- 75th percentile
- Max value
Descriptive Statistics of Airquality dataset
In this section, we are going to use the describe() function to get the statistical insights of the input airquality dataset.
#Imports required library import pandas as pd #Reads the dataframe df = pd.read_csv('airquality.csv') #View data df
Well, we have the input dataframe now. Let’s make a statistical interference into the data.
#Gives the statistical insights over the data df.describe()
As you see, there is a lot to understand and analyze. The describe() function has returned tons of important insights into the data. With this, you can make the following inferences concerning data points in Ozone.
- Total count of the data points is 153.
- The mean of each column is 42 for Ozone data.
- The min value is 1 and max value is 168 for Ozone data points.
- The 25% of the data is 18.
- The 50% of the data is 31.
- The 75% of the data is 168.
That’s fantastic. With a single line of code you got the complete outline of the data using python.
Skewness of the data using skew() function
The skew refers to a gaussian distribution that is either skewed in one or another direction. Many machine learning algorithms will assume Gaussian distribution.
Analysis of the skew is most important in the data preparation process and it will eventually contribute to the model accuracy.
Let’s see how we can find that using skew().
#Returns the skewness of the data df.skew()
Unnamed: 0 0.000000 Ozone 1.241796 Solar.R -0.428045 Wind 0.347818 Temp -0.377884 Month -0.002391 Day 0.002652 dtype: float64
Things to consider –
- Positive is the right skew
- Negative is the left skew
- Values closer to zero are less skewed.
As you deal with large datasets, it’s hard to peek into the data and generate insights. With the describe() function you can easily get the statistical properties of the data.
But getting the insights is not enough unless you take a movement and understand the behavior and distribution of the data.
Ask as many questions to yourself and try to answer them. Write down your observations. Think about the reason for that insights and this will gradually provide more hidden patterns and behaviours which will be helpful in further process.