The Descriptive statistics can showcase you great insights into the data and it’s nature. With a single line of code, you can create multiple insights over the input data. The describe() function is used for this purpose which lists over 8 statistical properties of each data attribute – 

  • Count
  • Mean
  • Standard deviation
  • Min value
  • 25th percentile
  • 50th percentile
  • 75th percentile
  • Max value

Descriptive Statistics of Airquality dataset

In this section, we are going to use the describe() function to get the statistical insights of the input airquality dataset.

#Imports required library 
import pandas as pd
#Reads the dataframe
df = pd.read_csv('airquality.csv')
#View data
df
Descriptive Statistics

Well, we have the input dataframe now. Let’s make a statistical interference into the data.

#Gives the statistical insights over the data
df.describe()
Descriptive Statistics

As you see, there is a lot to understand and analyze. The describe() function has returned tons of important insights into the data. With this, you can make the following inferences concerning data points in Ozone.

  • Total count of the data points is 153.
  • The mean of each column is 42 for Ozone data.
  • The min value is 1 and max value is 168 for Ozone data points.
  • The 25% of the data is 18.
  • The 50% of the data is 31.
  • The 75% of the data is 168.

That’s fantastic. With a single line of code you got the complete outline of the data using python.

Skewness of the data using skew() function

The skew refers to a gaussian distribution that is either skewed in one or another direction. Many machine learning algorithms will assume Gaussian distribution.

Analysis of the skew is most important in the data preparation process and it will eventually contribute to the model accuracy.

Let’s see how we can find that using skew().

#Returns the skewness of the data
df.skew()
Unnamed: 0    0.000000
Ozone         1.241796
Solar.R      -0.428045
Wind          0.347818
Temp         -0.377884
Month        -0.002391
Day           0.002652
dtype:       float64

Things to consider –

  • Positive is the right skew
  • Negative is the left skew
  • Values closer to zero are less skewed.

Final words

As you deal with large datasets, it’s hard to peek into the data and generate insights. With the describe() function you can easily get the statistical properties of the data.

But getting the insights is not enough unless you take a movement and understand the behavior and distribution of the data.

Ask as many questions to yourself and try to answer them. Write down your observations. Think about the reason for that insights and this will gradually provide more hidden patterns and behaviours which will be helpful in further process.

Categorized in:

Tagged in: