The R is both a programming language as well as a development environment that can be used for statistical computations and graphics. The R language is much similar to the S language, but R is better when it comes to mathematical computations. This article shows, how you can use R for statistics. 

R language is highly extensible. It offers a wide range of statistical techniques that are helpful in analysis. The major strengths of R languages lie in its well-defined documentation, good graphic features, strong analytical functions.

The Mean, Median and Mode in R for Statistics

mean, median and mode in R

In this section, let’s see how we can find the mean, median and mode of the data in the R programming language.

Mean:

The mean of the values is given by the sum of the values and dividing by the total number of values. In R language you have to use the mean() function to get the mean of the values.

#Creating a vector 
df <- c(34,56,25,67,54,56,78,98,34,16,58)

#Returns the mean of the values in vector 
mean(df)
52.36364

Median:

The center or the middle value of the data is called a Median of the data. R language offers the median() function to calculate the median of the data. 

#Creates a vector
df <- c(34,56,25,67,54,56,78,98,34,16,58)

#Returns the median of data
median(df)
56

Mode:

The mode of the data is given as the highest occurrence of the value in the given data. If the frequency of the data is the same, then there will be no mode in the given input data.

#Install required libraries
library(modeest)
df <- c(34,56,25,67,54,56,78,98,34,16,58)

#Returs the mode of the value 
mfv(df)
 34 56

Descriptive Statistics in R language

descriptive analysis

As you know R is a supreme language for statistical analysis, It includes various functions that assist you in statistical computations of the given data. The descriptive statistics of the data can give plenty of insights over the data shape, distribution, and more. The summary() function in R will give you 7 statistical properties of given data.

  • Mean
  • Median
  • 1st Quartile
  • 3rd Quartile
  • Minimum value
  • Maximum value
  • NA’s

Descriptive Statistics in Action

R statistics

R for statistics is the greatest asset for modern analytical problems. For any input data, with just a single function, R will give you the 7 statistical properties of that data. R is TOO FAST…

In this section let’s see Descriptive statistics in Action!!!

Let’s import one of the in-built dataset in R studio i.e. “Airquality” dataset.

#Impoting the data
df <- datasets::airquality

#Reads the data
df
      Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5
6      28      NA 14.9   66     5   6
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
12     16     256  9.7   69     5  12

Fantastic!!! Our data is now ready for some statistical interference. Let’s make use of the summary() function to get the statistical properties of this data.

#returns the statistical properties of data 
summary(df)
       Ozone            Solar.R              Wind             
 Min.   :  1.00      Min.   :  7.0      Min.   : 1.700    
 1st Qu.: 18.00      1st Qu.:115.8      1st Qu.: 7.6.0  
 Median : 31.50      Median :205.0      Median : 9.700   
 Mean   : 42.13      Mean   :185.9      Mean   : 9.958   
 3rd Qu.: 63.25      3rd Qu.:258.8      3rd Qu.:11.500   
 Max.   :168.00      Max.   :334.0      Max.   :20.700   
 NA's   :37          NA's   :7                     

     Temp             Month      
Min.   :56.00      Min.   :5.000  
1st Qu.:72.00      1st Qu.:6.000  
Median :79.00      Median :7.000  
Mean   :77.88      Mean   :6.993  
3rd Qu.:85.00      3rd Qu.:8.000  
Max.   :97.00      Max.   :9.000 

 Day      
 Min.   : 1.0  
 1st Qu.: 8.0  
 Median :16.0  
 Mean   :15.8  
 3rd Qu.:23.0  
 Max.   :31.0  

Inference Drawn:

  • The given data has Missing values (NA) in 2 columns. i.e. Ozone and Solar.R
  • The maximum temperature is 97 and minimum is 56.
  • The maximum Ozone is 168 and the minimum is 1.
  • The wind speed maximum value is 20 and minimum is 1.7
  • By all these insights, we can say that the data is pretty much accurate and can use for further analysis to get more insights.

R for Statistics – Correlation between Attributes

correlation in R

With the help of summary() function, we got around 7 statistical properties of our data. We got to know that our data is good to go for further analysis. Now we have found the correlation between the attributes. For this, we can use the cor() function in the R language

#Importing data
df <- datasets::VADeaths
#Prints data
df
           Rural Male  Rural Female Urban Male Urban Female
  50-54       11.7          8.7       15.4          8.4
  55-59       18.1         11.7       24.3         13.6
  60-64       26.9         20.3       37.0         19.3
  65-69       41.0         30.9       54.6         35.1
  70-74       66.0         54.3       71.1         50.0

Well, this is our data and note that to find the correlation between the variables of data, the data must be numeric. To avoid this you can convert the data using as.numeric() function. 

Now, let apply cor() function to get the correlation between all the attributes of the data.

#Returns the correlation between attributes
cor(VADeaths)
              Rural Male Rural Female Urban Male Urban Female
Rural Male    1.0000000    0.9979869  0.9841907    0.9934646
Rural Female  0.9979869    1.0000000  0.9739053    0.9867310
Urban Male    0.9841907    0.9739053  1.0000000    0.9918262
Urban Female  0.9934646    0.9867310  0.9918262    1.0000000

The most commonly used correlation method is the Pearson correlation. In this method, the correlation lies between -1 to 1 and 1 being the highest correlated and -1 being the least correlated values. If the correlation is 0, then there is no correlation between those attributes. 

In this data, you can see almost all the attributes are highly correlated to each other.

R for statistics – The Quantile distribution

data distribution

The quantile distribution is also one of the most important factors in data analysis to understand the data distribution.

Let’s see how we can find the quantile distribution of the data.

#Importing data
df <- datasets::iris
df
      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
#Returns the quantile distribution  
quantile(iris$Sepal.Length)
0%  25%  50%  75% 100% 
 4.3  5.1  5.8  6.4  7.9 

R for statistics – Skewness of the data

R for statistics

The skewness of the data is defined as the assumed Gaussian distribution in one or other directions. In the area of data preparation, it’s pretty much important to know the skewness of the data. Let’s use skewness() function for this purpose.

Let’s use the same IRIS dataset for this.

#Importing data
df <- datasets::iris
df
      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa

Now, we will see the skewness of the given dataset using the skewness() function in R language.

For this, we have to import a library = e1071

#Imports the library
library(e1071)
#Returns the skewness of the data
skewness(iris$Sepal.Length)
0.3086407
#Import library
ibrary(e1071)
#Returns the skewness of the data
skewness(iris$Petal.Length)
-0.2694109

As you can observe the output, if the skewness is positive, it is right-skewed and if the output is negative it is left-skewed. Note that if the skewness value is near to zero means it is less skewed.

Wrapping Up

R for statistics is one of the most important aspects of programming and statistic interference. You can use the R language for many statistical purposes as shown in this article.

R has plenty of in-built functions which will give amazing insights over the input data. You can get around 7 statistical properties of the data with the function summary()

By now, I hope you got better of R for the statistics topic. That’s all for now. Happy R!!!

More read: R CRAN project for statistics

Categorized in: