In this series of articles, we’ll deepen our knowledge of the principal statistical concepts. This time we’re going to bring our attention to the measures of central tendency. Plus, we’ll learn how to calculate them in WebDataRocks.

What is the central tendency?

How would you describe a data set with a single value? The most common approach is to define a central position of your data distribution. This is what the statisticians call the central tendency. Being a core concept in statistics, the central tendency summarizes the entire data set, thus giving an idea of its typical value.

What are the measures of central tendency?

The arithmetic mean (or average) is the first measure that comes to one’s mind when talking about a center point in the data or its typical value. Nevertheless, there are also other measures that describe the central tendency more accurately in certain scenarios. 

This time we’ll break down the purposes of the three main measures to describe the central position within a data set, namely:

  • Mean
  • Median
  • Mode

Moreover, we’ll learn how to calculate them both by hand and using WebDataRocks Pivot Table. Afterwards, you can compare which way you like better.

Mean

The mean is the most common way to summarize a data set. You can use the mean with either discrete or continuous data. Yet, it’s mostly used with continuous data.

There are two important properties the mean has:

  • The calculation of the mean considers each data point of your data set
  • The sum of deviations of each data point from the mean is always zero. 

How to calculate the mean

To calculate the mean, add all numerical data points and divide their sum by the total number of data points.

Here’s a formula to calculate the mean with:

And here’s how you can calculate the mean in Pivot Table:

Use cases

If data has a symmetrical distribution (e.g., normal), the mean adequately represents the central measure since it coincides with the median and the mode. Otherwise, it’s not the most suitable way of representing a typical value of the data set. 

And here’s why.

The main drawback of the mean as a central measure is its sensitivity to outliers. What does it mean? 

Let’s consider an example.

Imagine you’re conducting a salary survey for reporting an average income in the city. Let’s suppose there are 6 people living in this fictional city of ours.

This simplification will help us grasp the illustrated problem quickly. The following case can be easily generalized to any number of data points.

The salaries of the citizens are presented in the table as follows:

1 2 3 4 5 6
2K 3K 5K 2K 50K 100K

As you can tell from the data distribution, the majority of people have a salary lower than 5K. The mean of this data is 162K/6 = 27K. It may feel intuitive that this number doesn’t give an accurate portrait of the data. 

This case shows how one or more high numbers (in our case, 50K and 100K) can make the centre seem higher than it really is. The same is true for extremely low values.

Speaking with statistical terminology, these two high values are called outliers. Outliers are extreme or anomalous that are significantly different from the rest of the data points. 

The data is said to be skewed by the outliers that contribute extremely high or low values compared to the magnitude of other data points. 

But which measure does perform better at capturing the central tendency?

This is where the median comes in handy.

Median

The median is the middlemost number in the data distribution.

How to calculate the median

First, arrange the values in order from the least to the greatest. Next, select the data point which is located in the middle. This number is the median of your data set. 

This algorithm works when you have an odd number of data points.

Once your data contains an even number of observations, instead of choosing one number, pick two middlemost numbers, and average them (i.e., find their mean) to obtain a single median value.

Here’s an example of how we can find the median of the salaries data set:

2K 3K 5K 2K 50K 100K
  1. Sort the values in ascending order:
    2K 2K 3K 5K 50K 100K
  2. Find the central data point:

    Since we have two middle numbers, namely 3K and 5K, let’s sum them together and divide by 2: 

    (3K + 5K) / 2 = 4K

Congrats! You’ve just found the median value of the data set. Now we can compare it to the mean value which is 27K and conclude that the median gives a more accurate idea of the underlying data distribution. 

Here’s an alternative way of finding the median with Pivot Table:

Bonus

Sometimes it can be useful to visualize the distribution of the data with a box and whisker plot (or simply a box plot). This type of chart helps us see where the median is located. The median is also called the second quartile of the data set. It means that 50% of the data lie below this data point. In general, a quartile is an important concept in statistics. It’s a data point that splits the data into quarters.

Here’s how we can identify where the center of the data lies by looking at the box plot:

A common rule of thumb is to use the median as the central value when the distribution of the data is not symmetric. Examples of symmetric distributions are the normal distribution, the logistic distribution, the Cauchy distribution, the uniform distribution, etc. The symmetrical distribution presents a unique case when the mode, median, and mean coincide. 

Mode

A mode is the most common data point across all the observations. In other words, it’s the value that occurs most often.

The mode is rarely used with continuous data. As a rule, we find the mode in categorical data when we want to know which category is the most common. Here’s the example of the mode on a histogram:

Note that the mode might not necessarily be unique. The data set can have two or more modes. In such a case, it’s said that data has two or more peaks. The corresponding types of distributions are called bimodal or multimodal.

The mode is not the best way to represent the central tendency since it may lie quite far from the rest of the data points:

As you see, this value doesn’t represent the data well since most of its values are clustered within the X to Y range.

Conclusion

Today we’ve figured out what the central tendency is, how it can be measured, as well as the pros and cons of calculating it using the mean, median, and mode

We’ve also gained practical experience in calculating the central tendency measures with WebDataRocks Pivot Table and made sure that it can be done extremely quickly and efficiently.

What’s next?

Have you already checked our list of top Data Science and Analytics courses & specializations? This is where the knowledge of differences between mean, mode, and median will be in handy.

Attribution