So, what is a distribution of variables?
Analyzing the distribution of a variable means we do two things:
- We find out the spread of the distribution, or its range from highest to lowest, least to greatest, etc.
- We find out what individual values are in the distribution. Just because a distribution goes from 1 to 50, we can't automatically say that all 49 numbers in between are in our dataset.
In this chapter, we will analyze simple distributions of a single numerical or categorical variable. It is possible to have more than one variable at a time, but we will not address that in this course.
Why would you need to understand the distribution of variables? Well, let's imagine you have just taken a test. After you hand in your paper, you talk to others and try to figure out how you did. You know that the test is being scored on a scale of 1 to 20. Your score is 15, but then you see other tests that are scored 9, 19, and so on. Is your score one of the top in the class? Is it average? Or is it quite low?
To understand your own score, you need to see all the test scores. Another way of putting this is that you need to examine the distribution of the test scores. The most straightforward way to do this is to look at the class register and find out how many students got a better score. This would be tedious and time-consuming, so we will learn a better way to do it!