We already know one chart that can be used to visualize a distribution: the bar chart. Our new chart is very similar to the bar chart; it even uses bars to encode data. Maybe instead of introducing another chart, we should use the one we already know. How will that work?
These are the first 30 values for the consumption
variable:
5.28 0.45 10.60 7.80 7.84 8.15 4.23 10.52 12.10 1.98 9.19 0.01 8.41 14.44 10.22
6.76 1.33 3.95 4.54 5.99 7.52 10.80 4.55 4.16 4.75 2.20 6.15 8.40 1.67 0.50
In contrast to a categorical variable, a numerical variable like consumption
gives us lots of different values. Some values are very close together, but none of the ones shown above appears more than once!
What does this mean for a bar chart? We can expect it will have lots of very short bars. Let's go ahead and create a bar chart for this data. Then we can look at it and assess whether it will be useful in analyzing the distribution of a numerical variable.
The first step in creating a bar chart is to count how many times each variable value appears in the dataset. We can use the count()
function again for that:
tab <- count(alcohol_consumption, consumption)
You'll get something like this:
consumption n
1 0.01 1
2 0.08 1
Because consumption
is a numeric variable, we next have to change it to a categorical variable. We can do that using factor
:
tab$consumption <- factor(tab$consumption)
Now we can finally draw a bar chart for this data:
ggplot(data = tab, aes(x = consumption, y = n)) + geom_col()