We already know one chart that can be used to visualize a distribution: the bar chart. Our new chart is very similar to the bar chart; it even uses bars to encode data. Maybe instead of introducing another chart, we should use the one we already know. How will that work?
These are the first 30 values for the
5.28 0.45 10.60 7.80 7.84 8.15 4.23 10.52 12.10 1.98 9.19 0.01 8.41 14.44 10.22
6.76 1.33 3.95 4.54 5.99 7.52 10.80 4.55 4.16 4.75 2.20 6.15 8.40 1.67 0.50
In contrast to a categorical variable, a numerical variable like
consumption gives us lots of different values. Some values are very close together, but none of the ones shown above appears more than once!
What does this mean for a bar chart? We can expect it will have lots of very short bars. Let's go ahead and create a bar chart for this data. Then we can look at it and assess whether it will be useful in analyzing the distribution of a numerical variable.
The first step in creating a bar chart is to count how many times each variable value appears in the dataset. We can use the
count() function again for that:
tab <- count(alcohol_consumption, consumption)
You'll get something like this:
1 0.01 1
2 0.08 1
consumption is a numeric variable, we next have to change it to a categorical variable. We can do that using
tab$consumption <- factor(tab$consumption)
Now we can finally draw a bar chart for this data:
ggplot(data = tab, aes(x = consumption, y = n)) + geom_col()