To create a histogram, we have to divide our numerical variable's range into intervals called bins. How do we do this?
We can divide our numerical variable in one of two ways:
- Setting the number of intervals, and therefore the number of bars,
- Setting the interval length, and therefore the bars' width.
Both methods are connected: when you set the number of intervals for your variable's range, you automatically set how wide the bars representing those bins should be. Likewise, in setting interval length, you also set the number of intervals (bars) on your histogram.
When you use ggplot
to create a histogram, it automatically calculates intervals for the dataset, sorts values into these intervals, and counts the interval frequencies. You don't have to do the calculations manually every time you create a histogram. However, by doing the entire process yourself now, you'll have a better understanding of what's going on "under the hood" of ggplot
. You'll also get a better understanding of your dataset before you start plotting it.
Let's start by seeing how we can split a numerical variable into a determined number of intervals in R. We'll split the numerical variable using the cut()
function:
split <- cut(vector, breaks = no_intervals, include.lowest=TRUE)
- This takes vector as its first argument, which has numerical values that will be classed into the proper intervals by the
cut
function.
- Then we have the
breaks
argument, which sets how this division should be done. We want to set the number of intervals, so supply the appropriate numeric value here.
- The last argument is technical; it ensures that the lowest value in the vector is included in the division.
When executed, this function gives the same vector, but with interval labels for particular values. If we have a vector with the values 1,3,4,5,6,10 and we use this command to divide it into two intervals (of (1,5] and (5,10]), it will return a factor vector with the following labels: (1,5], (1,5], (1,5], (1,5], (5,10], (5,10]. There were four values that fell in the 1-5 range, so we have four (1,5] labels. Likewise, there were two values in the 5-10 range, so we got two (5,10] labels.
We can make a dataset from this new vector using this command:
split_df <- data.frame(interval = split)
Now we can use the count()
function again to calculate how many values from each interval are in the split_df
dataset. (The count()
function requires you to specify a dataset as the first argument.)
By the way, you don't have to use split
as the vector name, split_df
for the dataset name, and interval
for the column name. These are arbitrarily chosen names – you can choose your own if you like.