Visualize your data
4. Prepare the data: set the number of intervals
Work with your chart 2
Check yourself

Instruction

To create a histogram, we have to divide our numerical variable's range into intervals called bins. How do we do this?

We can divide our numerical variable in one of two ways:

  • Setting the number of intervals, and therefore the number of bars,
  • Setting the interval length, and therefore the bars' width.

Both methods are connected: when you set the number of intervals for your variable's range, you automatically set how wide the bars representing those bins should be. Likewise, in setting interval length, you also set the number of intervals (bars) on your histogram.

When you use ggplot to create a histogram, it automatically calculates intervals for the dataset, sorts values into these intervals, and counts the interval frequencies. You don't have to do the calculations manually every time you create a histogram. However, by doing the entire process yourself now, you'll have a better understanding of what's going on "under the hood" of ggplot. You'll also get a better understanding of your dataset before you start plotting it.

Let's start by seeing how we can split a numerical variable into a determined number of intervals in R. We'll split the numerical variable using the cut() function:

split <- cut(vector, breaks = no_intervals, include.lowest=TRUE)
  • This takes vector as its first argument, which has numerical values that will be classed into the proper intervals by the cut function.
  • Then we have the breaks argument, which sets how this division should be done. We want to set the number of intervals, so supply the appropriate numeric value here.
  • The last argument is technical; it ensures that the lowest value in the vector is included in the division.

When executed, this function gives the same vector, but with interval labels for particular values. If we have a vector with the values 1,3,4,5,6,10 and we use this command to divide it into two intervals (of (1,5] and (5,10]), it will return a factor vector with the following labels: (1,5], (1,5], (1,5], (1,5], (5,10], (5,10]. There were four values that fell in the 1-5 range, so we have four (1,5] labels. Likewise, there were two values in the 5-10 range, so we got two (5,10] labels.

We can make a dataset from this new vector using this command:

split_df <- data.frame(interval = split)

Now we can use the count() function again to calculate how many values from each interval are in the split_df dataset. (The count() function requires you to specify a dataset as the first argument.)

By the way, you don't have to use split as the vector name, split_df for the dataset name, and interval for the column name. These are arbitrarily chosen names – you can choose your own if you like.

Exercise

We want to divide the consumption variable from the alcohol_consumption dataset into 10 intervals. Then we want to count how many values are in each interval.

To do that, take alcohol_consumption$consumption and use it as the cut function vector. Set the breaks argument to 10 and write the remaining two parts of the code as shown above.

When you're done, press the Run and Check Code button to check your code. Then notice how the intervals look and check whether there are ten of them.

Stuck? Here's a hint!

You should write:

split <- cut(alcohol_consumption$consumption, breaks = 10, include.lowest = T)
split_df <- data.frame(interval = split)
count(split_df, interval)