Introduction to ggplot2

data visualization tool, data visualization techniques, data visualization course, data visualization in data mining, data visualization free, data visualization examples , data visualization tableau, data visualization, visualizing data

Show, don’t tell! Share data insights in stunning color and display with ggplot2, a wonderful R package for visualizing data.

Ggplot2: Grammar of Graphics

The end of qualitative data analysis should be clear—beautiful data visualizations. We are visual beings, after all, and a picture tells us far more than raw numbers! Among the many visualization tools, one in particular stands out : ggplot2—a free, open-source, and easy-to-use package that has become a favorite among many R programmers. This article explains the basics of ggplot2 and guides you in creating beautiful data visualizations.

Scenario: Game of Thrones—U.S. Viewers per Episode

A few weeks ago, the last episode of Game of Thrones, Season 7, aired. And I must say, it was nothing short of spectacular! Throughout the whole episode, my eyes were wide open— fire, ice, mysteries, intrigues, dragons, and unexpected twists. “This is probably the most watched episode yet!” I thought to myself.

How many viewers saw the episode? Has this season had the greatest number of viewers so far? Is the number of viewers from season to season on the rise? I decided to search for some data and create graphics in order to answer these questions. Surfing the net, I came across a table listing the number of viewers per each episode of each season. This’ll serve as an excellent example for practicing with ggplot2. Let’s make some plots!

What Is ggplot2?

Developed in 2005, ggplot2 is one of the most famous R packages for statistical graphics. It’s very popular because it relies on an iterative mode of work, its syntax is easy to learn, and the package is admirably maintained and documented. “Iterative” means that each component (part) of the graph can be built separately with its own specific syntax (the so-called “grammar” of graphics). For example, plot type and aesthetics are two separate components, and they are each defined by separate syntaxes. At any time we can save a plot as an object and then upgrade its components when necessary.

In the next section, we’ll move away from the theory and work through an example that illustrates some basic graph components and how they can be created with ggplot2 syntax. We’ll cover the following components:

  • data
  • aesthetics
  • geometrical shapes
  • scales, axes and legends
  • themes

Data

In order for us even to attempt visualization, we must first ensure that we have well-prepared data. In our example, we’ve prepared a table (what’s known as a “data frame” in R) with the number of viewers per each episode of the Game of Thrones serial. It 70 total observations, each representing one episode of a season.

For Season 7, the data in the table look like this:


Season 7: Number of U.S viewers per Episode

Data from the net were saved in an Excel file, so we’ll import them into R and store them in a data frame object we’ll name GOT. Here is the code:

library(xlsx)
GOT <- read.xlsx("got_melt.xls",sheetIndex=1)
head(GOT)

Great! The first component has been defined. Now that we have our data, our next step is to define the aesthetics.

Aesthetics Mappings

Which data are to be presented, and how will we present them on the graph? What will we put on our axes, and how will the data be grouped? The answers to these questions will define our second component. We must define the variables to be represented along our horizontal and vertical axes and additional variables (if necessary) for color, shape, size, and group.

In our example, we will measure Episode along our horizontal axis and Number_of_viewers on the vertical. Because we want to distinguish episodes by season, we’ll group and color our data according to a third variable named Season.

Aesthetics are defined through the aes() function. This is the code for our example:

plot <- ggplot(GOT, aes(x=Episode,y=Number_of_viewers,colour=Season,group=Season))
Plot

Using this syntax, we have defined our first and second components. However, executing the R code above gives us an empty canvas— there is no visualized data on the graph.


Empty plot—geometric shapes are not defined.

Why is the graph empty? Because we have not yet defined which geometric shapes will represent the data. We must specify what kind of plot we want to create.

Geometric Shapes

Our next step is to define the geometric shapes that will be used to represent our data. For example, with geom_point(), we will make a scatter plot; with geom_line(), we will create a line graph; and with geom_bar(), we will present a bar chart. (There are many different geometrical shapes, and I’m just mentioning the ones I use the most.)

Let’s go back to our example. Because a line graph is useful for displaying data that change continuously over time, we’ll use that type of graph to show changes in the number of viewers from season to season.

Let’s use geom_line() to make a line graph in our example:

plot <- plot + geom_line()
Plot

We now have our initial visualization:


First version of the graph

Wouldn’t it be nice to add points to the graph? This would make the information easier to read, since the points would emphasize the specific number of viewers for each episode. We can do this with ease, since ggplot2 allows you to use different shapes on a single plot.

Let’s add points with geom_point():

plot <- plot + geom_point()
Plot

Now the plot looks a little bit prettier:


Number of viewers of each episode

Having defined the base components of our graph, let’s now turn to the task of making the graph more readable.

Scales, Axes and Legends

If we take a look at the graph above without reading the article, it is not clear what the graph is about. It is very important that the visualization is simple, readable, and communicates a clear message. Adding a title, changing or removing the title of the legend, and properly setting axes values and labels will make our plot much more readable. With these components, we are specifying how our aesthetics (x and y axes, etc.) should be presented on the graph.

Let’s make some changes:

plot <- plot +
ggtitle("Game of Thrones Viewership per episode by Season") +
scale_x_discrete("Episode number",labels=c("1","2","3","4","5", "6","7","8","9","10")) +
scale_y_continuous("Number of U.S viewers in millions",breaks=seq(0,12,1))+
scale_colour_discrete(name="")

The result is the following graph:


We accomplished the following:

  • Scale_x_discrete() was used to change the x-axis title (“Episode number” instead of “Episode”) and its labels (“1,2…,10” instead of “Ep1,…,Ep10”). Because the variable Episode is discrete (it has finite set of values) here we are using scale_x_discrete() and not scale_x_continuous().
  • Because the variable on y axis is continuous, the title and values presented on that axis are changed with component scale_y_continuous(). The title is changed from “Number_of_viewers” to “Number of U.S. viewers in millions”, and values on the y axis are now the whole numbers “2,…,12” (in increments of one).
  • The legend title is removed with scale_colour_discrete(). We used here scale_colour_discrete because our legend is related to discrete Season coloring and grouping which we defined in our aes() statement.
  • Title name was added with ggtitle()

Themes

Wouldn’t you like to change the background color? Or emphasize the title? This is possible with themes. Themes are used to control details such as fonts, tick marks, panel strips, and the backgrounds of the entire canvas. Let’s bold our title and change the background color of our plot.

The code is as follows:

plot <- plot + theme(plot.title = element_text(face="bold"),
panel.background = element_rect(fill = "white",colour = "grey"),
panel.grid.major = element_line(colour = "grey",size=0.1),
panel.grid.minor = element_line(colour = "grey",size=0.1))

Here’s what we did:

  • With the plot.title = element_text(face=”bold”) argument, we made our title bold.
  • The background of our panel was changed to white, and the border was set to grayy. This was accomplished with the panel.background parameter, and additional borders were set with panel.grid.major/minor.

Our final plot looks like this


We’ve completed our visualization!  Now we can easily draw some conclusions related to the viewership of Game of Thrones. For example, we can clearly see from the graph that Season 7 had the most viewers. With each new season, viewership grows, and the last episode of Season 7 had more than 12 million viewers! Isn’t it great to read so much information from just one picture?

Summary

Of course, this is just scraping the surface of what we can do with ggplot2. This package gives you greater control over creating graph components, making it far easier to visualize data in a way that suits your needs.

Marija Ilic

Marija works as a data scientist in the banking industry. She specializes in big data platforms (Cloudera and Hadoop) with software and technologies such as Hive/Impala, Python and PySpark, Kafka, and R. Marija has an extensive background in DWH/ETL development in the banking industry. Her main interests are predictive modeling, real-time decision-making, and social network analysis. Outside of work, Marija enjoys listening to her favorite LPs on her old gramophone—and never grows tired of its soothing crackle.