Awesome! We're almost done with Part 5! Our emphasis in this part of the course was on missing values, and factors. Let's summarize what we learned.
We discussed the concept of a missing value, which is represented by NA
in R. All operations involving NA
will return NA
. Certain functions, such as min()
, max()
, and mean()
, offer an optional na.rm argument to remove NA
s from the calculations:
mean(houses$price, na.rm = TRUE)
You can obtain a logical vector of TRUE
/FALSE
values indicating which values are missing from a vector by using the is.na()
function. If you use the sum()
function on this logical vector, you'll know how many values are missing from the original vector.
sum(is.na(houses$price))
Since R doesn't know how to perform calculations with NA
, we discussed another important topic: imputation methods. To impute a missing value means to replace it with value.
Finally, we looked at factors, which allow you to specify categories of acceptable values. This allows you to limit input values in a column to only a fixed set of values. You create a factor using the factor()
function, like this:
houses$district_factor <- factor(houses$district)
We also discussed factor "levels", which are just the categories of acceptable values that the variable can store. You can specify your own levels by using the optional levels
argument in the factor()
function.
houses$price_category <- factor(houses$price_category, levels=c("HIGH", "MEDIUM", "LOW"))