Awesome! We're almost done with Part 5! Our emphasis in this part of the course was on missing values, and factors. Let's summarize what we learned.
We discussed the concept of a missing value, which is represented by
NA in R. All operations involving
NA will return
NA. Certain functions, such as
mean(), offer an optional na.rm argument to remove
NAs from the calculations:
mean(houses$price, na.rm = TRUE)
You can obtain a logical vector of
FALSE values indicating which values are missing from a vector by using the
is.na() function. If you use the
sum() function on this logical vector, you'll know how many values are missing from the original vector.
Since R doesn't know how to perform calculations with
NA, we discussed another important topic: imputation methods. To impute a missing value means to replace it with value.
Finally, we looked at factors, which allow you to specify categories of acceptable values. This allows you to limit input values in a column to only a fixed set of values. You create a factor using the
factor() function, like this:
houses$district_factor <- factor(houses$district)
We also discussed factor "levels", which are just the categories of acceptable values that the variable can store. You can specify your own levels by using the optional
levels argument in the
houses$price_category <- factor(houses$price_category, levels=c("HIGH", "MEDIUM", "LOW"))