library(forcats)
email <- openintro::email
ff <- openintro::fastfood
nc <- openintro::ncbirths # note the name change
mtcars <- mtcars # comes with R5 Working with Factors
In this lesson we will discuss ways to organize and deal with categorical data, also known as factor data types.
email data set contains information on emails received by one of the OpenIntro authors for the first three months in 2012. See ?email for more details.The fastfood data set from the openintro package describes nutrition amounts in 515 fast food items. See ?fastfood for more details.The mtcars data set contains data from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. See ?mtcars for more details.The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors. Often in R there are multiple ways to accomplish the same task. Some examples in this lesson will show how to perform a certain task using base R functions, as well as functions from the forcats package.
5.1 What is a factor?
The term factor refers to a data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable corresponds to a limited number of categories, while a continuous variable can correspond to an infinite number of values.
An example of a categorical variable is the number variable in the email data set. This variable contains data on whether there was no number, a small number (under 1 million), or a big number in the content of the email.
5.2 Confirming the factor data type
First we should confirm that R sees number as a factor. We can use the class function we saw earlier, or str.
class(email$number)[1] "factor"
We can use the levels() function to get to know factor variables.
levels(email$number)[1] "none" "small" "big"
There are three levels: none, small, and big.
Let’s look at the variable restaurant from the fast food (ff) data set.
levels(ff$restaurant)NULL
Wait - NULL? But this is a categorical variable.
class(ff$restaurant)[1] "character"
There is a subtle difference between factor and character data types in R. Both are categorical measures (not numbers), but factor variables have an assigned order and character variables do not. If we want to specifically control the ordering of the levels, the data type must be factor.
5.3 Convert a character variable to factor
We can use the as.factor() function.
ff$restaurant <- as.factor(ff$restaurant)
levels(ff$restaurant)[1] "Arbys" "Burger King" "Chick Fil-A" "Dairy Queen" "Mcdonalds"
[6] "Sonic" "Subway" "Taco Bell"
The forcats package has a similar function as_factor() that could be used here also.
5.4 Convert a number variable to factor
Sometimes data are entered into the computer using numeric codes such as 0 and 1. These codes stand for categories, such as “no” and “yes”. Sometimes we want to analyze these binary variables in two ways:
- For statistical analyses, the data must be numeric 0/1.
- For many graphics, the data must be a factor, “no/yes”.
table(mtcars$am) # view which values are present
0 1
19 13
class(mtcars$am) # confirm what R thinks the data type is[1] "numeric"
R thinks this variable is numeric, but we know that’s not the case. We can use the function factor() to convert the numeric variable am to a factor, applying labels to convert 0 to “automatic” and 1 to “manual”.
mtcars$transmission_type <- factor(mtcars$am,
labels=c("automatic", "manual"))The ordering of the labels argument must be in the same order (left to right) as the factor levels themselves. Look back at the order of columns in the table - it goes 0 then 1. Thus our labels need to go “automatic” then “manual”.
We can confirm that the new variable was created correctly by creating a two-way contingency table by calling the table(old variable, new variable) function on both the old and new variables.
table(mtcars$am, mtcars$transmission_type, useNA="always")
automatic manual <NA>
0 19 0 0
1 0 13 0
<NA> 0 0 0
Here we see that all the 0’s were recoded to automatic, and all the 1’s recoded to manual, and there are no new missing values. Success!
5.5 Factor (re)naming
What if the variable is already a factor, but has names we don’t prefer. We want them to say something else. We can accomplish this in both base R and using forcats package.
Re-factor the variable and apply new labels.
email$my_new_number <- factor(email$number,
labels=c( "None", "<1M","1M+"))
table(email$number, email$my_new_number, useNA="always")
None <1M 1M+ <NA>
none 549 0 0 0
small 0 2827 0 0
big 0 0 545 0
<NA> 0 0 0 0
Use the fct_recode("NEW" = "old") function here.
email$my_forcats_number <- fct_recode(email$number,
"BIG" = "big",
"NONE" = "none",
"SMALL" = "small")
table(email$number, email$my_forcats_number, useNA="always")
NONE SMALL BIG <NA>
none 549 0 0 0
small 0 2827 0 0
big 0 0 545 0
<NA> 0 0 0 0
The big factor is now labeled 1M+, none is named None, and small is <1M.
5.6 Factor ordering
Let’s look back at the variable restaurant from the fast food (ff) data set.
levels(ff$restaurant)[1] "Arbys" "Burger King" "Chick Fil-A" "Dairy Queen" "Mcdonalds"
[6] "Sonic" "Subway" "Taco Bell"
R defaults to alphabetical order in other cases, so beware! You may need to correct the ordering for other data sets.
We need to take control of these factors! We can do that by re-factoring the existing factor variable, but this time specifying the levels of the factor (since it already has labels). Say we decide to order the restaraunts by putting all the places that sell burgers together.
Original Order: Arbys, Burger King, Chick Fil-A, Dairy Queen, Mcdonalds, Sonic, Subway, Taco Bell
Desired Order: Arbys, Burger King, Dairy Queen, Mcdonalds, Sonic, Chick Fil-A, Subway, Taco Bell
<-) here, these changes were not made to the variable in the ff data set. The examples below demonstrate making an adjustmet to a factor variable and saving that adjustment as a new variable in the data set.Use the factor function again, and write out each factor level in the desired order. Make sure you are spelling each level correctly.
# results not saved
factor(ff$restaurant,
levels=c("Arbys", "Burger King", "Dairy Queen", "Mcdonalds", "Sonic",
"Chick Fil-A", "Subway", "Taco Bell")) |>
table()
Arbys Burger King Dairy Queen Mcdonalds Sonic Chick Fil-A
55 70 42 57 53 27
Subway Taco Bell
96 115
Using the fct_relevel function, you only need to specifiy the levels that you want to move.
# results not saved
ff$restaurant |>
fct_relevel("Arbys", "Burger King", "Dairy Queen", "Mcdonalds", "Sonic") |>
fct_count() # new function - acts like table# A tibble: 8 × 2
f n
<fct> <int>
1 Arbys 55
2 Burger King 70
3 Dairy Queen 42
4 Mcdonalds 57
5 Sonic 53
6 Chick Fil-A 27
7 Subway 96
8 Taco Bell 115
The fct_relevel function has nice shortcuts as well for example to keep the rest of the ordering but only one or two items. See ?fct_relevel for other options.
ff$restaurant |>
fct_relevel("Subway", after = Inf) |> # move to end
fct_relevel("Taco Bell") |> # move to front
levels()[1] "Taco Bell" "Arbys" "Burger King" "Chick Fil-A" "Dairy Queen"
[6] "Mcdonalds" "Sonic" "Subway"
5.7 Decreasing number of levels
For analysis purposes, sometimes you want to work with a smaller number of factor variables. Let’s look at the restaurants that are included in the fastfood data set.
table(ff$restaurant)
Arbys Burger King Chick Fil-A Dairy Queen Mcdonalds Sonic
55 70 27 42 57 53
Subway Taco Bell
96 115
5.7.1 Combining multiple categories into one
Let’s combine all the sandwich, and burger joints together. I am going to save this new variable as restaurant_new.
The syntax for the fct_collapse function is new level = "old level", where the “old level” is in quotes. As always, it is good practice to create a two way table to make sure the code typed does what we expected it to do.
ff$restaurant_new <- fct_collapse(ff$restaurant,
BurgerJoint = c("Burger King", "Mcdonalds", "Sonic"),
Sandwich = c("Arbys", "Subway"))
table(ff$restaurant, ff$restaurant_new, useNA="always")
Sandwich BurgerJoint Chick Fil-A Dairy Queen Taco Bell <NA>
Arbys 55 0 0 0 0 0
Burger King 0 70 0 0 0 0
Chick Fil-A 0 0 27 0 0 0
Dairy Queen 0 0 0 42 0 0
Mcdonalds 0 57 0 0 0 0
Sonic 0 53 0 0 0 0
Subway 96 0 0 0 0 0
Taco Bell 0 0 0 0 115 0
<NA> 0 0 0 0 0 0
5.7.2 Keeping the most frequent categories
Sometimes we only want to keep the most frequent categories and then lump uncommon factor together levels into “other”. The fct_lump_n function
fct_lump_n(ff$restaurant, n=5) |>
fct_count()# A tibble: 6 × 2
f n
<fct> <int>
1 Arbys 55
2 Burger King 70
3 Mcdonalds 57
4 Subway 96
5 Taco Bell 115
6 Other 122
5.7.3 Removing categories entirely
Sometimes, you don’t even want to consider certain levels. This often occurs in survey data where the respondent provides an answer of “Refuse to answer” or the data is coded as the word “missing”. The word “missing’ is fundamentally different than the NA code for a missing value.
For demonstration purposes, let’s get rid of the data from DQ. Who eats something other than ice cream at that place anyhow?
ff$restaurant[ff$restaurant == "Dairy Queen"] <- NA
table(ff$restaurant)
Arbys Burger King Chick Fil-A Dairy Queen Mcdonalds Sonic
55 70 27 0 57 53
Subway Taco Bell
96 115
Even though there are no records with the level Dairy Queen, the level itself still is there. R does not assume just because there are no records with that level, that the named level itself should be removed. We use the function fct_drop to drop the levels with no records.
ff$restaurant <- fct_drop(ff$restaurant)
table(ff$restaurant)
Arbys Burger King Chick Fil-A Mcdonalds Sonic Subway
55 70 27 57 53 96
Taco Bell
115
