5  Working with Factors

In this lesson we will discuss ways to organize and deal with categorical data, also known as factor data types.

🎓 Learning Objectives

After completing this lesson students will be able to

  • Convert a numeric variable to a factor variable.
  • Apply and change labels to factor
  • Understand and control the ordering of the factor.
  • Combine multiple levels of a factor variable into one level
  • Learn how to use the forcats package

👉 Prepare
  1. Open your Math 130 R Project.
  2. Right click and “save as” this lessons [Quarto notes file] and save into your Math130/notes folder.
  3. In the Files pane, open this Quarto file and Render this file.

The email data set contains information on emails received by one of the OpenIntro authors for the first three months in 2012. See ?email for more details.The fastfood data set from the openintro package describes nutrition amounts in 515 fast food items. See ?fastfood for more details.The mtcars data set contains data from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. See ?mtcars for more details.
library(forcats)
email <- openintro::email
ff    <- openintro::fastfood
nc    <- openintro::ncbirths # note the name change
mtcars <- mtcars # comes with R

The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors. Often in R there are multiple ways to accomplish the same task. Some examples in this lesson will show how to perform a certain task using base R functions, as well as functions from the forcats package.

5.1 What is a factor?

The term factor refers to a data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable corresponds to a limited number of categories, while a continuous variable can correspond to an infinite number of values.

An example of a categorical variable is the number variable in the email data set. This variable contains data on whether there was no number, a small number (under 1 million), or a big number in the content of the email.

5.2 Confirming the factor data type

First we should confirm that R sees number as a factor. We can use the class function we saw earlier, or str.

class(email$number)
[1] "factor"

We can use the levels() function to get to know factor variables.

levels(email$number)
[1] "none"  "small" "big"  

There are three levels: none, small, and big.

Character data types

Let’s look at the variable restaurant from the fast food (ff) data set.

levels(ff$restaurant)
NULL

Wait - NULL? But this is a categorical variable.

class(ff$restaurant)
[1] "character"

There is a subtle difference between factor and character data types in R. Both are categorical measures (not numbers), but factor variables have an assigned order and character variables do not. If we want to specifically control the ordering of the levels, the data type must be factor.

5.3 Convert a character variable to factor

We can use the as.factor() function.

ff$restaurant <- as.factor(ff$restaurant)
levels(ff$restaurant)
[1] "Arbys"       "Burger King" "Chick Fil-A" "Dairy Queen" "Mcdonalds"  
[6] "Sonic"       "Subway"      "Taco Bell"  

The forcats package has a similar function as_factor() that could be used here also.

5.4 Convert a number variable to factor

Sometimes data are entered into the computer using numeric codes such as 0 and 1. These codes stand for categories, such as “no” and “yes”. Sometimes we want to analyze these binary variables in two ways:

  • For statistical analyses, the data must be numeric 0/1.
  • For many graphics, the data must be a factor, “no/yes”.
Example: What type of transmission does that car have?

The am variable from the mtcars data set records whether or not the car has an automatic transmission, or a manual one. However, the values were recorded as 0 for automatic and 1 for manual.

table(mtcars$am) # view which values are present

 0  1 
19 13 
class(mtcars$am) # confirm what R thinks the data type is
[1] "numeric"

R thinks this variable is numeric, but we know that’s not the case. We can use the function factor() to convert the numeric variable am to a factor, applying labels to convert 0 to “automatic” and 1 to “manual”.

mtcars$transmission_type <- factor(mtcars$am, 
                                   labels=c("automatic", "manual"))

The ordering of the labels argument must be in the same order (left to right) as the factor levels themselves. Look back at the order of columns in the table - it goes 0 then 1. Thus our labels need to go “automatic” then “manual”.

Trust but verify

We can confirm that the new variable was created correctly by creating a two-way contingency table by calling the table(old variable, new variable) function on both the old and new variables.

table(mtcars$am, mtcars$transmission_type, useNA="always")
      
       automatic manual <NA>
  0           19      0    0
  1            0     13    0
  <NA>         0      0    0

Here we see that all the 0’s were recoded to automatic, and all the 1’s recoded to manual, and there are no new missing values. Success!

5.5 Factor (re)naming

What if the variable is already a factor, but has names we don’t prefer. We want them to say something else. We can accomplish this in both base R and using forcats package.

Re-factor the variable and apply new labels.

email$my_new_number <- factor(email$number, 
                              labels=c( "None", "<1M","1M+"))
table(email$number, email$my_new_number, useNA="always")
       
        None  <1M  1M+ <NA>
  none   549    0    0    0
  small    0 2827    0    0
  big      0    0  545    0
  <NA>     0    0    0    0

Use the fct_recode("NEW" = "old") function here.

email$my_forcats_number <- fct_recode(email$number, 
                                      "BIG" = "big", 
                                      "NONE" = "none", 
                                      "SMALL" = "small")

table(email$number, email$my_forcats_number, useNA="always")
       
        NONE SMALL  BIG <NA>
  none   549     0    0    0
  small    0  2827    0    0
  big      0     0  545    0
  <NA>     0     0    0    0

The big factor is now labeled 1M+, none is named None, and small is <1M.

5.6 Factor ordering

Let’s look back at the variable restaurant from the fast food (ff) data set.

levels(ff$restaurant)
[1] "Arbys"       "Burger King" "Chick Fil-A" "Dairy Queen" "Mcdonalds"  
[6] "Sonic"       "Subway"      "Taco Bell"  

R defaults to alphabetical order in other cases, so beware! You may need to correct the ordering for other data sets.

We need to take control of these factors! We can do that by re-factoring the existing factor variable, but this time specifying the levels of the factor (since it already has labels). Say we decide to order the restaraunts by putting all the places that sell burgers together.

Original Order: Arbys, Burger King, Chick Fil-A, Dairy Queen, Mcdonalds, Sonic, Subway, Taco Bell

Desired Order: Arbys, Burger King, Dairy Queen, Mcdonalds, Sonic, Chick Fil-A, Subway, Taco Bell

Since I did not use the assignment operator (<-) here, these changes were not made to the variable in the ff data set. The examples below demonstrate making an adjustmet to a factor variable and saving that adjustment as a new variable in the data set.

Use the factor function again, and write out each factor level in the desired order. Make sure you are spelling each level correctly.

# results not saved
factor(ff$restaurant, 
       levels=c("Arbys", "Burger King", "Dairy Queen", "Mcdonalds", "Sonic", 
                "Chick Fil-A", "Subway", "Taco Bell")) |> 
  table()

      Arbys Burger King Dairy Queen   Mcdonalds       Sonic Chick Fil-A 
         55          70          42          57          53          27 
     Subway   Taco Bell 
         96         115 

Using the fct_relevel function, you only need to specifiy the levels that you want to move.

# results not saved
ff$restaurant |> 
  fct_relevel("Arbys", "Burger King", "Dairy Queen", "Mcdonalds", "Sonic") |> 
  fct_count() # new function - acts like table
# A tibble: 8 × 2
  f               n
  <fct>       <int>
1 Arbys          55
2 Burger King    70
3 Dairy Queen    42
4 Mcdonalds      57
5 Sonic          53
6 Chick Fil-A    27
7 Subway         96
8 Taco Bell     115

The fct_relevel function has nice shortcuts as well for example to keep the rest of the ordering but only one or two items. See ?fct_relevel for other options.

ff$restaurant |> 
  fct_relevel("Subway", after = Inf) |> # move to end
  fct_relevel("Taco Bell") |> # move to front
  levels()
[1] "Taco Bell"   "Arbys"       "Burger King" "Chick Fil-A" "Dairy Queen"
[6] "Mcdonalds"   "Sonic"       "Subway"     

5.7 Decreasing number of levels

For analysis purposes, sometimes you want to work with a smaller number of factor variables. Let’s look at the restaurants that are included in the fastfood data set.

table(ff$restaurant)

      Arbys Burger King Chick Fil-A Dairy Queen   Mcdonalds       Sonic 
         55          70          27          42          57          53 
     Subway   Taco Bell 
         96         115 

5.7.1 Combining multiple categories into one

Let’s combine all the sandwich, and burger joints together. I am going to save this new variable as restaurant_new.

The syntax for the fct_collapse function is new level = "old level", where the “old level” is in quotes. As always, it is good practice to create a two way table to make sure the code typed does what we expected it to do.

ff$restaurant_new <- fct_collapse(ff$restaurant, 
                                    BurgerJoint = c("Burger King", "Mcdonalds", "Sonic"), 
                                    Sandwich = c("Arbys", "Subway"))

table(ff$restaurant, ff$restaurant_new, useNA="always")
             
              Sandwich BurgerJoint Chick Fil-A Dairy Queen Taco Bell <NA>
  Arbys             55           0           0           0         0    0
  Burger King        0          70           0           0         0    0
  Chick Fil-A        0           0          27           0         0    0
  Dairy Queen        0           0           0          42         0    0
  Mcdonalds          0          57           0           0         0    0
  Sonic              0          53           0           0         0    0
  Subway            96           0           0           0         0    0
  Taco Bell          0           0           0           0       115    0
  <NA>               0           0           0           0         0    0

5.7.2 Keeping the most frequent categories

Sometimes we only want to keep the most frequent categories and then lump uncommon factor together levels into “other”. The fct_lump_n function

fct_lump_n(ff$restaurant, n=5) |>
  fct_count()
# A tibble: 6 × 2
  f               n
  <fct>       <int>
1 Arbys          55
2 Burger King    70
3 Mcdonalds      57
4 Subway         96
5 Taco Bell     115
6 Other         122

5.7.3 Removing categories entirely

Sometimes, you don’t even want to consider certain levels. This often occurs in survey data where the respondent provides an answer of “Refuse to answer” or the data is coded as the word “missing”. The word “missing’ is fundamentally different than the NA code for a missing value.

For demonstration purposes, let’s get rid of the data from DQ. Who eats something other than ice cream at that place anyhow?

ff$restaurant[ff$restaurant == "Dairy Queen"] <- NA
table(ff$restaurant)

      Arbys Burger King Chick Fil-A Dairy Queen   Mcdonalds       Sonic 
         55          70          27           0          57          53 
     Subway   Taco Bell 
         96         115 

Even though there are no records with the level Dairy Queen, the level itself still is there. R does not assume just because there are no records with that level, that the named level itself should be removed. We use the function fct_drop to drop the levels with no records.

ff$restaurant <- fct_drop(ff$restaurant)
table(ff$restaurant)

      Arbys Burger King Chick Fil-A   Mcdonalds       Sonic      Subway 
         55          70          27          57          53          96 
  Taco Bell 
        115