<- read.table("data/email.txt", header=TRUE, sep="\t") email
9 Getting data into R
9.1 Introduction (Video)
In this lesson we will introduce ways to get data from external files into R, how R works with data, and how to use functions to explore your data frame.
After completing this lesson learners will be able to
- Import data that is stored in an external Excel, CSV or tab-delimited text file into R.
- Compare and contrast the three file types discussed in this lesson.
- Use functions to examine data objects to ensure data was imported correctly.
- Download the [09_import_notes.Rmd] R markdown file and save into your
Math130/notes
folder. - Install the following packages:
gridExtra
,RColorBrewer
andsjPlot
. - Right click each link and save the following data sets into your
Math130/data
folder.
:x: DO NOT RENAME ANYTHING If you download multiple copies of the data, do not keep the versions that have a (1) or some number in the name. Do not leave these files in your downloads folder.
9.2 File types (Video)
In this lesson we are only going to explore reading files that exist on your computer into R from three most commonly used data sources: A tab-delimited text file, A CSV file and an Excel file. The three different files we will be using have different file types, or extensions.
email.txt
is a.txt
or “text” file.NCBirths.csv
is a.csv
or “comma separated values” file.fatal-police-shootings-data.xlsx
is a.xlsx
or Excel file.
Each of these file types differ in the type of delimiter used. The delimiter is a character or symbol that separates columns of data from each other when stored in an external file. Recall back to the earlier lesson on data frames and matrices. Each column in the matrix represented data on a specific variable. Something had to tell R how to distinguish which values went with which variable.
There are two main types of delimiters we will consider in this class; comma and tab. That does not mean that data can’t be stored in other ways, these are just the two most common.
Each of these different data types requires a different function or mechanism to import the data into R. If you use the wrong mechanism, the data may not be read in correctly if at all.
9.3 General Importing data (Video)
To import data into R, we have to tell the program explicitly where to find the files you just downloaded. To do that we need to find the file’s path. The path is a programmatic way to direct your computer to the location of a file. It’s like the file’s address, or, where it lives on your computer.
In all the examples below, the path shown (data/
) is the path on MY computer. You will have to update this path to YOUR path on YOUR machine that points to your MATH130 folder.
You can find the path of a file by navigating to the desired file in your finder or browser window. Right click and copy the file to the clipboard, then right click and pasting into a R markdown file. This will show a path that looks something like this:
:///C:/Math 130/Data/Challenger.txt file
You will need to remove the file:///
part at the beginning before using this path. You will have three chances in this lab to practice this.
9.3.1 Special Instructions for Mac
- Navigate to the file or folder you wish to copy the path for Right-click (or Control+Click, or a Two-Finger click on trackpads) on the file or folder in the Mac Finder
- While in the right-click menu, hold down the OPTION key to reveal the “Copy (item name) as Pathname” option, which replaces the standard Copy option. Once selected, the file or folders path is now in the clipboard, ready to be pasted anywhere
Source: https://stackoverflow.com/questions/52695546/how-to-copy-path-of-a-file-in-mac-os
9.4 Checking the import was successful
The first thing you should always do after importing a data file is look at the raw data and ask yourself the following things:
- Were the variable names read in correctly?
- Is there the expected number of rows and columns?
- Are the data types for each variable as expected?
There are many ways this can be done. Here are three:
str()
to show you the structure of the data frame.head()
to show you the top 6 rowsdata[1:x, 1:x]
- when the data frame is too large for the above two to work well, we can use matrix notation to view only the first X columns, and first X rows.
Next we will read in three different data sets, each with unique file types, and look at each one to make sure it was read in correctly.
9.5 Text files
Text files are very simple files that have a .txt
file extension. Common delimiters include a space, a comma (,) or a tab. Uncommon delimiters could include a % or even a semi-colon. By opening the file directly on your computer (not in R) you can see what delimiter is being used.
We will use the read.table()
function that is in base R to read in any type of delimited file. A tab-delimited text file can be read in using "\t"
as the deliminator character. In this class you ALWAYS want to include header=TRUE
to signify that the data in the first row contains our column names.
Here we call the str()
or structure function to examine the data that was imported.
str(email)
'data.frame': 3921 obs. of 21 variables:
$ spam : int 0 0 0 0 0 0 0 0 0 0 ...
$ to_multiple : int 0 0 0 0 0 0 1 1 0 0 ...
$ from : int 1 1 1 1 1 1 1 1 1 1 ...
$ cc : int 0 0 0 0 0 0 0 1 0 0 ...
$ sent_email : int 0 0 0 0 0 0 1 1 0 0 ...
$ time : chr "2011-12-31 22:16:41" "2011-12-31 23:03:59" "2012-01-01 08:00:32" "2012-01-01 01:09:49" ...
$ image : int 0 0 0 0 0 0 0 1 0 0 ...
$ attach : int 0 0 0 0 0 0 0 1 0 0 ...
$ dollar : int 0 0 4 0 0 0 0 0 0 0 ...
$ winner : chr "no" "no" "no" "no" ...
$ inherit : int 0 0 1 0 0 0 0 0 0 0 ...
$ viagra : int 0 0 0 0 0 0 0 0 0 0 ...
$ password : int 0 0 0 0 2 2 0 0 0 0 ...
$ num_char : num 11.37 10.5 7.77 13.26 1.23 ...
$ line_breaks : int 202 202 192 255 29 25 193 237 69 68 ...
$ format : int 1 1 1 1 0 0 1 1 0 1 ...
$ re_subj : int 0 0 0 0 0 0 0 0 0 0 ...
$ exclaim_subj: int 0 0 0 0 0 0 0 0 0 0 ...
$ urgent_subj : int 0 0 0 0 0 0 0 0 0 0 ...
$ exclaim_mess: int 0 1 6 48 1 1 1 18 1 0 ...
$ number : chr "big" "small" "small" "small" ...
- The
email
data set has 3921 observations, and 21 variables. - The variable names were read in correctly, as evidenced by the words in the far right corner after the
$
. - Most variables are treated as numeric integers, except a few are factors. Including
time
. If we wanted to examine a time trend, we would need to convert the class into adatetime
What happens if you forget to include the arguments for
header
orsep
? Try that now and discuss what happened.
Forgetting both arguments
<- read.table("data/email.txt") email
Error in read.table("data/email.txt"): duplicate 'row.names' are not allowed
Forgetting the header
<- read.table("data/email.txt", sep="\t")
email head(email)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 spam to_multiple from cc sent_email time image attach dollar
2 0 0 1 0 0 2011-12-31 22:16:41 0 0 0
3 0 0 1 0 0 2011-12-31 23:03:59 0 0 0
4 0 0 1 0 0 2012-01-01 08:00:32 0 0 4
5 0 0 1 0 0 2012-01-01 01:09:49 0 0 0
6 0 0 1 0 0 2012-01-01 02:00:01 0 0 0
V10 V11 V12 V13 V14 V15 V16 V17
1 winner inherit viagra password num_char line_breaks format re_subj
2 no 0 0 0 11.37 202 1 0
3 no 0 0 0 10.504 202 1 0
4 no 1 0 0 7.773 192 1 0
5 no 0 0 0 13.256 255 1 0
6 no 0 0 2 1.231 29 0 0
V18 V19 V20 V21
1 exclaim_subj urgent_subj exclaim_mess number
2 0 0 0 big
3 0 0 1 small
4 0 0 6 small
5 0 0 48 small
6 0 0 1 none
Forgetting the separator
<- read.table("data/email.txt", header=TRUE) email
Error in read.table("data/email.txt", header = TRUE): duplicate 'row.names' are not allowed
head(email)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 spam to_multiple from cc sent_email time image attach dollar
2 0 0 1 0 0 2011-12-31 22:16:41 0 0 0
3 0 0 1 0 0 2011-12-31 23:03:59 0 0 0
4 0 0 1 0 0 2012-01-01 08:00:32 0 0 4
5 0 0 1 0 0 2012-01-01 01:09:49 0 0 0
6 0 0 1 0 0 2012-01-01 02:00:01 0 0 0
V10 V11 V12 V13 V14 V15 V16 V17
1 winner inherit viagra password num_char line_breaks format re_subj
2 no 0 0 0 11.37 202 1 0
3 no 0 0 0 10.504 202 1 0
4 no 1 0 0 7.773 192 1 0
5 no 0 0 0 13.256 255 1 0
6 no 0 0 2 1.231 29 0 0
V18 V19 V20 V21
1 exclaim_subj urgent_subj exclaim_mess number
2 0 0 0 big
3 0 0 1 small
4 0 0 6 small
5 0 0 48 small
6 0 0 1 none
9.6 CSV: Comma Separated Values
CSV is a fancy way of saying a text file with comma-separated values (i.e. CSV). This file type can also open in Excel–the icons even look similar. We could use read.table()
to import this data file, but read.csv()
is optimized to read in CSV files.
<- read.csv("data/NCbirths.csv", header=TRUE)
NCbirths head(NCbirths)
fage mage mature weeks premie visits marital gained weight
1 NA 13 younger mom 39 full term 10 married 38 7.63
2 NA 14 younger mom 42 full term 15 married 20 7.88
3 19 15 younger mom 37 full term 11 married 38 6.63
4 21 15 younger mom 41 full term 6 married 34 8.00
5 NA 15 younger mom 39 full term 9 married 27 6.38
6 NA 15 younger mom 38 full term 19 married 22 5.38
lowbirthweight gender habit whitemom
1 not low male nonsmoker not white
2 not low male nonsmoker not white
3 not low female nonsmoker white
4 not low male nonsmoker white
5 not low female nonsmoker not white
6 low male nonsmoker not white
- The
NCbirths
data set has 1000 rows and 13 columns. - Variable names are read in OK.
- Most the data is begin read in as factors and integers.
- There are some
NA
values for the variablefage
(fathers age).
9.7 Excel files
The best method to read in Excel files is from the readxl
package by Hadley Wickham. This package need to be installed first, and then can be simply loaded using the library()
function each time you start an R
session where you will be reading in this type of data.
The read_excel()
function is what we are going to use. Note the use of the underscore _
instead of a period .
between read
and excel
.
library(readxl)
<- read_excel("data/fatal-police-shootings-data.xlsx", sheet=1, col_names=TRUE)
police 1:10,1:5] police[
# A tibble: 10 × 5
id name date manner_of_death armed
<dbl> <chr> <dttm> <chr> <chr>
1 3 Tim Elliot 2015-01-02 00:00:00 shot gun
2 4 Lewis Lee Lembke 2015-01-02 00:00:00 shot gun
3 5 John Paul Quintero 2015-01-03 00:00:00 shot and Tasered unarmed
4 8 Matthew Hoffman 2015-01-04 00:00:00 shot toy weapon
5 9 Michael Rodriguez 2015-01-04 00:00:00 shot nail gun
6 11 Kenneth Joe Brown 2015-01-04 00:00:00 shot gun
7 13 Kenneth Arnold Buck 2015-01-05 00:00:00 shot gun
8 15 Brock Nichols 2015-01-06 00:00:00 shot gun
9 16 Autumn Steele 2015-01-06 00:00:00 shot unarmed
10 17 Leslie Sapp III 2015-01-06 00:00:00 shot toy weapon
- The variable
date
is adttm
or date-time variable. This meansR
recognizes it directly as a date, not some string of numbers. - Categorical variables such as
manner_of_death
andcity
are read in ascharacter
instead of factor. We may or may not want to change that later. - Numeric variables such as
age
are of typedbl
(double). This is similar tointeger
ornumeric
, so we are fine.