Data Frames

Reading Assignments

Why Data Frames?

  • Learning how to manipulate data frames is among the most important data computing skills in R.
  • Two primary approaches for manipulating data frames:
    • In base R, i.e. the “traditional” or “classic” approach \(\Leftarrow\) covered in this chapter
    • In tidyverse, i.e. a modern version \(\Leftarrow\) Discussed later in this class

Data Frames

  • A data frame is a special type of R list.
  • In most cases, a data frame is internally stored as a list of vectors or factors, columnwise.

Example:

#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Dimension of Data Frames

Check the Dimension of mtcars

dim(mtcars)
#> [1] 32 11
nrow(mtcars)
#> [1] 32
ncol(mtcars)
#> [1] 11

There are 32 rows and 11 columns in mtcars.

Structure

Show the structure or summary of mtcars

summary(mtcars)
#>       mpg             cyl             disp             hp       
#>  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
#>  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
#>  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
#>  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
#>  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
#>  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
#>       drat             wt             qsec             vs        
#>  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
#>  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
#>  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
#>  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
#>  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
#>  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
#>        am              gear            carb      
#>  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
#>  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
#>  Median :0.0000   Median :4.000   Median :2.000  
#>  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
#>  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
#>  Max.   :1.0000   Max.   :5.000   Max.   :8.000
str(mtcars)
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
attributes(mtcars)
#> $names
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"
#> 
#> $row.names
#>  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
#>  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
#>  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
#> [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
#> [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#> [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
#> [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
#> [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
#> [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
#> [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
#> [31] "Maserati Bora"       "Volvo 142E"         
#> 
#> $class
#> [1] "data.frame"

Names

Column Names

names(mtcars)
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"
colnames(mtcars)
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"

Row Names

rownames(mtcars)
#>  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
#>  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
#>  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
#> [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
#> [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#> [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
#> [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
#> [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
#> [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
#> [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
#> [31] "Maserati Bora"       "Volvo 142E"

Both Column and Row Names

dimnames(mtcars)
#> [[1]]
#>  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
#>  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
#>  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
#> [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
#> [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#> [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
#> [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
#> [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
#> [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
#> [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
#> [31] "Maserati Bora"       "Volvo 142E"         
#> 
#> [[2]]
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"
attributes(mtcars)
#> $names
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"
#> 
#> $row.names
#>  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
#>  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
#>  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
#> [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
#> [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#> [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
#> [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
#> [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
#> [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
#> [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
#> [31] "Maserati Bora"       "Volvo 142E"         
#> 
#> $class
#> [1] "data.frame"

Sample

head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
head(mtcars, 10)
#>                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
tail(mtcars)
#>                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
tail(mtcars, 2)
#>                mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
#> Volvo 142E    21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

Selecting elements in Data Frames

Use data.frame dat as an example below.

dat <- data.frame(
  name = c('Leia', 'Luke', 'Han'),
  gender = c('female', 'male', 'male'),
  height = c(1.50, 1.72, 1.80),
  jedi = c(FALSE, TRUE, FALSE),
  stringsAsFactors = FALSE
)

dat
#>   name gender height  jedi
#> 1 Leia female   1.50 FALSE
#> 2 Luke   male   1.72  TRUE
#> 3  Han   male   1.80 FALSE

Select Cell/Row/Column \(\Rightarrow\) similar to matrix operations

dataframe[RowIndex, ColIndex]

# select value in row 1 and column 1
dat[1,1]

# select values in these cells
dat[1:2,3:4] # Row 1 to 2, col 3 to 4
dat[2:3, c(1,4)] # Row 2 to 3, col 1 and 4
dat[-2, -3] # Not row 2 or col 3
dat[1, ] # selecting first row
dat[-2, ] # selecting rows except row 2
dat[, 3] # selecting third column
dat[, -1] # selecting columns except col 1
  • More Options to select columns

Five equivalent methods

mtcars$mpg # Method I
mtcars[,1] # Method II
mtcars[[1]] # Method III
mtcars[, "mpg"] # Method IV
mtcars[["mpg"]] # Method V
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4

Modifying Data Frames

Temporarily vs Permanently Modifying the Values

dat[, -1]
#>   gender height  jedi
#> 1 female   1.50 FALSE
#> 2   male   1.72  TRUE
#> 3   male   1.80 FALSE
dat
#>   name gender height  jedi
#> 1 Leia female   1.50 FALSE
#> 2 Luke   male   1.72  TRUE
#> 3  Han   male   1.80 FALSE

Q: Why the first column of dat was NOT deleted?

A: dat[, -1] only temporarily modifies the values. To make the permanently change, we should use dat <- dat[, -1]. That is, new data frame overwrites the old one. Of course, you can assign it to a new name too, like dat_new <- dat[, -1].

Similar Examples:

# vector of weights
weight <- c(49, 77, 85)

# adding weights vector to dat
dat <- cbind(dat, weight)
dat
#>   name gender height  jedi weight
#> 1 Leia female   1.50 FALSE     49
#> 2 Luke   male   1.72  TRUE     77
#> 3  Han   male   1.80 FALSE     85
Direct modifying the column commands
# Add new_column permanently
dat$new_column <- c('a', 'e', 'i')
dat
#>   name gender height  jedi weight new_column
#> 1 Leia female   1.50 FALSE     49          a
#> 2 Luke   male   1.72  TRUE     77          e
#> 3  Han   male   1.80 FALSE     85          i

# Delete new_column permanently
dat$new_column <- NULL
dat
#>   name gender height  jedi weight
#> 1 Leia female   1.50 FALSE     49
#> 2 Luke   male   1.72  TRUE     77
#> 3  Han   male   1.80 FALSE     85

Modifying Data Frames (continue)

Adding/Deleting columns

Adding Columns Method I:

dat$new_column <- c('a', 'e', 'i')
dat
#>   name gender height  jedi weight new_column
#> 1 Leia female   1.50 FALSE     49          a
#> 2 Luke   male   1.72  TRUE     77          e
#> 3  Han   male   1.80 FALSE     85          i

Adding Columns Method II:

# vector of weights
weight <- c(49, 77, 85)

# adding weights vector to dat
dat <- cbind(dat, weight)
dat
#>   name gender height  jedi weight new_column weight
#> 1 Leia female   1.50 FALSE     49          a     49
#> 2 Luke   male   1.72  TRUE     77          e     77
#> 3  Han   male   1.80 FALSE     85          i     85

Deleting Columns

dat$weight <- NULL
dat

Renaming a column

names(dat)
#> [1] "name"       "gender"     "height"     "jedi"       "weight"    
#> [6] "new_column" "weight"

# changing gender to sex
attributes(dat)$names[2] <- "sex"
# Equivalently,
names(dat)[2] <- "sex"
colnames(dat)[2] <- "Sex"

#names(dat)

Moving Columnns

reordered_names <- c("name", "jedi", "height", "weight", "sex")
dat <- dat[ ,reordered_names]
dat
#>   name  jedi height weight    sex
#> 1 Leia FALSE   1.50     49 female
#> 2 Luke  TRUE   1.72     77   male
#> 3  Han FALSE   1.80     85   male

Transforming Columns

Recall 1 kg = 2.20462 pounds (i.e. 1 pounds = 0.453592 kg)

dat$height <- dat$height * 100 # converting height to centimeters

dat[ ,"weight"] <- dat[ ,"weight"] * 2.20462 # converting weight to pounds

dat <- transform(dat, weight = weight * 0.453592) # Converting weight back to kgs

Creating Data Frames

Majority of the time, data frames are read from external data files, or are built-in data frames in R packages.

From time to time, you need to create some data table manually.

  • Option 1: data.frame()
dat <- data.frame(
  name = c('Anakin', 'Padme', 'Luke', 'Leia'),
  gender = c('male', 'female', 'male', 'female'),
  height = c(1.88, 1.65, 1.72, 1.50),
  weight = c(84, 45, 77, 49), 
  stringsAsFactors = TRUE
) 

dat
#>     name gender height weight
#> 1 Anakin   male   1.88     84
#> 2  Padme female   1.65     45
#> 3   Luke   male   1.72     77
#> 4   Leia female   1.50     49
str(dat)
#> 'data.frame':    4 obs. of  4 variables:
#>  $ name  : Factor w/ 4 levels "Anakin","Leia",..: 1 4 3 2
#>  $ gender: Factor w/ 2 levels "female","male": 2 1 2 1
#>  $ height: num  1.88 1.65 1.72 1.5
#>  $ weight: num  84 45 77 49

For comparison purpose


# If stringsAsFactors = FALSE
dat <- data.frame(
  name = c('Anakin', 'Padme', 'Luke', 'Leia'),
  gender = c('male', 'female', 'male', 'female'),
  height = c(1.88, 1.65, 1.72, 1.50),
  weight = c(84, 45, 77, 49), 
  stringsAsFactors = FALSE
) 

str(dat)
#> 'data.frame':    4 obs. of  4 variables:
#>  $ name  : chr  "Anakin" "Padme" "Luke" "Leia"
#>  $ gender: chr  "male" "female" "male" "female"
#>  $ height: num  1.88 1.65 1.72 1.5
#>  $ weight: num  84 45 77 49
  • Option 2: list() \(\Rightarrow\) data.frame()
# another way to create a basic data frame
lst <- list(
  name = c('Anakin', 'Padme', 'Luke', 'Leia'),
  gender = c('male', 'female', 'male', 'female'),
  height = c(1.88, 1.65, 1.72, 1.50),
  weight = c(84, 45, 77, 49)
)

tbl <- data.frame(lst, stringsAsFactors = TRUE)

tbl
#>     name gender height weight
#> 1 Anakin   male   1.88     84
#> 2  Padme female   1.65     45
#> 3   Luke   male   1.72     77
#> 4   Leia female   1.50     49
str(tbl)
#> 'data.frame':    4 obs. of  4 variables:
#>  $ name  : Factor w/ 4 levels "Anakin","Leia",..: 1 4 3 2
#>  $ gender: Factor w/ 2 levels "female","male": 2 1 2 1
#>  $ height: num  1.88 1.65 1.72 1.5
#>  $ weight: num  84 45 77 49

stringsAsFactors

  • = TRUE: Convert character vectors into factors (default option in older R)
  • = FALSE: Prevent data.frame() from converting strings into factors (default option in newer R)
  • Default option changed before and after 3.1.0 versions of R