Introduction to Tidyverse and Tibbles


Introduction to Tidyverse

  • Tidyverse is an collection of R packages designed for data science, including ggplot2, tibble, dplyr, tidyr, readr, stringr, etc.

  • ggplot2 \(\rightarrow\) for data visualization

  • dplyr, tidyr \(\rightarrow\) for data wrangling and data preparation

Workflow in data science, with Tidyverse – https://oliviergimenez.github.io/intro_tidyverse/#1
  • Data preparation can take up to \(70\%\)\(80\%\) of your project time.
    • So wouldn’t it be nice if there were an intuitive and idiomatic way to wrangle data?

Install and Load Tidyverse

Please install and load the tidyverse.


install.packages("tidyverse") # Remember, you ONLY need to install it once

library(tidyverse) # but you'll need to load it every session you use it

Once you’re done, let’s start the journey!

Syntax Comparison: Base R vs Tidyverse

Let’s use the mpg tibble in ggplot2. Please refer to help(mpg) for the variable definitions.

library(tidyverse) 

str(mpg)
#> tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
#>  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
#>  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
#>  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
#>  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
#>  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
#>  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
#>  $ drv         : chr [1:234] "f" "f" "f" "f" ...
#>  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
#>  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
#>  $ fl          : chr [1:234] "p" "p" "p" "p" ...
#>  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
glimpse(mpg) # glimpse() is the counterpart in dplyr of str()
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
#> $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
#> $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
#> $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
#> $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
#> $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
#> $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
#> $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
#> $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
#> $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
#> $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Example 1 What types of vehicles did dodge produce from the mpg data?

Syntax Comparison: Base R vs Tidyverse

Example 1 What types of vehicles did dodge produce from the mpg data?

Base R syntax

unique(mpg[mpg$manufacturer == "dodge", "class"])
#> # A tibble: 3 × 1
#>   class  
#>   <chr>  
#> 1 minivan
#> 2 pickup 
#> 3 suv

Fine, but a little awkward.

Tidyverse syntax

mpg %>%
 filter(manufacturer == "dodge") %>%
 distinct(class) 
#> # A tibble: 3 × 1
#>   class  
#>   <chr>  
#> 1 minivan
#> 2 pickup 
#> 3 suv

Much, much nicer.

Tidyverse Syntax and Pipe Operator

The pipe operator %>% is used to build the pipeline.

  • You could interpret %>% as then

  • Passes result on left into first argument of function on right.

  • Shortcut to type %>%

  • %>% has to come at the end of the line, not the start.

Check Missing Values

mpg %>% 
  dplyr::select(everything()) %>% # use everything() to select all variables
  summarize_all(~sum(is.na(.))) # summarize_all() affects every variable
#> # A tibble: 1 × 11
#>   manufacturer model displ  year   cyl trans   drv   cty   hwy    fl class
#>          <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1            0     0     0     0     0     0     0     0     0     0     0

Wrangling & Graphing in Tidyverse

Chain the tidyverse pipeline into the ggplot function.

Example 2

Draw boxplots of engine displacement (variable displ in mpg) for three vehicle types of dodge.

mpg %>% ### Data wrangling part
 filter(manufacturer == "dodge") %>%
 ggplot(aes(x = class, y = displ)) + ### Data visualization part
 geom_boxplot() 

  • Or equivalently,
### Data wrangling part 
mpg_dodge <- mpg %>% 
  filter(manufacturer == "dodge")

### Data visualization part 
  ggplot(mpg_dodge, aes(x = class, y = displ)) + 
  geom_boxplot() 

Tibbles

tidyverse mainly deals with tibble instead of data.frame. Therefore this is where we start.

tibble is a data.frame with different attributes and requirements. The package tibble provides support for tibble. It is included in tidyverse.

Tibble vs Data Frame

  • Tibbles are enhanced data frames.
  • Tibbles are a table format provided by the tibble package, part of the core tidyverse.
  • Make working in the tidyverse a little easier.
  • We will compare tibble and data frame on the following aspects:
    • Creating
    • Coercion (i.e. data frame \(\longleftrightarrow\) tibble)
    • Difference in printing
    • Difference in subsetting
    • Difference in recycling rules
    • Difference in accepting row names

Tibble vs Data Frame – Creating (Import a CSV File)

Import Data as a Data frame

Recall: use read.csv() to read the data frame.

df_workshop <- read.csv("DS Workshop Participants List.csv")
class(df_workshop)
#> [1] "data.frame"
str(df_workshop)
#> 'data.frame':    16 obs. of  11 variables:
#>  $ Name                           : chr  "Dwayne Johnson" "Rihanna" "Ellen DeGeneres" "Will Smith" ...
#>  $ Gender                         : chr  "M" "F" "F" "M" ...
#>  $ Email.Address                  : chr  "Djohnson@illinois.edu" "Rihanna@illinois.edu" "Edegeneres@illinois.edu" "Wsmith@illinois.edu" ...
#>  $ Department                     : chr  "Statistics" "Economics" "Biology" "Electrical and Computer Engineering" ...
#>  $ Info.Source                    : chr  "Email" "Class" "Email" "Email" ...
#>  $ Class.Year                     : chr  "Undergraduate" "Graduate" "Undergraduate" "Undergraduate" ...
#>  $ Major                          : chr  "Statistics" "ECON" "Biology" "Electrical Engineering" ...
#>  $ Related.Courses.Taken          : chr  "STAT 207, MATH 220" "BUS 201, MATH 426" "STAT 207" "MATH 221, MATH 220" ...
#>  $ Programming.Language.Known     : chr  "R, SAS, Matlab" "Python, SAS" "R, Python" "R, Python, SQL" ...
#>  $ Willingness.to.be.the.Presenter: chr  "Y" "N" "N" "Y" ...
#>  $ DS.Years.of.Experience         : num  1 1 0 0 1 NA 2 0.5 0.5 NA ...

Import Data as a Tibble

  • Use read_csv() (from readr package) to read the tibble.
tbl_workshop <- read_csv("DS Workshop Participants List.csv")
class(tbl_workshop)
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
glimpse(tbl_workshop) 
#> Rows: 16
#> Columns: 11
#> $ Name                              <chr> "Dwayne Johnson", "Rihanna", "Ellen …
#> $ Gender                            <chr> "M", "F", "F", "M", "F", "M", "M", "…
#> $ `Email Address`                   <chr> "Djohnson@illinois.edu", "Rihanna@il…
#> $ Department                        <chr> "Statistics", "Economics", "Biology"…
#> $ `Info Source`                     <chr> "Email", "Class", "Email", "Email", …
#> $ `Class Year`                      <chr> "Undergraduate", "Graduate", "Underg…
#> $ Major                             <chr> "Statistics", "ECON", "Biology", "El…
#> $ `Related Courses Taken`           <chr> "STAT 207, MATH 220", "BUS 201, MATH…
#> $ `Programming Language Known`      <chr> "R, SAS, Matlab", "Python, SAS", "R,…
#> $ `Willingness to be the Presenter` <chr> "Y", "N", "N", "Y", "Y", NA, "N", "N…
#> $ `DS Years of Experience`          <dbl> 1.0, 1.0, 0.0, 0.0, 1.0, NA, 2.0, 0.…

Tibble vs Data Frame – Creating (Construct a Data Frame by Columns)

Data frame


data.frame('crazy name' = 1:3, 'not so crazy & name' = c("a", "b", "c"))
#>   crazy.name not.so.crazy...name
#> 1          1                   a
#> 2          2                   b
#> 3          3                   c

data.frame(x = 1:3, y = x + 2)
#> Error in eval(expr, envir, enclos): object 'x' not found

Tibble


tibble('crazy name' = 1:3, 'not so crazy & name' = c("a", "b", "c")) 
#> # A tibble: 3 × 2
#>   `crazy name` `not so crazy & name`
#>          <int> <chr>                
#> 1            1 a                    
#> 2            2 b                    
#> 3            3 c
# Tibble allows  non-syntactic variable names. To refer to these variables, please surround them with backticks.

tibble(x = 1:3, y = x + 2) # Tibble allows referring to variables just created
#> # A tibble: 3 × 2
#>       x     y
#>   <int> <dbl>
#> 1     1     3
#> 2     2     4
#> 3     3     5

Tibble vs Data Frame – Coercion

Data Frame to Tibble

class(df_workshop)
#> [1] "data.frame"

class(as_tibble(df_workshop))
#> [1] "tbl_df"     "tbl"        "data.frame"

Tibble to Data Frame

class(tbl_workshop) 
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

class(as.data.frame(tbl_workshop))
#> [1] "data.frame"

Tibble vs Data Frame – printing

Data Frame

df_workshop
#>                 Name Gender           Email.Address
#> 1     Dwayne Johnson      M   Djohnson@illinois.edu
#> 2            Rihanna      F    Rihanna@illinois.edu
#> 3    Ellen DeGeneres      F Edegeneres@illinois.edu
#> 4         Will Smith      M     Wsmith@illinois.edu
#> 5     Angelina Jolie      F     Ajolie@illinois.edu
#> 6  Cristiano Ronaldo      M   Cronaldo@illinois.edu
#> 7  Leonardo DiCaprio      M  Ldicaprio@illinois.edu
#> 8         Tom Cruise      M    Tcruise@illinois.edu
#> 9  Robert Downey Jr.      M  RDowneyJr@illinois.edu
#> 10       Celine Dion      F      Cdion@illinois.edu
#> 11             Adele      F      Adele@illinois.edu
#> 12   Serena Williams      F  Swilliams@illinois.edu
#> 13      Lionel Messi      M     Lmessi@illinois.edu
#> 14      Taylor Swift      F     Tswift@illinois.edu
#> 15     J. K. Rowling      F  JKRowling@illinois.edu
#> 16      LeBron James      M     Ljames@illinois.edu
#>                             Department        Info.Source        Class.Year
#> 1                           Statistics              Email     Undergraduate
#> 2                            Economics              Class          Graduate
#> 3                              Biology              Email     Undergraduate
#> 4  Electrical and Computer Engineering              Email     Undergraduate
#> 5                     Computer Science              Class     Undergraduate
#> 6                            Economics Friends/Colleagues Faculty and staff
#> 7                            Economics              Email     Undergraduate
#> 8                          Mathematics              Class     Undergraduate
#> 9   mechanical Science and Engineering              Class     Undergraduate
#> 10                             Biology Friends/Colleagues Faculty and staff
#> 11                          Statistics              Class     Undergraduate
#> 12                    Computer Science              Email     Undergraduate
#> 13                             Biology          Professor          Graduate
#> 14                             BIology              Flyer     Undergraduate
#> 15                             Finance              Email     Undergraduate
#> 16 Electrical and Computer Engineering              Flyer     Undergraduate
#>                     Major        Related.Courses.Taken
#> 1              Statistics           STAT 207, MATH 220
#> 2                    ECON            BUS 201, MATH 426
#> 3                 Biology                     STAT 207
#> 4  Electrical Engineering           MATH 221, MATH 220
#> 5        Computer Science      CS 173, CS 411, CS 210 
#> 6                    <NA>                         <NA>
#> 7               Economics  BUS 201, MATH 426, STAT 425
#> 8             Mathematics MATH 227, STAT 207, STAT 425
#> 9  Mechanical Engineering           MATH 220, STAT 425
#> 10                   <NA>                         <NA>
#> 11             Statistics MATH 220, MATH 426, STAT 425
#> 12       Computer Science               CS 225, CS 173
#> 13                Biology                     MATH 257
#> 14                Biology           MATH 257, MATH 426
#> 15                Finance             BUS 302, BUS 201
#> 16                     EE           MATH 227, STAT 425
#>    Programming.Language.Known Willingness.to.be.the.Presenter
#> 1              R, SAS, Matlab                               Y
#> 2                 Python, SAS                               N
#> 3                   R, Python                               N
#> 4              R, Python, SQL                               Y
#> 5                 Python, SAS                               Y
#> 6                        <NA>                            <NA>
#> 7     Java, C++, HTML, Matlab                               N
#> 8                      Matlab                               N
#> 9              Matlab, Python                               Y
#> 10                       <NA>                            <NA>
#> 11          R, Python, Matlab                               Y
#> 12                  R, Python                               Y
#> 13                  C++, JAVA                               N
#> 14                     Python                               Y
#> 15             R, Hadoop, SAS                               N
#> 16                     Matlab                               N
#>    DS.Years.of.Experience
#> 1                     1.0
#> 2                     1.0
#> 3                     0.0
#> 4                     0.0
#> 5                     1.0
#> 6                      NA
#> 7                     2.0
#> 8                     0.5
#> 9                     0.5
#> 10                     NA
#> 11                    1.0
#> 12                    0.5
#> 13                    2.0
#> 14                    0.0
#> 15                    1.5
#> 16                    0.5

Tibble

tbl_workshop
#> # A tibble: 16 × 11
#>    Name       Gender `Email Address` Department `Info Source` `Class Year` Major
#>    <chr>      <chr>  <chr>           <chr>      <chr>         <chr>        <chr>
#>  1 Dwayne Jo… M      Djohnson@illin… Statistics Email         Undergradua… Stat…
#>  2 Rihanna    F      Rihanna@illino… Economics  Class         Graduate     ECON 
#>  3 Ellen DeG… F      Edegeneres@ill… Biology    Email         Undergradua… Biol…
#>  4 Will Smith M      Wsmith@illinoi… Electrica… Email         Undergradua… Elec…
#>  5 Angelina … F      Ajolie@illinoi… Computer … Class         Undergradua… Comp…
#>  6 Cristiano… M      Cronaldo@illin… Economics  Friends/Coll… Faculty and… <NA> 
#>  7 Leonardo … M      Ldicaprio@illi… Economics  Email         Undergradua… Econ…
#>  8 Tom Cruise M      Tcruise@illino… Mathemati… Class         Undergradua… Math…
#>  9 Robert Do… M      RDowneyJr@illi… mechanica… Class         Undergradua… Mech…
#> 10 Celine Di… F      Cdion@illinois… Biology    Friends/Coll… Faculty and… <NA> 
#> 11 Adele      F      Adele@illinois… Statistics Class         Undergradua… Stat…
#> 12 Serena Wi… F      Swilliams@illi… Computer … Email         Undergradua… Comp…
#> 13 Lionel Me… M      Lmessi@illinoi… Biology    Professor     Graduate     Biol…
#> 14 Taylor Sw… F      Tswift@illinoi… BIology    Flyer         Undergradua… Biol…
#> 15 J. K. Row… F      JKRowling@illi… Finance    Email         Undergradua… Fina…
#> 16 LeBron Ja… M      Ljames@illinoi… Electrica… Flyer         Undergradua… EE   
#> # ℹ 4 more variables: `Related Courses Taken` <chr>,
#> #   `Programming Language Known` <chr>,
#> #   `Willingness to be the Presenter` <chr>, `DS Years of Experience` <dbl>

Tibble vs Data Frame – Subsetting

Recall Subsetting in Data Frame

df_workshop$Major # Returns a vector
#>  [1] "Statistics"             "ECON"                   "Biology"               
#>  [4] "Electrical Engineering" "Computer Science"       NA                      
#>  [7] "Economics"              "Mathematics"            "Mechanical Engineering"
#> [10] NA                       "Statistics"             "Computer Science"      
#> [13] "Biology"                "Biology"                "Finance"               
#> [16] "EE"
df_workshop[, "Major"] # Returns a vector
#>  [1] "Statistics"             "ECON"                   "Biology"               
#>  [4] "Electrical Engineering" "Computer Science"       NA                      
#>  [7] "Economics"              "Mathematics"            "Mechanical Engineering"
#> [10] NA                       "Statistics"             "Computer Science"      
#> [13] "Biology"                "Biology"                "Finance"               
#> [16] "EE"

Subsetting in tibble – distinguish [] and [[]] in tibble

  • Using [] returns a tibble
tbl_workshop[,'Major'] 
#> # A tibble: 16 × 1
#>    Major                 
#>    <chr>                 
#>  1 Statistics            
#>  2 ECON                  
#>  3 Biology               
#>  4 Electrical Engineering
#>  5 Computer Science      
#>  6 <NA>                  
#>  7 Economics             
#>  8 Mathematics           
#>  9 Mechanical Engineering
#> 10 <NA>                  
#> 11 Statistics            
#> 12 Computer Science      
#> 13 Biology               
#> 14 Biology               
#> 15 Finance               
#> 16 EE
tbl_workshop['Major']
#> # A tibble: 16 × 1
#>    Major                 
#>    <chr>                 
#>  1 Statistics            
#>  2 ECON                  
#>  3 Biology               
#>  4 Electrical Engineering
#>  5 Computer Science      
#>  6 <NA>                  
#>  7 Economics             
#>  8 Mathematics           
#>  9 Mechanical Engineering
#> 10 <NA>                  
#> 11 Statistics            
#> 12 Computer Science      
#> 13 Biology               
#> 14 Biology               
#> 15 Finance               
#> 16 EE
  • Using [[]] or drop = TRUE returns a vector
tbl_workshop[['Major']]
#>  [1] "Statistics"             "ECON"                   "Biology"               
#>  [4] "Electrical Engineering" "Computer Science"       NA                      
#>  [7] "Economics"              "Mathematics"            "Mechanical Engineering"
#> [10] NA                       "Statistics"             "Computer Science"      
#> [13] "Biology"                "Biology"                "Finance"               
#> [16] "EE"
tbl_workshop[,'Major', drop = TRUE]
#>  [1] "Statistics"             "ECON"                   "Biology"               
#>  [4] "Electrical Engineering" "Computer Science"       NA                      
#>  [7] "Economics"              "Mathematics"            "Mechanical Engineering"
#> [10] NA                       "Statistics"             "Computer Science"      
#> [13] "Biology"                "Biology"                "Finance"               
#> [16] "EE"

Tibble vs Data Frame – Recycling

Data frames – recycling rule is applied automatically

data.frame(x = 1:6, y = "STAT")
#>   x    y
#> 1 1 STAT
#> 2 2 STAT
#> 3 3 STAT
#> 4 4 STAT
#> 5 5 STAT
#> 6 6 STAT

data.frame(x = 1:6, y = "STAT", z = c("Y", "N"))
#>   x    y z
#> 1 1 STAT Y
#> 2 2 STAT N
#> 3 3 STAT Y
#> 4 4 STAT N
#> 5 5 STAT Y
#> 6 6 STAT N

Tibble – only values of length 1 are recycled

tibble(x = 1:6, y = "STAT")
#> # A tibble: 6 × 2
#>       x y    
#>   <int> <chr>
#> 1     1 STAT 
#> 2     2 STAT 
#> 3     3 STAT 
#> 4     4 STAT 
#> 5     5 STAT 
#> 6     6 STAT

tibble(x = 1:6, y = "STAT", z = c("Y", "N"))
#> Error in `tibble()`:
#> ! Tibble columns must have compatible sizes.
#> • Size 6: Existing data.
#> • Size 2: Column `z`.
#> ℹ Only values of size one are recycled.

# Correction
tibble(x = 1:6, y = "STAT", z = rep(c("Y", "N"), times = 3))
#> # A tibble: 6 × 3
#>       x y     z    
#>   <int> <chr> <chr>
#> 1     1 STAT  Y    
#> 2     2 STAT  N    
#> 3     3 STAT  Y    
#> 4     4 STAT  N    
#> 5     5 STAT  Y    
#> 6     6 STAT  N

Tibble vs Data Frame – Row Names

Data frames accept assigned row names

rownames(df_workshop) <- letters[1:16]
head(df_workshop)
#>                Name Gender           Email.Address
#> a    Dwayne Johnson      M   Djohnson@illinois.edu
#> b           Rihanna      F    Rihanna@illinois.edu
#> c   Ellen DeGeneres      F Edegeneres@illinois.edu
#> d        Will Smith      M     Wsmith@illinois.edu
#> e    Angelina Jolie      F     Ajolie@illinois.edu
#> f Cristiano Ronaldo      M   Cronaldo@illinois.edu
#>                            Department        Info.Source        Class.Year
#> a                          Statistics              Email     Undergraduate
#> b                           Economics              Class          Graduate
#> c                             Biology              Email     Undergraduate
#> d Electrical and Computer Engineering              Email     Undergraduate
#> e                    Computer Science              Class     Undergraduate
#> f                           Economics Friends/Colleagues Faculty and staff
#>                    Major   Related.Courses.Taken Programming.Language.Known
#> a             Statistics      STAT 207, MATH 220             R, SAS, Matlab
#> b                   ECON       BUS 201, MATH 426                Python, SAS
#> c                Biology                STAT 207                  R, Python
#> d Electrical Engineering      MATH 221, MATH 220             R, Python, SQL
#> e       Computer Science CS 173, CS 411, CS 210                 Python, SAS
#> f                   <NA>                    <NA>                       <NA>
#>   Willingness.to.be.the.Presenter DS.Years.of.Experience
#> a                               Y                      1
#> b                               N                      1
#> c                               N                      0
#> d                               Y                      0
#> e                               Y                      1
#> f                            <NA>                     NA

Tibbles don’t accept assigned row names

rownames(tbl_workshop) <- letters[1:16]

More motivation

  • The tidyverse has become the industry-standard in the R-using data science community.

  • The tidyverse in R is comparable to the numpy and pandas packages in Python.

Reference

Reference books