library(tidyverse)
library(nycflights13)
Lab 7 Data Manipulation and Visualization Practice
Load the packages
Question 1
We use the diamonds
tibble from the ggplot2 package for this question. This tibble contains the prices and other attributes of almost 54,000 diamonds. There are 53940 rows and 10 columns. You may use ?diamonds
to find more information about this data.
[5 pts] Use
mutate()
to create a new variable,unitprice
, that calculates the price per carat in the tibblediamonds
.[10 pts] Which cut category (Fair, Good, Very Good, Premium, Ideal) exhibits the least or greatest variability in unit price? To explore this, create a boxplot comparing
unitprice
vs quality of the cut. Additionally, usegroup_by()
anddplyr::summarize()
to calculate the standard deviation of unit price for each cut category. Provide comments on your findings.[5 pts] Which color category (D, E, F, G, H, I, J) has the most diamonds? Please use
count()
to get the number of diamonds for each color category. In addition, answer this question by creating a barplot of variablecolor
. Provide comments on your findings.[10 pts] Which combination of cut and color has the highest average unit price for diamonds larger than 1 carat? To answer this question, please filter the
diamonds
dataset for diamonds larger than 1 cart, and select theunitprice
,cut
, andcolor
columns. Usegroup_by()
anddplyr::summarize()
to calculateavgprice
for each cut and color combination. Then usearrange()
to sort the results in descending order of average unit price. Provide comments on your findings.
Question 2
Re-create the R code necessary to generate the following graphs. The dataset is mpg
from ggplot2 package.
- [10 pts] [Hint: You may use
geom_smooth(formula = y ~ x, method = "loess")
to add the smoothed line(s) for the questions.]
- [10 pts]
[Hint: You may use color = drv
.]
Question 3
[10 pts] Use
flights
dataset from the nycflights13 package. Find the average departure delay by month.[10 pts] Visualize the average departure delay by month from part (a) by drawing a bar-plot. Which three months have the worst average departure delays? [Hint: You may need to convert month to factor type using
as.factor()
.][10 pts] Currently, the
dep_time
andsched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Please convert them to a more convenient representation of number of minutes since midnight. Denote the two new variables asdep_time_minutes
andsched_dep_time_minutes
, respectively.
For example, 1755
corresponds to 5:55 pm. We can convert 1755
to minutes as \(17*60 + 55 = 1075\) minutes from midnight. You may consider using integer division operator \(\%/\%\) and modulus operator \(\%\%\).