Assignment 4: dplyr, data manipulation
Assignment 4: dplyr, data manipulation
2022-11-16
To do yourself
Read Wickham, Hadley, et al. “Welcome to the Tidyverse.” Journal of Open Source Software, 2019, or Wickham, Hadley. “Tidy data.” Journal of Statistical Software, 2014
introverse - alternate documentation for commonly-used functions and concepts in Base R and in the tidyverse. Tweet
Data Manipulation Using R (& dplyr) by Ram Narasimhan, PDF slides
Data Manipulation with dplyr brief tutorial
Aggregating and analyzing data with dplyr tutorial by Data Carpentry
Introduction to dplyr for Faster Data Manipulation in R tutorial and a 40 min video Hands-on dplyr tutorial for faster data manipulation in R
Animations of tidyverse verbs using R, the tidyverse, and gganimate - visual explanation of dplyr operations
Reusing Tidyverse code - dplyr/tidyverse data manipulation lecture slides
dplyr
What is the difference between
read_xls()
andread_xlsx()
functions? What message do you get if reading an.xlsx
file usingread_xls()
function?What does the
skip
argument do?Do we need to refer to a sheet within an excel file as a number, or can we refer to it as the sheet name instead?
What does the
guess_max
argument do?What happens if columns in the Excel worksheet are of different length?
How would you write into an Excel file? Demonstrate saving the
mtcars
dataset into an Excel file.Use the
starwars
dataset that is loaded with the tidyverse. Accomplish the following in one long string of pipes.- Keep only observations with weight and height recorded. Also include the homeworld variable.
- Create a variable called
bmi
that calculates the character’s BMI (search for formula). - Summarize the BMI variable, grouping observations by homeworld.
- Print this summary in decreasing order of average BMI.
Read in the following data into R. This data is from the American Community Survey and references the population of three cities in Virginia between 2009 and 2012.
cities <- data.frame(name = rep(c("richmond", "norfolk", "charlottesville")), pop2009 = c(1202494,236071,191515), pop2010 = c(1235565,242143,197279), pop2011 = c(1248271,241943,199675), pop2012 = c(1260202,243056,210909))
In one long string of pipes, convert the data from wide (2009 to 2012 population values) to long format, naming the new column of populationspop
, group by city, create a summary variable that is the ratio of the largest population value to smallest population value for the city, and arrange by this ratio value in decreasing order.
To submit on Canvas
Create RMarkdown document with headers, text, and code to answer/visualize questions. Submit both Rmd and knitted PDF. Pay attention to code clarity, variable names, comments.