class: center, middle, inverse, title-slide .title[ # R preliminaries ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 08-28-2023 ] --- ## Summary of the previous class .panelset[ .panel[.panel-name[Objects] - Everything that exists in R is an **object** - Everything that happens in R is a **call to a function** - The assignment operator: ` <- `, preferred over ` = ` - Do not name objects with [Reserved Words in R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) like `if else while function for TRUE FALSE NULL Inf NaN NA` ] .panel[.panel-name[Object types] .can-edit[ - Scalars, `a <- 3.14` - Vectors, `b <- c(1, 2)` or `b <- c(1, "2")` - Vectors can be named, `names(b) <- c("First", "Second")`. Use `unname()` to remove. - Matrices, `mat <- matrix(data = 0, nrow = 2, ncol = 2)` - Data frames, `dat = data.frame(Column.1 = c(3, 1, 3), Column.2 = c("2", "3", "2"))` - Boolean, `TRUE` or `FALSE` - Factors, `factor(c("Cats", "Dogs"), levels = c("Dogs", "Cats"))` ] ] .panel[.panel-name[Subsetting] .can-edit[ - Access to elements using `[]` - Row/column indexes start from 1, `dat[1, 2]` - Columns in data frames can be accessed with `$`, `dat$Column.1` - Elements can be subsetted using Boolean indexes, `a[c(TRUE, FALSE)]` - Avoid spaces in column names, as well as numerical column names - If absolutely necessary, wrap column names in forward ticks, `dat$1` ] ] .panel[.panel-name[Auxillary functions] .can-edit[ - `class(a)`, `str(a)`, `is.character(a)`, `as.character(b)` - `dim(dat)`, `nrow(dat)`, `ncol(dat)`, `length(dat)`, `colnames(dat)`, `rownames(dat)` - `head()`, `tail()`, `summary()` - Use `?` on any function to get help, `?cor` - Type function name without parentheses to see its code, `cor` ] ] ] --- ## Sequences of elements ```r rep(c(1, 2, 3), 5) ``` ``` ## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 ``` ```r seq(from = 1, to = 20, by = 2) ``` ``` ## [1] 1 3 5 7 9 11 13 15 17 19 ``` ```r 1:5 ``` ``` ## [1] 1 2 3 4 5 ``` ```r 5:1 ``` ``` ## [1] 5 4 3 2 1 ``` --- ## Sequences of characters .panelset[ .panel[.panel-name[Sequence of characters] ```r "a":"e" ``` ] .panel[.panel-name[Results] ```r "a":"e" Error in "a":"e" : NA/NaN argument In addition: Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion ``` ] .panel[.panel-name[Solutions] .can-edit[ ```r head(letters) ``` ``` ## [1] "a" "b" "c" "d" "e" "f" ``` ```r tail(LETTERS) ``` ``` ## [1] "U" "V" "W" "X" "Y" "Z" ``` ```r month.abb ``` ``` ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" ``` ```r month.name ``` ``` ## [1] "January" "February" "March" "April" "May" "June" ## [7] "July" "August" "September" "October" "November" "December" ``` ```r pi ``` ``` ## [1] 3.141593 ``` ] ] ] --- ## Lists - **Lists**: objects containing elements of different types - Each list element can be of different length ```r lst = list(A = rep(2, 5), B = seq(1:5), C = letters[1:10]) lst ``` ``` ## $A ## [1] 2 2 2 2 2 ## ## $B ## [1] 1 2 3 4 5 ## ## $C ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" ``` ```r unlist(lst) ``` ``` ## A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 ## "2" "2" "2" "2" "2" "1" "2" "3" "4" "5" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" ``` --- ## Addressing elements in a list - Address any element as `lst[1]` (or, `lst["A"]`) ```r lst[1] ``` ``` ## $A ## [1] 2 2 2 2 2 ``` - Address _the content of any element_ as `lst[[1]]` (or, `lst[["A"]]`, `lst$A`) ```r lst[[1]] ``` ``` ## [1] 2 2 2 2 2 ``` --- ## Logical operators .can-edit[ ```r 3 < 4 & "a" == "b" ``` ] - Like arithmetic operations, logic statements follow the order of preference. Operators `>`, `==`, `!` etc. are evaluated before `&` and `|` - If in doubt, wrap your expressions in parentheses .can-edit[ ```r (3 < 4) & ("a" == "b") ``` ] --- ## Logical operators .panelset[ .panel[.panel-name[Evaluation] ```r 1 + 2 == 3 ``` ``` ## [1] TRUE ``` What do you think will happen if we evaluate `0.1 + 0.2 == 0.3`? ] .panel[.panel-name[Caveats] ```r 0.1 + 0.2 == 0.3 ``` ``` ## [1] FALSE ``` **Problem:** Computers represent numbers as binary (i.e. base 2) floating-points. [Read more](https://floating-point-gui.de/basic/) **Solution:** Use `all.equal()` for evaluating floats (i.e fractions). ```r all.equal(0.1 + 0.2, 0.3) ``` ``` ## [1] TRUE ``` ] ] --- ## Value matching - To see whether an object is contained within (i.e. matches one of) a list of items, use `%in%` ```r 5 %in% 1:10 ``` ``` ## [1] TRUE ``` ```r 1:10 %in% 5 ``` ``` ## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE ``` - Value matching can be useful to subset R objects ```r pvals <- c(0.05, 0.04, 0.09, 0.03, 0.12) pvals[pvals <= 0.05] ``` ``` ## [1] 0.05 0.04 0.03 ``` --- ## Comments R ignores everything after the `#` sign ```r # This line is a comment print("Hello, World!") # This will print the message, but the comment will be ignored ``` ``` ## [1] "Hello, World!" ``` --- ## Clean up your environment ```r z <- c(1, 2, 3) ls() ``` ``` ## [1] "lst" "pvals" "z" ``` ```r rm(z) # Remove one variable ls() ``` ``` ## [1] "lst" "pvals" ``` ```r # Remove everything from the environment rm(list = ls()) # Not the same as restarting R session ls() ``` ``` ## character(0) ``` --- ## Functions - A function is a set of statements organized together to perform a specific task - **Name** - the actual name of the function, e.g., `summary()`, `mean()` - **Arguments** - values passed to the functions. Argument-less functions exist - **Code** - actual code of the function - **Return value** - the result of the function's code execution ``` r read.csv(file="scores.csv") ``` `read.csv` is a function to import a CSV file, and `file` is an argument that specifies which file to import R has a large number of built-in functions, and the user can create their own functions --- ## Running functions - From the R console - type the function and hit Enter - One function at a time, not efficient - Using an `R` script - a text file that contains all your `R` functions/code - `R` scripts allow you to save, edit, reproduce and share your code - R scripts stored in files with `.R` extension - Run the whole script as `source("script_name.R")`, or, from command line, `Rscript script_name.R` - In RStudio, you can run individual lines, code chunks, or source whole scripts. Keyboard shortcuts are available --- ## Packages - All functions belong to *packages*. The `read.csv` function is in the `utils` package. - `R` comes with about 30 packages (called "base `R`"), but as of August 2023, there are about 20,000 CRAN packages and over 2,000 Bioconductor packages - Example: `ggplot2` is a popular package that adds functions for creating graphs in a different way than what base `R` provides - To use functions in a package, the package must be installed and loaded. (They're free) - You only _install_ a package once - You _load_ a package whenever you want to use its functions --- ## Package repositories - `CRAN` - Comprehensive R Archive Network - a collection of > 18,000 (September 2021) packages - `Bioconductor` - genomics-oriented free and open source project hosting > 2,000 specialized R packages (September 2021) - `GitHub` - code-hosting repository, packages for everyone and by everyone .small[ https://cran.r-project.org/web/packages/ https://www.bioconductor.org/ https://github.com/ ] --- ## Installing packages - `install.packages(“<package_name>”)` - installs packages from CRAN, e.g., `install.packages("BiocManager")` - `install.packages(“<package_name.tar.gz>”, repos = NULL)` - install from a tarball archive - `R CMD INSTALL <package_name.tar.gz>` - install from a command line - `remotes` package - installs R packages from GitHub, GitLab, Bitbucket, Bioconductor, or plain 'subversion' or 'git' repositories. E.g., `remotes::install_github("tidyverse/ggplot2")` - `BiocManager::install()` - Install or update Bioconductor, CRAN, or GitHub packages .small[ https://CRAN.R-project.org/package=BiocManager ] --- ## Installing packages - `AnVIL::install()` - Install package binaries, speeds up installation process - RStudio point-and-click interface .center[<img src="img/rstudio_snippets.png" height=400 >] --- ## Loading packages - `library()` will load the package, e.g., `library(readxl)` or `library("readxl")` - But, when installing packages, always use parentheses, e.g., `install.packages("readxl")` - `require()` will load the package and, if success, return TRUE. Useful in `if` statement, e.g. ``` r if (!require(ggplot2)) { install.packages("ggplot2") } ``` --- ## Loading packages - `library(package_name)` - load library to use its functions - `library()` vs. `require()` - `require()` _tries_ to load the package, returns TRUE or FALSE - `library()` just loads the package, fails if the package is not available - Use only `library(package_name)` .small[ https://yihui.name/en/2014/07/library-vs-require/ ] --- ## Using functions from other packages - You can access functions without loading the package using the `::` operator, e.g., `Hmisc::rcorr()` - Entering the function name without parentheses will output its code ``` r > data.frame function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) { data.row.names <- if (check.rows && is.null(row.names)) ... ``` - You can access internal functions of a package with the `:::` operator if you know their name --- ## Getting help - Use `?function_name` to get help on a function from a _loaded_ package. E.g., `?boxplot` (same as `help(boxplot)`) - Use `example(boxplot)` to see how the function can be used - Use `??function_name` to search for the function across all installed packages, even not loaded. E.g., `??ggplotly` - Get an overview of all functions in a package: `help(package = "dplyr")` --- ## Getting help - For many packages, you can also try the `vignette()` function, which will provide an introduction to a package and it's purpose through a series of helpful examples. E.g., `vignette("dplyr")` - Bioconductor packages have vignettes, short tutorials on package-specific tasks. Browse them, e.g., `browseVignettes(package = "limma")` - Some packages have interactive demos. List all such packages with `demo(package = .packages(all.available = TRUE))`, use as `demo("fibonacci", package = "future")` --- ## Useful ways of getting data in and out of R - Base functions: `read.table`, `read.csv`, `write.table`, `write.csv` - For fixed-width files, use `read.fwf` or `readr::read_fwf` funcitons - Tidyverse way, `readr` package: `read_table`, `read_csv`, `read_tsv`, `write_csv` ... - For reading/writing Excel files, use `readxl` and `writexl` packages, `read_xlsx`, `write_xlsx` functions - Remember that `.csv` is the preferred text-based format that opens in Excel - `data.table` way: `fread` and `fwrite` functions .small[https://readr.tidyverse.org/ https://readxl.tidyverse.org/ https://CRAN.R-project.org/package=writexl https://CRAN.R-project.org/package=data.table] <!-- ## The stringsAsFactors curse (for R before 4.0) - When creating data frames with `data.frame()` or reading data with `read.table()`, strings automatically converted to factors - This behind-the-scenes factor conversion can lead to unpredictable behaviors - Use `as.is = TRUE` in `read.table()` to avoid such conversion - Better yet, set `options(stringsAsFactors = FALSE)` at the beginning of your script files .small[https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/] --> --- ## Save/load R objects - `save()`, `load()` - saves/loads R **objects** to the specified file ``` r x <- stats::runif(20) y <- list(a = 1, b = TRUE, c = "oops") save(x, y, file = "xy.rda") load(file = "xy.rda") ``` - `saveRDS()`, `readRDS()` - saves/loads a __representation__ of the object ``` r x <- stats::runif(20) saveRDS(x, file = "x.rds") x2 <- readRDS(file = "x.rds") identical(x, x2, ignore.environment = TRUE) ``` .small[https://fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/] --- ## R datasets R contains many datasets (stored as data frames) that are built-in to the software ```r data() # All built-in datasets # ?trees data(trees) # Load a particular one head(trees) ``` ``` ## Girth Height Volume ## 1 8.3 70 10.3 ## 2 8.6 65 10.3 ## 3 8.8 63 10.2 ## 4 10.5 72 16.4 ## 5 10.7 81 18.8 ## 6 10.8 83 19.7 ``` --- ## Accessing data in datasets ```r attach(trees) # You can make R find variables in any data frame by adding the data frame to the search path search() # .GlobalEnv is your workspace and the package quantities are libraries ``` ``` ## [1] ".GlobalEnv" "trees" "package:xaringanthemer" ## [4] "package:stats" "package:graphics" "package:grDevices" ## [7] "package:utils" "package:datasets" "package:methods" ## [10] "Autoloads" "package:base" ``` ```r detach(trees) # To remove an object from the search path, use the detach() with(trees, mean(Height)) # Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data ``` ``` ## [1] 76 ``` `attach()` can cause name overloads and other serious issues. Avoid it