class: center, middle, inverse, title-slide .title[ # Data organization best practices ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 09-20-2023 ] --- ## Spreadsheets - Store data in text files, delimiter-separated - Spreadsheets may be used for data entry and storage - Analysis and visualization should happen outside of spreadsheets .small[Broman, Karl W, and Kara H. Woo. “[Data Organization in Spreadsheets](https://doi.org/10.7287/peerj.preprints.3183v1).” - Excel spreadsheet tips and best practices] --- ## General spreadsheet tips - Use a consistent data layout in multiple files - Your primary data file should contain just the data and nothing else: no calculations, no graphs - Enter one piece of information in one cell - Don't use formatting, e.g., coloring, to convey additional meaning to the cell value. Instead, add another column, e.g., "outlier" with a boolean `TRUE` or `FALSE` indicator variable - Use a consistent format for all dates, YYYY-MM-DD recommended - Use "Text" format where possible, to avoid autoreformatting, e.g., "SEPT9" will become "9/9/2020" displayed as "9-Sep" - Use "\`forward tick" trick to force text format --- ## General spreadsheet tips (cont.) - Fill in all cells. No empty cells - Use a consistent fixed code for any missing values, e.g. "NA". Don't use numerical values like 999 - Use consistent file names for similar datasets - Never include “final” in a file name - Save and distribute the data in plain text files - Use comma-delimited (CSV) format .small[http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/ https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates] --- ## Variable naming conventions - Avoid using spaces in column/row names. Use some filler instead, e.g., "." or "_" - Note that `read.csv()`, `read.table` and the like functions will silently replace spases with dots in column names ```r column_names_with_spaces <- c("First column", "Second column") make.names(column_names_with_spaces) ``` ``` ## [1] "First.column" "Second.column" ``` --- ## Spreadsheet horror stories <img src="http://www.eusprig.org/images/logo.jpg" height=100> - A public archive of spreadsheet “horror stories” - In 13 audits of real-world spreadsheets, an average of 88% spreadsheets contained errors (Panko, Raymond R. "[What we know about spreadsheet errors](https://www.igi-global.com/article/know-spreadsheet-errors/55750)." Journal of Organizational and End User Computing (JOEUC) 10, no. 2 (1998): 15-21.) .small[http://www.eusprig.org/horror-stories.htm] --- ## Project organization principles - One project = one folder - Create readable names for subfolders/code. E.g. `00_raw_data`, `01_raw_data_QC` etc. - Choose file names carefully. Don't put spaces and special characters in file names! - Be sure to get and keep any/all data and meta-data possible - Get the data in the most-raw form possible. Keep the original files, names intact. (gzipped) CSV Text format is the most preferable - Separate data from code. Use relative paths in code. Create multiple `README.md` and write notes - your future you will thank you .small[Convert Excel files to CSV https://github.com/dilshod/xlsx2csv] --- ## Another project organization idea ``` bash project/ | data/ | | processing_scripts | | raw/ | | proc/ | tools/ | | src/ | | bin/ | exps | | pipeline_scripts | | results/ | | analysis_scripts | | figures/ ``` --- ## And another project organization idea ``` bash project/ | 00_raw/ | 01_processed/ | 02_summary/ | 01_process.sh | 02_summary.sh | README.md ``` --- ## Folder organization - Your project folder dictates relative paths of all files and subfolders - Use relative paths, e.g., `read_csv("data/matrix.txt")` - Set your project folder as a variable; if a project is transferred to another computer, there will be only ove variable to change ``` r dirMain <-"/path/to/my/project" # On Windows, can use \\ setwd(dirMain) # Set working directory. Check ?getwd list.files(path = dirMain) $ List files ``` - You can construct explicit paths like `file.path(dirMain, "data/matrix.txt")` --- ## Project organization principles - **Script everything** - All analysis steps, including data cleaning (removal of outliers, correcting numbers, typos, renaming columns etc.) should be scripted - **Scalability and universality** - ask yourself a question, if the data are updated (e.g., additional subjects) or you find some artifact that needs fixing, can you just “press a button” to update? If you work on a similar project, can you reuse your existing scripts with minimal modifications? - **Document everything** - Text format, human readable. Explicitly tie files together. Have a plan to organize, store and make your work understandable by others --- ## File naming principles - Machine readable - Human readable - Plays well with default ordering .small[http://www.mnhs.org/preserve/records/electronicrecords/erfnaming.php] --- ## File naming: Machine readable - Regular expression and globbing friendly - Avoid spaces, punctuation, accented characters, case sensitivity - Easy to compute on - Deliberate use of delimiters, e.g. "-", "\_" (think `cut -d "_" -f1,3,5`, `grep Notes` commands) ``` bash 2018-05-16_Lecture_Slides_01.pdf 2018-05-16_Lecture_Notes_01.pdf 2018-05-16_Lecture_Slides_02.pdf 2018-05-16_Lecture_Notes_02.pdf ``` --- ## File naming: Human readable - Name contains info on content - Easy to figure out what something is, based on its name ``` bash 01_preprocessing.R 02_quality-control.R helper01_rename-files.sh helper02_merge-duplicates.py ``` --- ## File naming summary - Plays well with default ordering - Put something numeric first - Left pad numbers with zeros - Use the ISO 8601 standard for dates (YYYY-MM-DD _or_ YYMMDD) Good ``` bash 2018-05-16_Lecture_Slides_01.pdf 2018-05-16_Lecture_Notes_01.pdf ``` Bad ``` bash 10_final-figures.R 1_data-cleaning.R 2_quality-control.R ``` .small[http://en.wikipedia.org/wiki/ISO_8601] --- ## Data management - Save the raw data - Ensure that raw data are backed up in more than one location - Create the data you wish to see in the world - Create analysis-friendly data - Record all the steps used to process data - Anticipate the need to use multiple tables, and use a unique identifier for every record - Submit data to a reputable DOI - issuing repository so that others can access and cite it --- ## Data management - Data are cheap, time is expensive - A terabyte hard drive costs about US $50 retail, which means that 50 Gigabytes costs less than US $5 - How much of your time is needed to generate 50Gb of code? --- ## References - [Tips for organizing projects](http://kbroman.org/steps2rr/pages/organize.html), [Organizing data in spreadsheets](http://kbroman.org/dataorg/) by Karl Broman - [Data Organization in Spreadsheets, common mistakes](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/) - [ProjectTemplate](http://projecttemplate.net/) - an R package for advanced project management, [GitHub](https://github.com/johnmyleswhite/ProjectTemplate) - Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, et al. "[Best Practices for Scientific Computing](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745) ." PLoS Biology 2014 - Noble, William Stafford. “[A Quick Guide to Organizing Computational Biology Projects](https://doi.org/10.1371/journal.pcbi.1000424).” PLoS Computational Biology 5, no. 7 (July 2009) - Computational projects organization, folder structure, command line scripts, version control