Data Carpentry Ecology The Jackson Laboratory 9am-12pm Tuesday - Friday, Dec 7-10 Workshop website https://smcclatchy.github.io/2021-12-07-jackson/ Lesson website https://datacarpentry.org/ecology-workshop/ Etherpad https://pad.carpentries.org/2021-12-07-jackson Welcome to Data Carpentry Ecology We will use this Etherpad to share links and snippets of code, take notes, ask and answer questions, and whatever else comes to mind. The page displays a screen with three major parts: * The left side holds today's notes: please edit these as we go along. you can take notes * The top right side shows the names of users who are logged in: please add your name and pick the color that best reflects your mood and personality. * The bottom right is a real time chat window for asking questions of the instructor and your fellow learners. Instructors Mary-Francis LaPorte, UC Davis Maddie Gastonguay (Helper), JAX, chicken noodle soup! Kurt Showmaker, The Jackson Laboratory Annajiat Alim Rasel, BRAC University Helpers Meryl Brodsky, UT Austin Sue McClatchy, The Jackson Laboratory Please sign in for day 1 below (name, institution, favorite food). DAN CORTES, JAX, ENCHILADAS Terrence Turner, Regeneron, Steak DIVYA JAIN, TENNESSEE STATE UNIVERSITY, PIZZA CARLOS LEON, JAX, DRUNKEN NOODLES Katie Janeczko, JAX, Chicken Wings Maheshwor Thapa, JAX Mimi Byun, UNT, pasta Maria Laurent-Rice, Stillman College, seafood-conch Faith Onomeh, Stillman College, Jollof Rice and Chicken. Kurt Showmaker (Helper), JAX, beef tips with crawfish sauce Mary-Francis LaPorte, UC Davis Meryl Brodsky (Helper), UT Austin, Spaghetti & Marinara Jiayuan Shi, JAX Qianru Yu, JAX sarah seneff stillman college pasta Kennedy-MyCal Davis Stillman College Chicken Tenders LaDarien Rancher Stillman College Seafood Bread Rob Pazdro, UGA, sauerkraut pizza Annajiat Alim Rasel, Brac University, milk and honey Data Organization in Spreadsheets ( https://datacarpentry.org/spreadsheet-ecology-lesson/ ) Resources Head First Excel https://www.amazon.com/Head-First-Excel-learners-spreadsheets/dp/0596807694/ LibreOffice digital lab notebooks https://www.repro4everyone.org/resources/intro-fsbyt-sxk8z Exercise * How many people have used spreadsheets in their research? * How many people have accidentally done something that made them frustrated or sad? Exercise We’re going to take a messy version of the survey data and describe how we would clean it up. 1. Download the data by clicking here to get it from FigShare. https://ndownloader.figshare.com/files/2252083 2. Open up the data in a spreadsheet program. 3. You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014, and they both kept track of the data in their own way in tabs 2013 and 2014 of the dataset, respectively. Now you’re the person in charge of this project and you want to be able to start analyzing the data. 4. Identify what is wrong with this spreadsheet. Also think about the steps you would need to take to clean up the 2013 and 2014 tabs, and to put them all together in one spreadsheet. Important Do not forget our first piece of advice: to create a new file (or tab) for the cleaned data, never modify your original (raw) data. After you go through this exercise, we’ll discuss as a group what was wrong with this data and how you would fix it. Excel Exercise findings: comments after data in column +1 +1 the 2013 sheet could have easily combined the three tables by adding a column for species+1 +1 +1 Unit format for weight should be the same format and separate comment, too much empty space, column header should have _ in between words+1 Better not use spaces between words (R won't like that much) Date not consistently written +1+1 species & sex should be split into 2 columns in plot 4+1 Date formats are not consistent for sheet 2014 +1 Data sets organized by species on one tab, plot on other eliminate letters in the weight category Exercise Challenge: pulling month, day and year out of dates * Let’s create a tab called dates in our data spreadsheet and copy the ‘plot 3’ table from the 2014 tab (that contains the problematic dates). * Let’s extract month, day and year from the dates in the Date collected column into new columns. For this we can use the following built-in Excel functions: *YEAR() * *MONTH() * *DAY() * (Make sure the new columns are formatted as a number and not as a date.) You can see that even though we expected the year to be 2014, the year is actually 2015. What happened here is that the field assistant who collected the data for year 2014 initially forgot to include their data for ‘plot 3’ in this dataset. They came back in 2015 to add the missing data into the dataset and entered the dates for ‘plot 3’ without the year. Excel automatically interpreted the year as 2015 - the year the data was entered into the spreadsheet and not the year the data was collected. Thereby, the spreadsheet program introduced an error in the dataset without the field assistant realising. Please download: https://github.com/datacarpentry/spreadsheet-ecology-lesson/blob/gh-pages/data/survey_sorting_exercise.xlsx?raw=true Exercise: Why does sorting this help? Some of the entries have the scale grams in the tab and hence are not actually sorted correctly +1 quickly identify errorneous entries and group them by nature+1 increase in weght of different entries Displayed by categories shows inconsistencies in entering data Differences among different CSV format https://donatstudios.com/CSV-An-Encoding-Nightmare https://excelribbon.tips.net/T013672_Different_CSV_Formats.html https://en.wikipedia.org/wiki/Newline#Representation https://en.wikipedia.org/wiki/Byte_order_mark https://unicode.org/ For those who had questions about electronic lab notebooks from yesterday’s discussion, see this presentation from Reproducibility for Everyone: https://www.repro4everyone.org/resources/intro-fsbyt-sxk8z Data Cleaning with OpenRefine ( https://datacarpentry.org/OpenRefine-ecology-lesson/ ) Download this data file to your computer: https://ndownloader.figshare.com/files/7823341 Download OpenRefine: https://openrefine.org/download.html Faceting Here we will use faceting to look for potential errors in data entry in the scientificName column. 1. Scroll over to the scientificName column. 2. Click the down arrow and choose facet > Text facet. 3. In the left panel, you’ll now see a box containing every unique value in the scientificName column along with a number representing how many times that value occurs in the column. 4. Try sorting this facet by name and by count. Do you notice any problems with the data? What are they? 5. Hover the mouse over one of the names in the facet list. You should see that you have an edit function available. 6. You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering. What do you notice about the scientific names repeats +1 Typo or spelling diffrences +1 1. Using faceting, find out how many years are represented in the census. 2. Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display? 3. Which years have the most and least observations? 1. 26 2. text 3. 1997+1+1+1 Least 1977+1 https://stackoverflow.com/help/how-to-ask https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/ Speed check, please use + (plus sign) below okay: ++++++ go-faster: + go-slower: Metadata: https://figshare.com/articles/dataset/Portal_Project_Teaching_Database/1314459 FAIR and CARE principles https://www.go-fair.org/fair-principles/ https://www.gida-global.org/care https://www.scilifelab.se/event/fairpoints/ https://twitter.com/FAIR_Points https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth https://stackoverflow.com/questions/tagged/openrefine OpenRefine Limits https://groups.google.com/g/openrefine/c/-loChQe4CNg/m/eroRAq9_BwAJ 1. For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as abc) or blank (no number or text)+1+1+1+1 not a valid number message+1+1 1. Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.+1+1+1+1+1 1. Notice that there are several checkboxes in this facet: Numeric, Non-numeric, Blank, and Error. Below these are counts of the number of cells in each category. You should see checks for Non-numeric and Blank if you changed some values.+1+1+1 1. Experiment with checking or unchecking these boxes to select subsets of your data.+1 Exercise: 1) sorted and made reordering permanent +1+1+1+1 2) extracted the steps to a plaintext file+1+1+1+1 3) applied those steps+1+1+1+1+1 4) exported project+1+1 5) exported CSV+1+1 More OpenRefine resources: https://openrefine.org/download.html#:~:text=List%20of%20Extensions https://github.com/OpenRefine/OpenRefine https://groups.google.com/g/openrefine https://datacarpentry.org/OpenRefine-ecology-lesson/06-resources/index.html OpenRefine related R packages https://cran.r-project.org/web/packages/refinr/index.html https://cran.r-project.org/web/packages/rrefine/vignettes/rrefine-vignette.html Closing OpenRefine Windows: To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down Control and press C on your keyboard. This will save any last changes to your projects. MacOS: To exit, close all your OpenRefine browser tabs, go back to the terminal window and press Command and Q to close it down. Resources for the R lesson Code: https://datacarpentry.org/R-ecology-lesson/code-handout.R Data: https://doi.org/10.6084/m9.figshare.1314459 Introduction to R https://datacarpentry.org/R-ecology-lesson/01-intro-to-r.html install.packages(c("tidyverse", "hexbin", "patchwork", "RSQLite")) library(tidyverse) library(hexbin) library(patchwork) library(RSQLite) getwd() setwd(where you are saving your data) data-carpentry data raw_data scripts fig list.dirs() dir.create("data") dir.create("fig") ?mean() ??kruskall browseVignettes() packagesDescription('ggplot2') sessionInfo() #Creating an object weight_kg <- 55 What are the values after each statement in the following? mass <- 47.5 # mass? 47.5 +1+1 age <- 122 # age? 122 +1+1 mass <- mass * 2.0 # mass? 95+1 age <- age - 20 # age? 102++1 mass_index <- mass/age # mass_index? 0.931+1 1. We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector? it seems that one type will overrule others; character seems dominant class 2. What will happen in each of these examples? (hint: use class() to check the data type of your objects): num_char <- c(1, 2, 3, "a")character +1+1+1+1+1 num_logical <- c(1, 2, 3, TRUE)numeric+1+1+1+1+1+1 char_logical <- c("a", "b", "c", TRUE)character+1+1+1+1+1 tricky <- c(1, 2, 3, "4")character+1+1+1+1+1 3. Why do you think it happens? I'm not sure if the quotes also have something to do with what class R thinks these are.+1 maybe it knows that characters are limited in use in function as opposed to numbers 4. How many values in combined_logical are "TRUE" (as a character) in the following example (reusing the 2 ..._logicals from above):one combined_logical <- c(num_logical, char_logical)character+1 5. You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced? logical -> numeric -> character <- logical Speed test (put a plus one (+1) next to the statement you agree with): too slow just right+1+1+1+1+1+1 too fast ### Challenge (optional) ## ## * Can you figure out why `"four" > "five"` returns `TRUE`? ??? 1. Using this vector of heights in inches, create a new vector, heights_no_na, with the NAs removed. heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) heights<- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) > heights[!is.na(heights)] [1] 63 69 60 65 68 61 70 61 59 64 69 63 63 72 65 64 70 63 65 > heights[!is.na(heights)] [1] 63 69 60 65 68 61 70 61 59 64 69 63 63 72 65 64 70 63 65 2. Use the function median() to calculate the median of the heights vector. > median(heights[!is.na(heights)]) [1] 64[1] 64 > median(heights[!is.na(heights)]) [1] 64 3. Use R to figure out how many people in the set are taller than 67 inches. > length(heights[heights > 67]) [1] 6 Starting with Data https://datacarpentry.org/R-ecology-lesson/02-starting-with-data.html Download data: *download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv") have downlaoded data: Yes:+1 +1+1+1+1+1 No: Based on the output of str(surveys), can you answer the following questions? * What is the class of the object surveys? ##### * Create a data.frame (surveys_200) containing only the data in row 200 of the surveys dataset. surveys_200 <- head(surveys,200) surveys[200,] * Notice how nrow() gave you the number of rows in a data.frame? * Use that number to pull out just that last row in the data frame. * Compare that with what you see as the last row using tail() to make sure it’s meeting expectations. * Pull out that last row using nrow() instead of the row number. * Create a new data frame (surveys_last) from that last row. * Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle. * Combine nrow() with the - notation above to reproduce the behavior of head(surveys), keeping just the first through 6th rows of the surveys dataset. MAKE A Vector notation of what your able to complet when done: YES: (1:3), NO:(4), * #Challenge #Change the columns taxa and genus in the surveys data frame into a factor. #Using the functions you learned before, can you find out… #How many rabbits were observed? #How many different genera are in the genus column? #as of now to make a clean data frame surverys <- read_csv("data_raw/portal_data_joined.csv") Data Wrangling cheat sheet https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Next, we’re going to learn some of the most common dplyr functions: * select(): subset columns * filter(): subset rows on conditions * mutate(): create new columns by using information from other columns * group_by() and summarize(): create summary statistics on grouped data * arrange(): sort results * count(): count discrete values Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight. surveys %>% filter(year<1995) %>% select(year,sex,weight) %>% head() surveys %>% filter(year < 1995) %>% select(year, sex, weight) Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_cm containing the hindfoot_length values (currently in mm) converted to centimeters. In this hindfoot_cm column, there are no NAs and all values are less than 3. Hint: think about how the commands should be ordered to produce this data frame!