Data Carpentry Ecology                                           The Jackson Laboratory
9am-12pm                                                                   Tuesday - Friday, Dec 7-10                      
Workshop website                                                      https://smcclatchy.github.io/2021-12-07-jackson/
Lesson website                                                           https://datacarpentry.org/ecology-workshop/
Etherpad                                                                      https://pad.carpentries.org/2021-12-07-jackson

Welcome to Data Carpentry Ecology
We will use this Etherpad to share links and snippets of code, take notes, ask and answer questions, and whatever else comes to mind.
The page displays a screen with three major parts:

	* The left side holds today's notes: please edit these as we go along. you can take notes
	* The top right side shows the names of users who are logged in: please add your name and pick the color that best reflects your mood and personality.
	* The bottom right is a real time chat window for asking questions of the instructor and your fellow learners.

Instructors
Mary-Francis LaPorte, UC Davis
Maddie Gastonguay (Helper), JAX, chicken noodle soup!
Kurt Showmaker, The Jackson Laboratory
Annajiat Alim Rasel, BRAC University

Helpers
Meryl Brodsky, UT Austin
Sue McClatchy, The Jackson Laboratory

Please sign in for day 1 below (name, institution, favorite food).

DAN CORTES, JAX, ENCHILADAS
Terrence Turner, Regeneron, Steak
DIVYA JAIN, TENNESSEE STATE UNIVERSITY, PIZZA
CARLOS LEON, JAX, DRUNKEN NOODLES
Katie Janeczko, JAX, Chicken Wings
Maheshwor Thapa, JAX
Mimi Byun, UNT, pasta
Maria Laurent-Rice, Stillman College, seafood-conch 
Faith Onomeh, Stillman College, Jollof Rice and Chicken.
Kurt Showmaker (Helper), JAX, beef tips with crawfish sauce
Mary-Francis LaPorte, UC Davis
Meryl Brodsky (Helper), UT Austin, Spaghetti & Marinara
Jiayuan Shi, JAX
Qianru Yu, JAX
sarah seneff  stillman college  pasta
Kennedy-MyCal Davis Stillman College Chicken Tenders
LaDarien Rancher Stillman College Seafood Bread
Rob Pazdro, UGA, sauerkraut pizza
Annajiat Alim Rasel, Brac University, milk and honey


Data Organization in Spreadsheets ( https://datacarpentry.org/spreadsheet-ecology-lesson/ ) 

Resources
Head First Excel https://www.amazon.com/Head-First-Excel-learners-spreadsheets/dp/0596807694/
LibreOffice
digital lab notebooks https://www.repro4everyone.org/resources/intro-fsbyt-sxk8z

Exercise  
	* How many people have used spreadsheets in their research?
	* How many people have accidentally done something that made them frustrated or sad?

Exercise   
We’re going to take a messy version of the survey data and describe how we would clean it up.
  
	1. Download the data by clicking here to get it from FigShare. https://ndownloader.figshare.com/files/2252083
	2. Open up the data in a spreadsheet program.
	3. You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014, and they both kept track of the data in their own way in tabs 2013 and 2014 of the dataset, respectively. Now you’re the person in charge of this project and you want to be able to start analyzing the data.
	4. Identify what is wrong with this spreadsheet. Also think about the steps you would need to take to clean up the 2013 and 2014 tabs, and to put them all together in one spreadsheet.   
Important Do not forget our first piece of advice: to create a new file (or tab) for the cleaned data, never modify your original (raw) data.
  
After you go through this exercise, we’ll discuss as a group what was wrong with this data and how you would fix it.


Excel Exercise findings:
comments after data in column +1 +1 
the 2013 sheet could have easily combined the three tables by adding a column for species+1 +1 +1
Unit format for weight should be the same format and separate comment, too much empty space, column header should have _ in between words+1
Better not use spaces between words (R won't like that much)
Date not consistently written +1+1
species & sex should be split into 2 columns in plot 4+1
Date formats are not consistent for sheet 2014 +1
Data sets organized by species on one tab, plot on other
eliminate letters in the weight category

Exercise   
Challenge: pulling month, day and year out of dates
  
	* Let’s create a tab called dates in our data spreadsheet and copy the ‘plot 3’ table from the 2014 tab (that contains the problematic dates).
	* Let’s extract month, day and year from the dates in the Date collected column into new columns. For this we can use the following built-in Excel functions:   
*YEAR()
*  
*MONTH()
*  
*DAY()
*  
(Make sure the new columns are formatted as a number and not as a date.)
  
You can see that even though we expected the year to be 2014, the year is actually 2015. What happened here is that the field assistant who collected the data for year 2014 initially forgot to include their data for ‘plot 3’ in this dataset. They came back in 2015 to add the missing data into the dataset and entered the dates for ‘plot 3’ without the year. Excel automatically interpreted the year as 2015 - the year the data was entered into the spreadsheet and not the year the data was collected. Thereby, the spreadsheet program introduced an error in the dataset without the field assistant realising.

Please download: https://github.com/datacarpentry/spreadsheet-ecology-lesson/blob/gh-pages/data/survey_sorting_exercise.xlsx?raw=true

Exercise: Why does sorting this help?

Some of the entries have the scale grams in the tab and hence are not actually sorted correctly +1
quickly identify errorneous entries and group them by nature+1
increase in weght of different entries
Displayed by categories
shows inconsistencies in entering data

Differences among different CSV format
https://donatstudios.com/CSV-An-Encoding-Nightmare
https://excelribbon.tips.net/T013672_Different_CSV_Formats.html
https://en.wikipedia.org/wiki/Newline#Representation
https://en.wikipedia.org/wiki/Byte_order_mark
https://unicode.org/

For those who had questions about electronic lab notebooks from yesterday’s discussion, see this presentation from Reproducibility for Everyone: https://www.repro4everyone.org/resources/intro-fsbyt-sxk8z


Data Cleaning with OpenRefine ( https://datacarpentry.org/OpenRefine-ecology-lesson/ )
  
Download this data file to your computer: https://ndownloader.figshare.com/files/7823341
Download OpenRefine: https://openrefine.org/download.html

Faceting
Here we will use faceting to look for potential errors in data entry in the scientificName column.

	1. Scroll over to the scientificName column.
	2. Click the down arrow and choose facet > Text facet. 
	3. In the left panel, you’ll now see a box containing every unique value in the scientificName column along with a number representing how many times that value occurs in the column. 
	4. Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
	5. Hover the mouse over one of the names in the facet list. You should see that you have an edit function available.
	6. You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.

What do you notice about the scientific names
repeats +1

Typo or spelling diffrences +1


	1. Using faceting, find out how many years are represented in the census.
	2. Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
	3. Which years have the most and least observations?
1. 26 2. text 3. 1997+1+1+1
Least 1977+1


https://stackoverflow.com/help/how-to-ask
https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/

Speed check, please use +  (plus sign) below
okay: ++++++
go-faster: +
go-slower: 

Metadata:
https://figshare.com/articles/dataset/Portal_Project_Teaching_Database/1314459

FAIR and CARE principles
https://www.go-fair.org/fair-principles/
https://www.gida-global.org/care
https://www.scilifelab.se/event/fairpoints/
https://twitter.com/FAIR_Points

https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
https://stackoverflow.com/questions/tagged/openrefine

OpenRefine Limits
https://groups.google.com/g/openrefine/c/-loChQe4CNg/m/eroRAq9_BwAJ


	1. For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as abc) or blank (no number or text)+1+1+1+1
not a valid number message+1+1


	1. Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.+1+1+1+1+1


	1. Notice that there are several checkboxes in this facet: Numeric, Non-numeric, Blank, and Error. Below these are counts of the number of cells in each category. You should see checks for Non-numeric and Blank if you changed some values.+1+1+1


	1. Experiment with checking or unchecking these boxes to select subsets of your data.+1


Exercise:
1) sorted and made reordering permanent +1+1+1+1
2) extracted the steps to a plaintext file+1+1+1+1
3) applied those steps+1+1+1+1+1
4) exported project+1+1
5) exported CSV+1+1

More OpenRefine resources:
https://openrefine.org/download.html#:~:text=List%20of%20Extensions
https://github.com/OpenRefine/OpenRefine
https://groups.google.com/g/openrefine
https://datacarpentry.org/OpenRefine-ecology-lesson/06-resources/index.html

OpenRefine related R packages
https://cran.r-project.org/web/packages/refinr/index.html
https://cran.r-project.org/web/packages/rrefine/vignettes/rrefine-vignette.html

Closing OpenRefine

Windows: To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down Control and press C on your keyboard. This will save any last changes to your projects.

MacOS: To exit, close all your OpenRefine browser tabs, go back to the terminal window and press Command and Q to close it down.

Resources for the R lesson
Code: https://datacarpentry.org/R-ecology-lesson/code-handout.R
Data: https://doi.org/10.6084/m9.figshare.1314459


Introduction to R
https://datacarpentry.org/R-ecology-lesson/01-intro-to-r.html

install.packages(c("tidyverse", "hexbin", "patchwork", "RSQLite"))
library(tidyverse)
library(hexbin)
library(patchwork)
library(RSQLite)

getwd()
setwd(where you are saving your data)

data-carpentry
	data
	raw_data
	scripts
	fig
list.dirs()

dir.create("data")
dir.create("fig")

?mean()
??kruskall
browseVignettes()

packagesDescription('ggplot2')

sessionInfo()

#Creating an object
 weight_kg <- 55
 
 
What are the values after each statement in the following?
		mass <- 47.5            # mass?
		47.5 +1+1
		age  <- 122             # age?
		122 +1+1
		mass <- mass * 2.0      # mass?
		95+1
		
		age  <- age - 20        # age?
		102++1
		
		mass_index <- mass/age  # mass_index?
		0.931+1
		
		
1. We’ve seen that atomic vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?
it seems that one type will overrule others; character seems dominant class

2. What will happen in each of these examples? (hint: use class() to check the data type of your objects):
	num_char <- c(1, 2, 3, "a")character +1+1+1+1+1
	num_logical <- c(1, 2, 3, TRUE)numeric+1+1+1+1+1+1
	char_logical <- c("a", "b", "c", TRUE)character+1+1+1+1+1
	tricky <- c(1, 2, 3, "4")character+1+1+1+1+1
	
3. Why do you think it happens?
I'm not sure if the quotes also have something to do with what class R thinks these are.+1
maybe it knows that characters are limited in use in function as opposed to numbers 
4. How many values in combined_logical are "TRUE" (as a character) in the following example (reusing the 2 ..._logicals from above):one
	combined_logical <- c(num_logical, char_logical)character+1
	
5. You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?

logical -> numeric -> character <- logical

Speed test (put a plus one (+1) next to the statement you agree with):
too slow
just right+1+1+1+1+1+1
too fast

### Challenge (optional)
##
## * Can you figure out why `"four" > "five"` returns `TRUE`?
???


	1. Using this vector of heights in inches, create a new vector, heights_no_na, with the NAs removed.
	heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
	heights<- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
	> heights[!is.na(heights)]
	 [1] 63 69 60 65 68 61 70 61 59 64 69 63 63 72 65 64 70 63 65
	> heights[!is.na(heights)]
	 [1] 63 69 60 65 68 61 70 61 59 64 69 63 63 72 65 64 70 63 65
	
2. Use the function median() to calculate the median of the heights vector.
> median(heights[!is.na(heights)])
[1] 64[1] 64
> median(heights[!is.na(heights)])
[1] 64
3. Use R to figure out how many people in the set are taller than 67 inches.
> length(heights[heights > 67]) 
[1] 6


Starting with Data
https://datacarpentry.org/R-ecology-lesson/02-starting-with-data.html

Download data:
*download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv")

have downlaoded data:
Yes:+1 +1+1+1+1+1
No:

Based on the output of str(surveys), can you answer the following questions?
	* What is the class of the object surveys?


#####
	* Create a data.frame (surveys_200) containing only the data in row 200 of the surveys dataset.
surveys_200 <- head(surveys,200)
surveys[200,]

	* Notice how nrow() gave you the number of rows in a data.frame?
		* Use that number to pull out just that last row in the data frame.
		* Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
		* Pull out that last row using nrow() instead of the row number.
		* Create a new data frame (surveys_last) from that last row.

	* Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle.


	* Combine nrow() with the - notation above to reproduce the behavior of head(surveys), keeping just the first through 6th rows of the surveys dataset.


MAKE A Vector notation of what your able to complet when done:
YES: (1:3),  
NO:(4),
*
#Challenge
#Change the columns taxa and genus in the surveys data frame into a factor.
#Using the functions you learned before, can you find out…

#How many rabbits were observed?

#How many different genera are in the genus column?

#as of now to make a clean data frame
surverys <- read_csv("data_raw/portal_data_joined.csv")


Data Wrangling cheat sheet
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf


Next, we’re going to learn some of the most common dplyr functions:
	* select(): subset columns
	* filter(): subset rows on conditions
	* mutate(): create new columns by using information from other columns
	* group_by() and summarize(): create summary statistics on grouped data
	* arrange(): sort results
	* count(): count discrete values


Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight.


surveys %>% 
  filter(year<1995) %>% 
  select(year,sex,weight) %>% 
  head()
  
  surveys %>%
  filter(year < 1995) %>%
  select(year, sex, weight)

Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_cm containing the hindfoot_length values (currently in mm) converted to centimeters. In this hindfoot_cm column, there are no NAs and all values are less than 3.
Hint: think about how the commands should be ordered to produce this data frame!