Welcome to The Carpentries Etherpad!

This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.

Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).

Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html

All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/

 ----------------------------------------------------------------------------
S We will use this collaborative document for chatting, taking notes, and sharing URLs and bits of code. 


Data Organization in Spreadsheets for Ecologist
If you haven't already, we'll be working with this data set, which you will want to download this data file to your computer: https://ndownloader.figshare.com/files/2252083 
Lesson are found at https://datacarpentry.org/spreadsheet-ecology-lesson/04-quality-control.html

Goals for this lesson:

Resources:
Stanford file naming conventions - https://guides.library.stanford.edu/data-best-practices (?)

Instructors:
Carolyn Jackson
Paria

Helpers:
Robert McCowen (he/him), Kansas State University Libraries
Erin Antognoli (she/her), USDA NAL

Students:

Day 1: Spreadhseet lesson / OpenRefine:
Gwendolyn's Notes: 
- specify when characteristics are separated, aka two types of data in one spreadsheet column
- Find and replace to remove dual labeling from above^
See text to columns under the tab settings in excel for "deconcatenate"
Top of excel is known as the "formula bar"
Dates are very problematice. They are not interoperable. PROBLEMATIC to use. 
- Format dates as number (YYYY-MM-DD) and format your corresponding cells to display a number instead of a date conversion. Otherwise it may reformat depending on your OS. 
Exporting your excel sheet as a file that a program can use: 
- no pdf
- Does a program recognize comma separated values? If not, then different filetype. 
- Comma separated values (csv) do not support multiple sheets. Save as individual sheets. The code of mulitple sheets is not interoperable. 
- Libre Office, follows a standard that is open source. 

Clean 2013 Data
1. colbined all of 2013 data sets into one table
2. created acolum for Species and added species to that colum
3. Delete repetitive header rows

Date express 

Accounting for missing information


Day 1: OpenRefine lesson:
Correct file for Open Refine: https://datacarpentry.org/OpenRefine-ecology-lesson/data/Portal_rodents_19772002_simplified.csv
OpenRefine Lessons are available at https://datacarpentry.org/OpenRefine-ecology-lesson/index.html#setup
Java based - does not work in Edge browser
Open source software
Does not alter your original data file - creates a new copy/version

Make sure white space and trim is NOT checked for this lesson
column header arrows > facet > text facet will group similar values on  the left
- can edit directly in the view but there is an easier way
Facets
Facets are intended to group together common values and OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory. If you create a facet where there are many unique values (for example, a facet on a ‘book title’ column in a data set that has one row per book) the facet created will be very large and may either slow down the application, or OpenRefine will not create the facet.

Exercise questions:
1. Using faceting, find out how many years are represented in the census? 16
2. Is the column formatted as Number, Date, or Text? text column
3. Exercise: transform period, plot, record ID to numbers.
Can use Undo/Redo to undo changes

Scatterplot is a good way to compare variables
Split contents of a column
4. Look at the data in the column coordinates and split these values to obtain latitude and longitude. Make sure that the option for Guess cell type is checked and that Remove this column is not.
What type of data OpenRefine assigns to the new colunms? (numbers)

Edit cells > common transforms > Trim leading and trailing white spaces takes away extra spaces
Then when we split columns with space delimiter it results in just 2 columns 

5. Try to change the name of the scientificName 2 column to species. What problem do you encounter? How can you fix the problem?
Clustering finds similar values so they can be normalized
Filtering works on a subset of data

6. What scientific names are selected? text filter "bai" found 2 different values with 35 records total

Sorting > various options to view content

7. Sort the data by plot. What year(s) were observations recorded for plot 1 in this filtered dataset?
- Sort by year, month and day in some order. Be creative: try sorting as numbers or text, and in reverse order (largest to smallest or z to a).

Reconcile data - normalize the data
https://openrefine.org/docs/manual/reconciling
When choosing the source to reconcile against you need to have the same amount of granularity

Exporting data
OpenRefine saves changes to data - can extract steps and replicate the same steps on another dataset
- undo/redo > extract > select steps you want > paste into plain text to use later
(open a new file, create a project, undo/redo > apply > paste > perform operation)
Can export the entire project - creates a zip folder

Strength of OpenRefine - repeatable steps on similarly structured datasets/files

  1. Let us suppose we want to split the scientificName column into separate columns, one for genus and one for species.
2. Click the down arrow next to the scientificName column. Choose Edit Column > Split into several columns...
3. In the pop-up, in the Separator box, replace the comma with a space (the box will look empty when you’re done).
4. important! Uncheck the box that says Remove this column.
5. Click OK. You should get some new columns called scientificName 1, scientificName 2, scientificName 3, and scientificName 4.
6. Notice that in some cases these newly created columns are empty (you can check by text faceting the column). Why? What do you think we can do to fix it?

Look at the data in the column coordinates and split these values to obtain latitude and longitude. Make sure that the option for Guess cell type is checked and that Remove this column is not. Rename the new columns.

*** in the lesson on clustering  there is an error ***
#6 The text facet of scientificName will update to show the new summary of the column. It will have now 7 options should say 10 options https://datacarpentry.org/OpenRefine-ecology-lesson/05-filter-exclude-sort.html

Full documentation on clustering can be found at the OpenRefine Clustering Methods In-depth page of the OpenRefine manual.https://openrefine.org/docs/technical-reference/clustering-in-depth

The OpenRefine manual provides more information about the reconciliation feature.


DAY 2: SQL Lesson
https://datacarpentry.org/sql-ecology-lesson/01-sql-basic-queries.html
Instructor: Robert McCowen
Relational databases
 
Spreadsheet: Table
Columns in a spreadsheet: Fields
Key: ties each observation to another table - this is what makes a database relational
SQL: Structured Query Language - pulls out only the data we need

SQLLite Browswer - designed to let us open the database, understand the structure, browse the database

* is a wildcard - will return everything

can name a column instead of using *

SELECT year
FROM surveys;

SELECT DISTINCT - gives you unique values from a field
LIMIT - returns only the amount that follows this statement - remove this to see all results
Adding more than one field to the DISTINCT query returns all unique combinations

Can do math/calculations, but need to know/understand your dataset
Find specific values in records using WHERE = 'value here'

Find a range of values using greater than or less than
Combine the two - find records after 2000 where species ID is DM

Can nest criteria

Another way to write this query to help group logic:
Best to go step by step to make sure your individual queries work and then add more steps after that.
Can add notes to your query that will not affect the query by using --before each statement to explain the purpose of that portion of the query

Can save queries as SQL files - button right above the query tab

Can sort with ORDER BY and ASC / DSC
It matters what order we execute statements in SQL - generally you want to write in this order:
1. SELECT
2. FROM
3. WHERE
4. ORDER
You can write on a single line but shouldn't because it's not as human readable and understandable 
You can break multiple where species down on separate lines as well

COUNT records to find more about proportions - good way to find out more about the database contents

Returns total records and sum total of all the weights collected:
Final grouped query:
JOINS- attach 2 tables by the same field to combine fields from both tables 
DAY 3 - R (part 1)
Instructor: Robert McCowen

SQL lets you ask questions, R lets you import and manipulate date
R is extensible - meaning anyone can write a new package to pull into R to allow new capabilities 
Use this code to install packages:
install.packages(c("tidyverse", "hexbin", "patchwork", "RSQLite"))
- you only need to do this once

Data files: https://doi.org/10.6084/m9.figshare.1314459

Use "Create a project" button to create a new project in a new directory
File > New Project > New Directory > New Project > Browse to your working directory > give your new project a name like Carpentries-Intro-R-Ecology > Click "Create Project" button
- everything will go to the same place and stay together in this directory
# Use document + icon to create a new script or go to File > New File > R Script

# Use the help tab to get documentation on a function 
# Can also use a ? before the function name to get help, example: ?mean

# assignment operator <- assigns name to an object - creates a variable, which is called an object in R

weight_kg <- 55

# shortcut for <- is alt - (windows), option - (mac)

# use = in a function, example: mean(x = c(1, 2, 3))

# Definition of a vector: A vector is substantially a list of variables, and the simplest data structure in R. A vector consists of a collection of numbers, arithmetic expressions, logical values or character strings for example.

# don't use special characters or spaces in variable names only - or _ can be used

# create a vector and see the feedback (printed in console) with ( ) around the code
(weight_kg <- 55)

# or you can enter the vector name to see the value
weight_kg

# can do math on vectors
2.2 * weight_kg

body_length_cm <- 4.8

body_length_cm / 2.4

weight_kg <- 57.5

# create weight_lb object with math on weight_kg
weight_lb <- 2.2 * weight_kg

# what happens to weight_lb value if we change weight_kg?
# nothing

# use # to comment - won't run as code and lets you leave notes in your code for future you and others

# run code with ctrl enter (windows), command enter (mac)

weight_kg <- 57.5
weight_lb <- 2.2 * weight_kg

# save your script as you go! test-script

# clear your console and environment with the broom icon

# start a new R script with + document icon

# functions
# mean( ) fuction
mean(c(1, 2, 3))

# assign an object and use a fuction at the same time
# square root function sqrt( )
square_root_variable <- sqrt(16)

# functions use arguments 
# round ( x , digits = 0)
round(3.1415927) # default output rounds to the whole number

# add an argument to tell the number of digits to round to
round(3.1415927, 2)  # output is 3.14

# can use argument labels or not if you are entering the values in the default order
# can change the order of the arguments if you use the argument labels
round(digits = 2, x = 3.1415927) # rounds to 3.14
round(2, 3.1415927) # sees the value as 2, and rounds it to 3.1415927 digits

# vectors
# create the vector "weights" using the c( ) function
weights <- c(50, 60, 65, 82)

# R starts with 1, meaning that the first element of the vector is considered in the first place, not the zero place

# can create vectors with characters
animals <- c("mouse", "rat", "dog")

# note that the arguments are separated by commas in the c( ) function
# can use double or single quotes - just make sure to use the same one throughout the vector

# vectors with multiple items have a length
# use length( ) function to find the length of a vector

# vectors have a class - can only have one class per vector
# use class( ) function to find class of vector
weights_char <- c("50", "60", "65", "82") 
class(weights_char) # class is character because numbers are in quotes

# str( ) function to find the structure of the data - great first step when importing a dataset!
str(weights)

# objects, variable, vectors are different, but we often blend them together
# we can assign objects, "Robert" to variables: first_name
first_name<-"Robert"

# object is the thing, and the variable is the alias/name

# add values to vectors with c( )
c(weights, 90)

# mixing object times in a vector - R will make assumptions!
c("Robert", "Jerry", TRUE, 87)

str(c("Robert", "Jerry", TRUE, 87)) # R sees this as a character vector and not see the other value types

# a list will allow you to mix different kinds of values

# subsetting values in vectors by position
animals[2] # will pull the second value "rat" from the vector

# can subset more than on value at a time
animals[2:3] # pulls "rat" and "dog"

animals[c(3,2)]

animals_subset <- animals[c(3,2)]

# reorder values
animals[c(1,2,3,2,1)]

weight_g <- c(21, 34, 39, 54, 55)

weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
# output: [1] 21 54 55

# adding comparisons < > ==
weight_g[weight_g > 50]
# output: [1] 54 55

weight_g > 50
# output: FALSE FALSE FALSE  TRUE  TRUE

weight_g[weight_g > 30 & weight_g < 50]
# output: [1] 34 39

weight_g[weight_g < 0]
# output: numeric(0)

# missing data
heights <- c(2, 4, 4, NA, 6)

mean(heights) # output NA
max(heights) # output NA
min(heights) # output NA

# use !is.na( ) function to remove NAs
heights[!is.na(heights)]
output: [1] 2 4 4 6

!is.na(heights)
output: [1]  TRUE  TRUE  TRUE FALSE  TRUE

# In Files tab, create a new folder (click the new folder icon) and call it data_raw

# grab the dataset with the following code and put it in the data_raw folder
download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv")

# load in tidyverse with the library( ) function
library(tidyverse)

# read in the dataset to load it into R with the read_csv( ) function
surveys <- read_csv("data_raw/portal_data_joined.csv")

# click blue arrow to see columns/vectors in dataset, see variable names and data types per vector/column

# inspect the data table - see the first 6 rows with head( ) function
head(surveys)

# can click "surveys" name in the environment pane to open the data table in another tab

# another way to open the data table in another tab
View(surveys)

# columns in a data table are vectors with names and the values in an individual column have the same data type, but each column can be a different data type from another

# inspect the data table with str( ) - great first step when loading in your data!
str(surveys)

dim(surveys) # output is number of rows and columns

# subsetting by position in a data table
surveys[1, ]  # first row, every column
surveys[, 1]  # first column, every row
surveys[c(1, 2, 3)]  # colums 1, 2, 3, all rows
surveys[-1, ]  # removes the first row, all columns
surveys[c(1:6),] # first 6 rows, all columns
surveys[-(1:100),]  # removed the first 100 rows, all columns

# can call a vector/column by names with dataframeName$columnName 
surveys$sex

# factors - replacing the values in the sex column with the results of the factor( ) function
surveys$sex <- factor(surveys$sex)

# In environment pane, can see the sex column has Factors with 2 levels
# will order the factors in alphbetical order

levels(surveys$sex)
# output: [1] "F" "M"

nlevels(surveys$sex)
# output [1] 2

unordered <- c("low", "low", "medium", "high", "high", "high")

factor(unordered)

factor(unordered, levels = c("low", "medium", "high"))

# use as.character( ) function to return to character
# use as.number( ) to convert to numbers

summary(surveys$sex)

plot(surveys$sex)

# example of factoring numerical data
sample_data <- c("0", "1", "2", "2", "2")
factor(sample_data)
#output: 
[1] 0 1 2 2 2
Levels: 0 1 2

as.numeric(factor(sample_data))
#output: [1] 1 2 3 3 3

# make sure to understand what is happening to your factored data

# dates
# load in lubridate library
library(lubridate)

# ymd( ) function to create a date object

# assign to a variable name
my_date <- ymd("2023-06-01")
my_date <- ymd(paste(2023, 6, 1, sep = "-"))

str(my_date)  # output:  Date[1:1], format: "2023-06-01"

ymd(paste(surveys$year, 
          surveys$month, 
          surveys$day,
          sep = "-"))

# create date column
surveys$date <- ymd(paste(surveys$year, 
                          surveys$month, 
                          surveys$day,
                          sep = "-"))

# inspect the new column
summary(surveys$date)

------------------------------

Day 4 - R (part 2)
Instructor: Robert McCowen

https://datacarpentry.org/R-ecology-lesson/03-dplyr.html

Cheat sheets: 
https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf
https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf

Open project from last time
- File > Open Project > navigate to your folder where the package is

> require(tidyverse)
- do this twice to make sure the package is installed and working (the second time will not give any return errors if it's correct)

Enable the source pane console window through File > New File > R script OR the green + button > R Script
Working in the source pane allows us to update and edit code and re-run

> view(surveys) 

subset columns
> surveys$month
> surveys[2]
> surveys[[2]]

# surveys[c("month", "day", "year")]
> select(surveys, month, day, year)

> select(surveys, -record_id, -species_id)
- select all except the noted columns with minus in front of them

# Filter based on a column value
> filter(surveys, year == 1995)

# Pipe
> surveys2 <- filter(surveys, weight <= 5)
> surveys_small <- select( filter(surveys, weight <= 5), species_id, sex, weight)
> surveys_small2 <- surveys %>% filter(weight <= 5)

# create surveys_small2 by filtering for wght <= 5 g
# and grabbing only species ID, sex, and weight vars
> surveys_small2 <- surveys %>% 

# Mutate verb
> surveys$months_new <- surveys$month + 100




#shows 13 variables instead of 14

# Group by function - in this case group by sex and summarize by mean weight (remove NA values)
# Making graphs in R
# Loading data
# Our data files were not updated so we used "portal_data_joined.csv instead of the file indicated in the online lesson

# GG Plot 2 - creates visualizations 
# More human readable
# Transparency (in theory - Erin's isn't working!)

#Color (cheat sheet of R colors http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
# Note: make sure colors are colorblind-friendly if creating for publishing

#Color coded for each species - too many species to make this understandable but want to simplify it
  surveys_boxplot <- ggplot(
    data = surveys_complete,
    mapping = aes(x = species_id, y = weight)
  )

  surveys_boxplot +
    geom_boxplot(outlier.shape = NA)  +
    geom_jitter(alpha = 0.3, color = "purple")
    
    #overlap of the bar and points switched fromabove to make it more readable
      surveys_boxplot +
    geom_jitter(alpha = 0.1, color = "purple") +
    geom_boxplot(outlier.shape = NA)  
    
    ### Time series data
    
 yearly_counts <- surveys_complete %>%
  timeseries_plot <- ggplot(data = yearly_counts, 
                            mapping = aes(x = year, y = n, group = genus))
  
  timeseries_plot +
    geom_line()
Another way to write that ^^ that is easier to understand:

  surveys_complete %>%
    count(year, genus) %>%
    ggplot(aes(x = year, y = n, group = genus)) +
    geom_line()
    
    # Setting color by genus
surveys_complete %>%
    count(year, genus) %>%
    ggplot(aes(x = year, y = n, color = genus)) +
    geom_line()
    
    # Drop certain genus values from the visualization
    surveys_complete %>%
    filter(!(genus %in% c("Rodent", "Lizard", "Sparrow"))) %>%
    count(year, genus) %>%
    ggplot(aes(x = year, y = n, color = genus)) +
    geom_line()
    
    ## Faceting - makes a graph for each genus value - easier to compare trends between variables
  surveys_complete %>%
    count(year, genus) %>%
    filter(n > 10) %>%
    ggplot(aes(x = year, y = n)) +
    geom_line() +
    facet_wrap(facet = vars(genus))
    
     
     
      keep_these <- c("Chaetodipus", "Dipodomys", "Neotoma", "Onchomys", "Perognathus", "Peromyscus", "Reithrondontomys")
  
  surveys_complete %>%
    count(year, genus) %>%
    filter(genus %in% keep_these) %>%
    ggplot(aes(x = year, y = n)) +
    geom_line() +
    facet_wrap(facet = vars(genus))

  surveys_complete %>%
    count(year, genus, sex) %>%
    filter(genus %in% keep_these) %>%
    ggplot(aes(x = year, y = n, color = sex)) +
    geom_line() +
    facet_wrap(facet = vars(genus))
    
    facet_grid(rows = vars(genus))
    
    facet_grid(cols = vars(genus))
    
    theme_light()
    theme_bw()
    theme_minimal()
    theme_void()
    
    
Thank you, everyone, for attending the AgNIC Data Carpentries Workshop. Please take a minute to give us some feedback by completing our post-workshop survey at https://carpentries.typeform.com/to/UgVdRQ?slug=2023-05-18-AgNIC_online. I'll also send an email with this link.