Welcome to The Carpentries Etherpad!
This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).
Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
----------------------------------------------------------------------------
S We will use this collaborative document for chatting, taking notes, and sharing URLs and bits of code.
Data Organization in Spreadsheets for Ecologist
If you haven't already, we'll be working with this data set, which you will want to download this data file to your computer: https://ndownloader.figshare.com/files/2252083
Lesson are found at https://datacarpentry.org/spreadsheet-ecology-lesson/04-quality-control.html
Goals for this lesson:
- Implementing best practices for using spreadsheets and data formatting
- Addressing common mistakes
- Handling dates in spreadsheets
- QA/QC features and data manipulation factors
- Export data from spreadsheets to other programs
Resources:
Stanford file naming conventions - https://guides.library.stanford.edu/data-best-practices (?)
Instructors:
Carolyn Jackson
Paria
Helpers:
Robert McCowen (he/him), Kansas State University Libraries
Erin Antognoli (she/her), USDA NAL
Students:
Day 1: Spreadhseet lesson / OpenRefine:
Gwendolyn's Notes:
- specify when characteristics are separated, aka two types of data in one spreadsheet column
- Find and replace to remove dual labeling from above^
See text to columns under the tab settings in excel for "deconcatenate"
Top of excel is known as the "formula bar"
Dates are very problematice. They are not interoperable. PROBLEMATIC to use.
- Format dates as number (YYYY-MM-DD) and format your corresponding cells to display a number instead of a date conversion. Otherwise it may reformat depending on your OS.
Exporting your excel sheet as a file that a program can use:
- no pdf
- Does a program recognize comma separated values? If not, then different filetype.
- Comma separated values (csv) do not support multiple sheets. Save as individual sheets. The code of mulitple sheets is not interoperable.
- Keep in mind, that since this is a fact, formulas cannot cross mulitple sheets.
- Libre Office, follows a standard that is open source.
Clean 2013 Data
1. colbined all of 2013 data sets into one table
2. created acolum for Species and added species to that colum
3. Delete repetitive header rows
Date express
- Have something standard and notate your date choice - YYYY-MM-DD is a common format for date
- Can express as a number rather than a date format (format cell as a number rather than a date)
- Can also separate out Year, Month, and Day for easier sorting
Accounting for missing information
- Do not use zero - it affects averages and other calculations (leaving blank is preferable to zero or NA)
- Document your choices so it's understood
Day 1: OpenRefine lesson:
Correct file for Open Refine: https://datacarpentry.org/OpenRefine-ecology-lesson/data/Portal_rodents_19772002_simplified.csv
OpenRefine Lessons are available at https://datacarpentry.org/OpenRefine-ecology-lesson/index.html#setup
Java based - does not work in Edge browser
Open source software
Does not alter your original data file - creates a new copy/version
Make sure white space and trim is NOT checked for this lesson
column header arrows > facet > text facet will group similar values on the left
- can edit directly in the view but there is an easier way
Facets
Facets are intended to group together common values and OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory. If you create a facet where there are many unique values (for example, a facet on a ‘book title’ column in a data set that has one row per book) the facet created will be very large and may either slow down the application, or OpenRefine will not create the facet.
Exercise questions:
1. Using faceting, find out how many years are represented in the census? 16
2. Is the column formatted as Number, Date, or Text? text column
- - Everything coming into OpenRefine is brought in as Text
- - can change this by using the arrow > edit cells > common transforms > number (in this case)
3. Exercise: transform period, plot, record ID to numbers.
- Can all columns be transformed to numbers? - it changed all of them
- - what happens to data that is not a number in these cells?
- - can use numeric facet to find outliers in values that should be numbers
Can use Undo/Redo to undo changes
Scatterplot is a good way to compare variables
- - finds numeric column values and creates scatterplots from them - click on one to open
- - Can click and drag to select specific data within the square
Split contents of a column
- - Edit column > split column (do not check "remove this column")
- - Used "space" as a delimiter but it created 6 columns (lots of spaces) - can remove extra spaces when you import your file
4. Look at the data in the column coordinates and split these values to obtain latitude and longitude. Make sure that the option for Guess cell type is checked and that Remove this column is not.
What type of data OpenRefine assigns to the new colunms? (numbers)
Edit cells > common transforms > Trim leading and trailing white spaces takes away extra spaces
Then when we split columns with space delimiter it results in just 2 columns
5. Try to change the name of the scientificName 2 column to species. What problem do you encounter? How can you fix the problem?
- - we already have a species column - we get an error (unless capitalization is different and then no error)
Clustering finds similar values so they can be normalized
- - first create the facet (text facet)
- - Choose "cluster" in the facet
- - Keying function: metaphone3 > merge selected and close
Filtering works on a subset of data
- - first bring up the text facet > choose "include" on the right as you hover over a value (can choose multiple)
- - arrow > text filter > type values
- - can select "case sensitive" to narrow down further
6. What scientific names are selected? text filter "bai" found 2 different values with 35 records total
Sorting > various options to view content
7. Sort the data by plot. What year(s) were observations recorded for plot 1 in this filtered dataset?
- Sort by year, month and day in some order. Be creative: try sorting as numbers or text, and in reverse order (largest to smallest or z to a).
Reconcile data - normalize the data
- - Wikidata and Encyclopedia of Life
- - Remove all your sorting first
- - arrow > reconcile > start reconciling
https://openrefine.org/docs/manual/reconciling
When choosing the source to reconcile against you need to have the same amount of granularity
Exporting data
OpenRefine saves changes to data - can extract steps and replicate the same steps on another dataset
- undo/redo > extract > select steps you want > paste into plain text to use later
(open a new file, create a project, undo/redo > apply > paste > perform operation)
Can export the entire project - creates a zip folder
Strength of OpenRefine - repeatable steps on similarly structured datasets/files
- Let us suppose we want to split the scientificName column into separate columns, one for genus and one for species.
2. Click the down arrow next to the scientificName column. Choose Edit Column > Split into several columns...
3. In the pop-up, in the Separator box, replace the comma with a space (the box will look empty when you’re done).
4. important! Uncheck the box that says Remove this column.
5. Click OK. You should get some new columns called scientificName 1, scientificName 2, scientificName 3, and scientificName 4.
6. Notice that in some cases these newly created columns are empty (you can check by text faceting the column). Why? What do you think we can do to fix it?
Look at the data in the column coordinates and split these values to obtain latitude and longitude. Make sure that the option for Guess cell type is checked and that Remove this column is not. Rename the new columns.
*** in the lesson on clustering there is an error ***
#6 The text facet of scientificName will update to show the new summary of the column. It will have now 7 options should say 10 options https://datacarpentry.org/OpenRefine-ecology-lesson/05-filter-exclude-sort.html
Full documentation on clustering can be found at the OpenRefine Clustering Methods In-depth page of the OpenRefine manual.https://openrefine.org/docs/technical-reference/clustering-in-depth
The OpenRefine manual provides more information about the reconciliation feature.
DAY 2: SQL Lesson
https://datacarpentry.org/sql-ecology-lesson/01-sql-basic-queries.html
Instructor: Robert McCowen
Relational databases
Spreadsheet: Table
Columns in a spreadsheet: Fields
Key: ties each observation to another table - this is what makes a database relational
SQL: Structured Query Language - pulls out only the data we need
SQLLite Browswer - designed to let us open the database, understand the structure, browse the database
* is a wildcard - will return everything
can name a column instead of using *
SELECT year
FROM surveys;
SELECT DISTINCT - gives you unique values from a field
- SELECT DISTINCT species_id
- FROM surveys
- LIMIT 10;
LIMIT - returns only the amount that follows this statement - remove this to see all results
Adding more than one field to the DISTINCT query returns all unique combinations
- - all species that appeared in each year, for instance
Can do math/calculations, but need to know/understand your dataset
- SELECT year, month, day, ROUND(weight/1000.0, 2)
- FROM surveys;
- - returns weights converted from grams to kilograms
- - In this case, weights are too small to convert in integers - drops everything after the decimal point by default unless we add a decimal at the end and ROUND
Find specific values in records using WHERE = 'value here'
- SELECT *
- FROM surveys
- WHERE species_id='DM';
Find a range of values using greater than or less than
- SELECT *
- FROM surveys
- WHERE year >=2000;
- - this only works with integers - if text, use quotes
Combine the two - find records after 2000 where species ID is DM
- SELECT *
- FROM surveys
- WHERE (year >=2000) AND (species_id='DM');
Can nest criteria
- SELECT *
- FROM surveys
- WHERE (year >= 2000) AND ( (species_id= 'DM') OR (species_id= 'DO') );
Another way to write this query to help group logic:
- SELECT *
- FROM surveys
- WHERE (year >= 2000) AND (species_id IN ('DM','DO', 'DS') );
Best to go step by step to make sure your individual queries work and then add more steps after that.
Can add notes to your query that will not affect the query by using --before each statement to explain the purpose of that portion of the query
- --Get post-2000 data on Dipodomys specieas
- -- Data is in the surveys table and we want all matching records
- SELECT *
- FROM surveys
- --sampling date is in the year field
- WHERE (year >= 2000)
- --species ids for Diopomys are represented as DM, DO, DS
- AND (species_id IN ('DM','DO', 'DS') );
Can save queries as SQL files - button right above the query tab
Can sort with ORDER BY and ASC / DSC
- SELECT *
- FROM species
- ORDER BY taxa ASC, genus ASC, species ASC;
It matters what order we execute statements in SQL - generally you want to write in this order:
1. SELECT
2. FROM
3. WHERE
4. ORDER
- Example:
- SELECT *
- FROM species
- WHERE
- ORDER BY
You can write on a single line but shouldn't because it's not as human readable and understandable
You can break multiple where species down on separate lines as well
COUNT records to find more about proportions - good way to find out more about the database contents
- SELECT COUNT(*)
- FROM surveys;
Returns total records and sum total of all the weights collected:
- SELECT COUNT(*), SUM(weight)
- FROM surveys;
Final grouped query:
- SELECT
- --Group by species ID
- species_id,
- --total number of records
- COUNT(*),
- --Sum of biomass in kg for each species
- ROUND( SUM(weight)/1000.0, 3),
- --mean mass in g for each species
- ROUND( AVG(weight), 3)
- FROM surveys
- --ensure calculated fields are grouped
- GROUP BY species_id
- ORDER BY COUNT(species_id) DESC;
- SELECT
- species_id,
- COUNT(*)
- FROM surveys
- GROUP BY species_id
- ORDER BY COUNT(species_id) DESC;
- Find the highest/largest value of a field:
- SELECT MAX(year) AS last_surveyed_year
- FROM surveys;
- Find species ID with count greater than 10 - allows you to filter out values too small to be useful
- SELECT species_id, COUNT(species_ID)
- FROM surveys
- GROUP BY species_id
- HAVING COUNT(species_ID) >10;
- Can limit records by species field and count specific species
- SELECT species_id, COUNT(species_ID)
- FROM surveys
- WHERE species_id IN('DM', 'DS', 'DO')
- GROUP BY species_id
- HAVING COUNT(species_ID) >10;
- Create an alias for a column header using an "as" statement
- SELECT species_id, COUNT(species_ID) AS number_of_observations
- FROM surveys
- GROUP BY species_id
- HAVING COUNT(species_ID) >10;
- Limit to 100 and order by number of observations:
- SELECT species_id, COUNT(species_ID)AS number_of_observations
- FROM surveys
- GROUP BY species_id
- HAVING COUNT(species_ID) >100
- ORDER BY number_of_observations DESC;
- Missing data always noted as NULL
- SELECT AVG(weight) AS mean_mass_g
- FROM surveys
- WHERE
- year = 2000 AND (month > 4) AND (month < 10)
- AND species_id = 'PE';
- Did not work out well:
- SELECT SUM(weight), COUNT(*), SUM(weight)/COUNT(*)
- FROM surveys
- WHERE
- year = 2000 AND (month > 4) AND (month < 10)
- AND species_id = 'PE';
- Updated to ignore records with NULL values:
- SELECT SUM(weight), COUNT(weight), SUM(weight)/COUNT(weight)
- FROM surveys
- WHERE
- year = 2000 AND (month > 4) AND (month < 10)
- AND species_id = 'PE';
- != equals NOT but will not include NULL values -
- SELECT COUNT(*)
- FROM surveys
- WHERE year = 2000 AND (month > 4 AND month < 10)
- AND sex != 'M' OR sex IS NULL;
- SELECT COUNT(*)
- FROM surveys
- WHERE
- year = 2000 AND (month > 4 AND month < 10)
- AND sex IS NULL;
- To replace null values and clean data:
- SELECT COALESCE(sex, 'unknown') AS sex_corrected, COUNT(*) AS num_obs
- FROM surveys
- WHERE
- year = 2000 AND (month > 4 AND month < 10)
- GROUP BY sex_corrected;
JOINS- attach 2 tables by the same field to combine fields from both tables
- SELECT*
- FROM surveys
- JOIN species
- ON surveys.species_id = species.species_id;
- Can also use thsi which will not duplicate the species_id field:
- SELECT*
- FROM surveys
- JOIN species
- USING (species_id);
- Join 2 tables to pull in specific fields from each:
- SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
- FROM surveys
- JOIN species
- USING (species_id);
- Unique observations:
- SELECT DISTINCT surveys.year, surveys.month, surveys.day, species.genus, species.species
- FROM surveys
- JOIN species
- USING (species_id);
- Inner join: when you put the data from the tables together, you only get results that exist in both tables (species_id exists in both tables and anything that doesn't have this value drops from the results)
- Outer join (left and right) - allows you to keep null values (count is higher in this because it includes null values)
- SELECT COUNT(*)
- FROM surveys
- LEFT OUTER JOIN species
- USING (species_id);
- SELECT species_id, COUNT(*)
- FROM surveys
- JOIN species
- USING (species_id)
- GROUP BY species_id;
- ----------------------------------------
DAY 3 - R (part 1)
Instructor: Robert McCowen
SQL lets you ask questions, R lets you import and manipulate date
R is extensible - meaning anyone can write a new package to pull into R to allow new capabilities
- - packages already exist for most functions
Use this code to install packages:
install.packages(c("tidyverse", "hexbin", "patchwork", "RSQLite"))
- you only need to do this once
Data files: https://doi.org/10.6084/m9.figshare.1314459
Use "Create a project" button to create a new project in a new directory
File > New Project > New Directory > New Project > Browse to your working directory > give your new project a name like Carpentries-Intro-R-Ecology > Click "Create Project" button
- everything will go to the same place and stay together in this directory
# Use document + icon to create a new script or go to File > New File > R Script
# Use the help tab to get documentation on a function
# Can also use a ? before the function name to get help, example: ?mean
# assignment operator <- assigns name to an object - creates a variable, which is called an object in R
weight_kg <- 55
# shortcut for <- is alt - (windows), option - (mac)
# use = in a function, example: mean(x = c(1, 2, 3))
# Definition of a vector: A vector is substantially a list of variables, and the simplest data structure in R. A vector consists of a collection of numbers, arithmetic expressions, logical values or character strings for example.
# don't use special characters or spaces in variable names only - or _ can be used
# create a vector and see the feedback (printed in console) with ( ) around the code
(weight_kg <- 55)
# or you can enter the vector name to see the value
weight_kg
# can do math on vectors
2.2 * weight_kg
body_length_cm <- 4.8
body_length_cm / 2.4
weight_kg <- 57.5
# create weight_lb object with math on weight_kg
weight_lb <- 2.2 * weight_kg
# what happens to weight_lb value if we change weight_kg?
# nothing
# use # to comment - won't run as code and lets you leave notes in your code for future you and others
# run code with ctrl enter (windows), command enter (mac)
weight_kg <- 57.5
weight_lb <- 2.2 * weight_kg
# save your script as you go! test-script
# clear your console and environment with the broom icon
# start a new R script with + document icon
# functions
# mean( ) fuction
mean(c(1, 2, 3))
# assign an object and use a fuction at the same time
# square root function sqrt( )
square_root_variable <- sqrt(16)
# functions use arguments
# round ( x , digits = 0)
round(3.1415927) # default output rounds to the whole number
# add an argument to tell the number of digits to round to
round(3.1415927, 2) # output is 3.14
# can use argument labels or not if you are entering the values in the default order
# can change the order of the arguments if you use the argument labels
round(digits = 2, x = 3.1415927) # rounds to 3.14
round(2, 3.1415927) # sees the value as 2, and rounds it to 3.1415927 digits
# vectors
# create the vector "weights" using the c( ) function
weights <- c(50, 60, 65, 82)
# R starts with 1, meaning that the first element of the vector is considered in the first place, not the zero place
# can create vectors with characters
animals <- c("mouse", "rat", "dog")
# note that the arguments are separated by commas in the c( ) function
# can use double or single quotes - just make sure to use the same one throughout the vector
# vectors with multiple items have a length
# use length( ) function to find the length of a vector
# vectors have a class - can only have one class per vector
# use class( ) function to find class of vector
weights_char <- c("50", "60", "65", "82")
class(weights_char) # class is character because numbers are in quotes
# str( ) function to find the structure of the data - great first step when importing a dataset!
str(weights)
# objects, variable, vectors are different, but we often blend them together
# we can assign objects, "Robert" to variables: first_name
first_name<-"Robert"
# object is the thing, and the variable is the alias/name
# add values to vectors with c( )
c(weights, 90)
# mixing object times in a vector - R will make assumptions!
c("Robert", "Jerry", TRUE, 87)
str(c("Robert", "Jerry", TRUE, 87)) # R sees this as a character vector and not see the other value types
# a list will allow you to mix different kinds of values
# subsetting values in vectors by position
animals[2] # will pull the second value "rat" from the vector
# can subset more than on value at a time
animals[2:3] # pulls "rat" and "dog"
animals[c(3,2)]
animals_subset <- animals[c(3,2)]
# reorder values
animals[c(1,2,3,2,1)]
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
# output: [1] 21 54 55
# adding comparisons < > ==
weight_g[weight_g > 50]
# output: [1] 54 55
weight_g > 50
# output: FALSE FALSE FALSE TRUE TRUE
weight_g[weight_g > 30 & weight_g < 50]
# output: [1] 34 39
weight_g[weight_g < 0]
# output: numeric(0)
# missing data
heights <- c(2, 4, 4, NA, 6)
mean(heights) # output NA
max(heights) # output NA
min(heights) # output NA
# use !is.na( ) function to remove NAs
heights[!is.na(heights)]
output: [1] 2 4 4 6
!is.na(heights)
output: [1] TRUE TRUE TRUE FALSE TRUE
# In Files tab, create a new folder (click the new folder icon) and call it data_raw
# grab the dataset with the following code and put it in the data_raw folder
download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv")
# load in tidyverse with the library( ) function
library(tidyverse)
# read in the dataset to load it into R with the read_csv( ) function
surveys <- read_csv("data_raw/portal_data_joined.csv")
# click blue arrow to see columns/vectors in dataset, see variable names and data types per vector/column
# inspect the data table - see the first 6 rows with head( ) function
head(surveys)
# can click "surveys" name in the environment pane to open the data table in another tab
# another way to open the data table in another tab
View(surveys)
# columns in a data table are vectors with names and the values in an individual column have the same data type, but each column can be a different data type from another
# inspect the data table with str( ) - great first step when loading in your data!
str(surveys)
dim(surveys) # output is number of rows and columns
# subsetting by position in a data table
surveys[1, ] # first row, every column
surveys[, 1] # first column, every row
surveys[c(1, 2, 3)] # colums 1, 2, 3, all rows
surveys[-1, ] # removes the first row, all columns
surveys[c(1:6),] # first 6 rows, all columns
surveys[-(1:100),] # removed the first 100 rows, all columns
# can call a vector/column by names with dataframeName$columnName
surveys$sex
# factors - replacing the values in the sex column with the results of the factor( ) function
surveys$sex <- factor(surveys$sex)
# In environment pane, can see the sex column has Factors with 2 levels
# will order the factors in alphbetical order
levels(surveys$sex)
# output: [1] "F" "M"
nlevels(surveys$sex)
# output [1] 2
unordered <- c("low", "low", "medium", "high", "high", "high")
factor(unordered)
factor(unordered, levels = c("low", "medium", "high"))
# use as.character( ) function to return to character
# use as.number( ) to convert to numbers
summary(surveys$sex)
plot(surveys$sex)
# example of factoring numerical data
sample_data <- c("0", "1", "2", "2", "2")
factor(sample_data)
#output:
[1] 0 1 2 2 2
Levels: 0 1 2
as.numeric(factor(sample_data))
#output: [1] 1 2 3 3 3
# make sure to understand what is happening to your factored data
# dates
# load in lubridate library
library(lubridate)
# ymd( ) function to create a date object
# assign to a variable name
my_date <- ymd("2023-06-01")
my_date <- ymd(paste(2023, 6, 1, sep = "-"))
str(my_date) # output: Date[1:1], format: "2023-06-01"
ymd(paste(surveys$year,
surveys$month,
surveys$day,
sep = "-"))
# create date column
surveys$date <- ymd(paste(surveys$year,
surveys$month,
surveys$day,
sep = "-"))
# inspect the new column
summary(surveys$date)
------------------------------
Day 4 - R (part 2)
Instructor: Robert McCowen
https://datacarpentry.org/R-ecology-lesson/03-dplyr.html
Cheat sheets:
https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf
https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf
Open project from last time
- File > Open Project > navigate to your folder where the package is
> require(tidyverse)
- do this twice to make sure the package is installed and working (the second time will not give any return errors if it's correct)
Enable the source pane console window through File > New File > R script OR the green + button > R Script
- Ctrl+Shft+N
- read_csv - Load data using read_csv function
- > surveys <- read_csv("data_raw/portal_data_joined.csv")
- then use contol/enter if in the top window (regular enter if in the console window)
Working in the source pane allows us to update and edit code and re-run
- useful if we make errors in writing or to change another variable
- can leave notes using the # symbol
> view(surveys)
- to see the table in a new tab in the source pane
subset columns
> surveys$month
> surveys[2]
- or > select(surveys, month)
> surveys[[2]]
# surveys[c("month", "day", "year")]
> select(surveys, month, day, year)
> select(surveys, -record_id, -species_id)
- select all except the noted columns with minus in front of them
# Filter based on a column value
> filter(surveys, year == 1995)
- == is the symbol for equal to
# Pipe
> surveys2 <- filter(surveys, weight <= 5)
> surveys_small <- select( filter(surveys, weight <= 5), species_id, sex, weight)
- hard to read this line, can use the pipe to fix this
> surveys_small2 <- surveys %>% filter(weight <= 5)
- surveys_small2 is the same as surveys2
#ctrl + shift + m (PC shortcut for %>%)
# create surveys_small2 by filtering for wght <= 5 g
# and grabbing only species ID, sex, and weight vars
> surveys_small2 <- surveys %>%
- > filter(weight <= 5) %>%
- > select(species_id, sex, weight)
# Mutate verb
> surveys$months_new <- surveys$month + 100
- creates a new column in the surveys dataset called "months_new"
- # surveys$months_new <- surveys$month + 100
- > surveys %>%
- > mutate(weight_kg = weight / 1000)
- makes it clear this is a mathematical equation, more human readable
- #create a weight in kg and pounds in new columbs
- > surveys %>%
- > mutate(weight_kg = weight / 1000,
- > weight_lb = weight_kg * 2.2)
- head()
- > surveys %>%
- > mutate(weight_kg = weight / 1000,
- > weight_lb = weight_kg * 2.2) %>%
#shows 13 variables instead of 14
- > surveys <- surveys %>% select(-months_new)
# Group by function - in this case group by sex and summarize by mean weight (remove NA values)
- > surveys %>%
- > group_by(sex) %>%
- > summarize(mean_weight = mean(weight, na.rm = TRUE))
- OR for the same value and result
- > surveys %>%
- > group_by(sex) %>%
- > summarize(mean_weight = weight %>% mean(na.rm = TRUE))
- # To separate by species ID and sex
- > surveys %>%
- > group_by(sex, species_id) %>%
- > summarize(mean_weight = weight %>% mean(na.rm = TRUE))
- # Find missing values for weight
- > surveys %>% filter(weight %>% is.na())
- > surveys %>%
- > filter(!is.na(weight)) %>%
- > group_by(sex, species_id) %>%
- > summarize(mean_weight = weight %>% mean())
- #Create multiple output columns by summarizing values
- surveys %>%
- filter(!is.na(weight)) %>%
- group_by(sex, species_id) %>%
- summarize(mean_weight = weight %>% mean(),
- min_weight = weight %>% min(),
- max_weight = weight %>% max()) %>%
- arrange(min_weight) %>%
- head()
- # Same function but descending order
- surveys %>%
- filter(!is.na(weight)) %>%
- group_by(sex, species_id) %>%
- summarize(mean_weight = weight %>% mean(),
- min_weight = weight %>% min(),
- max_weight = weight %>% max()) %>%
- arrange(desc(min_weight)) %>%
- head()
- # How many organisms do we have
- surveys %>% count(sex)
- # Group by and summarize by creating a new variable
- surveys %>%
- group_by(sex) %>%
- # Counting unique values - a more clear way to write this
- surveys %>%
- group_by(sex) %>%
- summarize( count_organisms = n())
- # Sort
- surveys %>% count(sex, sort = TRUE)
- OR
- surveys %>%
- group_by(sex) %>%
- summarize( count_organisms = n()) %>%
- arrange(sex)
- surveys %>%
- count(sex, species) %>%
- arrange(species, desc(n))
- ### reshaping data ###
- # Pivot_wider and Pivot_longer
- # tidy data
- surveys_gw <- surveys %>%
- filter(!is.na(weight)) %>%
- group_by(plot_id, genus) %>%
- summarize(mean_weight = mean(weight))
- # pivot_wider
- # human readable data
- # for each plot make a column for mean weight for each genus - easier to read and makes more sense
- # no longer tidy data - a row of mean weights but without column names we don't know what they represent
- surveys_wide <- surveys_gw %>%
- pivot_wider(names_from = genus, values_from = mean_weight)
- # makes a table with genus and mean weight columns
- surveys_long <- surveys_wide %>%
- pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)
- ---
- View an object in your source pane
- > view(surveys_gw)
- or click on the content in your Environment
- Remove an object from your environment without clearing everything:
- > rm(surveys_gw)
- There are no prompts or undos so be certain when you remove something because you can't undo it
- But can recreate content by going back in your source pane and using the same code (if it isn't too extensive)
- ---
# Making graphs in R
# Loading data
# Our data files were not updated so we used "portal_data_joined.csv instead of the file indicated in the online lesson
- library(tidyverse)
- surveys_complete <- read_csv("data_raw/portal_data_joined.csv")
# GG Plot 2 - creates visualizations
- surveys_plot <- ggplot(
- data = surveys_complete,
-
- mapping = aes(x = weight, y = hindfoot_length)
- ) +
- geom_point()
# More human readable
- surveys_plot <- ggplot(
- data = surveys_complete,
-
- mapping = aes(x = weight, y = hindfoot_length)
- )
- surveys_plot +
- geom_point()
# Transparency (in theory - Erin's isn't working!)
- surveys_plot +
- geom_point(alpha = 0.1)
#Color (cheat sheet of R colors http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
- surveys_plot +
- geom_point(alpha = 0.1, color = "purple")
# Note: make sure colors are colorblind-friendly if creating for publishing
#Color coded for each species - too many species to make this understandable but want to simplify it
- surveys_plot +
- geom_point(alpha = 0.1, aes(color = species_id))
surveys_boxplot <- ggplot(
data = surveys_complete,
mapping = aes(x = species_id, y = weight)
)
surveys_boxplot +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.3, color = "purple")
#overlap of the bar and points switched fromabove to make it more readable
surveys_boxplot +
geom_jitter(alpha = 0.1, color = "purple") +
geom_boxplot(outlier.shape = NA)
### Time series data
yearly_counts <- surveys_complete %>%
timeseries_plot <- ggplot(data = yearly_counts,
mapping = aes(x = year, y = n, group = genus))
timeseries_plot +
geom_line()
Another way to write that ^^ that is easier to understand:
surveys_complete %>%
count(year, genus) %>%
ggplot(aes(x = year, y = n, group = genus)) +
geom_line()
# Setting color by genus
surveys_complete %>%
count(year, genus) %>%
ggplot(aes(x = year, y = n, color = genus)) +
geom_line()
# Drop certain genus values from the visualization
surveys_complete %>%
filter(!(genus %in% c("Rodent", "Lizard", "Sparrow"))) %>%
count(year, genus) %>%
ggplot(aes(x = year, y = n, color = genus)) +
geom_line()
## Faceting - makes a graph for each genus value - easier to compare trends between variables
surveys_complete %>%
count(year, genus) %>%
filter(n > 10) %>%
ggplot(aes(x = year, y = n)) +
geom_line() +
facet_wrap(facet = vars(genus))
keep_these <- c("Chaetodipus", "Dipodomys", "Neotoma", "Onchomys", "Perognathus", "Peromyscus", "Reithrondontomys")
surveys_complete %>%
count(year, genus) %>%
filter(genus %in% keep_these) %>%
ggplot(aes(x = year, y = n)) +
geom_line() +
facet_wrap(facet = vars(genus))
surveys_complete %>%
count(year, genus, sex) %>%
filter(genus %in% keep_these) %>%
ggplot(aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(facet = vars(genus))
facet_grid(rows = vars(genus))
facet_grid(cols = vars(genus))
theme_light()
theme_bw()
theme_minimal()
theme_void()
Thank you, everyone, for attending the AgNIC Data Carpentries Workshop. Please take a minute to give us some feedback by completing our post-workshop survey at https://carpentries.typeform.com/to/UgVdRQ?slug=2023-05-18-AgNIC_online. I'll also send an email with this link.