Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ ---------------------------------------------------------------------------- S We will use this collaborative document for chatting, taking notes, and sharing URLs and bits of code. Data Organization in Spreadsheets for Ecologist If you haven't already, we'll be working with this data set, which you will want to download this data file to your computer: https://ndownloader.figshare.com/files/2252083 Lesson are found at https://datacarpentry.org/spreadsheet-ecology-lesson/04-quality-control.html Goals for this lesson: * Implementing best practices for using spreadsheets and data formatting * Addressing common mistakes * Handling dates in spreadsheets * QA/QC features and data manipulation factors * Export data from spreadsheets to other programs Resources: Stanford file naming conventions - https://guides.library.stanford.edu/data-best-practices (?) Instructors: Carolyn Jackson Paria Helpers: Robert McCowen (he/him), Kansas State University Libraries Erin Antognoli (she/her), USDA NAL Students: Day 1: Spreadhseet lesson / OpenRefine: Gwendolyn's Notes: - specify when characteristics are separated, aka two types of data in one spreadsheet column - Find and replace to remove dual labeling from above^ See text to columns under the tab settings in excel for "deconcatenate" Top of excel is known as the "formula bar" Dates are very problematice. They are not interoperable. PROBLEMATIC to use. - Format dates as number (YYYY-MM-DD) and format your corresponding cells to display a number instead of a date conversion. Otherwise it may reformat depending on your OS. Exporting your excel sheet as a file that a program can use: - no pdf - Does a program recognize comma separated values? If not, then different filetype. - Comma separated values (csv) do not support multiple sheets. Save as individual sheets. The code of mulitple sheets is not interoperable. Keep in mind, that since this is a fact, formulas cannot cross mulitple sheets. - Libre Office, follows a standard that is open source. Clean 2013 Data 1. colbined all of 2013 data sets into one table 2. created acolum for Species and added species to that colum 3. Delete repetitive header rows Date express * Have something standard and notate your date choice - YYYY-MM-DD is a common format for date * Can express as a number rather than a date format (format cell as a number rather than a date) * Can also separate out Year, Month, and Day for easier sorting Accounting for missing information * Do not use zero - it affects averages and other calculations (leaving blank is preferable to zero or NA) * Document your choices so it's understood Day 1: OpenRefine lesson: Correct file for Open Refine: https://datacarpentry.org/OpenRefine-ecology-lesson/data/Portal_rodents_19772002_simplified.csv OpenRefine Lessons are available at https://datacarpentry.org/OpenRefine-ecology-lesson/index.html#setup Java based - does not work in Edge browser Open source software Does not alter your original data file - creates a new copy/version Make sure white space and trim is NOT checked for this lesson column header arrows > facet > text facet will group similar values on the left - can edit directly in the view but there is an easier way Facets Facets are intended to group together common values and OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory. If you create a facet where there are many unique values (for example, a facet on a ‘book title’ column in a data set that has one row per book) the facet created will be very large and may either slow down the application, or OpenRefine will not create the facet. Exercise questions: 1. Using faceting, find out how many years are represented in the census? 16 2. Is the column formatted as Number, Date, or Text? text column - Everything coming into OpenRefine is brought in as Text - can change this by using the arrow > edit cells > common transforms > number (in this case) 3. Exercise: transform period, plot, record ID to numbers. Can all columns be transformed to numbers? - it changed all of them - what happens to data that is not a number in these cells? - can use numeric facet to find outliers in values that should be numbers Can use Undo/Redo to undo changes Scatterplot is a good way to compare variables - finds numeric column values and creates scatterplots from them - click on one to open - Can click and drag to select specific data within the square Split contents of a column - Edit column > split column (do not check "remove this column") - Used "space" as a delimiter but it created 6 columns (lots of spaces) - can remove extra spaces when you import your file 4. Look at the data in the column coordinates and split these values to obtain latitude and longitude. Make sure that the option for Guess cell type is checked and that Remove this column is not. What type of data OpenRefine assigns to the new colunms? (numbers) Edit cells > common transforms > Trim leading and trailing white spaces takes away extra spaces Then when we split columns with space delimiter it results in just 2 columns 5. Try to change the name of the scientificName 2 column to species. What problem do you encounter? How can you fix the problem? - we already have a species column - we get an error (unless capitalization is different and then no error) Clustering finds similar values so they can be normalized - first create the facet (text facet) - Choose "cluster" in the facet - Keying function: metaphone3 > merge selected and close Filtering works on a subset of data - first bring up the text facet > choose "include" on the right as you hover over a value (can choose multiple) - arrow > text filter > type values - can select "case sensitive" to narrow down further 6. What scientific names are selected? text filter "bai" found 2 different values with 35 records total Sorting > various options to view content 7. Sort the data by plot. What year(s) were observations recorded for plot 1 in this filtered dataset? - Sort by year, month and day in some order. Be creative: try sorting as numbers or text, and in reverse order (largest to smallest or z to a). Reconcile data - normalize the data - Wikidata and Encyclopedia of Life - Remove all your sorting first - arrow > reconcile > start reconciling - can add a new service > link for Wikidata is http://wikidata.reconci.link/en/api - https://eol.org/api/reconciliation (encyclopedia of life - for Taxon) - use the "country" field https://openrefine.org/docs/manual/reconciling When choosing the source to reconcile against you need to have the same amount of granularity Exporting data OpenRefine saves changes to data - can extract steps and replicate the same steps on another dataset - undo/redo > extract > select steps you want > paste into plain text to use later (open a new file, create a project, undo/redo > apply > paste > perform operation) Can export the entire project - creates a zip folder Strength of OpenRefine - repeatable steps on similarly structured datasets/files 1. Let us suppose we want to split the scientificName column into separate columns, one for genus and one for species. 2. Click the down arrow next to the scientificName column. Choose Edit Column > Split into several columns... 3. In the pop-up, in the Separator box, replace the comma with a space (the box will look empty when you’re done). 4. important! Uncheck the box that says Remove this column. 5. Click OK. You should get some new columns called scientificName 1, scientificName 2, scientificName 3, and scientificName 4. 6. Notice that in some cases these newly created columns are empty (you can check by text faceting the column). Why? What do you think we can do to fix it? Look at the data in the column coordinates and split these values to obtain latitude and longitude. Make sure that the option for Guess cell type is checked and that Remove this column is not. Rename the new columns. *** in the lesson on clustering there is an error *** #6 The text facet of scientificName will update to show the new summary of the column. It will have now 7 options should say 10 options https://datacarpentry.org/OpenRefine-ecology-lesson/05-filter-exclude-sort.html Full documentation on clustering can be found at the OpenRefine Clustering Methods In-depth page of the OpenRefine manual.https://openrefine.org/docs/technical-reference/clustering-in-depth The OpenRefine manual provides more information about the reconciliation feature. DAY 2: SQL Lesson https://datacarpentry.org/sql-ecology-lesson/01-sql-basic-queries.html Instructor: Robert McCowen Relational databases Spreadsheet: Table Columns in a spreadsheet: Fields Key: ties each observation to another table - this is what makes a database relational SQL: Structured Query Language - pulls out only the data we need SQLLite Browswer - designed to let us open the database, understand the structure, browse the database * is a wildcard - will return everything SELECT * FROM surveys; can name a column instead of using * *SELECT year FROM surveys; SELECT DISTINCT - gives you unique values from a field SELECT DISTINCT species_id FROM surveys LIMIT 10; LIMIT - returns only the amount that follows this statement - remove this to see all results Adding more than one field to the DISTINCT query returns all unique combinations - all species that appeared in each year, for instance Can do math/calculations, but need to know/understand your dataset SELECT year, month, day, ROUND(weight/1000.0, 2) FROM surveys; - returns weights converted from grams to kilograms - In this case, weights are too small to convert in integers - drops everything after the decimal point by default unless we add a decimal at the end and ROUND Find specific values in records using WHERE = 'value here' SELECT * FROM surveys WHERE species_id='DM'; Find a range of values using greater than or less than SELECT * FROM surveys WHERE year >=2000; - this only works with integers - if text, use quotes Combine the two - find records after 2000 where species ID is DM SELECT * FROM surveys WHERE (year >=2000) AND (species_id='DM'); Can nest criteria SELECT * FROM surveys WHERE (year >= 2000) AND ( (species_id= 'DM') OR (species_id= 'DO') ); Another way to write this query to help group logic: SELECT * FROM surveys WHERE (year >= 2000) AND (species_id IN ('DM','DO', 'DS') ); Best to go step by step to make sure your individual queries work and then add more steps after that. Can add notes to your query that will not affect the query by using --before each statement to explain the purpose of that portion of the query --Get post-2000 data on Dipodomys specieas -- Data is in the surveys table and we want all matching records SELECT * FROM surveys --sampling date is in the year field WHERE (year >= 2000) --species ids for Diopomys are represented as DM, DO, DS AND (species_id IN ('DM','DO', 'DS') ); Can save queries as SQL files - button right above the query tab Can sort with ORDER BY and ASC / DSC SELECT * FROM species ORDER BY taxa ASC, genus ASC, species ASC; It matters what order we execute statements in SQL - generally you want to write in this order: 1. SELECT 2. FROM 3. WHERE 4. ORDER Example: SELECT * FROM species WHERE species='sp.' ORDER BY taxa ASC, genus ASC; You can write on a single line but shouldn't because it's not as human readable and understandable You can break multiple where species down on separate lines as well COUNT records to find more about proportions - good way to find out more about the database contents SELECT COUNT(*) FROM surveys; Returns total records and sum total of all the weights collected: SELECT COUNT(*), SUM(weight) FROM surveys; Final grouped query: SELECT --Group by species ID species_id, --total number of records COUNT(*), --Sum of biomass in kg for each species ROUND( SUM(weight)/1000.0, 3), --mean mass in g for each species ROUND( AVG(weight), 3) FROM surveys --ensure calculated fields are grouped GROUP BY species_id ORDER BY COUNT(species_id) DESC; SELECT species_id, COUNT(*) FROM surveys GROUP BY species_id ORDER BY COUNT(species_id) DESC; Find the highest/largest value of a field: SELECT MAX(year) AS last_surveyed_year FROM surveys; Find species ID with count greater than 10 - allows you to filter out values too small to be useful SELECT species_id, COUNT(species_ID) FROM surveys GROUP BY species_id HAVING COUNT(species_ID) >10; Can limit records by species field and count specific species SELECT species_id, COUNT(species_ID) FROM surveys WHERE species_id IN('DM', 'DS', 'DO') GROUP BY species_id HAVING COUNT(species_ID) >10; Create an alias for a column header using an "as" statement SELECT species_id, COUNT(species_ID) AS number_of_observations FROM surveys GROUP BY species_id HAVING COUNT(species_ID) >10; Limit to 100 and order by number of observations: SELECT species_id, COUNT(species_ID)AS number_of_observations FROM surveys GROUP BY species_id HAVING COUNT(species_ID) >100 ORDER BY number_of_observations DESC; Missing data always noted as NULL SELECT AVG(weight) AS mean_mass_g FROM surveys WHERE year = 2000 AND (month > 4) AND (month < 10) AND species_id = 'PE'; Did not work out well: SELECT SUM(weight), COUNT(*), SUM(weight)/COUNT(*) FROM surveys WHERE year = 2000 AND (month > 4) AND (month < 10) AND species_id = 'PE'; Updated to ignore records with NULL values: SELECT SUM(weight), COUNT(weight), SUM(weight)/COUNT(weight) FROM surveys WHERE year = 2000 AND (month > 4) AND (month < 10) AND species_id = 'PE'; != equals NOT but will not include NULL values - SELECT COUNT(*) FROM surveys WHERE year = 2000 AND (month > 4 AND month < 10) AND sex != 'M' OR sex IS NULL; SELECT COUNT(*) FROM surveys WHERE year = 2000 AND (month > 4 AND month < 10) AND sex IS NULL; To replace null values and clean data: SELECT COALESCE(sex, 'unknown') AS sex_corrected, COUNT(*) AS num_obs FROM surveys WHERE year = 2000 AND (month > 4 AND month < 10) GROUP BY sex_corrected; JOINS- attach 2 tables by the same field to combine fields from both tables SELECT* FROM surveys JOIN species ON surveys.species_id = species.species_id; Can also use thsi which will not duplicate the species_id field: SELECT* FROM surveys JOIN species USING (species_id); Join 2 tables to pull in specific fields from each: SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species FROM surveys JOIN species USING (species_id); Unique observations: SELECT DISTINCT surveys.year, surveys.month, surveys.day, species.genus, species.species FROM surveys JOIN species USING (species_id); Inner join: when you put the data from the tables together, you only get results that exist in both tables (species_id exists in both tables and anything that doesn't have this value drops from the results) Outer join (left and right) - allows you to keep null values (count is higher in this because it includes null values) SELECT COUNT(*) FROM surveys LEFT OUTER JOIN species USING (species_id); SELECT species_id, COUNT(*) FROM surveys JOIN species USING (species_id) GROUP BY species_id; ---------------------------------------- DAY 3 - R (part 1) Instructor: Robert McCowen SQL lets you ask questions, R lets you import and manipulate date R is extensible - meaning anyone can write a new package to pull into R to allow new capabilities - packages already exist for most functions Use this code to install packages: install.packages(c("tidyverse", "hexbin", "patchwork", "RSQLite")) - you only need to do this once Data files: https://doi.org/10.6084/m9.figshare.1314459 Use "Create a project" button to create a new project in a new directory File > New Project > New Directory > New Project > Browse to your working directory > give your new project a name like Carpentries-Intro-R-Ecology > Click "Create Project" button - everything will go to the same place and stay together in this directory # Use document + icon to create a new script or go to File > New File > R Script # Use the help tab to get documentation on a function # Can also use a ? before the function name to get help, example: ?mean # assignment operator <- assigns name to an object - creates a variable, which is called an object in R weight_kg <- 55 # shortcut for <- is alt - (windows), option - (mac) # use = in a function, example: mean(x = c(1, 2, 3)) # Definition of a vector: A vector is substantially a list of variables, and the simplest data structure in R. A vector consists of a collection of numbers, arithmetic expressions, logical values or character strings for example. # don't use special characters or spaces in variable names only - or _ can be used # create a vector and see the feedback (printed in console) with ( ) around the code (weight_kg <- 55) # or you can enter the vector name to see the value weight_kg # can do math on vectors 2.2 * weight_kg body_length_cm <- 4.8 body_length_cm / 2.4 weight_kg <- 57.5 # create weight_lb object with math on weight_kg weight_lb <- 2.2 * weight_kg # what happens to weight_lb value if we change weight_kg? # nothing # use # to comment - won't run as code and lets you leave notes in your code for future you and others # run code with ctrl enter (windows), command enter (mac) weight_kg <- 57.5 weight_lb <- 2.2 * weight_kg # save your script as you go! test-script # clear your console and environment with the broom icon # start a new R script with + document icon # functions # mean( ) fuction mean(c(1, 2, 3)) # assign an object and use a fuction at the same time # square root function sqrt( ) square_root_variable <- sqrt(16) # functions use arguments # round ( x , digits = 0) round(3.1415927) # default output rounds to the whole number # add an argument to tell the number of digits to round to round(3.1415927, 2) # output is 3.14 # can use argument labels or not if you are entering the values in the default order # can change the order of the arguments if you use the argument labels round(digits = 2, x = 3.1415927) # rounds to 3.14 round(2, 3.1415927) # sees the value as 2, and rounds it to 3.1415927 digits # vectors # create the vector "weights" using the c( ) function weights <- c(50, 60, 65, 82) # R starts with 1, meaning that the first element of the vector is considered in the first place, not the zero place # can create vectors with characters animals <- c("mouse", "rat", "dog") # note that the arguments are separated by commas in the c( ) function # can use double or single quotes - just make sure to use the same one throughout the vector # vectors with multiple items have a length # use length( ) function to find the length of a vector # vectors have a class - can only have one class per vector # use class( ) function to find class of vector weights_char <- c("50", "60", "65", "82") class(weights_char) # class is character because numbers are in quotes # str( ) function to find the structure of the data - great first step when importing a dataset! str(weights) # objects, variable, vectors are different, but we often blend them together # we can assign objects, "Robert" to variables: first_name first_name<-"Robert" # object is the thing, and the variable is the alias/name # add values to vectors with c( ) c(weights, 90) # mixing object times in a vector - R will make assumptions! c("Robert", "Jerry", TRUE, 87) str(c("Robert", "Jerry", TRUE, 87)) # R sees this as a character vector and not see the other value types # a list will allow you to mix different kinds of values # subsetting values in vectors by position animals[2] # will pull the second value "rat" from the vector # can subset more than on value at a time animals[2:3] # pulls "rat" and "dog" animals[c(3,2)] animals_subset <- animals[c(3,2)] # reorder values animals[c(1,2,3,2,1)] weight_g <- c(21, 34, 39, 54, 55) weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)] # output: [1] 21 54 55 # adding comparisons < > == weight_g[weight_g > 50] # output: [1] 54 55 weight_g > 50 # output: FALSE FALSE FALSE TRUE TRUE weight_g[weight_g > 30 & weight_g < 50] # output: [1] 34 39 weight_g[weight_g < 0] # output: numeric(0) # missing data heights <- c(2, 4, 4, NA, 6) mean(heights) # output NA max(heights) # output NA min(heights) # output NA # use !is.na( ) function to remove NAs heights[!is.na(heights)] output: [1] 2 4 4 6 !is.na(heights) output: [1] TRUE TRUE TRUE FALSE TRUE # In Files tab, create a new folder (click the new folder icon) and call it data_raw # grab the dataset with the following code and put it in the data_raw folder download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv") # load in tidyverse with the library( ) function library(tidyverse) # read in the dataset to load it into R with the read_csv( ) function surveys <- read_csv("data_raw/portal_data_joined.csv") # click blue arrow to see columns/vectors in dataset, see variable names and data types per vector/column # inspect the data table - see the first 6 rows with head( ) function head(surveys) # can click "surveys" name in the environment pane to open the data table in another tab # another way to open the data table in another tab View(surveys) # columns in a data table are vectors with names and the values in an individual column have the same data type, but each column can be a different data type from another # inspect the data table with str( ) - great first step when loading in your data! str(surveys) dim(surveys) # output is number of rows and columns # subsetting by position in a data table surveys[1, ] # first row, every column surveys[, 1] # first column, every row surveys[c(1, 2, 3)] # colums 1, 2, 3, all rows surveys[-1, ] # removes the first row, all columns surveys[c(1:6),] # first 6 rows, all columns surveys[-(1:100),] # removed the first 100 rows, all columns # can call a vector/column by names with dataframeName$columnName surveys$sex # factors - replacing the values in the sex column with the results of the factor( ) function surveys$sex <- factor(surveys$sex) # In environment pane, can see the sex column has Factors with 2 levels # will order the factors in alphbetical order levels(surveys$sex) # output: [1] "F" "M" nlevels(surveys$sex) # output [1] 2 unordered <- c("low", "low", "medium", "high", "high", "high") factor(unordered) factor(unordered, levels = c("low", "medium", "high")) # use as.character( ) function to return to character # use as.number( ) to convert to numbers summary(surveys$sex) plot(surveys$sex) # example of factoring numerical data sample_data <- c("0", "1", "2", "2", "2") factor(sample_data) #output: [1] 0 1 2 2 2 Levels: 0 1 2 as.numeric(factor(sample_data)) #output: [1] 1 2 3 3 3 # make sure to understand what is happening to your factored data # dates # load in lubridate library library(lubridate) # ymd( ) function to create a date object # assign to a variable name my_date <- ymd("2023-06-01") my_date <- ymd(paste(2023, 6, 1, sep = "-")) str(my_date) # output: Date[1:1], format: "2023-06-01" ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-")) # create date column surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-")) # inspect the new column summary(surveys$date) ------------------------------ Day 4 - R (part 2) Instructor: Robert McCowen https://datacarpentry.org/R-ecology-lesson/03-dplyr.html Cheat sheets: https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf Open project from last time - File > Open Project > navigate to your folder where the package is > require(tidyverse) - do this twice to make sure the package is installed and working (the second time will not give any return errors if it's correct) Enable the source pane console window through File > New File > R script OR the green + button > R Script Ctrl+Shft+N read_csv - Load data using read_csv function > surveys <- read_csv("data_raw/portal_data_joined.csv") then use contol/enter if in the top window (regular enter if in the console window) Working in the source pane allows us to update and edit code and re-run useful if we make errors in writing or to change another variable can leave notes using the # symbol > view(surveys) to see the table in a new tab in the source pane subset columns > surveys$month > surveys[2] or > select(surveys, month) > surveys[[2]] # surveys[c("month", "day", "year")] > select(surveys, month, day, year) > select(surveys, -record_id, -species_id) - select all except the noted columns with minus in front of them operator to skip columns # Filter based on a column value > filter(surveys, year == 1995) == is the symbol for equal to # Pipe > surveys2 <- filter(surveys, weight <= 5) > surveys_small <- select( filter(surveys, weight <= 5), species_id, sex, weight) hard to read this line, can use the pipe to fix this > surveys_small2 <- surveys %>% filter(weight <= 5) surveys_small2 is the same as surveys2 #ctrl + shift + m (PC shortcut for %>%) # create surveys_small2 by filtering for wght <= 5 g # and grabbing only species ID, sex, and weight vars > surveys_small2 <- surveys %>% > filter(weight <= 5) %>% > select(species_id, sex, weight) # Mutate verb > surveys$months_new <- surveys$month + 100 creates a new column in the surveys dataset called "months_new" # surveys$months_new <- surveys$month + 100 > surveys %>% > mutate(weight_kg = weight / 1000) makes it clear this is a mathematical equation, more human readable #create a weight in kg and pounds in new columbs > surveys %>% > mutate(weight_kg = weight / 1000, > weight_lb = weight_kg * 2.2) head() > surveys %>% > mutate(weight_kg = weight / 1000, > weight_lb = weight_kg * 2.2) %>% > head() #shows 13 variables instead of 14 > surveys <- surveys %>% select(-months_new) # Group by function - in this case group by sex and summarize by mean weight (remove NA values) > surveys %>% > group_by(sex) %>% > summarize(mean_weight = mean(weight, na.rm = TRUE)) OR for the same value and result > surveys %>% > group_by(sex) %>% > summarize(mean_weight = weight %>% mean(na.rm = TRUE)) # To separate by species ID and sex > surveys %>% > group_by(sex, species_id) %>% > summarize(mean_weight = weight %>% mean(na.rm = TRUE)) # Find missing values for weight > surveys %>% filter(weight %>% is.na()) > surveys %>% > filter(!is.na(weight)) %>% > group_by(sex, species_id) %>% > summarize(mean_weight = weight %>% mean()) #Create multiple output columns by summarizing values surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = weight %>% mean(), min_weight = weight %>% min(), max_weight = weight %>% max()) %>% arrange(min_weight) %>% head() # Same function but descending order surveys %>% filter(!is.na(weight)) %>% group_by(sex, species_id) %>% summarize(mean_weight = weight %>% mean(), min_weight = weight %>% min(), max_weight = weight %>% max()) %>% arrange(desc(min_weight)) %>% head() # How many organisms do we have surveys %>% count(sex) # Group by and summarize by creating a new variable surveys %>% group_by(sex) %>% # Counting unique values - a more clear way to write this surveys %>% group_by(sex) %>% summarize( count_organisms = n()) # Sort surveys %>% count(sex, sort = TRUE) OR surveys %>% group_by(sex) %>% summarize( count_organisms = n()) %>% arrange(sex) surveys %>% count(sex, species) %>% arrange(species, desc(n)) ### reshaping data ### # Pivot_wider and Pivot_longer # tidy data surveys_gw <- surveys %>% filter(!is.na(weight)) %>% group_by(plot_id, genus) %>% summarize(mean_weight = mean(weight)) # pivot_wider # human readable data # for each plot make a column for mean weight for each genus - easier to read and makes more sense # no longer tidy data - a row of mean weights but without column names we don't know what they represent surveys_wide <- surveys_gw %>% pivot_wider(names_from = genus, values_from = mean_weight) # makes a table with genus and mean weight columns surveys_long <- surveys_wide %>% pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id) --- View an object in your source pane > view(surveys_gw) or click on the content in your Environment Remove an object from your environment without clearing everything: > rm(surveys_gw) There are no prompts or undos so be certain when you remove something because you can't undo it But can recreate content by going back in your source pane and using the same code (if it isn't too extensive) --- # Making graphs in R # Loading data # Our data files were not updated so we used "portal_data_joined.csv instead of the file indicated in the online lesson library(tidyverse) surveys_complete <- read_csv("data_raw/portal_data_joined.csv") # GG Plot 2 - creates visualizations surveys_plot <- ggplot( data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length) ) + geom_point() # More human readable surveys_plot <- ggplot( data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length) ) surveys_plot + geom_point() # Transparency (in theory - Erin's isn't working!) surveys_plot + geom_point(alpha = 0.1) #Color (cheat sheet of R colors http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) surveys_plot + geom_point(alpha = 0.1, color = "purple") # Note: make sure colors are colorblind-friendly if creating for publishing #Color coded for each species - too many species to make this understandable but want to simplify it surveys_plot + geom_point(alpha = 0.1, aes(color = species_id)) surveys_boxplot <- ggplot( data = surveys_complete, mapping = aes(x = species_id, y = weight) ) surveys_boxplot + geom_boxplot(outlier.shape = NA) + geom_jitter(alpha = 0.3, color = "purple") #overlap of the bar and points switched fromabove to make it more readable surveys_boxplot + geom_jitter(alpha = 0.1, color = "purple") + geom_boxplot(outlier.shape = NA) ### Time series data yearly_counts <- surveys_complete %>% count(year, genus) timeseries_plot <- ggplot(data = yearly_counts, mapping = aes(x = year, y = n, group = genus)) timeseries_plot + geom_line() Another way to write that ^^ that is easier to understand: surveys_complete %>% count(year, genus) %>% ggplot(aes(x = year, y = n, group = genus)) + geom_line() # Setting color by genus surveys_complete %>% count(year, genus) %>% ggplot(aes(x = year, y = n, color = genus)) + geom_line() # Drop certain genus values from the visualization surveys_complete %>% filter(!(genus %in% c("Rodent", "Lizard", "Sparrow"))) %>% count(year, genus) %>% ggplot(aes(x = year, y = n, color = genus)) + geom_line() ## Faceting - makes a graph for each genus value - easier to compare trends between variables surveys_complete %>% count(year, genus) %>% filter(n > 10) %>% ggplot(aes(x = year, y = n)) + geom_line() + facet_wrap(facet = vars(genus)) keep_these <- c("Chaetodipus", "Dipodomys", "Neotoma", "Onchomys", "Perognathus", "Peromyscus", "Reithrondontomys") surveys_complete %>% count(year, genus) %>% filter(genus %in% keep_these) %>% ggplot(aes(x = year, y = n)) + geom_line() + facet_wrap(facet = vars(genus)) surveys_complete %>% count(year, genus, sex) %>% filter(genus %in% keep_these) %>% ggplot(aes(x = year, y = n, color = sex)) + geom_line() + facet_wrap(facet = vars(genus)) facet_grid(rows = vars(genus)) facet_grid(cols = vars(genus)) theme_light() theme_bw() theme_minimal() theme_void() Thank you, everyone, for attending the AgNIC Data Carpentries Workshop. Please take a minute to give us some feedback by completing our post-workshop survey at https://carpentries.typeform.com/to/UgVdRQ?slug=2023-05-18-AgNIC_online. I'll also send an email with this link.