Welcome to The Carpentries Etherpad!

This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.

Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).

Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html

All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/

*Workshop Links:
*Post workshop Survey:  https://ucsb.co1.qualtrics.com/jfe/form/SV_9AJwIjSYRmQFutM
RStudio Image:  https://carpentryworkshop.lsit.ucsb.edu/
- you may need to do an extra install for palmerpenguins with install.packages("palmerpenguins")

Workshop Website: https://ucsbcarpentry.github.io/2022-07-26-ucsb-R/
Workshop Zoom: https://ucsb.zoom.us/j/83786617934?pwd=c0VlRFB1WEZ6UXlFeVpWQ1hhM0g3dz09

Workshop Lessons; https://ucsbcarpentry.github.io/CustomDC-R/
daily survey: https://forms.gle/cKHHpR9dW1WFo2Yb6

Instructor Scripts
Manipulating, Analyzing, and Exporting Data Part 1:  https://github.com/UCSBCarpentry/2022-07-26-ucsb-R/blob/gh-pages/code/dplyr_pt_1.R
Manipulating, Analyzing, and Exporting Data Part 2:  https://github.com/UCSBCarpentry/2022-07-26-ucsb-R/blob/gh-pages/code/dplyr_pt_2.R
Time Series Visualization:  https://github.com/UCSBCarpentry/2022-07-26-ucsb-R/blob/gh-pages/code/vis_time_series.R
 ----------------------------------------------------------------------------
*Introductions
Name, ucsb/other institution affiliation, favorite icecream

Christopher Kibler, UCSB Geography, dark chocolate
Dave Hunter, Westmont College Math/CS, Cookie Dough
Emily Fox, UCSB Sociology, mint chip
Laurie Van De Werfhorst, UCSB Bren School, vanilla w/cherries
Angie Torrico, UCSB undergrad: environmental studies, strawberry
Amanda Maheras | UCSB Molecular, Cellular and Developmental Biology PhD Student |  Black Cherry
Edwin Gao, UCSB PSTAT & Econ, strawberry
Mackenzie Taradalsky, UCSB Econ & Philosophy, mint chip
Pippa Lin, UCSB Statistics and Probability Undergrad, white chocolate
Jennifer Rink | UCSB PSTAT Undergrad | Chocolate Macadamia 
Lucero Torres Ojeda | UCSB Environmental Studies Undergrad | Rocky Road
Gabriel Franco, Geology, raspberry
Lizette Rivera, UCSB Economics 
Denis Lomov, UCSB, 1 year PhD in Political Science, pineapple
Adam Parison/UCSB Classics/Peanut butter 
Yoobin Won, UCSB PSTAT, matcha
Justin Gutierrez, UCSB Biology, peanut butter 
Abhishek Sharma, UCSB Mechanical Engineering, Butterscotch
Matthew Rosen, UCSB Pre-Bio Undergrad, Strawberry
Kaitlyn Deen, UCSB Psych & Brain Sciences, Chocolate with Oreo
Joyce Chen, UCSB math&stats undergrad
Natalia Almanza UCSB economics , cookies n cream 
Madison Avila, Sociology, chocolate  
Fatima Gonzalez, UCSB Psych & Brain Sciences, chocolate
Martha Garcia, UCSB Psych and Brain Sciences undergrad, chocolate
Dong Li UCSB Bren School


*Setup packages - you'll need these for the workshop
library(tidyverse)
library(hexbin)
library(patchwork)
library(RSQLite)

or compile  install.packages(c("tidyverse", "hexbin", "patchwork", "RSQLite"))
- but this may take a while

install.packages("palmerpenguins")
library(palmerpenguins)

For Pipes
ctrl-shift-m on linux/windows
cmd-shit-m on mac

Dataframe = spreadsheet form workable in Rstudio/R. We will be working with these during th workshop 

head(data) will give us the first few rows of a dataset
view(data) will pop up a new tab with the complete dataset

NA values are indication that some data may be missing
is.na to set missing-ness (useful if we want to find NAs or omit them)

summary(data) tabulate all the individual records in the dataset and spit out some summary statistics 

[      ] vs (    ) 
[      ] = used for vectors and dataframes, with dataframes to specify [ rows, columns ]
(      ) = for calling a function in R


*DAY 1 Whiteboard Content:
Data Wrangling 
"Reproducibility"

select() = selects columns, can unselect columns with a minus (-)
filter() =  filters rows

Pipe
%>% = percent sign, greater than, percent sign; you may read it as "and then" or "pipe to"
		* example: select() %>% filter() = select and then filter

mutate() = creates new columns

Order for Challenge:
1. new column (mutate)
2. filter out
3. select columns

*Challenge
Create a new data frame from the penguins data that meets the following criteria: contains only the species column and a new column called flipper_length_cm containing the length of the penguin flipper values (currently in mm) converted to centimeters. In this flipper_length_mm column, there are no NAs and all values are less than 200. Hint: think about how the commands should be ordered to produce this data frame!
*

Here's another R ecology lesson that covers SQL: https://datacarpentry.org/R-ecology-lesson/ 

Dave's script from Tuesday is here: https://math.westmont.edu/dc/episode04a.R


*Day 2 Notes
*
*Review of Day 1
- select, filter, mutate: 
- dealing with NAs

Zoom chat notes:
	* The ! reverses TRUE and FALSE values, so is.na(value) tells you if a value is NA and !is.na(value) tells you if a value is not NA.
	* https://www.rstudio.com/resources/cheatsheets/
	* The dplyr and ggplot cheat sheets cover the topics that we'll be learning today
	* Question: What did summarize do in this analysis?
	* Answer: Summarize made a new data frame instead of adding columns to the old data frame. The new data frame has three rows (one for each observed value of sex).
	* Question: Where did the other 2 penguins go, given that there were 11 penguins with NA values?
	* Answer: They did not have any measurements at all, so we removed them at the beginning. The other nine penguins had some measurements.
	* If you prefer a dark theme, you can change it in Tools > Global Options > Appearance > Editor Theme
	* Question: What does geom_point() do?
	* Answer: It creates a scatter plot with the input parameters in the ggplot() function.
	* Any aes() parameters in the first line of code will automatically be applied to all of the individual layers
	* In this case, adding the color to the geom_point() makes it so only the points change color
	* In theory, you can skip the aes() call in the first line, but then you have to add it to each layer individually
	* It's almost always advisable to set your x and y parameters in the first line
	* There are literally hundreds of parameters in ggplot. The best way to approach ggplot is to learn the structure and theory behind it, and then google the names of the specific parameters you need.

## Challenge:
  
# 1. How many penguins are in each island surveyed?
#   
penguins %>% 
  group_by(island) %>% 
  summarise(num_penguins = n()) %>%
  view()
# 2. Use group_by() and summarize() to find the mean, 
#    min, and max bill length for each species (using species). 
#    Also add the number of observations (hint: see ?n).
# 
# 3. What was the heaviest animal measured in each year? 
#    Return the columns year, island, species, and body_mass_g.

*Dave's script is available here:
*https://math.westmont.edu/dc/episode04b.R

*Visualization using ggplot

penguins_plot <- ggplot(data = penguins_comp, mapping = aes(x = body_mass_g, y = flipper_length_mm)) 


*Challenge:
Scatter plots can be useful exploratory tools for small datasets. For data sets with large numbers of observations, such as the surveys_complete data set, overplotting of points can be a limitation of scatter plots. One strategy for handling such settings is to use hexagonal binning of observations. The plot space is tessellated into hexagons. Each hexagon is assigned a color based on the number of observations that fall within its boundaries. To use hexagonal binning with ggplot2, first install the R package hexbin from CRAN:
*install.packages("hexbin")library(hexbin)
Then use the geom_hex() function:
*penguins_plot +
*  geom_hex()
What are the relative strengths and weaknesses of a hexagonal bin plot compared to a scatter plot? Examine the above scatter plot and compare it with the hexagonal bin plot that you created.


ggplot cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

*Challenge:
Boxplots are useful summaries, but hide the shape of the distribution. For example, if there is a bimodal distribution, it would not be observed with a boxplot. An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape (of the density of points) is drawn.
Replace the box plot with a violin plot; see geom_violin(). Try making a new plot to explore the distribution of another variable within each species.
Some suggested explorations:
	* Create boxplot for flipper_length_mm. Overlay the boxplot layer on a jitter layer to show actual measurements.
	* Add color to the data points on your boxplot according to the plot from which the observation was located (island). Hint: Check the class for island.


Whiteboard:
Good plots address questions such as ...
DISTRIBUTION
RELATIONSHIP
COMPOSITION

Great book: The Visual Display of Quantitative information, by Tufte

This works:

penguins_raw_subset <- penguins_raw %>%
    select("Island", "Species", "Date Egg") %>%
    filter(year(penguins_raw[["Date Egg"]]) == 2008) %>%
    rename(date_egg = "Date Egg")

daily_counts <- penguins_raw_subset %>%
  count(date_egg, Species)

ggplot(data = daily_counts, aes(x = date_egg, y = n, color = Species)) +
  geom_line()
*
Or better yet:

penguins_raw_subset <- penguins_raw %>%
    select("Island", "Species", "Date Egg") %>%
    rename(date_egg = "Date Egg") %>%
    filter(year(date_egg) == 2008)
    
daily_counts <- penguins_raw_subset %>%
  count(date_egg, Species)

ggplot(data = daily_counts, aes(x = date_egg, y = n, color = Species)) +
  geom_line()