Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ Links to CMU Libraries trainings and events: Workshop Series: https://www.library.cmu.edu/about/publications/news/upcoming-workshops https://cmu.libcal.com/calendar/events?cid=9148&t=d&d=0000-00-00&cal=9148&inc=0 Open Science & Data Collaborations Program: https://www.library.cmu.edu/datapub/open-science#:~:text=This%20program%20provides%20services%20and,as%20research%20consultants%20and%20collaborators.
 ---------------------------------------------------------------------------- Please sign in with your name, [institution], [department], [position] (faculty, staff, undergrad, graduate student, post-doc, etc.) and [pronouns]: * Sarah Young, Carnegie Mellon University, University Libraries, faculty, she/her * Emma Slayton, Carnegie Mellon University, University Libraries, Faculty She/Her * Hannah Gunderman, Carnegie Mellon University, University Libraries, Faculty + Helper, She/they * Tanaya Seth, Carnegie Mellon University, Helper, She/Her * Luling Huang, Carnegie Mellon University, Helper, He/His * Ann Marie Mesco, Carnegie Mellon University Libraries, Digitization/Data Deposits, Staff, She/Her - Helper * Dominic Bordelon, University of Pittsburgh Libraries - Digital Scholarship Services, librarian, helper, he/him * Chi-Hsiang Wang, CSIRO Australia, Research Scientist, he/his, co-instructor * Rohit Goswami, University of Iceland, Doctoral Researcher, he/his, co-instructor * Julia Poepping, Carnegie Mellon University, Information Systems, Assoc Director Partnership Development, Special Faculty She/Her * Melanie Gainey, Carnegie Mellon University Libraries, Faculty, she/her * Kuldeep Singh , DDMLab, CMU, Reserach Associate, staff, he/his * Aaron Detwiler, CMU, Software Engineering Institute, Project Manager, he/his * Keela Thomson, CMU, Social and Decision Sciences; Psychology, PhD student, she/her * Ziyang Li, CMU, Computational Biology, graduate student, she/her * John Chin, CMU, Center for International Relations and Politics, CIRP Research Coordinator, He/His * Matthew Ko, CMU, Heinz College MSPPM-DC, Graduate Student, he/his * Sien Tam, CMU, Computational Biology, Graduate Student, she/her * Mandy Lanyon, CMU, Social and Decision Sciences, Research Associate II/Lab Manager, she/her * Nobuyuki Fukawa, Missouri University of Science and technology, faculty in marketing * AJ Liu, CMU, Tepper, undergrad student, she/her * Jordyn Gilliard, CMU, Grad student, International Relations, she/her * Sabrina Tsai, CMU, Computational Biology, Graduate Student, she/her * Ruhi Naik, CMU, Computational Biology, Graduate Student, she/her * Divakaran Liginlal, Dietrich College, he/his * Arun Thomas, CMU, Grad Student, Information Systems * Kafui Godzi, CMU, Heinz College MSPPM, Graduate Student, She/Her Lessons: Ecology Lesson: https://datacarpentry.org/spreadsheet-ecology-lesson/
 -Data source for excell lesson: https://ndownloader.figshare.com/files/2252083 , - Excell sheeet: https://github.com/datacarpentry/spreadsheet-ecology-lesson/blob/gh-pages/data/survey_sorting_exercise.xlsx?raw=true Open Refine Lesson: https://datacarpentry.org/OpenRefine-ecology-lesson/ - Data source for open refine: https://ndownloader.figshare.com/files/7823341 - Issues with the software: point your browser at https://openrefine.org/download.html, and then try opening it the interface itself point your browser at http://127.0.0.1:3333 - Lesson practice 1: https://datacarpentry.org/OpenRefine-ecology-lesson/01-working-with-openrefine/index.html - Lesson practice 2: https://datacarpentry.org/OpenRefine-ecology-lesson/02-filter-exclude-sort/index.html
 -Lesson practice 3: https://datacarpentry.org/OpenRefine-ecology-lesson/03-numbers/index.html Introduction to R Lesson:https://datacarpentry.org/R-ecology-lesson/index.html
 To Set up: (https://datacarpentry.org/R-ecology-lesson/00-before-we-start.html ) 1. Start RStudio. 2. Under the File menu, click on New Project. Choose New Directory, then New Project. 3. Enter a name for this new folder (or "directory"), and choose a convenient location for it. This will be your working directory for the rest of the day (e.g., ~/data-carpentry). 4. Click on Create Project. 5. Download the code handout, place it in your working directory and rename it (e.g., data-carpentry-script.R). 6. (Optional) Set Preferences to 'Never' save workspace in RStudio. Lesson Materials: https://datacarpentry.org/R-ecology-lesson/01-intro-to-r.html Code Handout for R: https://datacarpentry.org/R-ecology-lesson/code-handout.R
 Manipulating Data in R https://datacarpentry.org/R-ecology-lesson/03-dplyr.html -------------------------------------------------------------------------------------------------- *DAY 1 Feel free to put questions here! Best for questions that don't need an immediate answer, but might be useful for others to know. * Q: Where can I find I find the information/lessons we are working on? * A: See Lessons above ^ * Q:now()-today() gave me 1/0/1900 13:59
I didn't get a number like you have shown? * Adressed in zoom room * Q: Do we have an access to the PPT presentaion (the one instructor is using) somewhere? Thanks for your help! * I believe we will be able to share the PPT after this session, as well as a recording of the lecture. * JK, here it is: https://github.com/datacarpentry/spreadsheet-ecology-lesson/blob/gh-pages/data/survey_sorting_exercise.xlsx?raw=true * Also, https://datacarpentry.org/spreadsheet-ecology-lesson/
 * Q: One of the things that pushed me to sign up for this workshop was that I have some large data sets that crash Excel when I try to use Excel for cleaning. I was under the impression that I can/should do the cleaning in R to get around this, but it sounds like the recommendation is actually to tidy in Excel before putting the data in R. Any tips for how to effectively handle large data in Excel, or alternate spreadsheet programs that are better equipped for large files? Thank you!![[follow up: thanks!!! -Keela]] * A: In fact about all of what Sichong presented here could be tidied up in R (i.e. The R package suite 'tidyverse'). I think we will learn some of the functions on day 3. * Q: neither of the links that were sent would open OpenRefine. Please advice. * Hmm. Try https://openrefine.org/download.html to download open refine. Does that work? If you have downloaded the program, it should open within a browser. If gave me this message: “openrefine-mac-3.4.1.dmg” can’t be opened because it is from an unidentified developer.. Hold down the Ctrl key as you open it (or, right click the OpenRefine logo and click 'Open'). I think it is finally opening. I feel very behind at this point though. All the things she has been doing are in the lesson at https://datacarpentry.org/OpenRefine-ecology-lesson/01-working-with-openrefine/index.html. It should have the instructions to open the data and do a facet. And if you have any questions, specifcally to the workshop, please feel free to ask! We can try and get you caught up. I have it downloaded, but how to I open the file? Sorry. No worries, hopefully Sarah can help. I think it is opening now. Thank you! * Q:How can you change the data type of the whole column? * We can change a field by going to the drop down menue of the field, under edditing cells, common transforms, editing to number (for example). * Q: Is it okay to leave numbers as text? Is this something we should do with all number fields? * Hannah G. likes to make sure all numbers are also coded as numbers fields, just in case. But just a personal pref - anyone else have other thoughts? * Advantages are the ability to use a numeric facet to find numbers within a range, and smaller file size, but with the potential loss of data. * Depends on how you will work with that data, either in open refine or other tools. If you are making averages, means and medians, scater plots, having it as a numeric field may be helpful. Still, for end purposes it might not make sense to apply those analysis to the data in which case it might not be worth it to change it to numeric * Another use case of transforming text columns into numeric columns : If you need to apply "scatterplot facet," you need to have at least two numeric columns. * Q: For ngram fingerprint, do people typically play around with the size? * If you are not getting a lot of matches, you can for sure play around with them! * Q: Are there any situations where it would not work to use underscores in variable names or the data itself? Or is that pretty universally safe? [[Thank you!]] * I would say they are universally safe. That is true across a lot of code types. In fact, often when having file names or data, it is safe to have underscore I think (may be personal preference). * For data cleaning purposes, what would be the difference between Excel and Open Refine? What are some unique features only available in Open Refine, but not in Excel? Thank you! * Greatt question, I will ask Sarah to exapnd on this. * In addition to the JSON script and open source nature of Open Refine, I think the merging algorithms is a really powerful feature not available in Excel. Questions about R * Q: So an .R file is equivalent to a .do file in STATA? * Yes! * Q: i have a little yellow bar saying "Packages DBI, dplyr, and 3 others are required but not installed" should i install them for this workshop? * We will install them later in the lesson * I missed it at the end--what did Debby save and how ? * You can press the little disk/save icon that is right above the script. Or, you can use the keyboard shortcut Ctrl-S (S for Save). *DAY 2 Put questions here! Best for questions that don't need an immediate answer, but might be useful for others to know. * Q: If we have two functions with same name but in different package than which will run ? * It willl depend on the order in which they are loaded
one will "mask" the other
so base gets overshadowed by anything called with library * Q: Do we need to specify the package name while calling it ? * You can use ::, it helps to specify which package to use: * so concretely, suppose there is a library("fakeread") which has a function read.csv (defined in utils)
Now running read.csv will call fakeread::read.csv
. We can still use the "masked" version by being explicit utils::read.csv
 * * Q: Not sure I understand the results of %in%? It returned False for Rat and True for Duck? * No it compares the first (elementwise) to the second. So animals is c("mouse" "rat" "dog") and of these only the second is present in c("rat","duck","cat"). So this is essentially: * Is "mouse" anywhere in c("rat","duck","cat")? FALSE * Is "rat" anywhere in c("rat","duck","cat")? TRUE * Is "dog" anywhere in c("rat","duck","cat")? FALSE * Hence c(FALSE, TRUE, FALSE) was the output * Q: Should we update packages to be current? * Recomended practice says yes! * When I try to create a new Rmarkdown file, it says I need to update rmarkdown package. When I do this, it isn't working though. * Answered in zoom * * I missed how Chi-Hsiang opened the R markdown file (the pretty, formatted version) to compare with what the code looks like [[I don't think I have the correct vocabulary to describe what I mean, but I am asking not about how to create a new blank r markdown file, but instead how to get to the static, non-code version of it. For a moment, Chi-Hsiang compared the "code" version to what it looks like as the "pretty" static formatted version with charts and stuff]] * That would be the Knit button at the top of the RMarkdown window. So once you have saved your RMarkdown file, you can hit "Knit" and it will create a nice documents with all of the formatting. * Oh, yes, that was it! Thanks! * Also, when installing tidyverse, why was it written in quotes? -- install.packages("tidyverse") instead of install.packages(tidyverse) * Without quotes, it is not a string and R would try to load an object called tidyverse, which doesn't exist * The "" describe a string, because install packages looks for a string on CRAN, the R package repository * Yes, you neeed to havee install.packages("tidyverse") in qoutations. When we load the libray you do not need to have the quation marks, see reason above in green. * Q: Is # define the heading size? * Yes, Kuldeep! Based on the number of # characters, it will define the size of the text * Getting: * Error in download.file(url = "https://ndownloader.figshare.com/files/2292169", : * cannot open destfile '../data_raw/portal_data_joined.csv', reason 'No such file or directory' * This may may have to doo with your directory line. You need to make sure you are specificing the correct file you have placed the downloaded folder in, or that the dsitination file structure is correct. * Okay, I took off a . to make it ./data_raw/portal_data_joined.csv and it seems to have gone through * Note that . means the current directory, and .. means the directory above this * .. # If you are one level below * . # You are here * data_raw # This is the folder you are putting things in * On a related note it is often easier to navigate with your file browser and copy the absolute path * Sorry, I missed something else! Why does it say Ctrl + Alt + i on line 22-- what does that do? [[thanks!!]] * That is a short cut for "Insert chunk (Sweave and Knitr)" . On a mac, that is Cmd+Option+I * R includes a powerful and flexible system (Sweave) for creating dynamic reports and reproducible research using LaTeX. Sweave enables the embedding of R code within LaTeX documents to generate a PDF file that includes narrative and analysis, graphics, code, and the results of computations. * https://support.rstudio.com/hc/en-us/articles/200552056-Using-Sweave-and-knitr * Q: can you explain the $ between survey and taxa/genus? * $ indicates the column of interest. so survey$genus is calling the genus column of the survey dataset. * Q: I'm not always clear when and why I'm using the bottom left vs bottom left window. * Bottom Left: Console, which is used to see what comes out after executing codes. You can also type in R code (after > ) and let the code be executed here. Use case: To quickly test a line/block of code. * Bottom Right: Can be used to: * Install/update packages * See what's in the current working directory (i.e., what files are there in the folder?) * Read documentation * View plots * Upper left (if this is what you meant): This is the place where you write the R code (can edit as a text file). If you execute a line of code here, the results will be shown in the console (bottom left). When and why: Edit and save the .R file for record. ## Questions To Be Answered What is the output of as.numeric(sexs)? Why? ```code sex<-as.factor(c("male","female","female","male")) sexs<-factor(sex,levels=c("male","female")) ``` https://jamboard.google.com/d/1m8iWyzJXNsPf0q9iD2qGY3E2tVYal1AS2NN73ggcb00/edit?usp=sharing Again (factors)? + + Need time? ### Thinking through the Challenge 1. Change the columns taxa and genus in the surveys data frame into a factor. 2. Using the functions you learned before, can you find out... * How many rabbits were observed? * How many different genera are in the genus column? answer: 26? I did: * genus_factor <- as.factor(surveys$genus) * nlevels(genus_factor)+++ * summary(genus_factor) * What do we need? * Where are the rabbits? `names` `unique` * How many rabbits? `factor` `summary` somehow? Is composition clear? - Yes - No - Again YES * Q: I'm not always clear when and why I'm using the bottom left vs bottom left window. * Bottom Left: Console, which is used to see what comes out after executing codes. You can also type in R code (after > ) and let the code be executed here. When and why: e.g., to quickly test a line/block of code. * Bottom Right: Can be used to: * Install/update packages * See what's in the current working directory (i.e., what files are there in the folder?) * Read documentation * View plots * Upper left (if this is what you meant): This is the place where you write the R code (can edit as a text file). If you execute a line of code here, the results will be shown in the console (bottom left). When and why: Edit and save the .R file for record. *Challenge Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight. ### Thinking through the Challenge 1. Select the columns 2. Filter them selectedCols<-select(surveys, year,weight,sex) filteredCols<-filter(selectedCols,year<1995) surveys %>% select(year,weight,sex) %>% filter(year<1995) ## 2 Pros and 2 cons for each # Why does mutate not seem to affect the dataset? - is only the fn stores like a pseudo data? +++++ I see the instructor use console (bottom box) and top box in R studio. What is the difference between these two boxes? Do you use top box just to define data? - They're connected in the sense that they use the same R session - So the finer points are that the Rscript needs to be *sourced* to enter the R console Script File: https://filepush.co/scZS/portalTest.R *DAY 3 Did anyone get a blank plot when they ran the current command? I have weight and hindfoot_length showing, but nothing in it. --- Did you run the ggplot command or the one with a geom_? I keep getting this message: Error in ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) : could not find function "ggplot" --- library("tidyverse") *List of R Resources https://datacarpentry.org/R-ecology-lesson/ *Tidyverse - Tidyverse Documentation: https://tidyverse.org - See all the packages : https://www.tidyverse.org/packages/ - R For Data Science (Tidyverse!): https://r4ds.had.co.nz/ - Cheatsheets: https://rstudio.com/resources/cheatsheets/ *Graphics - R Graph Gallery : www.r-graph-gallery.com - From Data to Viz: https://www.data-to-viz.com - R Graphics https://r-graphics.org/ - Data Visualization: A Practical Introduction: https://socviz.co/ - Shiny (interactive Visualization): https://shiny.rstudio.com *Statistical Modeling - "ISLR" http://faculty.marshall.usc.edu/gareth-james/ISL/ - "ESL" https://web.stanford.edu/~hastie/ElemStatLearn/ - Packages: Basic regression is built in. lme4 (mixed modeling), afex (ANOVA) - Tidymodels: tidymodels.org - Tinycups Giraffes and Statistics https://tinystats.github.io/teacups-giraffes-and-statistics *Other Topics - Bookdown (Books, many of which are R): https://bookdown.org - Advanced R: https://adv-r.hadley.nz/ (for programmers) - R Markdown : - R Markdown Workshop https://ulyngs.github.io/rmarkdown-workshop-2019/