Welcome to The Carpentries Etherpad!
This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org/ ).
Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
----------------------------------------------------------------------------.
General Links:
Workshop site: https://ucsbcarpentry.github.io/2020-05-29-UCSB-R/
- *contains setup instructions, curriculum, daily schedule, link to this document, workshop survey links
UCSB Carpentry Home Page: https://ucsbcarpentry.github.io/
- *contains list of all upcoming and past workshops
UCSB Carpentry List-serve: https://groups.google.com/a/library.ucsb.edu/forum/#!forum/carpentry
- *sign-in with a ucsb.edu email address and click "join" to be updated about all future carpentry workshops
Workshop links:
May 29th Dataframe for ggplot2 episode: https://drive.google.com/file/d/1vXc6Pr3ujIu4-YNV-W2udX89PgrqOs7y/view?usp=sharing
Setup instructions: https://datacarpentry.org/r-socialsci/setup.html
- *Note: The latest version of R is 4.0.0 (out April 2020). Please use version 3.6.3 for this workshop as there are some issues still being worked out with the workshop's compatibility with v. 4.0.0
Feedback form: https://forms.gle/SGvWoTw1avh2Ueci7
- *please give feedback during breaks and/or at the end of the day
Pre-workshop survey: https://carpentries.typeform.com/to/wi32rS?slug=2020-05-29-UCSB-R
- *please complete before the workshop begins
Post-workshop survey: https://carpentries.typeform.com/to/UgVdRQ?slug=2020-05-29-UCSB-R
- *please complete after the workshop ends
R Resources:
https://www.r-graph-gallery.com/
http://www.cookbook-r.com/
https://www.youtube.com/watch?v=h29g21z0a68 "plotting anything with ggplot"
https://ggplot2.tidyverse.org/reference/ggtheme.html
Workshop Notes:
*answered questions, helpful comments and links from the chat, and exercise questions will be posted here
-----------------------------------------------------------------------------------
How do I rename myself in zoom?
- right-click on your name and a popup will open letting you rename yourself
May 22nd
Question:
How to find easy resources online to assist with R related troubleshooting?
Suggestions:
Slack - Eco Data Science at UCSB (they have an specific channel)
Website: https://support.rstudio.com/hc/en-us/sections/200096803-Troubleshooting
Website: Stack overflow
Just google your question with "R" as the first word is pretty effective, e.g., "R transpose matrix"
Q: How do I add a new script window in R?
A: Green plus sign icon on top far-left , select the R script in the dropdown and it will open a new script window, when adding a new R project, go to file and select 'new R project'
You can collapse your panes/sections in R - Jon collapsed his script so we could see the contents of his console instead
For clearing your console use Ctr+L or use the 'sweep' button in the top right corner
Link to download the file https://ndownloader.figshare.com/files/11492171
Q: Are projects transferable between R versions?
A: projects are RStudio specific, as long as the Rstudio versions are compatible the projects should be as well. In some case, the function in your project get updated, it might break the code in your later R version.
There may be errors between versions but they should be ok overall. packages and libraries written into code might get upset depending on the R version and throw out a warning.
Q: What does Wb stands for?
A: Write byte Style - there are other modes, r, rb, w, in addition to wb
Q: Error in download.file("https://ndownloader.figshare.com/files/11492171", :
cannot open destfile 'data/SAFI_clean.csv', reason 'Invalid argument'
A: destfile='data/SAFI_clean.csv' You need the equal sign before the file name
Q: Why do I get a ton of things downloaded when I install Tidyverse?
A: Those other things you see are the collection of packages in Tidyverse. There's tidy, readr, ggplot and others and rather than downloading these smaller packages individually, tidyverse puts it in a nice collection so we don't have to do this.
As you begin working with R and Rstudio, this is something to keep in mind: When R or Rstudio gets a new update, this doesn't necessarily mean the packages and libraries we use in R are also updated. Some packages may be updated after R which is also the reason why we were on the fence to ask users to install R 4.0 since it only came out recently. I currently have R 3.5.1 and had no issues with Tidyverse, but from looking at the chat, we know there are issues with some learners in different versions.
Q: Is there a reason you would want to comment in a destructive area like the console? Once you clear the console or relaunch RStudio your “notes” will disappear.
A: No, in general commenting in a transient place like the console is not helpful.
Q: Is there a reason to prefer <- over = as the assignment operator? I was playing around and saw that = works as well
A: We will discuss pipes in a later lesson, but good question.
<- for assignments, and = within functions (where <- doesn’t work)
(https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-assignment-operators-in-r)
Tip: arguments in functions
unnamed = mandatory. named = optional.
round(3.1415, 3) vs. round(x = 3.1415, digits = 3)
both work, but explicitly stating the argument allows you to put them in any order. The first example (unnamed) is more common.
Tip: use help tab in lower right window to look up functions and their characteristics (i.e. arguments and outputs)
Q: Related to Missing Data - What about blank data in datasets?
A: Blank data should be treated as NA when read the dataset in R. (this will be discussed later in thelesson)
It is often the case that when loading the data you can specify how to deal with empty data cells, either replace with NA or 0 or some other value that makes sense and you choose.
Tip: ! in R indicates logical negation (NOT)
Q: How do we do greater than request/function?
A: > is the greater than operator
Q: What is the shortcut to paste previous lines?
A: Use the up arrow. or you can also go to more previous commands by pressing it repeatedly.
Other alternative solutions for the exercise shared by learners:
There are many ways to write code, and still get to the same solution
length(rooms [rooms >2 & !is.na(rooms)])
> sum(rooms[!is.na(rooms)]>2)
Q: What if it just gives you the same input in the output?
i ran library(tidy verse) and I just got the same thing on the console (note there shouldn't be a space in 'tidyverse')
A: if it doesn't say anything else, it means its a success
Could somebody remind me how Li brought up this dialog box?
to load the dataset click on File menu in Rstudio, then Import Dataset, then choose text. …or double click on it in the Files pane. Choose ‘import file’ from the context menu
Q: How to get the dataset loaded into R?
A: Click on File menu in Rstudio, then Import Dataset, then choose text
, or double click on it in the Files pane. Choose ‘import file’ from the context menu.
Tip: to import the text data you can copy the code preview. Once you do that remember to change to interviews and null to capital letters NULL.
Q: What if my “NULL” substitution isn’t working?
A: You want NAs. You're telling the import function to convert "NULL" strings into NA markers
. Just to clarify, the reason why we want NA markers instead of "NULL" strings is that R has built-in support for NAs (e.g., na.rm arguments), but it has no special support for dealing with strings like "NULL"
Tip: There are different shortcuts you can use. To find your keyboard shortcut hover the mouse over the the run button.
Q. How to return different columns that are not next to each other?
A. For non-consective collumns you have to have them individually.
Exercise 1 (5 mins):
1. Create a data frame (interviews_100) containing only the data in row 100 of the dataset.
interviews_100 <- interviews[100,]
2. Create a data frame (interviews_last) containing only the last row in the data frame using nrow().
interviews_las <- interviews[nrow(interviews),]
3. Create a data frame (interviews_middle) that extract the row that is in the middle of the data frame.
interviews_middle <- interviews[(nrow(interviews)+1)/2, or floor(nrow(interviews/2))
If getting error "no existing Lubridate", Got to Packages Tab -> Install -> Type in "Lubridate"
------------------
May 29th
Q. I cannot access my screen. My entire screen is filled with the instructor's screen share and I can’t get out of that.
A1. Go to the top of the Zoom 'view options'and in the drop down select exit full screen
.
A2. Hit the ESC key and you will escape out of Zoom's full-screen mode.
A3. Command + F to exit the full screen
If you get a warning message on your console you can ignore it, but still continue. Errors messages you have to fix it, before proceeding.
Q. How often do you recommend updating your version of RStudio? How many versions 'behind' do you generally start running into problems?
A1. I think that is going to be totally dependent on what is included in each R update and how often they come out
A2.
This depends on the packages you wish to use for your typical use. Versions are in the form of X.Y.Z. changes in X have the possibility of breaking old scripts that you have written, changes in Y might affect syntax , Z is often used just for bug fixes in very specific areas.
A3. Personally, I'm at Rstudio 3.5. The reason I don't update to the most updated version the moment it comes out is that the packages may not be updated, and fresh updates may have some bugs. Updating once a year to the second most recent version is my way of making sure everything is running. On macs, theres the different operating systems to consider too (high sierra and above are ok, but I've heard of funky things on Catalina), on my windows I don't need to think about that.
If you get an error Error: 'data/SAFI_clean.cvs' does not exist in current working directory ('c:user/Hunter/Documents/R/Workshop project'), make sure that besides having SAFI_clean.csv in the Data subdir you have to reference it in the import_csv()
One tip to organize your script is to hit enter twice use a hashtag # and add notes so that you keep track of your actions.
A tibble is a version of a dataframe. If you need to print a tibble it will not print the whole dataframe. Helpful to check things on the data you might want to transform. Helps for you to check for the correct name of the variables.
Q. Does it only work (expression to select) with columns or is it just columns because we specified columns?
A. For selecting rows you have to use the expression
filter(interviews, village == "God")
Advantage of Tidyverse is the operator called Pipe that allows you to string together commands to get a flow of results.
Shortcut for Pipes: %>%
Windows: Ctrl+Shift+M
Mac: Cmd+Shift+M
For full list of shortcuts, please check: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
Exercise (Using Pipes)
There was an issue of not recognizing the "memb_assoc" that was fixed by switching the order, filtering and then selecting.
If tibble isn't popping up, it is because
if you assigned it to a variable, it won't show up automatically.
Tip: A good practice is to always ungroup() after you used the group_by function. Why?
- Avoids potential unintended errors due to the grouping.
- Makes pipes more readable by explicitly pointing out places where the data is being operated on according to groups.
- Avoids future situations of loading an .Rdata object and not realizing a grouping has been applied.
You can check for groupping using str (structure). It compactly displays the internal structure of an R object. It is a diagnostic function and an alternative to summary.
Tip: Use RStudio help menu to find cheatsheets for the packages your are using. It includes very helpful info.
Q: Are NA, NULL and empty values the same in R?
A: In R empty and NA value are the same thing. If you import a dataset with NULL values you have to tell R that is NA.
If you get an error stating that the function “pivot_wider” could not be found, try to load tidyverse again library(tidyverse)
Q: Could the way this table is now set up be potentially problematic depending on how your study interprets an observation?
A: Not necessarily, because there are unique identifiers that can still connect values to the correct observation. There is a ‘group_by(id)’ dependency.
Q: Why are you able to do this using the "plain" interviews data if the items are only separated in the interviews_plotting data?
A:
Code to create interview plotting:
interviews_plotting <- interviews %>%
## pivot wider by items_owned
separate_rows(items_owned, sep=";") %>%
mutate(items_owned_logical = TRUE) %>%
pivot_wider(names_from = items_owned,
values_from = items_owned_logical,
values_fill = list(items_owned_logical = FALSE)) %>%
rename(no_listed_items = `NA`) %>%
## pivot wider by months_lack_food
separate_rows(months_lack_food, sep=";") %>%
mutate(months_lack_food_logical = TRUE) %>%
pivot_wider(names_from = months_lack_food,
values_from = months_lack_food_logical,
values_fill = list(months_lack_food_logical = FALSE)) %>%
## add some summary columns
mutate(number_months_lack_food = rowSums(select(., Jan:May))) %>%
mutate(number_items = rowSums(select(., bicycle:car)))
To download the file for the plotting episode with ggplot
https://drive.google.com/file/d/1vXc6Pr3ujIu4-YNV-W2udX89PgrqOs7y/view?usp=sharing
Make sure to have a folder created fot data-output first, otherwise you will get an error with the code
Tip: Always remember to have a double paranthesis at the end of the expression for ggplot before running it, otherwise you will get an error
Tip: In case you want keep your syntax more organized, always have the plus (+) at the end of the line before hitting enter, otherwise you will get an error
"Starting point code" for lesson:
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_point()
Q: Why specifying the color into geom_jitter and not in ggplot?
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_jitter(alpha = 0.5, color = "blue")
A: When you specify color inside the mapping it is looking to treat it as a grouping variable. So before when we did aes(color=‘blue’) it assigned the group ‘blue’ to all the data. And aes(color=village) gives each village a color.
Q: Why did we need to add randomness?
A: Multiple points would be plotted on top of each other otherwise, making it difficult to identify data points.
Q: Is there a reason the jitter points are in different positions between those two examples? Is it because of the default ‘jitter’ in geom_jitter?
A: ‘Jittering’ involves randomness, so every time you run it it will be a little different. In case you don’t want to look different each time (say you’re hoping to get your plot over and over again for some purpose), then you can use ‘set.seed(1)’ where 1 can be any real number.
Error!
with tidyverse after reloading several times - can no longer use read_csv
Message:
Error in read_csv("dataoutput/interviews_plotting.csv") :
could not find function "read_csv"
Solution:
install.packages("tidyverse")
and then library(tidyverse)
Q: Is there a reason we are putting boxplot before jitter now?
A: It's your preference which one you want on top.
Error! define percent wall type
Message:
percent_wall_type <-interviews_plotting %>%
filter(respondent_wall_type != "cement") %>%
group_by(village) %>%
mutate(percent = n / sum(n) * 100) %>%
ungroup()
Solution: Missing the count line
You can improve the look of your plots using different themes included in this cheatsheet - https://ggplot2.tidyverse.org/reference/ggtheme.html