pwdWelcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ Main workshop Centre for Regenrative Medicine: May 30th - June 10th, 2021 (nadinebestard.github.io) Instances https://docs.google.com/spreadsheets/d/1QryoRXUKy-7NoshWXC4w8ruDpIcgmU_blLoLf0OzQ6Q/edit#gid=0 ---------------------------------------------------------------------------- Exercise 1 Use the -l option for the ls command to display more information for each item in the directory. What is one piece of additional information this long format gives you that you don’t see with the bare ls comma Finding Hidden directories First navigate to the shell_data directory. There is a hidden directory within this directory. Explore the options for ls to find out how to see hidden directories. List the contents of the directory and identify the name of the text file in that directory. Hint: hidden files and folders in Unix start with . (dot), for example .my_hidden_directory Navigation Navigate to your home directory. (1 step) From there, list the contents of the untrimmed_fastq directory. 2nd step -only one command Do each of the following tasks from your current directory using a single ls command for each: 1. List all of the files in /usr/bin that start with the letter ‘c’. 2. List all of the files in /usr/bin that contain the letter ‘a’. 3. List all of the files in /usr/bin that end with the letter ‘o’. Bonus: List all of the files in /usr/bin that contain the letter ‘a’ or the letter ‘c’. Hint: The bonus question requires a Unix wildcard that we haven’t talked about yet. Try searching the internet for information about Unix wildcards to find what you need to solve the bonus problem. History Find the line number in your history for the command that listed all the .sh files in /usr/bin. Rerun that command. Cat files 1. Print out the contents of the ~/shell_data/untrimmed_fastq/SRR097977.fastq file. What is the last line of the file? 2. From your home directory, and without changing directories, use one short command to print the contents of all of the files in the ~/shell_data/untrimmed_fastq directory. SEarch files : less SRR097977.fastq What are the next three nucleotides (characters) after the first instance of TTTTT? Starting in the shell_data/untrimmed_fastq/ directory, do the following: 1. Make sure that you have deleted your backup directory and all files it contains. 2. Create a backup of each of your FASTQ files using cp. (Note: You’ll need to do this individually for each of the two FASTQ files. We haven’t learned yet how to do this with a wildcard.) 3. Use a wildcard to move all of your backup files to a new backup directory. 4. Change the permissions on all of your backup files to be write-protected. Search with GREP 1. Search for the sequence GNATNACCACTTCC in the SRR098026.fastq file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. 2. Search for the sequence AAGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. How many sequences are there in SRR098026.fastq? Remember that every sequence is formed by four lines. How many sequences in SRR098026.fastq contain at least 3 consecutive Ns? Print the file prefix of all of the .txt files in our current directory. *DAY 2 (1st June) Exercise Open README.txt and add the date to the top of the file and save the file. We want the script to tell us when it’s done. 1. Open bad-reads-script.sh and add the line echo "Script finished!" after the grep command and save the file. 2. Run the updated script. Download data *mkdir -p ~/dc_workshop/data/untrimmed_fastq/ *cd ~/dc_workshop/data/untrimmed_fastq * *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz Bonus exercise How big are the files? (Hint: Look at the options for the ls command to see how to show file sizes.) Trimmomatic ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/NexteraPE-PE.fa trimmomatic PE SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz \ * SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz \ * SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz \ * SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 REference genome * curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz Subset of data *curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248 *tar xvf sub.tar.gz *mv sub/ ~/dc_workshop/data/trimmed_fastq_small Final script *set -e *cd ~/dc_workshop/results * *genome=~/dc_workshop/data/ref_genome/ecoli_rel606.fasta * *bwa index $genome * *mkdir -p sam bam * *for fq1 in ~/dc_workshop/data/trimmed_fastq_small/*_1.trim.sub.fastq * do * echo "working with file $fq1" * * base=$(basename $fq1 _1.trim.sub.fastq) * echo "base name is $base" * * fq1=~/dc_workshop/data/trimmed_fastq_small/${base}_1.trim.sub.fastq * fq2=~/dc_workshop/data/trimmed_fastq_small/${base}_2.trim.sub.fastq * sam=~/dc_workshop/results/sam/${base}.aligned.sam * bam=~/dc_workshop/results/bam/${base}.aligned.bam * * bwa mem $genome $fq1 $fq2 > $sam * samtools view -S -b $sam > $bam * * * done *R - day 6th June What will be the value of each variable after each statement in the following program? *mass <- 47.5 age <- 122 mass <- mass * 2.3 age <- age - 20 Compare mass to age ( is mass larger than age)? Install the following packages: ggplot2, dplyr, gapminder Download the gapminder data from here: https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv 1. Download the file (right mouse click on the link above -> “Save link as” / “Save file as”, or click on the link and after the page loads, press Ctrl+S or choose File -> “Save page as”) 2. Make sure it’s saved under the name gapminder_data.csv 3. Save the file in the data/ folder within your project. We will load and inspect these data later. Our fake data *coat,weight,likes_string calico,2.1,1 black,5.0,0 tabby,3.2,1 *logical -> integer -> numeric -> complex -> character * *Exercises data types and structures Make a vector with the numbers 1 through 26. Multiply the vector by 2 Is there a factor in our cats data.frame? what is its name? Try using ?read.csv to figure out how to keep text columns as character vectors instead of factors; then write a command or two to show that the factor in cats is actually a character vector when loaded in this way. Create a list of length two containing a character vector for each of the sections in this part of the workshop: * Data types * Data structures Populate each character vector with the names of the data types and data structures we’ve seen so far.crea # Subsetting data *x <- c(5.4, 6.2, 7.1, 4.8, 7.5) names(x) <- c('a', 'b', 'c', 'd', 'e') Write a subsetting command to return the values in x that are greater than 4 and less than 7. *f <- factor(c("a", "a", "b", "c", "c", "d")) my_list <- list(tilte= "CRM workshop", vector = 1:10, data = head(mtcars)) Using your knowledge of both list and vector subsetting, extract the number 2 from my_list. Hint: the number 2 is contained within the “vector” item in the list # matrix set.seed(1) *m <- matrix(rnorm(6*4), ncol=4, nrow=6) Let’s imagine that 1 cat year is equivalent to 7 human years. 1. Create a vector called human_age by multiplying cats$age by 7. 2. Convert human_age to a factor. 3. Convert human_age back to a numeric vector using the as.numeric() function. Now divide it by 7 to get the original ages back. Explain what happened. *download.file(url = "https://github.com/Bioconductor/bioconductor-teaching/raw/master/data/GSE96870/rnaseq.csv", * destfile = here("data/rnaseq.csv")) rna <- read.csv(file=here("data/rnaseq.csv", , stringsAsFactors = FALSE) Download this file, import as an R object Based on the output of str(rna), can you answer the following questions? * What is the class of the object rna? * How many rows and how many columns are in this object? * How many genes (as defined by the gene variable) have been measured in this experiment? Fix each of the following common data frame subsetting errors: Extract observations collected for the year 1957 gapminder[gapminder$year = 1957,] Extract all columns except 1 through to 4 gapminder[,-1:4] Extract the rows where the life expectancy is longer the 80 years gapminder[gapminder$lifeExp > 80] Extract the first row, and the fourth and fifth columns (continent and lifeExp). gapminder[1, 4, 5] Advanced: extract rows that contain information for the years 2002 and 2007 gapminder[gapminder$year == 2002 | 2007,] 1. Create a data.frame (rna_200) containing only the data in row 200 of the rna dataset. 2. Notice how nrow() gave you the number of rows in a data.frame? * Use that number to pull out just that last row in the inital rna data frame. * Compare that with what you see as the last row using tail() to make sure it’s meeting expectations. * Pull out that last row using nrow() instead of the row number. * Create a new data frame (rna_last) from that last row. * *# Day 4 "https://carpentries-incubator.github.io/bioc-intro/" Lesson 6 Challenge 1 Change the arguments bins or binwidth of geom_histogram() to change the number or width of the bins. Modify the example so that the figure shows how life expectancy has changed over time: *ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() Hint: the gapminder dataset has a column called “year”, which should appear on the x-axis. * In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “continent” column. What trends do you see in the data? Are they what you expected? Switch the order of the point and line layers from the previous example. What happened? Modify the color and size of the points on the point layer in the previous example. Hint: do not use the aes function. Subset the dataframe to only include the 25 countries from america. Generate boxplots to compare life expectancy between the different continents during the available years. * Rename y axis as Life Expectancy. * Remove x axis labels. Extra info - how to add plots to a single figure. box_plot_lifeexp <- ggplot(gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(aes(fill = continent)) + ylab("Life Expectancy") + theme(legend.position = "none") violin <- ggplot(gapminder, aes(x = continent, y = lifeExp)) + geom_violin(aes(fill = continent)) + ylab("Life Expectancy") + theme(legend.position = "none") install.packages("patchwork") library(gridExtra) library(patchwork) library(ggplot2) # with pathcwork box_plot_lifeexp + violin box_plot_lifeexp / (violin + rna_plot ) # with gridExtra grid.arrange(box_plot_lifeexp, violin, ncol = 2) Subset the dataset to only keep european countries. extra: and from less than 1956 Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your data frame have and why? Calculate the average life expectancy per country. Extra: Which has the longest average life expectancy and which has the shortest average life expectancy? * Rmarckdown Add code chunks to: * Load the ggplot2 package * Read the gapminder data * Create the plot you liked the most Convert the document to a webpage. Use chunk options to control the size of a figure and to hide the code. DAY 5 # Workflow we will follow "https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html" # RNAseq workflow package "https://www.bioconductor.org/packages/release/workflows/html/rnaseqGene.html" BiocManager::install("DESeq2") library(DESeq2) ```{r} filenames <- here("data", paste0(sampleTable$Run, "_subset.bam")) ``` ```{r} # extract the data from airway data("airway") # save as a summarised experiment se <- airway ``` colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) Plot pcadata, with colour representing the treatment (dex), and shape the cell line (cell) with ggplot. Add the variance in the labels