*Data Carpentry Genomics Workshop University of Kansas Online Workshop via Zoom Tuesday, January 10 - Thursday, January 12, 2023 9:00 am - 4:00 pm CST (UTC-6) all days Lunch break each day 12:00 noon - 1:00 pm Course website: https://kulibraries.github.io/2023-01-10-ku-dc-online/ This etherpad: https://pad.carpentries.org/2023-01-10-ku-dc-online Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ *Connection instructions for your AMI instance: To connect via shell, at your terminal prompt, type: *ssh dcuser@(your AMI instance name).compute-1.amazonaws.com Use the password included in your email To connect to RStudio: In your brower address bar, type: *(your AMI instance name).compute-1.amazonaws.com:8787 Your username and password is the one included in your email ----- Scroll to the bottom for FRESH notes ----- Scroll to the bottom for FRESH notes ----- *Please fill out the pre-workshop survey located on the course website: https://kulibraries.github.io/2023-01-10-ku-dc-online/#surveys *Tuesday, January 10 Attendees (Name / department or affiliation) * Nikki Dyanne Realubit / KU - EEB * Justin M. Bernstein / KU Center for Genomics * Natalie Ford / KU - EEB * Sharif K Tusuubira/KU EEB * Brandon Fagen / Zanders Lab, Stowers Institute * Camila G. Meneses/KU-EEB * Kervens Accilien / KU - MB * Vanna Hay / KU-Anthropology * Haylee Nedblake/ KU - EEB * Justin Van Goor / KU - EEB * Christian Supsup/KU - EEB * Anjali Gupta / KU - EEB * Surbhi Kumawat * Srivatsan Parthasarathy * Mark Herr / KU - EEB * Arezoo Fani/ KU - EEB * Kayla Clouse / KU - EEB * Meridia Jane Bryant / KU - EEB * * * Instructors (Name / department or affiliation) * Boryana Koseva / Children's Mercy Research Institute * Caroline Kisielinski / KU Anthropology Helpers (Name / department or affiliation) * Amanda Katzer / KU Ecology & Evolutionary Biology * Elizabeth Everman / KU Molecular Biosciences / Host * Tami Albin/KU Libraries/ * Jocelyn Colella / KU Ecology & Evolutionary Biology + Biodiversity Institute * Joel Swift / KBS * Martina Dalikova / KU Ecology & Evolutionary Biology *Project Organization and Management for Genomics Project Organization Reference Page: https://datacarpentry.org/organization-genomics/reference.html Full Lesson Material: https://datacarpentry.org/organization-genomics/ 1. Metadata - data that accompanies the sequencing data a. Types of data people have generated with their samples: fastq files Voucher information (Identification/museum accession ID) Sample concentration (DNA or RNA) Locality information for collected samples Tissue type Barcode information quality scores b. Where people record the data from (a) - elab notebook - csv files - handwritten field notes c. Way we organize spreadsheets may be friendly to humans, but not for computers d. Record everything (especially things you think you will remember later!) and make multiple copies 2. Tidy Data a. Every observation/sample is going to be on a row - each sample has a unique identifier - each sample ID does not have any spaces. Alternatively, use dashes or underscores b. Every variable/characteristic is going to be in a column - variable names should be detailed enough to understand what you are recording - only one type of data per variable (no combining variables in the columns) c. Potential problems that we see with this spreadsheet: XLSX file: https://github.com/datacarpentry/orga+nization-genomics/raw/gh-pages/files/Ecoli_metadata_composite_messy.xlsx - Nonuniformity of data -White spaces in some variable names -inconsistency of variable names (Run vs run - capitalized vs non) -Redundancy - multiple tables in one spreadsheet -Missing data (some have sequencing depth) d. Couple of rules: - leave raw data raw 1. make copies if you need to subset - Put each observation or sample in its own row. - Put all your variables in columns - the thing that vary between samples, like ‘strain’ or ‘DNA-concentration’. - Have column names be explanatory, but without spaces. Use ‘-‘, ‘_’ or camel case instead of a space (ThisIsAnExampleOfCamelCase). For instance ‘library-prep-method’ or ‘LibraryPrep’is better than ‘library preparation method’ or ‘prep’, because computers interpret spaces in particular ways. - Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories. * *Introduction to the Command Line for Genomics Introduction to the Command Line Reference Page: https://datacarpentry.org/shell-genomics/reference.html Command quick reference: https://github.com/kulibraries/swc-workshop-helps/blob/master/command-handout.md Full lesson material: https://datacarpentry.org/shell-genomics/ Useful Unix/Linux command reference: Unix and Linux Visual Quickstart Guide, 5th edition: http://www.worldcat.org/oclc/921910373 It's also possible to do an internet search for what you want to do in the shell and find lots of advice. You can also find the manual page online for any command by searching for the command plus man page, e.g. ls man page 1. Every sequencing facility is different, so find out what information they need before they recieve samples 2. Open your CLI (command line interface) a. mac: applications -> utilities -> terminal b. windows: right click on git bash to open 3. Why do we use bash? a. our desktops/laptops are not usually able to handle processing next-gen data so we use remote servers/computers and are often only accessible via CLI b. Software for analysis is often only accessible via CLI - only a few have a GUI (graphical user interface) c. helps prevent human error d. helps us be efficient (or lazy!) and avoid typing the same commands over and over - automation of commands 4. Amazon instances a. How to log in: ssh dcuser@YOURAMAZONINSTANCEINFO hit enter type 'yes' when the following message appears: The authenticity of host 'ec2-34-236-190-128.compute-1.amazonaws.com (34.236.190.128)' can't be established. ECDSA key fingerprint is SHA256:zM1C7+o+wEDkfnSe0nu2c0YKAhxKGZLXZAzkNtLVe4E. Are you sure you want to continue connecting (yes/no/[fingerprint])? type in password: data4Carp (you will not see the password as you type it in) b. 'clear' - will get a blank slate screen for the terminal clear will see (base) dcuser@yourinstance:~$ 5. Navigate the filesystem a. commands - programs that do something for us and will output b. Check where we are in the filesystem - our working directory pwd c. What is the contents of our folder ls d. move from directory to directory (folder = directory here) cd DirectoryName cd shell_data then type ls to see what is in that folder e. use ls to check what is a directory, program, or file ls -F name/ is a directory name* is a program name is a file f. get the manual man CommandOfInterest man ls - Challenge: find the flag that will give us additional information (AKA long format) test it! man brings you to another program, so quit with q to get to the previous command shell ls -l What ls -l tells you: permissions ownerOfFile SizeOfFolder DateOfLastModified g. Can give multiple options at a time ls -F -a - -a will show everything, even hidden folders (.foldername is a hidden folder) 1. ../ one directory up in the filesystem 2. ./ means this directory h. go into untrimmed_fastq folder cd shell_data/untrimmed_fastq - Challenge: Look at what files are in the untrimmed_fastq folder 1. command to use: ls 2. Should see 2 fastq files i. Go back to your home directory (without having to type the whole address name out) cd ~ - tilda is the shorthand name for your home directory j. tab complete - allows you to complete a folder or filename given sufficient unique-ness cd shell_data/untrimmed_fastq/ (press tab at some point when you are typing shell_data and press tab when typing untrimmed_fastq at some point) - conditions: 1. sufficient uniqueness in the name - will autocomplete to the next space that is distinguishable between the folders/files - hitting double tab once you hit the distinguishable part of a filename will list out the options that match - double tab to list what starts with what you have typed (filename when using ls, command names) k. echo $PS1 $ is a variable PS1='$ ' - this is not permenant l. cd .. will take you to the directory above your current directory m. ls can show you the contents of your current directory OR a directory that you specify cd shell_data ls untrimmed_fastq - see contents of home directory: How do I display the contents of my home directory while I am in shell_data/? ls ~ ls .. ls /home/dcuser - Full path vs relative path: 1. full path: starting at the root (/) I'm giving you the full address to the folder - you can be in any directory and will be given then contents of the directory of interest 2. relative path: given where I am right now, give me the contents of this folder (..) - ls .. will give you different responses based on what your current working directory is - some instances may have a program called 'tree' installed which will let you see your directory structure tree shell_data n. Combining multiple flags to list for a .afolder ls -F -a -l untrimmed_fastq ls -Fal untrimmed_fastq ls -Fal - Challenge: What are the contents of the hidden directory in shell_data? ls -Fal .hidden/ What is the full path of the .hidden directory? /home/dcuser/shell_data/.hidden/ Using the filesystem diagram below, if pwd displays /Users/thing, what will ls ../backup display? * 1. ../backup: No such file or directory++ 2. 2012-12-01 2013-01-08 2013-01-27 3. 2012-12-01/ 2013-01-08/ 2013-01-27/ 4. original pnas_final pnas_sub +++++++++++ Correct answer is 4 CUSTOMISING COMMAND PROMPT- You may have a prompt (the characters to the left of the cursor) that looks different from the $ sign character used here. If you would like to change your prompt to match the example prompt, first type the command: echo $PS1 into your shell, followed by pressing the Enter key. This will print the bash special characters that are currently defining your prompt. To change the prompt to a $ (followed by a space), enter the command: PS1='$ ' Your window should look like our example in this lesson. To change back to your original prompt, type in the output of the previous command echo $PS1 (this will be different depending on the original configuration) between the quotes in the following command: PS1="" For example, if the output of echo $PS1 was \u@\h:\w $ , then type those characters between the quotes in the above command: PS1="\u@\h:\w $ ". Alternatively, you can reset your original prompt by exiting the shell and opening a new session. This isn’t necessary to follow along (in fact, your prompt may have other helpful information you want to know about). This is up to you! ---------------------------------------- mac workaround for connectivity dropping: 1. completely log out of amazon instance 2. while in your home directory: nano .ssh/config 3. ServerAliveInterval 50 4. ctrl-X and yes to save 5. log into amazon instance ssh dcuser@AmazonInstance ---------------------------------------- 6. Working with files a. wildcard * - *.fastq means anything ending in .fastq - example used in a command: ls /usr/bin/*.sh - placement of wildcard matters! - Exercises: Do each of the following tasks from your current directory using a single ls command for each: 1. List all of the files in /usr/bin that start with the letter ‘c’. - ls /usr/bin/c* 2. List all of the files in /usr/bin that contain the letter ‘a’. - ls /usr/bin/*a* 3. List all of the files in /usr/bin that end with the letter ‘o’. - ls /usr/bin/*o Using the 'up arrow' on your keyboard will take you up to the last command you ran b. Access previous commands run: - up arrow - ctrl-c (kill a command; this is why this windows keyboard shortcut doesn't copy in a CLI!) - ctrl-r (reverse search) - history 1. can run a command again by !lineNumber c. Examine our files - print entire file to terminal cat FileName - less command to show a read-only version. less FileName 1. Commands to navigate through 'less' - space to go down, b to go up - g to go to beginning - G to go to end - q to quit 2. worthwhile to go through the manual for less! - only look through top few lines of file with 'head' head FileName 1. default will only show the top 10 lines head -n 1 FileName will only show you the first line - only look through bottom few lines of file with 'tail' tail FileName 1. default will only show the last 10 lines tail -n 1 FileName will only show you the last line d. Fastq - first line will always be @ followed by information about the read - second line will be the nucleotides - third will be + and may have the same info as line 1 - fourth line will be quality score information e. Copying files - command: cp FileToCopy NameOfCopy f. Making a new directory - command: mkdir NameOfNewDirectory g. Move files - command: mv FileToMove DirectoryToMoveItToo h. Change the permissions of a file (+ the command to remove a file) - provides extra security so we don't accidentally delete it - see the permissions with ls -l 1. first 3 denote permissions for user who created the file 2. second 3 and third 3 denote permission for the group and all others, respectively - group 2: within your user group - group 3: everyone else 3. stand for Read-Write-Execute (R-W-X) - remove the ability to write: chmod -w FileName - after changing permissions, there will be an extra safeguard against removing (rm FileName) that will ask you if you are sure you want to remove the file 1. rm will not remove a directory by default (to remove a directory use -r flag) i. Search within files without opening them - command: grep PatternOfInterest FileName 1. will look for strings or can look for regular expressions - example: look for reads with a bunch of Ns grep NNNNNNNNNN SRR098026.fastq - want to knowthe entire 4 line chunk that matches a pattern grep -B1 -A2 NNNNNNNNNN SRR098026.fastq 1. B refers to before; A refers to after; number after the A/B refers to the number of lines to select - Grep Exercises: Search for the sequence GNATNACCACTTCC in the SRR098026.fastq file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. Solution = $grep -B1 "GNATNACCACTTCC" SRR098026.fastq Search for the sequence AAGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. Solutions: grep -B1 AAGTT SRR098026.fastq SRR097977.fastq grep -B1 AAGTT *.fastq These will both search the files, but in different orders @SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35 GNATNACCACTTCCAGTGCTGANNNNNNNGGGAT - grep output will print to screen by default. To redirect the output, use '>' followed by a FileName for bash to make and store the output in grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt j. wordcount (wc) - command: wc FileName - output is lines words characters FileNameSearched - wc -l FileName will just grab the number of lines - How many sequences in SRR098026.fastq contain at least 3 consecutive Ns? grep NNN SRR098026.fastq > bad_reads.txt wc -l bad_reads.txt Alternative: grep NNN SRR098026.fastq | wc -l k. > writes a file, >> appends to a file - If you give the same output file name using '>' on multiple lines of commands, it will be overwritten grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt wc -l bad_reads.txt - output will give you 537 bad_reads.txt grep -B1 -A2 NNNNNNNNNN SRR097977.fastq > bad_reads.txt wc -l bad_reads.txt - output will now show 0 bad_reads.txt - use the '>>' to get both results by appending the second grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt wc -l bad_reads.txt grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt wc -l bad_reads.txt - output now shows 537 bad_reads.txt alternative way to write: grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt wc -l bad_reads.txt l. Pipes | - key is above the backslash (\) - link commands together into one line by taking the output of the first command and putting it into the next 1. reduces number of intermediate output files - example that combines grep and wc grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l - modified grep commands with pipe grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | grep -v '^--' > bad_reads.txt 1. -v flag means the inverse of the match. Pull out everything that doesn't match the pattern. 2. ^ indicates the beginning of the line to search - can have multiple pipes in the same line! m. Loops - For loop 1. repeat a command for each item in a list (each loop cycle is an iteration of the loop) 2. Example with the variable foo foo=abc echo foo is $foo - $ means the value of the variable - can remove a variable with unset VariableName echo foo is ${foo}efg - this does not actually reassign the variable to include efg, it is still only abc pwd for filename in *.fastq do head -n 2 ${filename} done 3. re-write the loop so there is an output file instead of printing to terminal for filename in *.fastq do head -n 2 ${filename} >> seq_info.txt done 4. ctrl-c to kill loops if it gets stuck!! - Basename in for loops 1. strip a file of its extra parts that you don't need/want 2. command: basename InputFile PartToRemove basename SRR097977.fastq .fastq 3. example of basename in a for loop for filename in *.fastq do name=$(basename ${filename} .fastq) echo ${name} done - Loop exercise: Print the file prefix of all of the .txt files in our current directory. for filename in *.txt do name=$(basename ${filename} .txt) echo ${name} done Solution: for filename in *.txt do name=$(basename ${filename} .txt) echo ${name} done Alternative way to write the same loop (one line) for filename in *.txt; do name=$(basename ${filename} .txt); echo ${name}; done 7. Writing scripts a. nano - text editor that can read or write files in the terminal - command: nano FileName 1. filename can be for a new file that doesn't exist, or a file that does already exist - not the only text editor out there - Navigation: 1. at top, tells you that you are in nano and the filename 2. key for what you can do at the bottom (^ means ctrl in nano) 3. cursor where you write - Exercise: Open README.txt and add your name to the top of the file and save the file. b. Create a script that pulls out the bad reads nano bad-reads-script.sh - .sh indicates that it is a bash script grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt - written within the nano text editor ctrl-x to get out of nano and save changes with y bash bad-reads-script.sh - runs the script less scripted_bad_reads.txt c. change script permissions to bypass needing to call 'bash' and make it executable chmod +x bad-reads-script.sh ./bad-reads-script.sh - this can now run the file without calling bash first 8. Getting data onto a remote server a. download data - command: wget 1. download data from a website - command: curl 1. display webpages or data without downloading it 2. can download the data too b. What programs do you have access to? - command: which ProgramName which curl which wget c. download using wget wget ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt SURVEY TO PROVIDE FEEDBACK FOR DAY 1: https://forms.gle/YGgkM6r5Y5y2WBzDA *Introduction to Cloud Computing for Genomics Introduction to Cloud Computing Reference Page: https://datacarpentry.org/cloud-genomics/reference.html Full Lesson Materials: https://datacarpentry.org/cloud-genomics/ *Wednesday, January 11 Attendees (Name / department or affiliation) * Sharif K Tusuubira/ KU EEB * Natalie Ford / KU EEB * Nikki Dyanne Realubit / KU EEB * Justin M. Bernstein / KU Center for Genomics * Anjali Gupta / KU EEB * Arezoo Fani / KU EEB * Surbhi Kumawat * Justin Van Goor / KU EEB/KBS * Haylee Nedblake / KU EEB * Srivatsan Parthasarathy * Meridia Jane Bryant / KU EEB * Brandon Fagen / Zanders Lab, Stowers Institute * Vanna Hay/ KU Anthropology * Camila G. Meneses/KU EEB * Jess Pfannenstiel/KU MB * Kervens Accilien / KU MB * Christian Supsup/KU EEB * Kayla Clouse / KU EEB * * * Instructors (Name / department or affiliation) * Boryana Koseva / Children's Mercy Research Institute * Caroline Kisielinski / KU Anthropology Helpers (Name / department or affiliation) * Amanda Katzer / KU Ecology & Evolutionary Biology * Elizabeth Everman / KU MB * Tami Albin/Libraries * Martina Dalikova / EEB KU * Joel Swift/ KBS * Jocelyn Colella / EEB KU, Biodiversity Institute * *Data Wrangling and Processing for Genomics Data Wrangling Reference Page: https://datacarpentry.org/wrangling-genomics/reference.html Full Lesson Materials: https://datacarpentry.org/wrangling-genomics/ 1. Downloading data a. use curl commands to download data from European Nucleotide Network - create a file in the home directory using nano (nano download_data.sh) mkdir -p ~/dc_workshop/data/untrimmed_fastq/ cd ~/dc_workshop/data/untrimmed_fastq/ curl commands (found on line 534) 1. -p flag for mkdir creates the parent directories as necessary *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz *curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz - get out of nano ctrl-o to save ctrl-x to exit - sh download_data.sh to start downloading those files 1. script file is opening a new shell when it is running (not always the case for every remote server) - copy of data in hidden directory (if your script isn't downloading the data quickly) ls dc_workshop/data/untrimmed_fastq/ cp ~/.backup/untrimmed_fastq/*.fastq.gz ~/dc_workshop/data/untrimmed_fastq/ 2. Bioinformatic workflows a. many, many workflows because there are many ways to do each step. - possible due to standards in data formatting cd dc_workshop/data/untrimmed_fastq/ ls gunzip SRR2584863_1.fastq.gzp head -n 4 SRR2584863_1.fastq 1. first line contains info about the sequence 2. second line actual sequence 3. placeholder + sign 4. quality score for the sequence 5. Will always see the above 4 lines for a fastq format b. Finding bad reads - use quality scores, but these are difficult for humans to read - fastqc can easily read these quality scores and give you some stats about them 1. Is it installed? If installed, will return a path to the executable which fastqc 2. Help flag is -h (often see this flag with many programs, but not all) fastqc -h 3. Zip that file we unzipped previously. Fastqc can work with either zipped or unzipped files. gzip SRR2584863_1.fastq ls -l (plus a flag that will properly display file sizes, search with man ls) ls -lh 4. Run fastqc on the files fastqc SSR258* 5. Output - still have each of the starting files, but also have an .html and a .zip file - html file is a report, but it cannot be opened in the remote server, so you have to download it. - Download html file: (command: scp SOURCE DESTINATION) Open a second terminal window for your local machine cd ~/Desktop mkdir -p dc_workshop/data/untrimmed_fastq cd dc_workshop/data/untrimmed_fastq scp dcuser@YourAmazonInstanceID:~/dc_workshop/data/untrimmed_fastq/*.html . Then enter your password Navigate to your desktop folder and open in a browser - html report: 1. Encoding - important to know how the quality scores are encoded 2. Total sequences in the file given 3. Also gives sequence length and GC content 4. Per base sequence quality graph. Y-axis is the quality score. Want in the green! Need 20 or above. - tail at the end of the read is normal, and tend to end up being trimmed later 5. Per base sequence content - Often see some craziness at the first few bases. Doesn't normally affect analyses 6. Sequence Duplication Levels - worth looking at 7. Adapter Content - will be trimmed in the next steps 8. Examples can be found here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ - zip files contain a number of text files with data 1. Exercise: Write a loop to unzip your files for zipFile in *.zip do unzip ${zipFile} done 2. ls -F shows that the uncompressed zip files are directories. The contain: - 2 directories - 4 files including summary.txt look at summary.txt with less (less SRR2584863_1_fastqc/summary.txt) which contains the pass/warn/fail for different categories - Housek cat *_fastqc/summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt eeping of output mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads cd ~/dc_workshop/results/fastqc_untrimmed_reads ls 6. Summarize what you have done in fastqc mkdir -p ~/dc_workshop/docs cd ../../docs 1. Exercise: Using what you have learned about grep, can you find how many tests failed? - recipe for grep: grep Pattern Filename grep ^FAIL fastqc_summaries.txt | wc -l c. Trimming and Filtering - Trimmomatic 1. is it on the machine? which trimmomatic 2. Manual/help gives usage for paired-end (PE) or single-end (SE) reads trimmomatic -h 3. Will not modify your original files. Will read them in, modify, and output a separate file. 4. Run: (recipe) trimmomatic PE -threads 4 SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz \ SRR2584863_1.trimmed.fastq SRR2584863_1un.trimmed.fastq \ SRR2584863_2.trimmed.fastq SRR2584863_2un.trimmed.fastq \ ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:20 ls. cd ~/dc_workshop/data/untrimmed_fastq/ cp ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/NexteraPE-PE.fa . head -n 5 NexteraPE-PE.fa trimmomatic PE SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz SRR2584863_1.trimmed.fastq SRR2584863_1un.trimmed.fastq SRR2584863_2.trimmed.fastq SRR2584863_2un.trimmed.fastq ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:20 - ILLUMINACLIP:::: - Sliding window: Looks at 4 bp at a time, checks that the average quality score is 20, if it isn't above 20, then it trims them. Then moves to the next 4. 5. Output: - input read pairs: ### Both Surviving ### Forward only Surviving ### Reverse Only Surviving ### Dropped ### - 6 files: (3 for each half of the paired reads) 1. raw data 2. paired trimmed data 3. unpaired trimmed data - File sizes: good place to sanity check that trimmomatic worked ls -lh SRR2584863_* 1. trimmed files should be smaller than the raw files. Double check that you aren't comparing zipped to unzipped file sizes! 6. Write a loop to process all of the samples in one go for infile in *_1.fastq.gz do base=$(basename ${infile} _1.fastq.gz) trimmomatic PE ${base}_1.fastq.gz ${base}_2.fastq.gz \ ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \ ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \ SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 done 7. Adapters that are commonly used with Trimmomatic: *ls ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/ 8. Housekeeping: organizing the files mkdir ../trimmed_fastq rm *trimmed.fastq mv *.trim.* ../trimmed_fastq/ cd ../trimmed_fastq/ ls - Run fastqc (again!) 3. Align to a reference genome a. download reference genome mkdir -p data/ref_genome *curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz ls data/ref_genome/ gunzip data/ref_genome/ecoli_rel606.fasta.gz b. download a set of trimmed fastq files curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248 - curl flags: -L means location which follows redirects -o means 'output'; write to file instead of stdout - When looking at a giant manual page, it can be helpful to search for the option you want to find. You can search in nano or less by using a backslash followed by the text you’re looking for. For example, I searched for /-o You can then toggle to the next instance of the pattern in the file by typing 'n', or the previous occurence by typing 'N' tar xvf sub.tar.gz ls sub mv sub/ ~/dc_workshop/data/trimmed_fastq_small cd c. make a series of directories mkdir -p results/sam results/bam results/bcf results/vcf d. index the reference genome - allows the aligner to quickly find matches to the reference - program: BWA (burrows-wheeler algorithm) bwa index data/ref_genome/ecoli_rel606.fasta e. Align using BWA mem (MEM = maximal exact match) - format of BWA: bwa mem ReferenceGenomeFasta InputFastq1 InputFastq2 > OutputSAMfile 1. SAM = sequence alignment/map bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam if you made the small directory and then moved the sub/ information your command might look like this: bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/sub/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/sub/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam f. SAM/BAM output - contain information about each individual read and how it aligns to the reference genome 1. SAM file format documentation: https://samtools.github.io/hts-specs/SAMv1.pdf - BAM files are Binary alignment/map and are smaller than SAM 1. Compress a SAM into a BAM samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam 2. Sort the BAM file by order (many ways to do this, name, order, etc.) samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam 3. Stats with samtools samtools flagstat results/bam/SRR2584866.aligned.sorted.bam - Can tell you: 1. read count that passed and failed their QC metrics 2. % of reads that mapped to the genome 3. % of reads that are properly paired 4. Variant Calling a. Look at where nucleotides from samples differ from the reference (usually SNPs) b. bcftools - BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. - https://samtools.github.io/bcftools/bcftools.html - Determine how much coverage: bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \ -f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam ls results/bcf/ - Detect the variants we have at single nucleotide positions using bcftools call bcftools call --ploidy 1 -m -v -o results/vcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf ls results/vcf/ - Filter variants & clean it up with vcfutils.pl 1. perl script vcfutils.pl varFilter results/vcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf ls results/vcf - Look into VCF file less -S results/vcf/SRR2584866_final_variants.vcf c. Assess the VCF files (visualization) - prep: 1. index bam file samtools index results/bam/SRR2584866.aligned.sorted.bam - tview in samtools: an alignment viewer samtools tview results/bam/SRR2584866.aligned.sorted.bam data/ref_genome/ecoli_rel606.fasta 1. line 1 = position; line 2 = reference; line 3 = consensus sequences -period indicates match to reference 2. Shortcuts/navigation: - arrow keys to scroll through -space jumps pages - help window: ? - go to a specific location: g - exit the program: q 3. Challenge: What variant is present at position 4377265? What is the canonical nucleotide in that position? Solution: A to G - IGV (integrative genome viewer) 1. opens on local computer 2. Transfer files to local machine using SCP open new terminal windown (do not sign into amazon instance) mkdir ~/Desktop/files_for_igv cd ~/Desktop/files_for_igv/ scp dcuser@AMAZON-INSTANCE:~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam ~/Desktop/files_for_igv scp dcuser@AMAZON-INSTANCE:~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam.bai ~/Desktop/files_for_igv scp dcuser@AMAZON-INSTANCE:~/dc_workshop/data/ref_genome/ecoli_rel606.fasta ~/Desktop/files_for_igv scp dcuser@AMAZON-INSTANCE:~/dc_workshop/results/vcf/SRR2584866_final_variants.vcf ~/Desktop/files_for_igv 3. Download software for IGV from the broad institute: https://software.broadinstitute.org/software/igv/download unzip file and then open 4. Load in reference genome file Genomes > Genome from file > Desktop > files_for_igv > ecoli_rel606.fasta 5. Load in sorted BAM and the VCF File > Load from File > Desktop > files_for_igv > SRR2584866.aligned.sorted.bam File > Load from File > Desktop > files_for_igv > SRR2584866_final_variants.vcf 6. Navigation - top right you can zoom in and out - each bar across the top of the plot shows the allele for each locus - light blue bar is a homozygous variant - "I" is an indel - grey arrow bars indicate coverage 5. Log out of amazon instance exit a. can log out of your local terminal with exit too Please give us your feedback for today here: https://forms.gle/YGgkM6r5Y5y2WBzDA *Thursday, January 12 Attendees (Name / department or affiliation) * Arezoo Fani / KU - EEB * Natalie Ford / KU EEB * Kervens Accilien / KU MB * Justin Van Goor / KU EEB/KBS * Jessica Pfannenstiel / KU MB * Haylee Nedblake / KU EEB * Surbhi Kumawat * Brandon Fagen / Zanders Lab, Stowers Institute * Camila G. Meneses/KU EEB * Kayla Clouse / KU EEB * Anjali Gupta / KU EEB * Meridia Jane Bryant / KU EEB * Vanna Hay / KU Anthropology * Christian Supsup/KU EEB * Justin M. Bernstein / KU Center for Genomics * Sharif Tusuubira/ KU EEB * Nikki Dyanne Realubit/KU EEB Instructors (Name / department or affiliation) * Boryana Koseva (Children's Mercy Research Institute) * Caroline Kisielinski / KU Anthropology Helpers (Name / department or affiliation) * Amanda Katzer / KU EEB * Elizabeth Everman / KU MB * Tami Albin/ Libraries * Martina Dalikova /KU EEB * Joel Swift / KBS * Jocelyn Colella/ KU EEB, Biodiversity Institute A helpful resources for shell scripting: Software Carpentry Shell lesson, section 6, Shell Scripts: https://swcarpentry.github.io/shell-novice/06-script/index.html *Data Wrangling and Processing for Genomics (cont) 1. Overview of steps: a. Fastqc to look at quality of data b. Trimmomatic to trim lower quality reads c. Fastqc again to look at difference in quality metrics d. Index the reference genome (only needs to be done once!) e. Align the trimmed reads to the reference genome (indexed genome) f. Convert the alignments that are in a SAM file to a BAM file g. Sort the BAM file (samtools was used to do this) h. Calculate read coverage of positions in genome i. Detect SNPs j. Filter and report SNPs in VCF 2. Shell Script a. create a new directory to keep everything tidy and organized mkdir -p ~/dc_workshop/scripts cd dc_workshop/scripts/ b. create a new shell script - command: touch; creates the file, but has no contents touch read_qc.sh nano read_qc.sh set -e - ensures that if an error occurs, the script will exit instead of continuing to the next step cd ~/dc_workshop/data/untrimmed_fastq/ echo "Running FastQC..." - echo is the bash version of print, so it will print message to the terminal fastqc *.fastq* mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads echo "Saving FastQC results to Results folder..." mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/ mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/ cd ~/dc_workshop/results/fastqc_untrimmed_reads/ echo "Unzipping FastQC archives..." for filename in *.zip do unzip ${filename} done echo "Saving summary..." cat *_fastqc/summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt - write script out with ctrl-o - enter to save to the script file name - Exit nano with ctrl-x - look at script with cat read_qc.sh - run script with: bash read_qc.sh c. if you run the same script multiple times, it will usually replace the previous outputs - Zip will ask you if you want to replace files. For this script, type A to replace all for each of the archives. d. Benefits of having the entire first steps in a shell script: - documentation of how you did your analysis e. Check on your finished script & add an echo statement to the end of the script ls -l ~/dc_workshop/docs nano read_qc.sh echo "I am done!" ctrl-x to exit nano y hit enter to save to the filename f. Download script for the entire pipeline: cd ~/dc_workshop/scripts/ curl -O https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/run_variant_calling.sh ls cat run_variant_calling.sh - What the script does: 1. set -e to have the script exit if there is an error instead of continuing through commands 2. changing directory to dc_workshop/results 3. change genome path to a variable 4. index the genome with BWA 5. make some directories; use -p 6. Set lots of variables and filepaths (in the for loop) - print which file you are working with - use basename to extract the sample name - print to terminal the basename - set the full paths to the fastq files - anticipate where the sam file will be saved and its name - anticipate where the bam file will be saved and its name - anticipate where the sorted bam file will be saved and its name - anticipate where the bcf will be saved and its name - anticipate where the vcf will be saved and its name - anticipate where the final variants vcf will be saved and its name 7. Do stuff with the variables: (in the same for loop as (6)) - BWA MEM run with genome and fastq files to make SAM file - Make SAM to BAM using samtools - Sort the BAM file with samtools - Index the BAM file so the downstream software can use it much faster with samtools - Create the raw_bcf file with the genome and the sorted BAM with bcftools mpileup - Call variants with bcftools call - Filter the variants with vcfutils.pl (perl script) *tip always use curly brackets when referencing a variable ${variable} you have set (e.g., ${fq1} ) g. Exercise: cp run_variant_calling.sh run_variant_calling_on_full_fastq.sh - Use nano and modify run_variant_calling_on_full_fastq.sh so that it runs on the full fastq files (~/dc_workshop/data/trimmed_fastq) instead of the small fastq files. - Solution: 1. In the for loop, changed the filepath to read: ~/dc_workshop/data/trimmed_fastq/*_1.trim.sub.fastq.gz (Comment: this one is supposedly without ".sub") 2. Change filename in basename to _1.trim.fastq.gz 3. Change filepaths for the variables fq1 & fq2 to ~/dc_workshop/data/trimmed_fastq/${base}_1.trim.fastq.gz or ~/dc_workshop/data/trimmed_fastq/${base}_2.trim.fastq.gz 4. Add an echo statement between the commands to make yourself notes 5. In the bwa mem line, add the flag -t 4 which indicates that 4 threads can be used - Add the comment: # -t 4 is used to speed up the alignment by utilizing 4 threads *BONUS EXERCISE* Annotate your script! echo statements Use the # for any comments. Anything after the # is ignored. # This is an example of a correctly formatted comment - Run the script! bash run_variant_calling_on_full_fastq.sh *RNA-seq Analysis Pipeline: Overview SLIDES: https://github.com/ereverman/DC_Genomics_2023 - Note that these slides are more detailed than those that were shared in the workshop, but the information about the pipeline is included. 1. Pipeline will depend on what your data is and what your question is a. quantify how active is a gene 2. For the most part, bioinformatic pipelines start the same a. First step is looking at the quality, and trim low quality nucleotides from the reads 3. RNA-seq pipeline have reads that came from mature mRNAs. 4. Steps: a. quality control: low quality reads, poly(A)-tails b. alignment to genome and quantification - There are a number of programs that will align, but the principles behind how they work varies. Often field and lab dependent on which you will choose. - Alignment is different than what we did with genomic data (DNA vs mRNA). 1. mRNA doesn't have introns, so there are fairly large chunks that will not align to the genome (gap alignment). BWA does not work to align RNA-seq data to a reference genome. - some softwares require a reference genome, others require a reference transcriptome. Some do not align at all (pseudoalignments for example) and skip to quantification. c. Look for differences in expression - All packages (or most!) for this are in R 5. Example packages for a pipeline: (Boryana's favorites) a. Quality Control - Trim and Filter with fastp (https://github.com/OpenGene/fastp) - Multiqc that can assemble all of the reports from fastp into one (https://multiqc.info/) b. Alignment to genome and quanitfication - Salmon to index the reference transcriptome and to quantify (https://combine-lab.github.io/salmon/) - Kallisto is an alternative to salmon (https://pachterlab.github.io/kallisto/manual) c. Differential Expression Analysis (normalize and run DE analysis) - DESeq2 (https://bioconductor.org/packages/devel/bioc/manuals/DESeq2/man/DESeq2.pdf) - Sleuth is an alternative that is from the developers of Kallisto (https://pachterlab.github.io/sleuth/manual) *Introduction to R and RStudio for Genomics Introduction to R and RStudio Reference Page: https://datacarpentry.org/genomics-r-intro/reference.html Full Lesson Materials: https://datacarpentry.org/genomics-r-intro/ Beginning R Markdown Cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf 1. Sign into RStudio a. go to AmazonInstance:8787 in a browser username: dcuser password: data4Carp 2. Create a project file > New Project > New Directory > New Project Directory name: dc_genomics_r Create project as a subdirectory of: home 3. Create an R script File > New File > R Script File > Save File Name: genomics_r_basics 4. Basic commands (write in script) a. show working directory getwd() b. Leave comments with # # this command shows the working directory c. change/set our working directory setwd(dir = "/home/dcuser/dc_genomics_r") d. round command to round out a number round(3.14) - ask R what the command round does ?round() args(round) #outputs the arguments that round takes 5. R basics a. variables = objects - create an object: ObjectName <- ObjectValue a <- 1 - rename/remove objects: 1. avoid whitespace and dashes in object names 2. avoid object names that are similar to program names in your environment gene_name <- 'tp53' rm(gene_name) #remove the object - data types of variables: numeric, character, and logical -Exercise: Using the mode() command, find the value type of each of the following variables: chromosome_name <- 'chr02' od_600_value <- 0.47 chr_position <- '1001701' spock <- TRUE pilot <- Earhart b. Mathematical functions (1 + (5 ** 0.5))/2 c. Vectors: collection of values of the same type (numeric, character, logical) snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1") mode(snp_genes) #character vector is the output length(snp_genes) #get the length of the vector - Manipulating vectors: #create 3 new vectors snps <- c("rs53576", "rs1815739", "rs6152", "rs1799971") snp_chromosomes <- c("3", "11", "X", "6") snp_positions <- c(8762685, 66560624, 67545785, 154039662) #retrieve values from a vector snps[3] #find the 3rd value in the snps vector snps[1:3] #get the first 3 values in the snps vector snps[c(1,3,4)] #get the first, third, and fourth values snps[c(1:3, 4)] - Add, modify, and delete values in vectors snp_genes <- c(snp_genes, "CYP1A1", "APOA5") #append to the end snp_genes[-6] #will remove whatever is at the sixth position snp_genes[6] <- "APOA5" #Add value in the sixth position d. Subset values (logical subsetting) snp_positions[snp_positions > 100000000] snp_positions > 100000000 e. Other common operations #Look for missing data is.na(snp_genes) #Find if a specific value is present in a vector using %in% c("ACTN3", "APOA5") %in% snp_genes 6. Downloading data a. read.csv function variants <- read.csv("/home/dcuser/r_data/combined_tidy_vcf.csv") 7. Manipulate the variants dataframe summary(variants) #summarize each of our variables in the dataframe subset <- data.frame(variants[,c(1:3, 6)]) #[rows, columns] str(subset) #structure function 8. Factors: vectors that are specialized for categorical data alt_alleles <- subset$ALT #pulling out the ALT column in the subset dataframe head(alt_alleles) #look at top of the object snps <- c(alt_alleles[alt_alleles == "A"], alt_alleles[alt_alleles == "T"], alt_alleles[alt_alleles == "C"], alt_alleles[alt_alleles == "G"]) factor_snps <- factor(snps) #make snps a factor object instead of character object summary(factor_snps) #gives raw counts of the different factors plot(factor_snps) ordered_factor_snps <- factor(factor_snps, levels = names(sort(table(factor_snps)))) #order the factors plot(ordered_factor_snps) 9. Further subset the variants dataframe #only want to look at one sample: SRR2584863_variants <- variants[variants$sample_id == "SRR2584863", ] dim(SRR2584863_variants) #get dimensions of the subset dataframe summary(SRR2584863_variants) 10. Make R read data as a particular type typeof(snp_chromosomes) #ask R what type of data this vector has snp_chromosomes_2 <- c(3, 11, "X", 6) typeof(snp_chromosomes_2) #force the data to be numerical snp_chromosomes_2 <- as.numeric(snp_chromosomes_2) snp_chromosomes_2 #forces the "X" to become an NA when calling as.numeric typeof(snp_positions) #calls this a double snp_positions_2 <- c("8762685", "66560624", "67545785", "154039662") typeof(snp_positions_2) #now a character vector snp_positions_2 <- as.numeric(snp_positions_2) typeof(snp_positions_2) #back to a double #where forcing things goes terribly wrong as.numeric(variants$sample_id) #entire dataframe turns into NAs 11. Write data out to a new csv file write.csv(SRR2584863_variants, file = "../dc_workshop/data/SRR2584863_variants.csv") 12. Import data from excel File > Import Dataset > From Excel > Choose File 13. Install packages install.packages("BiocManager") #bioconductor repository (https://www.bioconductor.org/) BiocManager::version() BiocManager::install("vcfR") install.packages("dplyr") #package that provides ways to manipulate data. Part of the tidyverse # (https://www.tidyverse.org/) install.packages("readr") library("dplyr") library("readr") 14. Use dplyr and readr to modify data frame select(variants, sample_id, REF, ALT, DP) #select columns out of a data frame select(variants, -CHROM) #select all the columns except for CHROM filter(variants, sample_id == "SRR2584863") #pull out rows from our data frame #pull out rows and columns with the pipe command (%>%). This pipe command can only be used if the # magrittr library is installed variants %>% filter(sample_id == "SRR2584863") %>% select(REF, ALT, DP) #assign it an object name SRR2584863_variants <- variants %>% filter(sample_id == "SRR2584863") %>% select(REF, ALT, DP) SRR2584863_variants %>% slice(1:6) #show the first 6 rows SRR2584863_variants %>% slice(10:25) #show lines 10-25 #do a statistical test and create a new column with the results variants %>% mutate(POLPROB = 1-(10^ -(QUAL/10))) 15. ggplot for graphing a. create complex plots because everything is in a separate layer that stacks ontop of one another install.packages("tidyverse") library("tidyverse") install.packages("ggplot2") library(ggplot2) variants = read.csv("https://raw.githubusercontent.com/naupaka/vcfr-for-data-carpentry-draft/main/output/combined_tidy_vcf.csv") glimpse(variants) #take a sneak peek at the data #first create a plot template ggplot(data = variants, aes(x= POS, y = DP)) + geom_point() #tell what type of representation for the data. In this case, a scatterplot coverage_plot <- ggplot(data = variants, aes(x= POS, y = DP)) coverage_plot + geom_point(alpha = 0.5, color='blue') ggplot(data = variants, aes(x= POS, y = DP, color = sample_id)) + geom_point(alpha = 0.5) We welcome your feedback for today’s workshop: https://forms.gle/YGgkM6r5Y5y2WBzDA