*Data Carpentry Genomics Workshop
University of Kansas
Online Workshop via Zoom
Tuesday, January 10 - Thursday, January 12, 2023
9:00 am - 4:00 pm CST (UTC-6) all days
Lunch break each day 12:00 noon - 1:00 pm

Course website: https://kulibraries.github.io/2023-01-10-ku-dc-online/
This etherpad: https://pad.carpentries.org/2023-01-10-ku-dc-online

Welcome to The Carpentries Etherpad!
This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.

Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).

Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html

All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/

*Connection instructions for your AMI instance:
To connect via shell, at your terminal prompt, type:
*ssh dcuser@(your AMI instance name).compute-1.amazonaws.com
Use the password included in your email

To connect to RStudio:
In your brower address bar, type:
*(your AMI instance name).compute-1.amazonaws.com:8787
Your username and password is the one included in your email

-----      Scroll to the bottom for FRESH notes     -----     Scroll to the bottom for FRESH notes     ----- 

*Please fill out the pre-workshop survey located on the course website:
https://kulibraries.github.io/2023-01-10-ku-dc-online/#surveys

*Tuesday, January 10
Attendees (Name / department or affiliation)
	* Nikki Dyanne Realubit / KU - EEB
	* Justin M. Bernstein / KU Center for Genomics
	* Natalie Ford / KU - EEB
	* Sharif K Tusuubira/KU EEB
	* Brandon Fagen / Zanders Lab, Stowers Institute
	* Camila G. Meneses/KU-EEB
	* Kervens Accilien / KU - MB
	* Vanna Hay / KU-Anthropology
	* Haylee Nedblake/ KU - EEB
	* Justin Van Goor / KU - EEB
	* Christian Supsup/KU - EEB
	* Anjali Gupta / KU - EEB
	* Surbhi Kumawat
	* Srivatsan Parthasarathy
	* Mark Herr / KU - EEB
	* Arezoo Fani/ KU - EEB
	* Kayla Clouse / KU - EEB
	* Meridia Jane Bryant / KU - EEB
	* 
	* 
	* 

Instructors (Name / department or affiliation)
	* Boryana Koseva / Children's Mercy Research Institute
	* Caroline Kisielinski / KU Anthropology

Helpers (Name / department or affiliation)
	* Amanda Katzer / KU Ecology & Evolutionary Biology
	* Elizabeth Everman / KU Molecular Biosciences / Host
	* Tami Albin/KU Libraries/ 
	* Jocelyn Colella / KU Ecology & Evolutionary Biology + Biodiversity Institute
	* Joel Swift / KBS
	* Martina Dalikova / KU Ecology & Evolutionary Biology

*Project Organization and Management for Genomics
Project Organization Reference Page: https://datacarpentry.org/organization-genomics/reference.html
Full Lesson Material: https://datacarpentry.org/organization-genomics/

1. Metadata - data that accompanies the sequencing data
	a. Types of data people have generated with their samples: 

fastq files
Voucher information (Identification/museum accession ID)
Sample concentration (DNA or RNA)
Locality information for collected samples
Tissue type
Barcode information
quality scores

	b. Where people record the data from (a)
		- elab notebook
		- csv files
		- handwritten field notes
	c. Way we organize spreadsheets may be friendly to humans, but not for computers
	d. Record everything (especially things you think you will remember later!) and make multiple copies 
2. Tidy Data
	a. Every observation/sample is going to be on a row
		- each sample has a unique identifier
		- each sample ID does not have any spaces. Alternatively, use dashes or underscores
	b. Every variable/characteristic is going to be in a column
		- variable names should be detailed enough to understand what you are recording
		- only one type of data per variable (no combining variables in the columns)
	c. Potential problems that we see with this spreadsheet:
XLSX file: https://github.com/datacarpentry/orga+nization-genomics/raw/gh-pages/files/Ecoli_metadata_composite_messy.xlsx
		- Nonuniformity of data 
		-White spaces in some variable names
		-inconsistency of variable names (Run vs run - capitalized vs non)
		-Redundancy 
		- multiple tables in one spreadsheet
		-Missing data (some have sequencing depth)
	d. Couple of rules:
		- leave raw data raw
			1. make copies if you need to subset
		- Put each observation or sample in its own row.
		- Put all your variables in columns - the thing that vary between samples, like ‘strain’ or ‘DNA-concentration’.
		- Have column names be explanatory, but without spaces. Use ‘-‘, ‘_’ or camel case instead of a space (ThisIsAnExampleOfCamelCase). For instance ‘library-prep-method’ or ‘LibraryPrep’is better than ‘library preparation method’ or ‘prep’, because computers interpret spaces in particular ways.
		- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.

*
*Introduction to the Command Line for Genomics
Introduction to the Command Line Reference Page: https://datacarpentry.org/shell-genomics/reference.html
Command quick reference: https://github.com/kulibraries/swc-workshop-helps/blob/master/command-handout.md
Full lesson material: https://datacarpentry.org/shell-genomics/
Useful Unix/Linux command reference: 
Unix and Linux Visual Quickstart Guide, 5th edition: http://www.worldcat.org/oclc/921910373
It's also possible to do an internet search for what you want to do in the shell and find lots of advice. You can also find the manual page online for any command by searching for the command plus man page, e.g. ls man page

1. Every sequencing facility is different, so find out what information they need before they recieve samples
2. Open your CLI (command line interface)
	a. mac: applications -> utilities -> terminal
	b. windows: right click on git bash to open
3. Why do we use bash?
	a. our desktops/laptops are not usually able to handle processing next-gen data so we use remote servers/computers and are often only accessible via CLI
	b. Software for analysis is often only accessible via CLI
		- only a few have a GUI (graphical user interface)
	c. helps prevent human error
	d. helps us be efficient (or lazy!) and avoid typing the same commands over and over
		- automation of commands
4. Amazon instances
	a. How to log in:
		ssh dcuser@YOURAMAZONINSTANCEINFO 
		hit enter
		
		type 'yes' when the following message appears: 
		The authenticity of host 'ec2-34-236-190-128.compute-1.amazonaws.com (34.236.190.128)' can't be established.
		ECDSA key fingerprint is SHA256:zM1C7+o+wEDkfnSe0nu2c0YKAhxKGZLXZAzkNtLVe4E.
		Are you sure you want to continue connecting (yes/no/[fingerprint])? 
		
		type in password: data4Carp (you will not see the password as you type it in)
		
	b. 'clear' - will get a blank slate screen for the terminal
		clear
		
		will see (base) dcuser@yourinstance:~$

5. Navigate the filesystem 
	a. commands - programs that do something for us and will output
	b. Check where we are in the filesystem - our working directory
		pwd
	c. What is the contents of our folder
		ls
	d. move from directory to directory (folder = directory here)
		cd DirectoryName
		cd shell_data
		then type ls to see what is in that folder
	e. use ls to check what is a directory, program, or file
		ls -F 
		name/ is a directory
		name* is a program
		name is a file
	f. get the manual
		man CommandOfInterest
		man ls
		- Challenge: find the flag that will give us additional information (AKA long format)
			test it! man brings you to another program, so quit with q to get to the previous command shell
			ls -l
			What ls -l tells you: permissions ownerOfFile SizeOfFolder DateOfLastModified
	g. Can give multiple options at a time 
		ls -F -a
		- -a will show everything, even hidden folders (.foldername is a hidden folder)
			1. ../ one directory up in the filesystem
			2. ./ means this directory
	h. go into untrimmed_fastq folder
		cd shell_data/untrimmed_fastq
		- Challenge: Look at what files are in the untrimmed_fastq folder
			1. command to use: ls
			2. Should see 2 fastq files
	i. Go back to your home directory (without having to type the whole address name out)
		cd ~
		- tilda is the shorthand name for your home directory
	j. tab complete - allows you to complete a folder or filename given sufficient unique-ness
		cd shell_data/untrimmed_fastq/ (press tab at some point when you are typing shell_data and press tab when typing untrimmed_fastq at some point)
		- conditions:
			1. sufficient uniqueness in the name
				- will autocomplete to the next space that is distinguishable between the folders/files
				- hitting double tab once you hit the distinguishable part of a filename will list out the options that match 
		- double tab to list what starts with what you have typed (filename when using ls, command names)
	k. echo $PS1
		$ is a variable
		PS1='$ '
		- this is not permenant 
	l. cd .. will take you to the directory above your current directory
	m. ls can show you the contents of your current directory OR a directory that you specify
		cd shell_data
		ls untrimmed_fastq
		- see contents of home directory:
			
How do I display the contents of my home directory while I am in shell_data/?
		
	
ls ~
ls ..
ls /home/dcuser

		- Full path vs relative path:
			1. full path: starting at the root (/) I'm giving you the full address to the folder
				- you can be in any directory and will be given then contents of the directory of interest
			2. relative path: given where I am right now, give me the contents of this folder (..)
				- ls .. will give you different responses based on what your current working directory is
		- some instances may have a program called 'tree' installed which will let you see your directory structure
			tree shell_data
	n. Combining multiple flags to list for a .afolder
		ls -F -a -l untrimmed_fastq
		ls -Fal untrimmed_fastq
		ls -Fal
		- Challenge: What are the contents of the hidden directory in shell_data?
			ls -Fal .hidden/
		

What is the full path of the .hidden directory?
/home/dcuser/shell_data/.hidden/

Using the filesystem diagram below, if pwd displays /Users/thing, what will ls ../backup display?
*  
	1. ../backup: No such file or directory++
	2. 2012-12-01 2013-01-08 2013-01-27 
	3. 2012-12-01/ 2013-01-08/ 2013-01-27/ 
	4. original pnas_final pnas_sub +++++++++++

Correct answer is 4


CUSTOMISING COMMAND PROMPT-

You may have a prompt (the characters to the left of the cursor) that looks different from the $ sign character used here. If you would like to change your prompt to match the example prompt, first type the command: echo $PS1 into your shell, followed by pressing the Enter key.
This will print the bash special characters that are currently defining your prompt. To change the prompt to a $ (followed by a space), enter the command: PS1='$ ' Your window should look like our example in this lesson.
To change back to your original prompt, type in the output of the previous command echo $PS1 (this will be different depending on the original configuration) between the quotes in the following command: PS1=""
For example, if the output of echo $PS1 was \u@\h:\w $ , then type those characters between the quotes in the above command: PS1="\u@\h:\w $ ". Alternatively, you can reset your original prompt by exiting the shell and opening a new session.
This isn’t necessary to follow along (in fact, your prompt may have other helpful information you want to know about). This is up to you!


----------------------------------------
mac workaround for connectivity dropping:
1. completely log out of amazon instance
2. while in your home directory:
	nano .ssh/config
3. ServerAliveInterval 50
4. ctrl-X and yes to save
5. log into amazon instance
	ssh dcuser@AmazonInstance
----------------------------------------

6. Working with files
	a. wildcard *
		- *.fastq means anything ending in .fastq
		- example used in a command: ls /usr/bin/*.sh
		- placement of wildcard matters!
		- Exercises:

Do each of the following tasks from your current directory using a single ls command for each:
	1. List all of the files in /usr/bin that start with the letter ‘c’.
	- ls /usr/bin/c*
2. List all of the files in /usr/bin that contain the letter ‘a’.
	- ls /usr/bin/*a*
3. List all of the files in /usr/bin that end with the letter ‘o’.
	- ls /usr/bin/*o
	Using the 'up arrow' on your keyboard will take you up to the last command you ran
	
	b. Access previous commands run:
		- up arrow 
		- ctrl-c (kill a command; this is why this windows keyboard shortcut doesn't copy in a CLI!)
		- ctrl-r (reverse search)
		- history
			1. can run a command again by !lineNumber
	c. Examine our files 
		- print entire file to terminal
			cat FileName
		- less command to show a read-only version. 
			less FileName
			1. Commands to navigate through 'less'
				- space to go down, b to go up
				- g to go to beginning
				- G to go to end
				- q to quit
			2. worthwhile to go through the manual for less!
		- only look through top few lines of file with 'head'
			head FileName
			1. default will only show the top 10 lines
				head -n 1 FileName will only show you the first line
		- only look through bottom few lines of file with 'tail'
			tail FileName
			1. default will only show the last 10 lines
				tail -n 1 FileName will only show you the last line
	d. Fastq
		- first line will always be @ followed by information about the read
		- second line will be the nucleotides
		- third will be + and may have the same info as line 1
		- fourth line will be quality score information
	e. Copying files
		- command: cp FileToCopy NameOfCopy
	f. Making a new directory
		- command: mkdir NameOfNewDirectory
	g. Move files
		- command: mv FileToMove DirectoryToMoveItToo
	h. Change the permissions of a file (+ the command to remove a file)
		- provides extra security so we don't accidentally delete it
		- see the permissions with ls -l 
			1. first 3 denote permissions for user who created the file
			2. second 3 and third 3 denote permission for the group and all others, respectively
				- group 2: within your user group
				- group 3: everyone else
			3. stand for Read-Write-Execute (R-W-X)
		- remove the ability to write:
			chmod -w FileName
		- after changing permissions, there will be an extra safeguard against removing (rm FileName) that will ask you if you are sure you want to remove the file
			1. rm will not remove a directory by default (to remove a directory use -r flag)
	i. Search within files without opening them
		- command: grep PatternOfInterest FileName
			1. will look for strings or can look for regular expressions
		- example: look for reads with a bunch of Ns
			grep NNNNNNNNNN SRR098026.fastq
		- want to knowthe entire 4 line chunk that matches a pattern
			grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
			1. B refers to before; A refers to after; number after the A/B refers to the number of lines to select
		- Grep Exercises:
			Search for the sequence GNATNACCACTTCC in the SRR098026.fastq file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. Solution = $grep -B1 "GNATNACCACTTCC" SRR098026.fastq
			Search for the sequence AAGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match.
				Solutions:
				grep -B1 AAGTT SRR098026.fastq SRR097977.fastq
				grep -B1 AAGTT *.fastq
				These will both search the files, but in different orders
	
			@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
			GNATNACCACTTCCAGTGCTGANNNNNNNGGGAT
			
		- grep output will print to screen by default. To redirect the output, use '>' followed by a FileName for bash to make and store the output in
			grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
	j. wordcount (wc)
		- command: wc FileName
		- output is lines words characters FileNameSearched
		- wc -l FileName will just grab the number of lines
		- How many sequences in SRR098026.fastq contain at least 3 consecutive Ns? 
			grep NNN SRR098026.fastq > bad_reads.txt
			wc -l bad_reads.txt
			Alternative:
			grep NNN SRR098026.fastq | wc -l
	k. > writes a file, >> appends to a file
		- If you give the same output file name using '>' on multiple lines of commands, it will be overwritten
			grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
			wc -l bad_reads.txt 
			- output will give you 537 bad_reads.txt
			grep -B1 -A2 NNNNNNNNNN SRR097977.fastq > bad_reads.txt
			wc -l bad_reads.txt
			- output will now show 0 bad_reads.txt
		- use the '>>' to get both results by appending the second 
			grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
			wc -l bad_reads.txt 
			grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt
			wc -l bad_reads.txt
			- output now shows 537 bad_reads.txt
			
			alternative way to write: grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt
			wc -l bad_reads.txt
	l. Pipes |
		- key is above the backslash (\) 
		- link commands together into one line by taking the output of the first command and putting it into the next
			1. reduces number of intermediate output files
		- example that combines grep and wc
			grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
		- modified grep commands with pipe
			grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | grep -v '^--' > bad_reads.txt
			1. -v flag means the inverse of the match. Pull out everything that doesn't match the pattern.
			2. ^ indicates the beginning of the line to search
		- can have multiple pipes in the same line!
	m. Loops
		- For loop
			1. repeat a command for each item in a list (each loop cycle is an iteration of the loop)
			2. Example with the variable foo
				foo=abc
				echo foo is $foo 
					- $ means the value of the variable
					- can remove a variable with unset VariableName
				echo foo is ${foo}efg
					- this does not actually reassign the variable to include efg, it is still only abc
				
				pwd
				for filename in *.fastq
				do 
				head -n 2 ${filename}
				done
			3. re-write the loop so there is an output file instead of printing to terminal
				for filename in *.fastq
				do
				head -n 2 ${filename} >> seq_info.txt
				done
			4. ctrl-c to kill loops if it gets stuck!!
		- Basename in for loops
			1. strip a file of its extra parts that you don't need/want
			2. command: basename InputFile PartToRemove
				basename SRR097977.fastq .fastq
			3. example of basename in a for loop
				for filename in *.fastq
				do
				name=$(basename ${filename} .fastq)
				echo ${name}
				done
		- Loop exercise:
			Print the file prefix of all of the .txt files in our current directory.

for filename in *.txt
do
name=$(basename ${filename} .txt)
echo ${name}
done

			Solution:
				for filename in *.txt
				do
				name=$(basename ${filename} .txt)
				echo ${name}
				done
				
				Alternative way to write the same loop (one line)
				for filename in *.txt; do name=$(basename ${filename} .txt); echo ${name}; done

7. Writing scripts
	a. nano
		- text editor that can read or write files in the terminal
		- command: nano FileName
			1. filename can be for a new file that doesn't exist, or a file that does already exist
		- not the only text editor out there
		- Navigation:
			1. at top, tells you that you are in nano and the filename
			2. key for what you can do at the bottom (^ means ctrl in nano)
			3. cursor where you write 
		- Exercise:
	Open README.txt and add your name to the top of the file and save the file.
	b. Create a script that pulls out the bad reads
		nano bad-reads-script.sh 
			- .sh indicates that it is a bash script
		grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt
			- written within the nano text editor
		ctrl-x to get out of nano and save changes with y
		bash  bad-reads-script.sh
			- runs the script
		less scripted_bad_reads.txt
	c. change script permissions to bypass needing to call 'bash' and make it executable
		chmod +x bad-reads-script.sh
		./bad-reads-script.sh
			- this can now run the file without calling bash first
8. Getting data onto a remote server
	a. download data
		- command: wget
			1. download data from a website
		- command: curl
			1. display webpages or data without downloading it
			2. can download the data too
	b. What programs do you have access to?
		- command: which ProgramName
		which curl
		which wget
	c. download using wget
		wget ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt
	
		
SURVEY TO PROVIDE FEEDBACK FOR DAY 1:
https://forms.gle/YGgkM6r5Y5y2WBzDA


*Introduction to Cloud Computing for Genomics
Introduction to Cloud Computing Reference Page: https://datacarpentry.org/cloud-genomics/reference.html
Full Lesson Materials: https://datacarpentry.org/cloud-genomics/


*Wednesday, January 11
Attendees (Name / department or affiliation)
	* Sharif K Tusuubira/ KU EEB
	* Natalie Ford / KU EEB 
	* Nikki Dyanne Realubit / KU EEB
	* Justin M. Bernstein / KU Center for Genomics
	* Anjali Gupta / KU EEB
	* Arezoo Fani / KU EEB
	* Surbhi Kumawat
	* Justin Van Goor / KU EEB/KBS
	* Haylee Nedblake / KU EEB
	* Srivatsan Parthasarathy
	* Meridia Jane Bryant / KU EEB
	* Brandon Fagen / Zanders Lab, Stowers Institute
	* Vanna Hay/ KU Anthropology 
	* Camila G. Meneses/KU EEB
	* Jess Pfannenstiel/KU MB
	* Kervens Accilien / KU MB
	* Christian Supsup/KU EEB
	* Kayla Clouse / KU EEB
	* 
	* 
	* 

Instructors (Name / department or affiliation)
	* Boryana Koseva / Children's Mercy Research Institute
	* Caroline Kisielinski / KU Anthropology

Helpers (Name / department or affiliation)
	* Amanda Katzer / KU Ecology & Evolutionary Biology
	* Elizabeth Everman / KU MB
	* Tami Albin/Libraries
	* Martina Dalikova / EEB KU
	* Joel Swift/ KBS
	* Jocelyn Colella / EEB KU, Biodiversity Institute
	* 

*Data Wrangling and Processing for Genomics
Data Wrangling Reference Page: https://datacarpentry.org/wrangling-genomics/reference.html
Full Lesson Materials: https://datacarpentry.org/wrangling-genomics/

1. Downloading data
	a. use curl commands to download data from European Nucleotide Network
		- create a file in the home directory using nano (nano download_data.sh)
			mkdir -p ~/dc_workshop/data/untrimmed_fastq/
			cd ~/dc_workshop/data/untrimmed_fastq/
			curl commands (found on line 534)
			
				1. -p flag for mkdir creates the parent directories as necessary
				
*curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
*curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
*curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
*curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
*curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz

		- get out of nano 
			ctrl-o to save 
			ctrl-x to exit 
		- sh download_data.sh to start downloading those files
			1. script file is opening a new shell when it is running (not always the case for every remote server)
		- copy of data in hidden directory (if your script isn't downloading the data quickly)
			ls dc_workshop/data/untrimmed_fastq/
			cp ~/.backup/untrimmed_fastq/*.fastq.gz ~/dc_workshop/data/untrimmed_fastq/
2. Bioinformatic workflows
	a. many, many workflows because there are many ways to do each step.
		- possible due to standards in data formatting
			cd dc_workshop/data/untrimmed_fastq/
			ls
			gunzip SRR2584863_1.fastq.gzp
			head -n 4 SRR2584863_1.fastq
				1. first line contains info about the sequence
				2. second line actual sequence
				3. placeholder + sign
				4. quality score for the sequence
				5. Will always see the above 4 lines for a fastq format
	b. Finding bad reads
		- use quality scores, but these are difficult for humans to read
		- fastqc can easily read these quality scores and give you some stats about them
			1. Is it installed? If installed, will return a path to the executable
				which fastqc
			2. Help flag is -h (often see this flag with many programs, but not all)
				fastqc -h
			3. Zip that file we unzipped previously. Fastqc can work with either zipped or unzipped files.
				gzip SRR2584863_1.fastq
				ls -l (plus a flag that will properly display file sizes, search with man ls)
				ls -lh
			4. Run fastqc on the files
				fastqc SSR258*
			5. Output - still have each of the starting files, but also have an .html and a .zip file
				- html file is a report, but it cannot be opened in the remote server, so you have to download it.
				- Download html file: (command: scp SOURCE DESTINATION)
					Open a second terminal window for your local machine
					cd ~/Desktop
					mkdir -p dc_workshop/data/untrimmed_fastq
					cd dc_workshop/data/untrimmed_fastq
					scp dcuser@YourAmazonInstanceID:~/dc_workshop/data/untrimmed_fastq/*.html .
					Then enter your password 
					Navigate to your desktop folder and open in a browser
				- html report: 
					1. Encoding - important to know how the quality scores are encoded
					2. Total sequences in the file given
					3. Also gives sequence length and GC content
					4. Per base sequence quality graph. Y-axis is the quality score. Want in the green! Need 20 or above.
						- tail at the end of the read is normal, and tend to end up being trimmed later
					5. Per base sequence content
						- Often see some craziness at the first few bases. Doesn't normally affect analyses
					6. Sequence Duplication Levels - worth looking at 
					7. Adapter Content - will be trimmed in the next steps
					8. Examples can be found here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
				- zip files contain a number of text files with data
					1. Exercise: Write a loop to unzip your files
						for zipFile in *.zip
						do
						unzip ${zipFile}
						done
					2. ls -F shows that the uncompressed zip files are directories. The contain:
						- 2 directories
						- 4 files including summary.txt
							look at summary.txt with less (less SRR2584863_1_fastqc/summary.txt) which contains the pass/warn/fail for different categories
				- Housek
				cat *_fastqc/summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
				eeping of output
					mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
					mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads
					mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads
					cd ~/dc_workshop/results/fastqc_untrimmed_reads
					ls
			6. Summarize what you have done in fastqc
				mkdir -p ~/dc_workshop/docs
				
				cd ../../docs
				
				1. Exercise: Using what you have learned about grep, can you find how many tests failed?
					- recipe for grep: grep Pattern Filename
					grep ^FAIL fastqc_summaries.txt | wc -l
	c. Trimming and Filtering
		- Trimmomatic
			1. is it on the machine? which trimmomatic
			2. Manual/help gives usage for paired-end (PE) or single-end (SE) reads
				trimmomatic -h 
			3. Will not modify your original files. Will read them in, modify, and output a separate file.
			4. Run: (recipe)
				trimmomatic PE -threads 4 SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz \ 
				SRR2584863_1.trimmed.fastq SRR2584863_1un.trimmed.fastq \
				SRR2584863_2.trimmed.fastq SRR2584863_2un.trimmed.fastq \
				ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:20
					ls.
			cd ~/dc_workshop/data/untrimmed_fastq/
			cp ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/NexteraPE-PE.fa .
					head -n 5 NexteraPE-PE.fa
			
trimmomatic PE SRR2584863_1.fastq.gz SRR2584863_2.fastq.gz SRR2584863_1.trimmed.fastq SRR2584863_1un.trimmed.fastq SRR2584863_2.trimmed.fastq SRR2584863_2un.trimmed.fastq ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:20

				- ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip
					threshold>:<simple clip threshold>
				- Sliding window: Looks at 4 bp at a time, checks that the average quality score is 20, if it isn't above 20, then it trims them. Then moves to the next 4.
			5. Output:
				- input read pairs: ### Both Surviving ### Forward only Surviving ### Reverse Only Surviving ### Dropped ###
				- 6 files: (3 for each half of the paired reads)
					1. raw data
					2. paired trimmed data
					3. unpaired trimmed data
				- File sizes: good place to sanity check that trimmomatic worked
					ls -lh SRR2584863_*
					1. trimmed files should be smaller than the raw files. Double check that you aren't comparing zipped to unzipped file sizes! 
			6. Write a loop to process all of the samples in one go
				for infile in *_1.fastq.gz
				do
				base=$(basename ${infile} _1.fastq.gz)
				trimmomatic PE ${base}_1.fastq.gz ${base}_2.fastq.gz \
				${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
				${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz  \
				SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
				done
			7. Adapters that are commonly used with Trimmomatic:
*ls ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/
			8. Housekeeping: organizing the files
				mkdir ../trimmed_fastq
				rm *trimmed.fastq
				mv *.trim.* ../trimmed_fastq/
				cd ../trimmed_fastq/
				ls
		- Run fastqc (again!)
3. Align to a reference genome
	a. download reference genome
		mkdir -p data/ref_genome
		
*curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
		ls data/ref_genome/
		gunzip data/ref_genome/ecoli_rel606.fasta.gz
	b. download a set of trimmed fastq files 
		curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
		- curl flags:
			-L means location which follows redirects
			-o means 'output'; write to file instead of stdout
			- When looking at a giant manual page, it can be helpful to search for the option you want to find. You can search in nano or less by using a backslash followed by the text you’re looking for. For example, I searched for /-o
			You can then toggle to the next instance of the pattern in the file by typing 'n', or the previous occurence by typing 'N'
		tar xvf sub.tar.gz
		ls sub
		mv sub/ ~/dc_workshop/data/trimmed_fastq_small
		cd
	c. make a series of directories
		mkdir -p results/sam results/bam results/bcf results/vcf
	d. index the reference genome
		- allows the aligner to quickly find matches to the reference
		- program: BWA (burrows-wheeler algorithm)
			bwa index data/ref_genome/ecoli_rel606.fasta
	e. Align using BWA mem (MEM = maximal exact match)
		- format of BWA: bwa mem ReferenceGenomeFasta InputFastq1 InputFastq2 > OutputSAMfile
			1. SAM = sequence alignment/map
		bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam
		
if you made the small directory and then moved the sub/ information your command might look like this: bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/sub/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/sub/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam
	f. SAM/BAM output
		- contain information about each individual read and how it aligns to the reference genome
			1. SAM file format documentation: https://samtools.github.io/hts-specs/SAMv1.pdf
		- BAM files are Binary alignment/map and are smaller than SAM
			1. Compress a SAM into a BAM
				samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam
			2. Sort the BAM file by order (many ways to do this, name, order, etc.)
				samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam
			3. Stats with samtools
				samtools flagstat results/bam/SRR2584866.aligned.sorted.bam
				- Can tell you:
					1. read count that passed and failed their QC metrics
					2. % of reads that mapped to the genome
					3. % of reads that are properly paired
4. Variant Calling
	a. Look at where nucleotides from samples differ from the reference (usually SNPs)
	b. bcftools
		- BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF.
		- https://samtools.github.io/bcftools/bcftools.html
		- Determine how much coverage:
			bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \
			-f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam
			ls results/bcf/
		- Detect the variants we have at single nucleotide positions using bcftools call
			bcftools call --ploidy 1 -m -v -o results/vcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
			ls results/vcf/
		- Filter variants & clean it up with vcfutils.pl
			1. perl script
			vcfutils.pl varFilter results/vcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf
			ls results/vcf
		- Look into VCF file
			less -S results/vcf/SRR2584866_final_variants.vcf
	c. Assess the VCF files (visualization)
		- prep:
			1. index bam file
				samtools index results/bam/SRR2584866.aligned.sorted.bam
		- tview in samtools: an alignment viewer
			samtools tview results/bam/SRR2584866.aligned.sorted.bam data/ref_genome/ecoli_rel606.fasta
			1. line 1 = position; line 2 = reference; line 3 = consensus sequences
				-period indicates match to reference
			2. Shortcuts/navigation:
				- arrow keys to scroll through
				-space jumps pages
				- help window: ?
				- go to a specific location: g
				- exit the program: q
			3. Challenge: 
What variant is present at position 4377265? What is the canonical nucleotide in that position?
				Solution: A to G
		- IGV (integrative genome viewer)
			1. opens on local computer
			2. Transfer files to local machine using SCP
				open new terminal windown (do not sign into amazon instance)
				mkdir ~/Desktop/files_for_igv
				cd ~/Desktop/files_for_igv/
				scp dcuser@AMAZON-INSTANCE:~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam ~/Desktop/files_for_igv
				scp dcuser@AMAZON-INSTANCE:~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam.bai ~/Desktop/files_for_igv
				scp dcuser@AMAZON-INSTANCE:~/dc_workshop/data/ref_genome/ecoli_rel606.fasta ~/Desktop/files_for_igv
				scp dcuser@AMAZON-INSTANCE:~/dc_workshop/results/vcf/SRR2584866_final_variants.vcf ~/Desktop/files_for_igv
			3. Download software for IGV from the broad institute:
				https://software.broadinstitute.org/software/igv/download
				unzip file and then open
			4. Load in reference genome file
				Genomes > Genome from file > Desktop > files_for_igv > ecoli_rel606.fasta
			5. Load in sorted BAM and the VCF
				File > Load from File > Desktop > files_for_igv > SRR2584866.aligned.sorted.bam 
				File > Load from File > Desktop > files_for_igv > SRR2584866_final_variants.vcf
			6. Navigation
				- top right you can zoom in and out
				- each bar across the top of the plot shows the allele for each locus
				- light blue bar is a homozygous variant
				- "I" is an indel
				- grey arrow bars indicate coverage
5. Log out of amazon instance
	exit
	a. can log out of your local terminal with exit too

Please give us your feedback for today here: https://forms.gle/YGgkM6r5Y5y2WBzDA

				
*Thursday, January 12
Attendees (Name / department or affiliation)
	* Arezoo Fani / KU - EEB
	* Natalie Ford / KU EEB
	* Kervens Accilien / KU MB
	* Justin Van Goor / KU EEB/KBS
	* Jessica Pfannenstiel / KU MB
	* Haylee Nedblake / KU EEB
	* Surbhi Kumawat
	* Brandon Fagen / Zanders Lab, Stowers Institute 
	* Camila G. Meneses/KU EEB
	* Kayla Clouse / KU EEB
	* Anjali Gupta / KU EEB
	* Meridia Jane Bryant / KU EEB
	* Vanna Hay / KU Anthropology 
	* Christian Supsup/KU EEB
	* Justin M. Bernstein / KU Center for Genomics
	* Sharif Tusuubira/ KU EEB
	* Nikki Dyanne Realubit/KU EEB

Instructors (Name / department or affiliation)
	* Boryana Koseva (Children's Mercy Research Institute)
	* Caroline Kisielinski / KU Anthropology

Helpers (Name / department or affiliation)
	* Amanda Katzer / KU EEB
	* Elizabeth Everman / KU MB
	* Tami Albin/ Libraries
	* Martina Dalikova /KU EEB
	* Joel Swift / KBS
	* Jocelyn Colella/ KU EEB, Biodiversity Institute

A helpful resources for shell scripting:
Software Carpentry Shell lesson, section 6, Shell Scripts: https://swcarpentry.github.io/shell-novice/06-script/index.html

*Data Wrangling and Processing for Genomics (cont)
1. Overview of steps:
	a. Fastqc to look at quality of data
	b. Trimmomatic to trim lower quality reads
	c. Fastqc again to look at difference in quality metrics
	d. Index the reference genome (only needs to be done once!)
	e. Align the trimmed reads to the reference genome (indexed genome)
	f. Convert the alignments that are in a SAM file to a BAM file
	g. Sort the BAM file (samtools was used to do this)
	h. Calculate read coverage of positions in genome
	i. Detect SNPs
	j. Filter and report SNPs in VCF
2. Shell Script
	a. create a new directory to keep everything tidy and organized
		mkdir -p ~/dc_workshop/scripts
		cd dc_workshop/scripts/
	b. create a new shell script
		- command: touch; creates the file, but has no contents
			touch read_qc.sh
			nano read_qc.sh
			
			set -e 
				- ensures that if an error occurs, the script will exit instead of continuing to the next step
			cd ~/dc_workshop/data/untrimmed_fastq/
			echo "Running FastQC..."
				- echo is the bash version of print, so it will print message to the terminal
			fastqc *.fastq*
			mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
			echo "Saving FastQC results to Results folder..."
			mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
			mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/
			cd ~/dc_workshop/results/fastqc_untrimmed_reads/
			echo "Unzipping FastQC archives..."
			for filename in *.zip
				do
				unzip ${filename} 
				done
			echo "Saving summary..."
			cat *_fastqc/summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
			
			- write script out with ctrl-o
			- enter to save to the script file name
			- Exit nano with ctrl-x
			
			- look at script with cat read_qc.sh
			- run script with:
				bash read_qc.sh
	c. if you run the same script multiple times, it will usually replace the previous outputs
		- Zip will ask you if you want to replace files. For this script, type A to replace all for each of the archives.
	d. Benefits of having the entire first steps in a shell script:
		- documentation of how you did your analysis
	e. Check on your finished script & add an echo statement to the end of the script
		ls -l ~/dc_workshop/docs
		nano read_qc.sh
		echo "I am done!"
		ctrl-x to exit nano
		y 
		hit enter to save to the filename
	f. Download script for the entire pipeline: 
		cd ~/dc_workshop/scripts/
	curl -O https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/run_variant_calling.sh
		ls
		cat run_variant_calling.sh
		- What the script does:
			1. set -e to have the script exit if there is an error instead of continuing through commands
			2. changing directory to dc_workshop/results
			3. change genome path to a variable
			4. index the genome with BWA
			5. make some directories; use -p 
			6. Set lots of variables and filepaths (in the for loop)
				- print which file you are working with
				- use basename to extract the sample name
				- print to terminal the basename
				- set the full paths to the fastq files
				- anticipate where the sam file will be saved and its name
				- anticipate where the bam file will be saved and its name
				- anticipate where the sorted bam file will be saved and its name
				- anticipate where the bcf will be saved and its name
				- anticipate where the vcf will be saved and its name
				- anticipate where the final variants vcf will be saved and its name
			7. Do stuff with the variables: (in the same for loop as (6))
				- BWA MEM run with genome and fastq files to make SAM file
				- Make SAM to BAM using samtools
				- Sort the BAM file with samtools
				- Index the BAM file so the downstream software can use it much faster with samtools
				- Create the raw_bcf file with the genome and the sorted BAM with bcftools mpileup
				- Call variants with bcftools call
				- Filter the variants with vcfutils.pl (perl script)
*tip always use curly brackets when referencing a variable ${variable} you have set (e.g., ${fq1} )
		g. Exercise:
			cp run_variant_calling.sh run_variant_calling_on_full_fastq.sh
			- Use nano and modify run_variant_calling_on_full_fastq.sh so that it runs on the full fastq files (~/dc_workshop/data/trimmed_fastq) instead of the small fastq files.
			- Solution:
				1. In the for loop, changed the filepath to read: ~/dc_workshop/data/trimmed_fastq/*_1.trim.sub.fastq.gz (Comment: this one is supposedly without ".sub")
				2. Change filename in basename to _1.trim.fastq.gz
				3. Change filepaths for the variables fq1 & fq2 to ~/dc_workshop/data/trimmed_fastq/${base}_1.trim.fastq.gz or ~/dc_workshop/data/trimmed_fastq/${base}_2.trim.fastq.gz
				4. Add an echo statement between the commands to make yourself notes
				5. In the bwa mem line, add the flag -t 4 which indicates that 4 threads can be used
					- Add the comment: # -t 4 is used to speed up the alignment by utilizing 4 threads
			
			*BONUS EXERCISE* Annotate your script!
				echo statements
				Use the # for any comments. Anything after the # is ignored.
				# This is an example of a correctly formatted comment
			- Run the script!
				bash run_variant_calling_on_full_fastq.sh


*RNA-seq Analysis Pipeline: Overview 
SLIDES: https://github.com/ereverman/DC_Genomics_2023
	- Note that these slides are more detailed than those that were shared in the workshop, but the information about the pipeline is included.

1. Pipeline will depend on what your data is and what your question is
	a. quantify how active is a gene
2. For the most part, bioinformatic pipelines start the same
	a. First step is looking at the quality, and trim low quality nucleotides from the reads
3. RNA-seq pipeline have reads that came from mature mRNAs.
4. Steps:
	a. quality control: low quality reads, poly(A)-tails
	b. alignment to genome and quantification
		- There are a number of programs that will align, but the principles behind how they work varies. Often field and lab dependent on which you will choose.
		- Alignment is different than what we did with genomic data (DNA vs mRNA).
			1. mRNA doesn't have introns, so there are fairly large chunks that will not align to the genome (gap alignment). BWA does not work to align RNA-seq data to a reference genome. 
		- some softwares require a reference genome, others require a reference transcriptome. Some do not align at all (pseudoalignments for example) and skip to quantification. 
	c. Look for differences in expression
		- All packages (or most!) for this are in R
5. Example packages for a pipeline: (Boryana's favorites)
	a. Quality Control
		- Trim and Filter with fastp (https://github.com/OpenGene/fastp)
		- Multiqc that can assemble all of the reports from fastp into one (https://multiqc.info/)
	b. Alignment to genome and quanitfication
		- Salmon to index the reference transcriptome and to quantify (https://combine-lab.github.io/salmon/)
		- Kallisto is an alternative to salmon (https://pachterlab.github.io/kallisto/manual)
	c. Differential Expression Analysis (normalize and run DE analysis)
		- DESeq2 (https://bioconductor.org/packages/devel/bioc/manuals/DESeq2/man/DESeq2.pdf)
		- Sleuth is an alternative that is from the developers of Kallisto (https://pachterlab.github.io/sleuth/manual)


*Introduction to R and RStudio for Genomics
Introduction to R and RStudio Reference Page: https://datacarpentry.org/genomics-r-intro/reference.html
Full Lesson Materials: https://datacarpentry.org/genomics-r-intro/
Beginning R Markdown Cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf

1. Sign into RStudio
	a. go to AmazonInstance:8787 in a browser
		username: dcuser
		password: data4Carp
2. Create a project
	file > New Project > New Directory > New Project
	Directory name: dc_genomics_r
	Create project as a subdirectory of: home
3. Create an R script
	File > New File > R Script
	File > Save 
		File Name: genomics_r_basics
4. Basic commands (write in script)
	a. show working directory
		getwd()
	b. Leave comments with #
		# this command shows the working directory
	c. change/set our working directory
		setwd(dir = "/home/dcuser/dc_genomics_r")
	d. round command to round out a number
		round(3.14)
		- ask R what the command round does
			?round()
			args(round) #outputs the arguments that round takes
5. R basics
	a. variables = objects
		- create an object:
			ObjectName <- ObjectValue
			a <- 1
		- rename/remove objects:
			1. avoid whitespace and dashes in object names
			2. avoid object names that are similar to program names in your environment
			gene_name <- 'tp53'
			rm(gene_name) #remove the object
		- data types of variables: numeric, character, and logical
		-Exercise: Using the mode() command, find the value type of each of the following variables:
			chromosome_name <- 'chr02'
			od_600_value <- 0.47
			chr_position <- '1001701'
			spock <- TRUE
			pilot <- Earhart
	b. Mathematical functions
		(1 + (5 ** 0.5))/2
	c. Vectors: collection of values of the same type (numeric, character, logical)
		snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1")
		mode(snp_genes) #character vector is the output
		length(snp_genes) #get the length of the vector
		- Manipulating vectors:
			#create 3 new vectors
			snps <- c("rs53576", "rs1815739", "rs6152", "rs1799971")
			snp_chromosomes <- c("3", "11", "X", "6")
			snp_positions <- c(8762685, 66560624, 67545785, 154039662)
			#retrieve values from a vector
			snps[3] #find the 3rd value in the snps vector
			snps[1:3] #get the first 3 values in the snps vector
			snps[c(1,3,4)] #get the first, third, and fourth values
			snps[c(1:3, 4)]
		- Add, modify, and delete values in vectors
			snp_genes <- c(snp_genes, "CYP1A1", "APOA5") #append to the end
			snp_genes[-6] #will remove whatever is at the sixth position
			snp_genes[6] <- "APOA5" #Add value in the sixth position
	d. Subset values (logical subsetting)
		snp_positions[snp_positions > 100000000] 
		snp_positions > 100000000
	e. Other common operations
		#Look for missing data
		is.na(snp_genes)
		#Find if a specific value is present in a vector using %in%
		c("ACTN3", "APOA5") %in% snp_genes
6. Downloading data
	a. read.csv function
		variants <- read.csv("/home/dcuser/r_data/combined_tidy_vcf.csv")
7. Manipulate the variants dataframe
	summary(variants) #summarize each of our variables in the dataframe
	subset <- data.frame(variants[,c(1:3, 6)]) #[rows, columns]
	str(subset) #structure function
8. Factors: vectors that are specialized for categorical data
	alt_alleles <- subset$ALT #pulling out the ALT column in the subset dataframe
	head(alt_alleles) #look at top of the object
	snps <- c(alt_alleles[alt_alleles == "A"], 
				alt_alleles[alt_alleles == "T"], 
				alt_alleles[alt_alleles == "C"], 
				alt_alleles[alt_alleles == "G"])
	factor_snps <- factor(snps) #make snps a factor object instead of character object
	summary(factor_snps) #gives raw counts of the different factors
	plot(factor_snps)
	ordered_factor_snps <- factor(factor_snps, levels = names(sort(table(factor_snps)))) #order the factors
	plot(ordered_factor_snps)
9. Further subset the variants dataframe
	#only want to look at one sample:
	SRR2584863_variants <- variants[variants$sample_id == "SRR2584863", ] 
	dim(SRR2584863_variants) #get dimensions of the subset dataframe
	summary(SRR2584863_variants)
10. Make R read data as a particular type
	typeof(snp_chromosomes) #ask R what type of data this vector has
	snp_chromosomes_2 <- c(3, 11, "X", 6) 
	typeof(snp_chromosomes_2) 
	#force the data to be numerical
	snp_chromosomes_2 <- as.numeric(snp_chromosomes_2) 
	snp_chromosomes_2 #forces the "X" to become an NA when calling as.numeric
	typeof(snp_positions) #calls this a double
	snp_positions_2 <- c("8762685", "66560624", "67545785", "154039662")
	typeof(snp_positions_2) #now a character vector
	snp_positions_2 <- as.numeric(snp_positions_2)
	typeof(snp_positions_2) #back to a double
	
	#where forcing things goes terribly wrong
	as.numeric(variants$sample_id) #entire dataframe turns into NAs
11. Write data out to a new csv file
	write.csv(SRR2584863_variants, file = "../dc_workshop/data/SRR2584863_variants.csv")
12. Import data from excel
	File > Import Dataset > From Excel > Choose File
13. Install packages
	install.packages("BiocManager") #bioconductor repository (https://www.bioconductor.org/)
	BiocManager::version()
	BiocManager::install("vcfR")
	install.packages("dplyr") #package that provides ways to manipulate data. Part of the tidyverse 
	# (https://www.tidyverse.org/)
	install.packages("readr")
	library("dplyr")
	library("readr")
14. Use dplyr and readr to modify data frame
	select(variants, sample_id, REF, ALT, DP)  #select columns out of a data frame
	select(variants, -CHROM) #select all the columns except for CHROM
	filter(variants, sample_id == "SRR2584863") #pull out rows from our data frame
	#pull out rows and columns with the pipe command (%>%). This pipe command can only be used if the 
	# magrittr library is installed
	variants %>% 
		filter(sample_id == "SRR2584863") %>%
		select(REF, ALT, DP)
	#assign it an object name
	SRR2584863_variants <- variants %>% 
		filter(sample_id == "SRR2584863") %>%
		select(REF, ALT, DP)
	SRR2584863_variants %>% slice(1:6) #show the first 6 rows
	SRR2584863_variants %>% slice(10:25) #show lines 10-25
	#do a statistical test and create a new column with the results
	variants %>%
		mutate(POLPROB = 1-(10^ -(QUAL/10)))
15. ggplot for graphing
	a. create complex plots because everything is in a separate layer that stacks ontop of one another
	install.packages("tidyverse")
	library("tidyverse")
	install.packages("ggplot2")
	library(ggplot2)
	
	variants = read.csv("https://raw.githubusercontent.com/naupaka/vcfr-for-data-carpentry-draft/main/output/combined_tidy_vcf.csv")
	glimpse(variants) #take a sneak peek at the data
	
	#first create a plot template
	ggplot(data = variants, aes(x= POS, y = DP)) +
		geom_point() #tell what type of representation for the data. In this case, a scatterplot
	
	coverage_plot <- ggplot(data = variants, aes(x= POS, y = DP))
	coverage_plot +
		geom_point(alpha = 0.5, color='blue')
	
	ggplot(data = variants, aes(x= POS, y = DP, color = sample_id)) +
		geom_point(alpha = 0.5)


We welcome your feedback for today’s workshop: https://forms.gle/YGgkM6r5Y5y2WBzDA