2020-10-21-nord

Welcome to The Carpentries Etherpad!

This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.

Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).

Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html

All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/

----------------------------------------------------------------------------
Participants
Endre Sebestyén - Semmelweis University

Lisa Tietze - PhD candidate, NTNU (analyze own data set)
Eirini Tsirvouli - PhD @ NTNU (analyze own data, maybe apply to be trained as a Data Carpentry trainer)
Iva Pitelkova, Head Engineer at Tromsø Museum, UiT (genome skimming on plant chloroplasts)
May Khider - PhD candidate @ NTNU dept. Biotechnology (would like to be more comfortable with programming software and apply knowledge from workshop to my own project)
Hannah Schweitzer - UiT The Arctic University of Norway
Jonathan Bramsiepe UiO - Plant RNAseq and DNAseq datasets
Katrine Bjerkan - UiO- Analysis of own data and future work- both plant RNAseq and DNAseq
Guy Hindley - PhD Candidate, UiO - NORMENT - Psychiatric genetics department. Feel more comfortable handling genomic data (psych GWAS sumstats predominantly)
Abel Gizaw - Postdoc at UiO
Abush Zinaw- Visiting PhD student at UiO
Didac Vidal Pineiro - Researcher at UiO
amit sharma, Reserarcher,NTNU
Bisa: Researcher at FBA, Nord University, Bodø
Alexandra Jonsson- PhD at UiO. Single cell trancriptomics in cod
Nur : researcher at Norwegian Institute of Public Health
Prabin, NMBU
Juline, Nord
Mingyi, UiO, genome methylation analysis

Data Carpentry Genomics Workshop

https://datacarpentry.org/genomics-workshop/setup.html

Part 1: Project Organization and Management for Genomics
https://datacarpentry.org/organization-genomics/

Sequencing experiments done: ++++++_++
RNA with sample info to seq-facilities and get back fastq-files and some basic bioinformaitc analysis was done at the seq-facility
plant DNA for barcoding, we needed to include plant taxon ID together with our samples, concentration of the samples, we now do inhouse sequencing of aDNA amplicon libraries and also genome skimming on plant chloroplasts
I have sent DNA fragments and plasmids for sequencing, never genomes
I have done 16S/18S targeted sequencing in house on our own MiSeq, and I have sent out metagenomics sequencing to a NovaSeq. For sending out metagenomics you need concentrated DNA and have analyzed all metagenomes in house. I have never done eukaryotic genomes.
Meta data: how samples were prepared for sequencing and when, which conditions samples were grown in (if applicable) or where they were taken from/isolated, what type of data (RNA/DNA, which organism)

Planning for NGS Projects
https://datacarpentry.org/organization-genomics/02-project-planning/index.html

Formatting problems in spreadsheets and how to deal with them
https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/index.html

A Quick Guide to Organizing Computational Biology Projects
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424

A bigger collection of papers, and tutorials on how to start with bioinformatics projects, what to learn, etc
https://github.com/esebesty/bioinf_starter_pack

NCBI SRA
https://www.ncbi.nlm.nih.gov/sra/

EBI ENA
https://www.ebi.ac.uk/ena/browser/home

Introduction to the Command Line for Genomics

Instructor
Luca Di Stasio

Helpers
Tadeu
Kari
Ali

Participants

Eirini Tsirvouli
Abel Gizaw
Lisa Tietze
Jonathan
Abush Zinaw
Fleur
Katrine Bjerkan
Bisa
amit
Hannah
May
Prabin
Nur
Guy
Iva -

Windows +++++++++
Mac++
Linux +

Both windows and linux (two computers) OK+
mac+

How much do you know shell?
never used +++++
once a year+++++
once a month+
once a week +

every day++
never

Amazon instances:

INSTRUCTIONS
substitute the links below for the ec2.... links in the example

username: dcuser
pw: data4Carp

shell access
$ ssh dcuser@ec2-12-345-678-90.compute-1.amazonaws.com

RStudio server
visit ec2-12-345-678-90.compute-1.amazonaws.com:8787 in your browser, use same username/pw

Same user/pw for all instances.

Learner Instances
ec2-3-237-39-84.compute-1.amazonaws.com - Eirini Tsirvouli
ec2-18-210-23-240.compute-1.amazonaws.com - Abel Gizaw
ec2-35-175-223-241.compute-1.amazonaws.com - Lisa Tietze
ec2-35-172-203-120.compute-1.amazonaws.com - Jonathan
ec2-3-236-114-246.compute-1.amazonaws.com - Abush Zinaw
ec2-3-226-122-33.compute-1.amazonaws.com - Fleur
ec2-3-236-56-11.compute-1.amazonaws.com - Katrine Bjerkan
ec2-18-204-56-16.compute-1.amazonaws.com - Bisa
ec2-3-236-118-202.compute-1.amazonaws.com - amit
ec2-3-227-211-96.compute-1.amazonaws.com- Hannah
ec2-3-236-85-209.compute-1.amazonaws.com - May
ec2-3-236-85-109.compute-1.amazonaws.com - Prabin
ec2-3-237-13-115.compute-1.amazonaws.com - Nur
ec2-35-175-120-232.compute-1.amazonaws.com - Guy
ec2-3-226-244-206.compute-1.amazonaws.com-Adrian
ec2-3-235-75-241.compute-1.amazonaws.com - Alexandra
ec2-3-238-112-102.compute-1.amazonaws.com - Iva
ec2-35-172-230-97.compute-1.amazonaws.com
ec2-3-238-124-61.compute-1.amazonaws.com
ec2-3-235-174-82.compute-1.amazonaws.comBenedicte
ec2-3-85-241-133.compute-1.amazonaws.comElisa
ec2-34-229-132-87.compute-1.amazonaws.com
ec2-3-238-84-202.compute-1.amazonaws.com
ec2-3-236-252-88.compute-1.amazonaws.com
ec2-18-206-16-63.compute-1.amazonaws.com_mingyi
ec2-3-215-22-202.compute-1.amazonaws.com
ec2-3-236-249-145.compute-1.amazonaws.com - Ali
ec2-100-26-176-28.compute-1.amazonaws.com - Endre
ec2-3-236-45-0.compute-1.amazonaws.com - Kari
ec2-3-234-208-96.compute-1.amazonaws.com - Tadeu

/home/dcuser/shell_data/sra_metadata
we want to go to dcuser
how can we do that?

..

cd ../..
cd.. cd .. (run cd .. twice)

cd /home/dcuser

$ cd ..
$ pwd
/home/dcuser/shell_data
$ cd ..
$ pwd
/home/dcuser
$
cd ../../

Guy$ pwd
/home/dcuser/shell_data
Guy$ cd /home/dcuser
Guy$ pwd
/home/dcuser

what's the difference between > and >> ?
ls --> myFolderContent.txtm
ls -l --> my FolderContent.txt

> creates a file and write the list into it while >> adds new content to the file.
I am not sure I see the difference
ls -l

Feedback
What went well
What should be improved

Introduction to the Command Line for Genomics

Instructor
Luca Di Stasio

Helpers
Tadeu
Kari
Ali

Participants

Eirini
Benedicte Garmann-Johnsen
Jonathan
Katrine
May
Bisa
Lisa
Fleur
Abel Gizaw
Guy
Alexandra
Abush
amit
Hannah
Iva
Adrian

mv - what does it do?
rename firstScript.sh to test.sh
mv firstScript.sh test.sh
rename, move file in to different directory... mv firstScript.sh test.sh
mv myfirstscript.sh test.sh
mv myfirstScript.sh test.sh
$mv ~/scripts/myfirstScript.sh ~/scripts/test.sh - move and change name of file

$ mv myfirstScript.sh test.sh
$ ls
firstExample.txt   myFolderContent.txt mytext.text
firstExample.txtĈ myHistory.txt        test.sh
mv is used to either move a folder or rename a file.

create a folder named "scriptsBackup" in dcuser
copy (cp) test.sh into scriptsBackup
q
$ mkdir scriptsBackup
$ cp test.sh scriptsBackup
ls ~mkdir ~/scriptsBackup and cp test.sh ~/scriptsBackup/
mkdir scriptsBackup
cp test.sh scriptsBackup
mkdir scriptsBackup to crate the scriptsBackup directory then, cp scripts/test.ch scriptsBackup
mkdir ScriptsBackup
cp ~/scripts/test.sh ~/ScriptsBackup/

$cd ..
$pwd
/home/dcuser
$ls
R r_data scripts shell_data
$mkdir scriptsBackup
$ls
$cp ~/scripts/test.sh ~/scriptsBackup
.
for each file in ../shell_data/untrimmed_fastq/ read thhe first 2 lines and save them to a file called seq_info.txt
for filename in *.fastq; do head -n 2 ${filename} >> seq_info.txt; done

$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done
$ cat seq_info.txt

for filename in *.fastq; do head -n 2 ${filename}; done > seq_info.txt

for filename in *.fastq; do head -n 2 ${filename}; >seq_info.txt; ls; done

create a script "dataAnalysis.sh" in the folder script
for each file in ../shell_data/untrimmed_fastq/ read the first 2 lines and save them to a file called seq_info.txt, the file seq_info.txt is in the same ../shell_data/untrimmed_fastq/ folder
nano dataAnalysis.sh (for filename in *.fastq; do head -n 2 ${filename} >> seq_info.txt; done)
chmod +x dataAnalysis.sh
./dataAnalysis.sh
cd ~/shell_data/untrimmed_fastq
for filename in *.fastq
do
head -n 2 ${filename} >> seq_info.txt
done

/home/dcuser/shell_data/untrimmed_fastq
$ for filename in *.fastq; do head -n 2 $filename >> seq_info.text; done

$ cat seq_info.text
@SRR097977.1 209DTAAXX_Lenski2_1_7:8:3:710:178 length=36
TATTCTGCCATAATGAAATTCGCCACTTGTTAGTGT
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
$ ls
seq_info.text SRR097977.fastq SRR098026.fastq

/home/dcuser
$ cd scripts
$ ls
dataAnalysis.sh   firstExample.txtĈ    myHistory.txt scriptsBackup
firstExample.txt myFolderContent.txt mytext.text    test.sh

How can I now move the seq_info.text to dataAnalysis.sh???

$ nano dataAnalysis.sh
$ mv dataAnalysis.sh ~/scripts

nano dataAnalysis.sh (for filename in *.fastq; do head -n 2 ${filename} >> seq_info.txt; done)
mv ~/scripts/dataAnalysis.sh ~/shell_data/untrimmed_fastq/

nano dataAnalysis.sh (for filename in ~/shell_data/untrimmed_fastq/*.fastq; do head -n 2 ${filename} >>~/shell_data/untrimmed_fastq/seq__info.txt; done)

seq_info.txt should be saved in the folder ~/results
write in a single line code: create the folder ~/results and execute the script and verify the result of the result of the script

mkdir ~/results && bash seq_info.txt && cat seq_txt.txt

nano ~/scripts/dataAnalysis.sh (for filename in ~/shell_data/untrimmed_fastq/*.fastq; do head -n 2 ${filename} >>~/results/seq_info.txt; done)
mkdir ~/results && ~/scripts/dataAnalysis.sh && cat ~/results/seq_info.txt
mkdir results && cd results && bash ~/scripts/dataAnalysis.sh && ls && cat seq_info.txt

Guy$ nano ~/scripts/dataAnalysis.sh
cd ~/shell_data/untrimmed_fastq
for filename in *.fastq
do
head -n 2 ${filename} >> ~/results/seq_info.txt
cat seq_info.txt
done
Guy$ mkdir ~/results && ~/scripts/dataAnalysis.sh && cat ~/results/seq_info.txt

ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt

Data Wrangling and Processing for Genomics
https://datacarpentry.org/wrangling-genomics/
https://datacarpentry.org/genomics-workshop/setup.html

Assessing Read Quality
https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html

Sequencing read quality check

fastqc code and documentation
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Guy$ tail -n 4 SRR2584863_1.fastq
CTGCAATACCACGCTGATCTTTCACATGATGTAAGAAAAGTGGGATCAGCAAACCGGGTGCTGCTGTGGCTAGTTGCAGCAAACCATGCAGTGAACCCGCCTGTGCTTCGCTATAGCCGTGACTGATGAGGATCGCCGGAAGCCAGCCAA
+
CCCFFFFFHHHHGJJJJJJJJJHGIJJJIJJJJIJJJJIIIIJJJJJJJJJJJJJIIJJJHHHHHFFFFFEEEEEDDDDDDDDDDDDDDDDDCDEDDBDBDDBDDDDDDDDDBDEEDDDD7@BDDDDDD>AA>?B?<@BDD@BDC?BDA?

tail -n 4 SRR2584863_1.fastq

tail -n 4 *863_1.fastq

$cat ~/dc_workshop/docs/fastqc_summaries.txt

$ cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
$ ls
$ cat ~/dc_workshop/docs/fastqc_summaries.txt

sort fastqc_summaries.txt | grep "FAIL" fastqc_summaries.txt

cat fastqc_summaries.txt | sort
How can you sort so that it kicks evrything out of the list that has a PASS?
Check the "man grep" command. grep has a -v parameter, that gets you everything that does NOT match a pattern, and after you can sort
for example: grep -v PASS fastqc_summaries.txt | sort

$cat ~/dc_workshop/docs/fastqc_summaries.txt

Trimming and Filtering
https://datacarpentry.org/wrangling-genomics/03-trimming/index.html

Data Wrangling and Processing for Genomics - part 2
https://datacarpentry.org/wrangling-genomics/

Instructor
Endre Sebestyén

Helpers
Tadeu
Kari
Ali

Participants
1. Lisa Tietze
Hannah
May
Jonathan
Bisa
Berihun
Abel Gizaw
Benedicte
Fleur
Iva
Abush
amit
Adrian
Alexandra
Katrine
Eirini
Nur

Variant Calling Workflow
https://datacarpentry.org/wrangling-genomics/04-variant_calling/index.html

BWA documentation
http://bio-bwa.sourceforge.net/bwa.shtml

The SAM/BAM file format specification
https://samtools.github.io/hts-specs/SAMv1.pdf

samtools documentation
http://www.htslib.org/doc/samtools.html

bcftools documentation
http://www.htslib.org/doc/bcftools.html

What is the real name of the reference genome?

head data/ref_genome/ecoli_rel606.fasta
>CP000819.1 Escherichia coli B str. REL606, complete genome

CP000819.1 Escherichia coli B str. REL606, complete genome

How many variants are there in the vcf file?
Use grep and wc

IGV download
http://software.broadinstitute.org/software/igv/download

IGV setup done
Endre
Lisa
Kari
Iva
Tadeu
Abel
Jonathan
Katrine
May
Fleur

IGV files to download
https://www.dropbox.com/sh/hhmh6j4b212b5dd/AADIfAdTsfo5rs5I_nWeuM1pa?dl=0

Intro to Cloud computing:
https://datacarpentry.org/cloud-genomics/03-verifying-instance/index.html

Introduction to Cloud Computing for Genomics
https://datacarpentry.org/cloud-genomics/

Post workshop survey
https://carpentries.typeform.com/to/UgVdRQ?slug=2020-10-21-nord-online