Welcome to The Carpentries Etherpad!
This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org).
Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
----------------------------------------------------------------------------
Participants
Endre Sebestyén - Semmelweis University
- Lisa Tietze - PhD candidate, NTNU (analyze own data set)
- Eirini Tsirvouli - PhD @ NTNU (analyze own data, maybe apply to be trained as a Data Carpentry trainer)
- Iva Pitelkova, Head Engineer at Tromsø Museum, UiT (genome skimming on plant chloroplasts)
- May Khider - PhD candidate @ NTNU dept. Biotechnology (would like to be more comfortable with programming software and apply knowledge from workshop to my own project)
- Hannah Schweitzer - UiT The Arctic University of Norway
- Jonathan Bramsiepe UiO - Plant RNAseq and DNAseq datasets
- Katrine Bjerkan - UiO- Analysis of own data and future work- both plant RNAseq and DNAseq
- Guy Hindley - PhD Candidate, UiO - NORMENT - Psychiatric genetics department. Feel more comfortable handling genomic data (psych GWAS sumstats predominantly)
- Abel Gizaw - Postdoc at UiO
- Abush Zinaw- Visiting PhD student at UiO
- Didac Vidal Pineiro - Researcher at UiO
- amit sharma, Reserarcher,NTNU
- Bisa: Researcher at FBA, Nord University, Bodø
- Alexandra Jonsson- PhD at UiO. Single cell trancriptomics in cod
- Nur : researcher at Norwegian Institute of Public Health
- Prabin, NMBU
- Juline, Nord
- Mingyi, UiO, genome methylation analysis
Data Carpentry Genomics Workshop
https://datacarpentry.org/genomics-workshop/setup.html
Part 1: Project Organization and Management for Genomics
https://datacarpentry.org/organization-genomics/
Sequencing experiments done: ++++++_++
RNA with sample info to seq-facilities and get back fastq-files and some basic bioinformaitc analysis was done at the seq-facility
plant DNA for barcoding, we needed to include plant taxon ID together with our samples, concentration of the samples, we now do inhouse sequencing of aDNA amplicon libraries and also genome skimming on plant chloroplasts
I have sent DNA fragments and plasmids for sequencing, never genomes
I have done 16S/18S targeted sequencing in house on our own MiSeq, and I have sent out metagenomics sequencing to a NovaSeq. For sending out metagenomics you need concentrated DNA and have analyzed all metagenomes in house. I have never done eukaryotic genomes.
Meta data: how samples were prepared for sequencing and when, which conditions samples were grown in (if applicable) or where they were taken from/isolated, what type of data (RNA/DNA, which organism)
Planning for NGS Projects
https://datacarpentry.org/organization-genomics/02-project-planning/index.html
Formatting problems in spreadsheets and how to deal with them
https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/index.html
A Quick Guide to Organizing Computational Biology Projects
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424
A bigger collection of papers, and tutorials on how to start with bioinformatics projects, what to learn, etc
https://github.com/esebesty/bioinf_starter_pack
NCBI SRA
https://www.ncbi.nlm.nih.gov/sra/
EBI ENA
https://www.ebi.ac.uk/ena/browser/home
Introduction to the Command Line for Genomics
Instructor
Luca Di Stasio
Helpers
Tadeu
Kari
Ali
Participants
- Eirini Tsirvouli
- Abel Gizaw
- Lisa Tietze
- Jonathan
- Abush Zinaw
- Fleur
- Katrine Bjerkan
- Bisa
- amit
- Hannah
- May
- Prabin
- Nur
- Guy
- Iva -
Windows +++++++++
Mac++
Linux +
Both windows and linux (two computers) OK+
mac+
How much do you know shell?
never used +++++
once a year+++++
once a month+
once a week +
every day++
never
Amazon instances:
INSTRUCTIONS
substitute the links below for the ec2.... links in the example
username: dcuser
pw: data4Carp
shell access
$ ssh dcuser@ec2-12-345-678-90.compute-1.amazonaws.com
RStudio server
visit ec2-12-345-678-90.compute-1.amazonaws.com:8787 in your browser, use same username/pw
Same user/pw for all instances.
Learner Instances
ec2-3-237-39-84.compute-1.amazonaws.com - Eirini Tsirvouli
ec2-18-210-23-240.compute-1.amazonaws.com - Abel Gizaw
ec2-35-175-223-241.compute-1.amazonaws.com - Lisa Tietze
ec2-35-172-203-120.compute-1.amazonaws.com - Jonathan
ec2-3-236-114-246.compute-1.amazonaws.com - Abush Zinaw
ec2-3-226-122-33.compute-1.amazonaws.com - Fleur
ec2-3-236-56-11.compute-1.amazonaws.com - Katrine Bjerkan
ec2-18-204-56-16.compute-1.amazonaws.com - Bisa
ec2-3-236-118-202.compute-1.amazonaws.com - amit
ec2-3-227-211-96.compute-1.amazonaws.com- Hannah
ec2-3-236-85-209.compute-1.amazonaws.com - May
ec2-3-236-85-109.compute-1.amazonaws.com - Prabin
ec2-3-237-13-115.compute-1.amazonaws.com - Nur
ec2-35-175-120-232.compute-1.amazonaws.com - Guy
ec2-3-226-244-206.compute-1.amazonaws.com-Adrian
ec2-3-235-75-241.compute-1.amazonaws.com - Alexandra
ec2-3-238-112-102.compute-1.amazonaws.com - Iva
ec2-35-172-230-97.compute-1.amazonaws.com
ec2-3-238-124-61.compute-1.amazonaws.com
ec2-3-235-174-82.compute-1.amazonaws.comBenedicte
ec2-3-85-241-133.compute-1.amazonaws.comElisa
ec2-34-229-132-87.compute-1.amazonaws.com
ec2-3-238-84-202.compute-1.amazonaws.com
ec2-3-236-252-88.compute-1.amazonaws.com
ec2-18-206-16-63.compute-1.amazonaws.com_mingyi
ec2-3-215-22-202.compute-1.amazonaws.com
ec2-3-236-249-145.compute-1.amazonaws.com - Ali
ec2-100-26-176-28.compute-1.amazonaws.com - Endre
ec2-3-236-45-0.compute-1.amazonaws.com - Kari
ec2-3-234-208-96.compute-1.amazonaws.com - Tadeu
/home/dcuser/shell_data/sra_metadata
we want to go to dcuser
how can we do that?
..
cd ../..
cd.. cd .. (run cd .. twice)
cd /home/dcuser
$ cd ..
$ pwd
/home/dcuser/shell_data
$ cd ..
$ pwd
/home/dcuser
$
cd ../../
Guy$ pwd
/home/dcuser/shell_data
Guy$ cd /home/dcuser
Guy$ pwd
/home/dcuser
what's the difference between > and >> ?
ls --> myFolderContent.txtm
ls -l --> my FolderContent.txt
> creates a file and write the list into it while >> adds new content to the file.
I am not sure I see the difference
ls -l
Feedback
What went well
What should be improved
Introduction to the Command Line for Genomics
Instructor
Luca Di Stasio
Helpers
Tadeu
Kari
Ali
Participants
- Eirini
- Benedicte Garmann-Johnsen
- Jonathan
- Katrine
- May
- Bisa
- Lisa
- Fleur
- Abel Gizaw
- Guy
- Alexandra
- Abush
- amit
- Hannah
- Iva
- Adrian
mv - what does it do?
rename firstScript.sh to test.sh
mv firstScript.sh test.sh
rename, move file in to different directory... mv firstScript.sh test.sh
mv myfirstscript.sh test.sh
mv myfirstScript.sh test.sh
$mv ~/scripts/myfirstScript.sh ~/scripts/test.sh - move and change name of file
$ mv myfirstScript.sh test.sh
$ ls
firstExample.txt myFolderContent.txt mytext.text
firstExample.txtĈ myHistory.txt test.sh
mv is used to either move a folder or rename a file.
create a folder named "scriptsBackup" in dcuser
copy (cp) test.sh into scriptsBackup
q
$ mkdir scriptsBackup
$ cp test.sh scriptsBackup
ls ~mkdir ~/scriptsBackup and cp test.sh ~/scriptsBackup/
mkdir scriptsBackup
cp test.sh scriptsBackup
mkdir scriptsBackup to crate the scriptsBackup directory then, cp scripts/test.ch scriptsBackup
mkdir ScriptsBackup
cp ~/scripts/test.sh ~/ScriptsBackup/
$cd ..
$pwd
/home/dcuser
$ls
R r_data scripts shell_data
$mkdir scriptsBackup
$ls
$cp ~/scripts/test.sh ~/scriptsBackup
.
for each file in ../shell_data/untrimmed_fastq/ read thhe first 2 lines and save them to a file called seq_info.txt
for filename in *.fastq; do head -n 2 ${filename} >> seq_info.txt; done
$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done
$ cat seq_info.txt
for filename in *.fastq; do head -n 2 ${filename}; done > seq_info.txt
for filename in *.fastq; do head -n 2 ${filename}; >seq_info.txt; ls; done
create a script "dataAnalysis.sh" in the folder script
for each file in ../shell_data/untrimmed_fastq/ read the first 2 lines and save them to a file called seq_info.txt, the file seq_info.txt is in the same ../shell_data/untrimmed_fastq/ folder
nano dataAnalysis.sh (for filename in *.fastq; do head -n 2 ${filename} >> seq_info.txt; done)
chmod +x dataAnalysis.sh
./dataAnalysis.sh
cd ~/shell_data/untrimmed_fastq
for filename in *.fastq
do
head -n 2 ${filename} >> seq_info.txt
done
/home/dcuser/shell_data/untrimmed_fastq
$ for filename in *.fastq; do head -n 2 $filename >> seq_info.text; done
$ cat seq_info.text
@SRR097977.1 209DTAAXX_Lenski2_1_7:8:3:710:178 length=36
TATTCTGCCATAATGAAATTCGCCACTTGTTAGTGT
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
$ ls
seq_info.text SRR097977.fastq SRR098026.fastq
/home/dcuser
$ cd scripts
$ ls
dataAnalysis.sh firstExample.txtĈ myHistory.txt scriptsBackup
firstExample.txt myFolderContent.txt mytext.text test.sh
How can I now move the seq_info.text to dataAnalysis.sh???
$ nano dataAnalysis.sh
$ mv dataAnalysis.sh ~/scripts
nano dataAnalysis.sh (for filename in *.fastq; do head -n 2 ${filename} >> seq_info.txt; done)
mv ~/scripts/dataAnalysis.sh ~/shell_data/untrimmed_fastq/
nano dataAnalysis.sh (for filename in ~/shell_data/untrimmed_fastq/*.fastq; do head -n 2 ${filename} >>~/shell_data/untrimmed_fastq/seq__info.txt; done)
seq_info.txt should be saved in the folder ~/results
write in a single line code: create the folder ~/results and execute the script and verify the result of the result of the script
mkdir ~/results && bash seq_info.txt && cat seq_txt.txt
nano ~/scripts/dataAnalysis.sh (for filename in ~/shell_data/untrimmed_fastq/*.fastq; do head -n 2 ${filename} >>~/results/seq_info.txt; done)
mkdir ~/results && ~/scripts/dataAnalysis.sh && cat ~/results/seq_info.txt
mkdir results && cd results && bash ~/scripts/dataAnalysis.sh && ls && cat seq_info.txt
Guy$ nano ~/scripts/dataAnalysis.sh
cd ~/shell_data/untrimmed_fastq
for filename in *.fastq
do
head -n 2 ${filename} >> ~/results/seq_info.txt
cat seq_info.txt
done
Guy$ mkdir ~/results && ~/scripts/dataAnalysis.sh && cat ~/results/seq_info.txt
ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt
Data Wrangling and Processing for Genomics
https://datacarpentry.org/wrangling-genomics/
https://datacarpentry.org/genomics-workshop/setup.html
Assessing Read Quality
https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html
Sequencing read quality check
fastqc code and documentation
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Guy$ tail -n 4 SRR2584863_1.fastq
CTGCAATACCACGCTGATCTTTCACATGATGTAAGAAAAGTGGGATCAGCAAACCGGGTGCTGCTGTGGCTAGTTGCAGCAAACCATGCAGTGAACCCGCCTGTGCTTCGCTATAGCCGTGACTGATGAGGATCGCCGGAAGCCAGCCAA
+
CCCFFFFFHHHHGJJJJJJJJJHGIJJJIJJJJIJJJJIIIIJJJJJJJJJJJJJIIJJJHHHHHFFFFFEEEEEDDDDDDDDDDDDDDDDDCDEDDBDBDDBDDDDDDDDDBDEEDDDD7@BDDDDDD>AA>?B?<@BDD@BDC?BDA?
tail -n 4 SRR2584863_1.fastq
tail -n 4 *863_1.fastq
$cat ~/dc_workshop/docs/fastqc_summaries.txt
$ cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
$ ls
$ cat ~/dc_workshop/docs/fastqc_summaries.txt
sort fastqc_summaries.txt | grep "FAIL" fastqc_summaries.txt
cat fastqc_summaries.txt | sort
How can you sort so that it kicks evrything out of the list that has a PASS?
Check the "man grep" command. grep has a -v parameter, that gets you everything that does NOT match a pattern, and after you can sort
for example: grep -v PASS fastqc_summaries.txt | sort
$cat ~/dc_workshop/docs/fastqc_summaries.txt
Trimming and Filtering
https://datacarpentry.org/wrangling-genomics/03-trimming/index.html
Data Wrangling and Processing for Genomics - part 2
https://datacarpentry.org/wrangling-genomics/
Instructor
Endre Sebestyén
Helpers
Tadeu
Kari
Ali
Participants
1. Lisa Tietze
Hannah
May
Jonathan
Bisa
Berihun
Abel Gizaw
Benedicte
Fleur
Iva
Abush
amit
Adrian
Alexandra
Katrine
Eirini
Nur
Variant Calling Workflow
https://datacarpentry.org/wrangling-genomics/04-variant_calling/index.html
BWA documentation
http://bio-bwa.sourceforge.net/bwa.shtml
The SAM/BAM file format specification
https://samtools.github.io/hts-specs/SAMv1.pdf
samtools documentation
http://www.htslib.org/doc/samtools.html
bcftools documentation
http://www.htslib.org/doc/bcftools.html
What is the real name of the reference genome?
head data/ref_genome/ecoli_rel606.fasta
>CP000819.1 Escherichia coli B str. REL606, complete genome
CP000819.1 Escherichia coli B str. REL606, complete genome
How many variants are there in the vcf file?
Use grep and wc
IGV download
http://software.broadinstitute.org/software/igv/download
IGV setup done
Endre
Lisa
Kari
Iva
Tadeu
Abel
Jonathan
Katrine
May
Fleur
IGV files to download
https://www.dropbox.com/sh/hhmh6j4b212b5dd/AADIfAdTsfo5rs5I_nWeuM1pa?dl=0
Intro to Cloud computing:
https://datacarpentry.org/cloud-genomics/03-verifying-instance/index.html
Introduction to Cloud Computing for Genomics
https://datacarpentry.org/cloud-genomics/
Post workshop survey
https://carpentries.typeform.com/to/UgVdRQ?slug=2020-10-21-nord-online