Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do Not Merge -> Suggestions + questions nr 1. #92

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
75 changes: 75 additions & 0 deletions disk_areas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@

Disc areas on Abel which are relevant for your work at the Veteriary Institute

# Summary :

|directory|path |what |characteristics|
|:--------|:-------------------|:---------|:--------------|
|`$HOME`|`usit/abel/u1/username`|"home-home" area|500GB - login area- NOT FOR DataProcessing -backed-up|
|**work**|`/work/projects/nn9305k`|common resources (eg. softwares, databases)|10 TB - faster disc, closer to nodes connection|
|**project**|`/projects/nn9305k`|Project data area (backed-up, raw data, results)|30 TB - slower disc|
|`$SCRATCH`|`/work/jobs/JOBID.d`|computing nodes: **FOR PROCESSING DATA = RUNNIng JOBS**|fastest|
|`$USERWORK`|`/work/users/username`|sharing data among different jobs = **staging**|do not backed-up - removed after 45 days|

1. **$HOME**: "home-home" directory (on login nodes): 500GB - backed up regularly: slow and costly.
- If you need to store files which do not require backup: put them in a `nobackup` directory
- recommended for: software, job configurations, job preparation, debugging ...
- DO NOT USE to run jobs -> use $SCRATCH

QQQ: -> **login nodes** no big jobs but tolerated test small for jobs.-> processes more taking more than 30 min are killed automatically => so in home-home

**work area**: `/work/projects/nn9305k` will only be used to:
- contain programs and librairies for programs to work
- public reference data, template files, adapters files ...

2. **project area**: `/projects/nn9305k` that contains
- project data, including raw sequencing files (zipped/tared and un-Z/T)
- data analyses that are performed
- this is where the work will NOW be done.

QQQ - [ ]-> **so our(vetinst users) /work/projects/nn9305k/home/username (incl. all subfolder should be moved here)**

3. **$SCRATCH**: when you start a job a temporary directory on SCRATCH is automatically created. It is automatically deleted when the job finishes. This allows the jobs to run faster without interfering with the work of others. You can access for status monitoring of your jobs: with `/work/jobs/JOBID.d`.

4. **$USERWORK** If files are needed for more than one job, they are staged in here. Files are deøeted after 45 days, and there is no backup. Access: `/work/users/username`

For more details you can look at: [Managing Data on Abel] and more generaly at [Abel User Guide]. You can also look at [Computer Ressources at CEES].
#### [Workflow] between areas

<img src="https://docs.google.com/drawings/d/e/2PACX-1vSY_KCj3fubTH1zk6ZkOL6eLhoOOuAbp4bfu1YkOAvkadHhPfbuZrsepwHCUpEqwr45Zqt2hlEoCwVk/pub?w=960&amp;h=720">

NB: **SLURM** is the queue system and shelduler deamon (program/process which always run in the background on Abel). It allows optimizing computing ressource usage and sheldule when runed on $SCRATCH.

NB: There is also a **version of NCBI databases** hosted on Abel: at `/work/databases/bio/ncbi`

QQQ: Thought we could also have a summary checklist overview -> what to do (at least easier for beginners..I guess) beloow example
# MEMO: Guidelines:

```
QQQ
*this part shoud be checked-and modified after moving*-> NEEd to be sure areas ok before
- [ ] the README should
- [ ] and New_user.txt
```

**New users**
- [ ] README in `/work/projects/nn9305k/`
- [ ] `/work/projects/nn9305k/samplefiles/new_user.txt` - follow instructions

**Everybody**
- [ ] read NEWS regularly: `/work/projects/nn9305k/`

#### All **projects directories** should be organized as such:

- [ ] : README copy from README_datafile - in samplefiles directory and filled about -> rename to README_tarfilename
- [ ] directory restructure
- [ ] rawdata
- [ ] analysis
- [ ] scripts
- [ ] logs
- [ ] please fill in the data registry file : /projects/nn9305k/sys/DataRegistry.txt

[Abel User Guide]:https://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/
[Managing Data on Abel]:https://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/data.html
[Workflow]:https://docs.google.com/drawings/d/e/2PACX-1vSY_KCj3fubTH1zk6ZkOL6eLhoOOuAbp4bfu1YkOAvkadHhPfbuZrsepwHCUpEqwr45Zqt2hlEoCwVk/pub?w=960&h=720
[Computer Ressources at CEES]:https://github.com/uio-cees/hpc/wiki/Computer-resources
136 changes: 136 additions & 0 deletions evfi-suggestions1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@

# course itself - videos introduction + Lectures.pdf

Q: Questions

## Small things...

**Suggestions:**

> maybe add complete reference for bibliography (end file) only to paste - in order from course. checked all links (in pdfs. to see if still valid).
> All cited articles (but one - not free - asked for it) -> found and sorted in folder for course: if you want I can put all in your course folder (avoid others to have to search for articles if interested) - (I just love to have a look at articles cited in courses, as they are actually chosen because of quality/reprensentativity).

-About: ..the building blocks dna ... at the end of first course. Not sure what to think about it: very simple and clear but THEN - > there is a sudden jump to more complicated concepts that have not been presented. Maybe too limited? (ex: ? Arvind Speaks also of libraries for **transposons ...**) ...
- so maybe short background refresh: (problem increase to more of a genetic course): or evt provide ref for good review papers if needed...
- mobile elements /"junk"
- methylation : -> Thomas: presentation about methylation detection (packBio) - epigenetics
- RNAs - kinds, and functions ...exome sequencing

- biais - method/platform specific: Maybe you say that later in the course (think would be to know what are potential problems, pitfalls, -> also important to understand proprely articles). Mention ok...Arvind speaks a bit about optical duplicates course 3 (new-machines with 2 color dyes) -> how will influence analyses?

**For each lecture:**
- Lecture1.pdf: Karch et. al. EMBO Mol. Med. 20**12** (typo)

- Lecture2.pdf: in case you want to actuallize your links
- p 22 Link: <https://www.sequencing.uio.no/forms/guidelines-submission-form-(illumina).pdf> they moved it to https
- p 23 Link: <http://www.micronautomata.com/bioinformatics> does not seem to respond - retry...
- p 31: Illumina - video link (see bellow)

- Lecture 2 Video (Thomas) :
- library: define earlier - maybe set the terms to know p 33 . earlier in the presentation
- Illumina ca 50 min on video
Q: when watching video: was wondering:
- how fliped over for paired end reads - when first sequencing pass finished `atomatic flip when DNA fraqument amplification finished? - Did not really get at first the **bridge amplification** (pair ends) - thought: easier with wideo
- > Suggestion: show video illumina: explain the bridge add ? <https://www.youtube.com/watch?v=fCd6B5HRaZ8> but I find somehow older video cleared - its also shorter (also explain reads indexes 1 and 2) <https://www.youtube.com/watch?v=HMyCqWhwB8E>

Q: MinION - can get high quality consesus continuous sequencing -> not sure how see how sequence many times same seq? (so each pore 1 DNA molecule, so many porea ...some same DNA molecule -> consensus -> from many pores-- ? do I understand right?

- p57 (was just wondering what were the different ending parts - as not yet explained : ex. P5-index2-Rd1SP)

- Lecture3.pdf: p3 link changed <https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/miseq/indexed-sequencing-overview-guide-15057455-04.pdf>
- so indexes - short tagging sequences -> representing the 2 complementary directions (old school: Forward and Reverse) of the sequenced DNA fragment.

- Lecture4.pdf (Karin): p 43: changed link to <http://www.cbcb.umd.edu/research/assembly_primer>
Q: - BLAST: K-mer World -> optimal choice (why not chose 36/17) - comparing all genome or part genome ? do we have idea about query seq?
- Outch! *burrow wheeler transform (at the end of day) - comming back to that later on*

- Ok- some clarrification assembly with choice length word....Q: From experience; lets say WGS - how genome structure (several chromosomes pairs - feks also with familly genes , or plasmid DNA -> bacteries ...affect assembly, think Bacteria, if HGT most likely plasmid ...)

Q: **5 days course** - 5th day lecture - no video? - ? going throught the ressources?

### additional questions - to be sure I understand
- Q? Curious - whole genome amplification ? works ??? tested before making libraries (we did try, some ok, some not) - applicable - sure biais some taxa --- experience?
- Q: depth and coverage:
- so if denovo genome -> have no idea of coverage -> estimate -> expectated closely related? => yes need example reference talking about ... works ok for bacteria - (what about plants - polyploidie)
- depth -> homomgeneity ...-> discriminate different OTU (use to help or unreliable?)

- Q: good enough overlapp between reads (Assembly) -> Q: chose length word ~17+ + some mismatches - way to determine optimal?

- Q: 454: do not use (as not presented) - need to read a bit more. Maybe worth short presentation or link - to be able to uderstand if specificity when reading papers.
--------------------
# ? ideas if not already suggested
- Introduction to Genomic Data Science - <https://courses.edx.org/courses/course-v1:UCSanDiegoX+CSE181.1x+1T2017/course/> - <https://stepik.org/> can help train programming with python: learn about sliding window, some text algorithms, charging machines ...

# REFERENSE LIST: READY TO PASTE

Only one in bold = not found -> asked (research gate)

#### Lecture 1

Did not include the historical papers

Lex Nederbragt blog: https://flxlexblog.wordpress.com/2016/07/08/developments-in-high-throughput-sequencing-july-2016-edition/

Stratton, Michael R., Peter J. Campbell, and P. Andrew Futreal. “The Cancer Genome.” Nature 458, no. 7239 (April 2009): 719–24. https://doi.org/10.1038/nature07943.

Nasir, Arshan, Kyung Mo Kim, and Gustavo Caetano-Anolles. “Giant Viruses Coexisted with the Cellular Ancestors and Represent a Distinct Supergroup along with Superkingdoms Archaea, Bacteria and Eukarya.” BMC Evolutionary Biology 12, no. 1 (August 24, 2012): 156. https://doi.org/10.1186/1471-2148-12-156.

Kujiraoka, Manabu, Makoto Kuroda, Koji Asai, Tsuyoshi Sekizuka, Kengo Kato, Manabu Watanabe, Hiroshi Matsukiyo, et al. “Comprehensive Diagnosis of Bacterial Infection Associated with Acute Cholecystitis Using Metagenomic Approach.” Frontiers in Microbiology 8 (April 20, 2017). https://doi.org/10.3389/fmicb.2017.00685.

Tyson, Gene W., Jarrod Chapman, Philip Hugenholtz, Eric E. Allen, Rachna J. Ram, Paul M. Richardson, Victor V. Solovyev, Edward M. Rubin, Daniel S. Rokhsar, and Jillian F. Banfield. “Community Structure and Metabolism through Reconstruction of Microbial Genomes from the Environment.” Nature 428, no. 6978 (March 2004): 37–43. https://doi.org/10.1038/nature02340.

Karch, Helge, Erick Denamur, Ulrich Dobrindt, B. Brett Finlay, Regine Hengge, Ludgers Johannes, Eliora Z. Ron, Tone Tønjum, Philippe J. Sansonetti, and Miguel Vicente. “The Enemy within Us: Lessons from the 2011 European Escherichia Coli O104:H4 Outbreak.” EMBO Molecular Medicine 4, no. 9 (September 4, 2012): 841–48. https://doi.org/10.1002/emmm.201201662.

Rohde, Holger, Junjie Qin, Yujun Cui, Dongfang Li, Nicholas J. Loman, Moritz Hentschke, Wentong Chen, et al. “Open-Source Genomic Analysis of Shiga-Toxin–Producing E. Coli O104:H4.” New England Journal of Medicine 365, no. 8 (August 25, 2011): 718–24. https://doi.org/10.1056/NEJMoa1107643.

Hendriksen, Rene S., Lance B. Price, James M. Schupp, John D. Gillece, Rolf S. Kaas, David M. Engelthaler, Valeria Bortolaia, et al. “Population Genetics of Vibrio Cholerae from Nepal in 2010: Evidence on the Origin of the Haitian Outbreak.” MBio 2, no. 4 (September 1, 2011): e00157-11. https://doi.org/10.1128/mBio.00157-11.

**Falush, Daniel. “Bacterial Genomics: Microbial GWAS Coming of Age.” Nature Microbiology 1, no. 5 (May 2016): 16059. https://doi.org/10.1038/nmicrobiol.2016.59.**

Earle, Sarah G., Chieh-Hsi Wu, Jane Charlesworth, Nicole Stoesser, N. Claire Gordon, Timothy M. Walker, Chris C. A. Spencer, et al. “Identifying Lineage Effects When Controlling for Population Structure Improves Power in Bacterial Association Studies.” Nature Microbiology 1 (April 4, 2016): 16041. https://doi.org/10.1038/nmicrobiol.2016.41. (with supplement)

Valenzuela-Muñoz, Valentina, Jacqueline Chavez-Mardones, and Cristian Gallardo-Escárate. “RNA-Seq Analysis Evidences Multiple Gene Responses in Caligus Rogercresseyi Exposed to the Anti-Salmon Lice Drug Azamethiphos.” Aquaculture 446 (September 1, 2015): 156–66. https://doi.org/10.1016/j.aquaculture.2015.05.011.

#### Lecture 2
See Lecture 1:

Stratton, Michael R., Peter J. Campbell, and P. Andrew Futreal. “The Cancer Genome.” Nature 458, no. 7239 (April 2009): 719–24. https://doi.org/10.1038/nature07943.

Guidelines:
“Norwegian Sequencing Centre. Guidelines for Completion of Illumina Sample Submission Form.,” n.d. https://www.sequencing.uio.no/forms/samplesubmissionforms.html. https://www.sequencing.uio.no.

Articles:
Metzker, Michael L. “Sequencing in Real Time.” Nature Biotechnology 27, no. 2 (February 2009): 150–51. https://doi.org/10.1038/nbt0209-150.

Rhoads, Anthony, and Kin Fai Au. “PacBio Sequencing and Its Applications.” Genomics, Proteomics & Bioinformatics 13, no. 5 (October 2015): 278–89. https://doi.org/10.1016/j.gpb.2015.08.002.

Muinck, Eric J. de, Pål Trosvik, Gregor D. Gilfillan, Johannes R. Hov, and Arvind Y. M. Sundaram. “A Novel Ultra High-Throughput 16S RRNA Gene Amplicon Sequencing Library Preparation Method for the Illumina HiSeq Platform.” Microbiome 5, no. 1 (July 6, 2017): 68. https://doi.org/10.1186/s40168-017-0279-1.

Macosko, Evan Z., Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, et al. “Highly Parallel Genome-Wide Expression Profiling of Individual Cells Using Nanoliter Droplets.” Cell 161, no. 5 (May 21, 2015): 1202–14. https://doi.org/10.1016/j.cell.2015.05.002.

#### Lecture 3
Schurch, Nicholas J., Pietá Schofield, Marek Gierliński, Christian Cole, Alexander Sherstnev, Vijender Singh, Nicola Wrobel, et al. “How Many Biological Replicates Are Needed in an RNA-Seq Experiment and Which Differential Expression Tool Should You Use?” RNA (New York, N.Y.) 22, no. 6 (2016): 839–51. https://doi.org/10.1261/rna.053959.115.

Liu, Yuwen, Jie Zhou, and Kevin P. White. “RNA-Seq Differential Expression Studies: More Sequence or More Replication?” Bioinformatics (Oxford, England) 30, no. 3 (February 1, 2014): 301–4. https://doi.org/10.1093/bioinformatics/btt688.

Busby, Michele A., Chip Stewart, Chase A. Miller, Krzysztof R. Grzeda, and Gabor T. Marth. “Scotty: A Web Tool for Designing RNA-Seq Experiments to Measure Differential Gene Expression.” Bioinformatics 29, no. 5 (March 1, 2013): 656–57. https://doi.org/10.1093/bioinformatics/btt015.

Sims, David, Ian Sudbery, Nicholas E. Ilott, Andreas Heger, and Chris P. Ponting. “Sequencing Depth and Coverage: Key Considerations in Genomic Analyses.” Nature Reviews. Genetics 15, no. 2 (February 2014): 121–32. https://doi.org/10.1038/nrg3642.

Conesa, Ana, Pedro Madrigal, Sonia Tarazona, David Gomez-Cabrero, Alejandra Cervera, Andrew McPherson, Michał Wojciech Szcześniak, et al. “A Survey of Best Practices for RNA-Seq Data Analysis.” Genome Biology 17, no. 1 (January 26, 2016): 13. https://doi.org/10.1186/s13059-016-0881-8.



#### Lecture 4

La Trobe Institute for Molecular Science, Melbourne, Australia, Thomas Shafee, Rohan Lowe, and La Trobe Institute for Molecular Science, Melbourne, Australia. “Eukaryotic and Prokaryotic Gene Structure.” WikiJournal of Medicine 4, no. 1 (2017). https://doi.org/10.15347/wjm/2017.002.

Langmead, Ben. “Introduction to the Burrows-Wheeler Transform and FM Index,” n.d., 12. http://www.cs.jhu.edu/~langmea/resources/bwt_fm.pdf

Miller, Jason R., Sergey Koren, and Granger Sutton. “Assembly Algorithms for Next-Generation Sequencing Data.” Genomics 95, no. 6 (June 2010): 315–27. https://doi.org/10.1016/j.ygeno.2010.03.001.

P43, 45, 51 Link: http://www.cbcb.umd.edu/research/assembly_primer

Schatz, Michael C., Arthur L. Delcher, and Steven L. Salzberg. “Assembly of Large Genomes Using Second-Generation Sequencing.” Genome Research 20, no. 9 (September 2010): 1165–73. https://doi.org/10.1101/gr.101360.109.

Roberts, Richard J., Mauricio O. Carneiro, and Michael C. Schatz. “The Advantages of SMRT Sequencing.” Genome Biology 14, no. 6 (July 3, 2013): 405. https://doi.org/10.1186/gb-2013-14-6-405.
34 changes: 15 additions & 19 deletions working_with_hpc.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,13 @@ are.
There are three main storage locations that you need to know about on
abel.

* Your home area `usit/abel/u1/username`. When you log in, you will automatically land in what
is called your _home area_. You will commonly not use this location much.

* The work-project area, `/work/projects/nn9305k`. This is where the
* Your home area `usit/abel/u1/username`. When you log in, you will automatically land in what is called your **home area**. You will commonly not use this location much.

* The work area, `/work/projects/nn9305k`. This is where the
Veterinary Institute does its work on abel. Think of it as one of
the common ressources areas on the Veterinary Institute area.

* The last one is the project's backup area. This is
`/projects/nn9305k`.
the common ressources areas on the Veterinary Institute area. Typically it will contain programs and public reference data.

* The last one is the project's backup area. `/projects/nn9305k`. This is where the data associated to bioinformatics projects from the VetInst (raw sequencing-data, data-analyses, your reasoning for analyses...) are stored.

There is also a forth area that will be discussed below.

Expand Down Expand Up @@ -173,12 +171,19 @@ computers, while ensuring that the data is not corrupted on the way.
This command has many options, as do other commands under unix. One set
of options that is commonly used with `rsync` is `-rauPW`.

### checking that file transfer completed without incident:
--------

**Task**
Have a look at the wikipedia page for rsync. Can you figure out the
syntax for rsync?

--------

### checking that file transfer completed without incident:
You can either:
> - redo rsyn with same options: if the transfer was successfull, nothing will be synchronized (same content).
> - use **hash** programs that generates a code based on file content (for both original and transfered/copied file). If both codes are identical, this means that the content of each files are identical i.e. that the file transfer was successull.
> use for ex. `md5sum file_origin`and `md5sum file_transfered` and compare codes.
> use for ex. `md5sum file_origin`and `md5sum file_transfered` and compare codes.

> Better to [automate] the process of checkingif you have many files
> - create file `XX_md5sum.txt`
Expand All @@ -187,15 +192,6 @@ You can either:
> - `md5sum -c "XX"md5sum.txt`
> - remove your temporary file `XX_md5sum.txt` if test passed

--------

**Task**
Have a look at the wikipedia page for rsync. Can you figure out the
syntax for rsync?

--------


### wget

wget is very useful for getting data from a website to either your local
Expand Down
19 changes: 0 additions & 19 deletions working_with_hpc.md~

This file was deleted.