typography & style pass

bihealth · Mar 8, 2024 · 07aff87 · 07aff87
1 parent 9dc8fe9
commit 07aff87
Showing 1 changed file with 38 additions and 38 deletions.
diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md
@@ -1,33 +1,33 @@
 ## What is going to happen?
-Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new filesystem.
-That includes users' home directories, work directories, and workgroup directories.
+Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new file system.
+That includes users' home directories, work directories, and work-group directories.
 Once files have been moved to their new locations, `/fast` will be retired.
 
 ## Why is this happening?
-`/fast` is based on a high performance proprietary hardware (DDN) & filesystem (GPFS).
+`/fast` is based on a high performance proprietary hardware (DDN) & file system (GPFS).
 The company selling it has terminated support which also means buying replacement parts will become increasingly difficult.
 
 ## The new storage
-There are *two* filesystems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed:
+There are *two* file systems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed:
 
 - **Tier 1** is faster than `/fast` ever was, but it only has about 75  % of its usable capacity.
 - **Tier 2** is not as fast, but much larger, almost 3 times the current usable capacity.
 
 The **Hot storage** Tier 1 is reserved for large files, requiring frequent random access.
 Tier 2 (**Warm storage**) should be used for everything else.
-Both filesystems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used.
-Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimised for cost.
+Both file systems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used.
+Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost.
 
 So these are the three terminologies in use right now:
 - Cephfs-1 = Tier 1 = Hot storage
 - Cephfs-2 = Tier 2 = Warm storage
 
 ### Snapshots and Mirroring
 Snapshots are incremental copies of the state of the data at a particular point in time. 
-They provide safety against various "Oups, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files.
+They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files.
 
-Depending on the location and Tier, Cephfs utilises snapshots in differ differently.
-Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the datacenter to provide an additional layer of security.
+Depending on the location and Tier, Cephfs utilizes snapshots in differ differently.
+Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center to provide an additional layer of security.
 
 | Tier | Location                 | Path                         | Retention policy                | Mirrored |
 |:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:|
@@ -51,7 +51,7 @@ User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth.
 |    2 | Group mirrored     | `/data/cephfs-2/mirrored/groups/<group>`       | 4 TB        |
 |    2 | Group unmirrored   | `/data/cephfs-2/unmirrored/groups/<group>`     | On request  |
 |    2 | Project mirrored   | `/data/cephfs-2/mirrored/projects/<project>`   | On request  |
-|    2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/<project>` | On request  |
+|    2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/<project>` | individual  |
 
 There are no quotas on the number of files.
 
@@ -81,7 +81,7 @@ The full list of symlinks is:
 - HPC web portal cache: `ondemand`
 - General purpose caching: `.cache` & `.local`
 - Containers: `.singularity` & `.apptainer`
-- Programming languanges libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm`
+- Programming languages libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm`
 - Others: `.ncbi`, `.vs`
 
 !!! warning
@@ -125,57 +125,57 @@ The full list of symlinks is:
 
 ### Example use cases
 
+Space on Tier 1 is limited.
+Your colleagues, other cluster users, and admins will be very grateful if you use it only for files you actively need to perform read/write operations on.
+This means main project storage should probably always be on Tier 2 with workflows to stage subsets of data onto Tier 1 for analysis.
+
 These examples are based on our experience of processing diverse NGS datasets.
 Your mileage may vary but there is a basic principle that remains true for all projects.
 
-!!! note
-    **Keep on Tier 1 only files in active use.**
-	The space on Tier 1 is limited, your colleagues and other cluster users will be 
-	grateful if you remove from Tier 1 the files you don't immediatly need.
-
 #### DNA sequencing (WES, WGS)
 
-The typical Whole Genome Sequencing of human sample at 100x coverage takes about 150GB storage.
-For Whole Exome Sequencing, the data typically takes between 6 to 30 GB.
-These large files require considerable computing resources for processing, in particular for the mapping step.
-Therefore, for mapping it may be useful to follow a prudent workflow, such as:
+Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150  GB of storage, Whole Exome Sequencing between 6 and 30  GB.
+These large files require considerable I/O resources for processing, in particular for the mapping step.
+A prudent workflow for these kind of analysis would therefore be the following:
 
 1. For one sample in the cohort, subsample its raw data files (`fastqs`) from the Tier 2 location to Tier 1. [`seqtk`](https://github.com/lh3/seqtk) is your friend!
 2. Test, improve & check your processing scripts on those smaller files.
 3. Once you are happy with the scripts, copy the complete `fastq` files from Tier 2 to Tier 1. Run the your scripts on the whole dataset, and copy the results (`bam` or `cram` files) back to Tier 2.
-4. **Remove the raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly.
+4. **Remove raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly.
 
 !!! tip
-    Don't forget to use your `scratch` area for transient operation, for example to sort your `bam` file after mapping.
-	You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md)
+    Don't forget to use your `scratch` area for transient operations, for example to sort your `bam` file after mapping.
+	More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md).
 
 #### bulk RNA-seq
 
 Analysis of RNA expression datasets are typically a long and iterative process, where the data must remain accessible for a significant period.
-However, there is usually not need to keep raw data files and mapping results available, once the genes & transcripts counts have been generated.
-The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. A typical workflow would be:
+However, there is usually no need to keep raw data files and mapping results available once the gene & transcripts counts have been generated.
+The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. 
+
+A typical workflow would be:
 
-1. Copy your `fastq` files from Tier 2 to Tier 1,
-2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1,
-3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example,
-4. Save the expression levels `R` objects and the output of `salmon`, `STAR` (or any mapper/aligner of your choice) to Tier 2,
-5. **Remove the raw data, bam & count files from Tier 1**
+1. Copy your `fastq` files from Tier 2 to Tier 1.
+2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1.
+3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example.
+4. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2.
+5. **Remove raw data, bam & count files from Tier 1.**
 
 !!! tip
-    If using `STAR`, don't forget to use your `scratch` area for transient operation.
-	You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md)
+    If using `STAR`, don't forget to use your `scratch` area for transient operations.
+	More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md)
 
 #### scRNA-seq
 
 The analysis workflow of bulk RNA & single cell dataset is conceptually similar:
-the large raw files need to be processed only once, and only the outcome of the processing (the gene counts matrix) is required for downstream analysis.
+Large raw files need to be processed once and only the outcome of the processing (gene counts matrices) are required for downstream analysis.
 Therefore, a typical workflow would be:
 
-1. Copy your `fastq` files from Tier 2 to Tier 1,
-2. Perform raw data QC (for example with `fastqc`),
-3. Get the count matrix, for example using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1,
-4. **Remove the raw data, bam & count files from Tier 1**
-5. Downstream analysis with for example `seurat`, `scanpy` or `Loupe Browser`.
+1. Copy your `fastq` files from Tier 2 to Tier 1.
+2. Perform raw data QC (for example with `fastqc`).
+3. Get the count matrix, e.  g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1.
+4. **Remove raw data, bam & count files from Tier 1.**
+5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`.
 
 #### Machine learning