diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index f8f6f3e3e..b8b978967 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -1,22 +1,22 @@ ## What is going to happen? -Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new filesystem. -That includes users' home directories, work directories, and workgroup directories. +Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new file system. +That includes users' home directories, work directories, and work-group directories. Once files have been moved to their new locations, `/fast` will be retired. ## Why is this happening? -`/fast` is based on a high performance proprietary hardware (DDN) & filesystem (GPFS). +`/fast` is based on a high performance proprietary hardware (DDN) & file system (GPFS). The company selling it has terminated support which also means buying replacement parts will become increasingly difficult. ## The new storage -There are *two* filesystems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed: +There are *two* file systems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed: - **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity. - **Tier 2** is not as fast, but much larger, almost 3 times the current usable capacity. The **Hot storage** Tier 1 is reserved for large files, requiring frequent random access. Tier 2 (**Warm storage**) should be used for everything else. -Both filesystems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used. -Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimised for cost. +Both file systems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used. +Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost. So these are the three terminologies in use right now: - Cephfs-1 = Tier 1 = Hot storage @@ -24,10 +24,10 @@ So these are the three terminologies in use right now: ### Snapshots and Mirroring Snapshots are incremental copies of the state of the data at a particular point in time. -They provide safety against various "Oups, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. +They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. -Depending on the location and Tier, Cephfs utilises snapshots in differ differently. -Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the datacenter to provide an additional layer of security. +Depending on the location and Tier, Cephfs utilizes snapshots in differ differently. +Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center to provide an additional layer of security. | Tier | Location | Path | Retention policy | Mirrored | |:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:| @@ -51,7 +51,7 @@ User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth. | 2 | Group mirrored | `/data/cephfs-2/mirrored/groups/` | 4 TB | | 2 | Group unmirrored | `/data/cephfs-2/unmirrored/groups/` | On request | | 2 | Project mirrored | `/data/cephfs-2/mirrored/projects/` | On request | -| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/` | On request | +| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/` | individual | There are no quotas on the number of files. @@ -81,7 +81,7 @@ The full list of symlinks is: - HPC web portal cache: `ondemand` - General purpose caching: `.cache` & `.local` - Containers: `.singularity` & `.apptainer` -- Programming languanges libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` +- Programming languages libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` - Others: `.ncbi`, `.vs` !!! warning @@ -125,57 +125,57 @@ The full list of symlinks is: ### Example use cases +Space on Tier 1 is limited. +Your colleagues, other cluster users, and admins will be very grateful if you use it only for files you actively need to perform read/write operations on. +This means main project storage should probably always be on Tier 2 with workflows to stage subsets of data onto Tier 1 for analysis. + These examples are based on our experience of processing diverse NGS datasets. Your mileage may vary but there is a basic principle that remains true for all projects. -!!! note - **Keep on Tier 1 only files in active use.** - The space on Tier 1 is limited, your colleagues and other cluster users will be - grateful if you remove from Tier 1 the files you don't immediatly need. - #### DNA sequencing (WES, WGS) -The typical Whole Genome Sequencing of human sample at 100x coverage takes about 150GB storage. -For Whole Exome Sequencing, the data typically takes between 6 to 30 GB. -These large files require considerable computing resources for processing, in particular for the mapping step. -Therefore, for mapping it may be useful to follow a prudent workflow, such as: +Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150 GB of storage, Whole Exome Sequencing between 6 and 30 GB. +These large files require considerable I/O resources for processing, in particular for the mapping step. +A prudent workflow for these kind of analysis would therefore be the following: 1. For one sample in the cohort, subsample its raw data files (`fastqs`) from the Tier 2 location to Tier 1. [`seqtk`](https://github.com/lh3/seqtk) is your friend! 2. Test, improve & check your processing scripts on those smaller files. 3. Once you are happy with the scripts, copy the complete `fastq` files from Tier 2 to Tier 1. Run the your scripts on the whole dataset, and copy the results (`bam` or `cram` files) back to Tier 2. -4. **Remove the raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly. +4. **Remove raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly. !!! tip - Don't forget to use your `scratch` area for transient operation, for example to sort your `bam` file after mapping. - You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md) + Don't forget to use your `scratch` area for transient operations, for example to sort your `bam` file after mapping. + More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md). #### bulk RNA-seq Analysis of RNA expression datasets are typically a long and iterative process, where the data must remain accessible for a significant period. -However, there is usually not need to keep raw data files and mapping results available, once the genes & transcripts counts have been generated. -The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. A typical workflow would be: +However, there is usually no need to keep raw data files and mapping results available once the gene & transcripts counts have been generated. +The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. + +A typical workflow would be: -1. Copy your `fastq` files from Tier 2 to Tier 1, -2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1, -3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example, -4. Save the expression levels `R` objects and the output of `salmon`, `STAR` (or any mapper/aligner of your choice) to Tier 2, -5. **Remove the raw data, bam & count files from Tier 1** +1. Copy your `fastq` files from Tier 2 to Tier 1. +2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1. +3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example. +4. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2. +5. **Remove raw data, bam & count files from Tier 1.** !!! tip - If using `STAR`, don't forget to use your `scratch` area for transient operation. - You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md) + If using `STAR`, don't forget to use your `scratch` area for transient operations. + More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md) #### scRNA-seq The analysis workflow of bulk RNA & single cell dataset is conceptually similar: -the large raw files need to be processed only once, and only the outcome of the processing (the gene counts matrix) is required for downstream analysis. +Large raw files need to be processed once and only the outcome of the processing (gene counts matrices) are required for downstream analysis. Therefore, a typical workflow would be: -1. Copy your `fastq` files from Tier 2 to Tier 1, -2. Perform raw data QC (for example with `fastqc`), -3. Get the count matrix, for example using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1, -4. **Remove the raw data, bam & count files from Tier 1** -5. Downstream analysis with for example `seurat`, `scanpy` or `Loupe Browser`. +1. Copy your `fastq` files from Tier 2 to Tier 1. +2. Perform raw data QC (for example with `fastqc`). +3. Get the count matrix, e. g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1. +4. **Remove raw data, bam & count files from Tier 1.** +5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`. #### Machine learning