Skip to content

Commit

Permalink
typography & style pass
Browse files Browse the repository at this point in the history
  • Loading branch information
sellth committed Mar 8, 2024
1 parent 9dc8fe9 commit 07aff87
Showing 1 changed file with 38 additions and 38 deletions.
76 changes: 38 additions & 38 deletions bih-cluster/docs/storage/storage-migration.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
## What is going to happen?
Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new filesystem.
That includes users' home directories, work directories, and workgroup directories.
Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new file system.
That includes users' home directories, work directories, and work-group directories.
Once files have been moved to their new locations, `/fast` will be retired.

## Why is this happening?
`/fast` is based on a high performance proprietary hardware (DDN) & filesystem (GPFS).
`/fast` is based on a high performance proprietary hardware (DDN) & file system (GPFS).
The company selling it has terminated support which also means buying replacement parts will become increasingly difficult.

## The new storage
There are *two* filesystems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed:
There are *two* file systems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed:

- **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity.
- **Tier 2** is not as fast, but much larger, almost 3 times the current usable capacity.

The **Hot storage** Tier 1 is reserved for large files, requiring frequent random access.
Tier 2 (**Warm storage**) should be used for everything else.
Both filesystems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used.
Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimised for cost.
Both file systems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used.
Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost.

So these are the three terminologies in use right now:
- Cephfs-1 = Tier 1 = Hot storage
- Cephfs-2 = Tier 2 = Warm storage

### Snapshots and Mirroring
Snapshots are incremental copies of the state of the data at a particular point in time.
They provide safety against various "Oups, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files.
They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files.

Depending on the location and Tier, Cephfs utilises snapshots in differ differently.
Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the datacenter to provide an additional layer of security.
Depending on the location and Tier, Cephfs utilizes snapshots in differ differently.
Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center to provide an additional layer of security.

| Tier | Location | Path | Retention policy | Mirrored |
|:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:|
Expand All @@ -51,7 +51,7 @@ User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth.
| 2 | Group mirrored | `/data/cephfs-2/mirrored/groups/<group>` | 4 TB |
| 2 | Group unmirrored | `/data/cephfs-2/unmirrored/groups/<group>` | On request |
| 2 | Project mirrored | `/data/cephfs-2/mirrored/projects/<project>` | On request |
| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/<project>` | On request |
| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/<project>` | individual |

There are no quotas on the number of files.

Expand Down Expand Up @@ -81,7 +81,7 @@ The full list of symlinks is:
- HPC web portal cache: `ondemand`
- General purpose caching: `.cache` & `.local`
- Containers: `.singularity` & `.apptainer`
- Programming languanges libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm`
- Programming languages libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm`
- Others: `.ncbi`, `.vs`

!!! warning
Expand Down Expand Up @@ -125,57 +125,57 @@ The full list of symlinks is:

### Example use cases

Space on Tier 1 is limited.
Your colleagues, other cluster users, and admins will be very grateful if you use it only for files you actively need to perform read/write operations on.
This means main project storage should probably always be on Tier 2 with workflows to stage subsets of data onto Tier 1 for analysis.

These examples are based on our experience of processing diverse NGS datasets.
Your mileage may vary but there is a basic principle that remains true for all projects.

!!! note
**Keep on Tier 1 only files in active use.**
The space on Tier 1 is limited, your colleagues and other cluster users will be
grateful if you remove from Tier 1 the files you don't immediatly need.

#### DNA sequencing (WES, WGS)

The typical Whole Genome Sequencing of human sample at 100x coverage takes about 150GB storage.
For Whole Exome Sequencing, the data typically takes between 6 to 30 GB.
These large files require considerable computing resources for processing, in particular for the mapping step.
Therefore, for mapping it may be useful to follow a prudent workflow, such as:
Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150 GB of storage, Whole Exome Sequencing between 6 and 30 GB.
These large files require considerable I/O resources for processing, in particular for the mapping step.
A prudent workflow for these kind of analysis would therefore be the following:

1. For one sample in the cohort, subsample its raw data files (`fastqs`) from the Tier 2 location to Tier 1. [`seqtk`](https://github.com/lh3/seqtk) is your friend!
2. Test, improve & check your processing scripts on those smaller files.
3. Once you are happy with the scripts, copy the complete `fastq` files from Tier 2 to Tier 1. Run the your scripts on the whole dataset, and copy the results (`bam` or `cram` files) back to Tier 2.
4. **Remove the raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly.
4. **Remove raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly.

!!! tip
Don't forget to use your `scratch` area for transient operation, for example to sort your `bam` file after mapping.
You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md)
Don't forget to use your `scratch` area for transient operations, for example to sort your `bam` file after mapping.
More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md).

#### bulk RNA-seq

Analysis of RNA expression datasets are typically a long and iterative process, where the data must remain accessible for a significant period.
However, there is usually not need to keep raw data files and mapping results available, once the genes & transcripts counts have been generated.
The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. A typical workflow would be:
However, there is usually no need to keep raw data files and mapping results available once the gene & transcripts counts have been generated.
The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1.

A typical workflow would be:

1. Copy your `fastq` files from Tier 2 to Tier 1,
2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1,
3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example,
4. Save the expression levels `R` objects and the output of `salmon`, `STAR` (or any mapper/aligner of your choice) to Tier 2,
5. **Remove the raw data, bam & count files from Tier 1**
1. Copy your `fastq` files from Tier 2 to Tier 1.
2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1.
3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example.
4. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2.
5. **Remove raw data, bam & count files from Tier 1.**

!!! tip
If using `STAR`, don't forget to use your `scratch` area for transient operation.
You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md)
If using `STAR`, don't forget to use your `scratch` area for transient operations.
More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md)

#### scRNA-seq

The analysis workflow of bulk RNA & single cell dataset is conceptually similar:
the large raw files need to be processed only once, and only the outcome of the processing (the gene counts matrix) is required for downstream analysis.
Large raw files need to be processed once and only the outcome of the processing (gene counts matrices) are required for downstream analysis.
Therefore, a typical workflow would be:

1. Copy your `fastq` files from Tier 2 to Tier 1,
2. Perform raw data QC (for example with `fastqc`),
3. Get the count matrix, for example using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1,
4. **Remove the raw data, bam & count files from Tier 1**
5. Downstream analysis with for example `seurat`, `scanpy` or `Loupe Browser`.
1. Copy your `fastq` files from Tier 2 to Tier 1.
2. Perform raw data QC (for example with `fastqc`).
3. Get the count matrix, e. g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1.
4. **Remove raw data, bam & count files from Tier 1.**
5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`.

#### Machine learning

Expand Down

0 comments on commit 07aff87

Please sign in to comment.