From df8ba58af4efc7aab9d9594039349c4fbe595866 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Thu, 29 Feb 2024 18:12:58 +0100 Subject: [PATCH 01/12] Introduce Ceph storage and migration plan --- bih-cluster/docs/storage/storage-migration.md | 175 ++++++++++++++++++ bih-cluster/mkdocs.yml | 1 + 2 files changed, 176 insertions(+) create mode 100644 bih-cluster/docs/storage/storage-migration.md diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md new file mode 100644 index 000000000..564338391 --- /dev/null +++ b/bih-cluster/docs/storage/storage-migration.md @@ -0,0 +1,175 @@ +## What is going to happen? +Files on the cluster's main storage (`/data/gpfs-1` aka. `/fast`) will move to a new filesystem. +That includes users' home directories, work directories, and workgroup directories. +Once files have been moved to their new locations, `/fast` will be retired. + +## Why is this happening? +`/fast` is based on a high performance proprietary hardware (DDN) & filesystem (GPFS). +The company selling it has terminated support which also means buying replacement parts will become increasingly difficult. + +## The new storage +There are *two* filesystems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed: + +- **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity. +- **Tier 2** is not as fast, but much larger, almost 3 times the current usable capacity. + +The **Hot storage** Tier 1 is reserved for large files, requiring frequent random access. +Tier 2 (**Warm storage**) should be used for everything else. +Both filesystems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used. +Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimised for cost. + +So these are the three terminologies in use right now: +- Cephfs-1 = Tier 1 = Hot storage +- Cephfs-2 = Tier 2 = Warm storage + +### Snapshots and Mirroring +Snapshots are incremental copies of the state of the data at a particular point in time. +They provide safety against various "Oups, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. + +Depending on the location and Tier, Cephfs utilises snapshots in differ differently. +Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the datacenter to provide an additional layer of security. + +| Tier | Location | Path | Retention policy | Mirrored | +|-----:|:-------------------------|:-----------------------------|:--------------------------------|:---------| +| 1 | User homes | `/data/cephfs-1/home/users/` | Hourly for 48 h, daily for 14 d | yes | +| 1 | Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | yes | +| 1 | Group/project scratch | `/data/cephfs-1/scratch/` | Daily for 3 d | no | +| 2 | Group/project mirrored | `/data/cephfs-2/mirrored/` | Daily for 30 d, weekly for 16 w | yes | +| 2 | Group/project unmirrored | `/data/cephfs-2/unmirrored/` | Daily for 30 d, weekly for 16 w | no | + +User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth.org/storage/accessing-snapshots + +### Quotas + +| Tier | Function | Path | Default Quota | +|-----:|:---------|:-----|--------------:| +| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | +| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | +| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | +| 1 | Projects work | `/data/cephfs-1/work/projects/` | individual | +| 1 | Projects scratch | `/data/cephfs-1/scratch/projects/` | individual | +| 2 | Group mirrored | `/data/cephfs-2/mirrored/groups/` | 4 TB | +| 2 | Group unmirrored | `/data/cephfs-2/unmirrored/groups/` | On request | +| 2 | Project mirrored | `/data/cephfs-2/mirrored/projects/` | On request | +| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/` | On request | + +There are no quotas on the number of files. + +## New file locations +Naturally, paths are going to change after files move to their new location. +Due to the increase in storage quality options, there will be some more more folders to consider. + +### Users +- Home on Tier 1: `/data/cephfs-1/home/users/` +- Work on Tier 1: `/data/cephfs-1/work/groups//users/` +- Scratch on Tier 1: `/data/cephfs-1/scratch/groups//users/` + +[!IMPORTANT] +User work & scratch spaces are now part of the user's group folder. +This means, groups should coordinate internally to distribute their allotted quota evenly among users. + +The implementation is done _via_ symlinks created by default when the user account is moved to its new destination. + +| Symlink named | Points to | +|:--------------|:----------| +| `/data/cephfs-1/home/users//work` | `/data/cephfs-1/work/groups//users/` | +| `/data/cephfs-1/home/users//scratch` | `/data/cephfs-1/scratch/groups//users/` | + +Additional symlinks are created from the user's home directory to avoid storing large files (R packages for example) in their home. +The full list of symlinks is: + +- HPC web portal cache: `ondemand` +- General purpose caching: `.cache` & `.local` +- Containers: `.singularity` & `.apptainer` +- Programming languanges libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` +- Others: `.ncbi`, `.vs` + +[!IMPORTANT] +Automatic symlink creation will **not create a symlink to any conda installation**. + +### Groups +- Work on Tier 1: `/data/cephfs-1/work/groups/` +- Scratch on Tier 1: `/data/cephfs-1/scratch/groups/` +- Mirrored work on Tier 2: `/data/cephfs-2/mirrored/groups/` + +[!NOTE] +Un-mirrored work space on Tier 2 is available on request. + +### Projects +- Work on Tier 1: `/data/cephfs-1/work/projects/` +- Scratch on Tier 1: `/data/cephfs-1/scratch/projects/` + +[!NOTE] +Tier 2 work space (mirrored & un-mirrored) is available on request. + +## Recommended practices +### Data locations +#### Tiers +- Tier 1: Designed for many I/O operations. Store files here which are actively used by your compute jobs. +- Tier 2: Big, cheap storage. Fill with files not in active use. +- Tier 2 mirrored: Extra layer of security. Longer term storage of invaluable data. + +#### Folders +- Home: Persistent storage for configuration files, templates, generic scripts, & small documents. +- Work: Persistent storage for conda environments, R packages, data actively processed or analyzed. +- Scratch: Non-persistent storage for temporary or intermediate files, caches, etc. Automatically deleted after 14 days. + +### Project life cycle +1. Import the raw data on Tier 2 for validation (checksums, …) +2. Stage raw data on Tier 1 for QC & processing. +3. Save processing results to Tier 2 and validate copies. +4. Continue analysis on Tier 1. +5. Save analysis results on Tier 2 and validate copies. +6. Reports & publications can remain on Tier 2. +7. After publication (or the end of the project), files on Tier 1 can be deleted. + +### Example use cases +#### DNA sequencing (WES, WGS) + +#### bulk RNA-seq + +#### scRNA-seq + +#### Machine learning + +## Data migration process from old `fast/` to CephFS +1. Administrative preparations + 1. HPC-Access registration (PIs will receive in invite mail) + 2. PIs need to assign a delegate. + 3. PI/Delegate needs to add group and the projects once. + 4. New Tier 1 & 2 resources will be allocated. +2. Users & group home directories are moved by HPC admin (big bang). +3. All directories on `/fast` set to read-only, that is: + - `/fast/home/users/` & `/fast/work/users/` + - `/fast/home/groups/` & `/fast/work/groups/` +4. Work data migration is done by the users (Tier 2 is primary target, Tier 1 staging when needed). + +Best practice and/or tools will be provided. + +[!NOTE] +The users' `work` space will be moved to the group's `work` space. + +## Technical details on the new infrastructure +### Tier 1 +- Fast & expensive (flash drive based), mounted on `/data/cephfs-1` +- Currently 12 Nodes with 10 × 14 TB NVME/SSD each installed + - 1.68 PB raw storage + - 1.45 PB erasure coded (EC 8:2) + - 1.23 PB usable (85 %, ceph performance limit) +- For typical CUBI use case 3 to 5 times faster I/O then the old DDN +- Two more nodes in purchasing process +- Example of flexible extension: + - Chunk size: 45 kE for one node with 150 TB, i.e. ca. 300 E/TB + +### Tier 2 +- Slower but more affordable (spinning HDDs), mounted on `/data/cephfs-2` +- Currently ten nodes with 52 HDDs slots plus SSD cache installed, per node ca. 40 HDDs with 16 to 18 TB filled, i.e. + - 6.6 PB raw + - 5.3 PB erasure coded (EC 8:2) + - 4.5 PB usable (85 %; Ceph performance limit) +- Nine more nodes in purchasing process with 5+ PB +- Very Flexible Extension possible: + - ca. 50 Euro per TB, 100 Euro mirrored, starting at small chunk sizes + +### Tier 2 mirror +Similar hardware and size duplicate (another 10 nodes, 6+ PB) in separate fire compartment diff --git a/bih-cluster/mkdocs.yml b/bih-cluster/mkdocs.yml index a7cea208c..9fe075a2c 100644 --- a/bih-cluster/mkdocs.yml +++ b/bih-cluster/mkdocs.yml @@ -105,6 +105,7 @@ nav: - "Episode 3": first-steps/episode-3.md - "Episode 4": first-steps/episode-4.md - "Storage": + - "Storage Migration": storage/storage-migration.md - "Accessing Snapshots": storage/accessing-snapshots.md - "Querying Quotas": storage/querying-storage.md - "Storage Locations": storage/storage-locations.md From 87832d9b0c992382df77297b562a9952bb3274a0 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Thu, 29 Feb 2024 18:28:29 +0100 Subject: [PATCH 02/12] consistency and styling --- bih-cluster/docs/storage/storage-migration.md | 34 +++++++++---------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index 564338391..0df903b83 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -1,5 +1,5 @@ ## What is going to happen? -Files on the cluster's main storage (`/data/gpfs-1` aka. `/fast`) will move to a new filesystem. +Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new filesystem. That includes users' home directories, work directories, and workgroup directories. Once files have been moved to their new locations, `/fast` will be retired. @@ -30,7 +30,7 @@ Depending on the location and Tier, Cephfs utilises snapshots in differ differen Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the datacenter to provide an additional layer of security. | Tier | Location | Path | Retention policy | Mirrored | -|-----:|:-------------------------|:-----------------------------|:--------------------------------|:---------| +|:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:| | 1 | User homes | `/data/cephfs-1/home/users/` | Hourly for 48 h, daily for 14 d | yes | | 1 | Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | yes | | 1 | Group/project scratch | `/data/cephfs-1/scratch/` | Daily for 3 d | no | @@ -42,8 +42,8 @@ User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth. ### Quotas | Tier | Function | Path | Default Quota | -|-----:|:---------|:-----|--------------:| -| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | +|:-----|:---------|:-----|--------------:| +| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | | 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | | 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | | 1 | Projects work | `/data/cephfs-1/work/projects/` | individual | @@ -64,9 +64,9 @@ Due to the increase in storage quality options, there will be some more more fol - Work on Tier 1: `/data/cephfs-1/work/groups//users/` - Scratch on Tier 1: `/data/cephfs-1/scratch/groups//users/` -[!IMPORTANT] -User work & scratch spaces are now part of the user's group folder. -This means, groups should coordinate internally to distribute their allotted quota evenly among users. +!!! warning + User work & scratch spaces are now part of the user's group folder. + This means, groups should coordinate internally to distribute their allotted quota evenly among users. The implementation is done _via_ symlinks created by default when the user account is moved to its new destination. @@ -84,23 +84,23 @@ The full list of symlinks is: - Programming languanges libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` - Others: `.ncbi`, `.vs` -[!IMPORTANT] -Automatic symlink creation will **not create a symlink to any conda installation**. +!!! warning + Automatic symlink creation will not create a symlink to any conda installation. ### Groups - Work on Tier 1: `/data/cephfs-1/work/groups/` - Scratch on Tier 1: `/data/cephfs-1/scratch/groups/` - Mirrored work on Tier 2: `/data/cephfs-2/mirrored/groups/` -[!NOTE] -Un-mirrored work space on Tier 2 is available on request. +!!! note + Un-mirrored work space on Tier 2 is available on request. ### Projects - Work on Tier 1: `/data/cephfs-1/work/projects/` - Scratch on Tier 1: `/data/cephfs-1/scratch/projects/` -[!NOTE] -Tier 2 work space (mirrored & un-mirrored) is available on request. +!!! note + Tier 2 work space (mirrored & un-mirrored) is available on request. ## Recommended practices ### Data locations @@ -132,7 +132,7 @@ Tier 2 work space (mirrored & un-mirrored) is available on request. #### Machine learning -## Data migration process from old `fast/` to CephFS +## Data migration process from old `/fast` to CephFS 1. Administrative preparations 1. HPC-Access registration (PIs will receive in invite mail) 2. PIs need to assign a delegate. @@ -146,10 +146,10 @@ Tier 2 work space (mirrored & un-mirrored) is available on request. Best practice and/or tools will be provided. -[!NOTE] -The users' `work` space will be moved to the group's `work` space. +!!! note + The users' `work` space will be moved to the group's `work` space. -## Technical details on the new infrastructure +## Technical details about the new infrastructure ### Tier 1 - Fast & expensive (flash drive based), mounted on `/data/cephfs-1` - Currently 12 Nodes with 10 × 14 TB NVME/SSD each installed From 367d274c6cf4a452090c4db589eb9488213484f6 Mon Sep 17 00:00:00 2001 From: Eric Blanc Date: Mon, 4 Mar 2024 17:34:31 +0100 Subject: [PATCH 03/12] docs: NGS examples (no machine learning yet) --- bih-cluster/docs/storage/storage-migration.md | 47 +++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index 0df903b83..4cf93ff4d 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -124,12 +124,59 @@ The full list of symlinks is: 7. After publication (or the end of the project), files on Tier 1 can be deleted. ### Example use cases + +These examples are based on our experience of processing diverse NGS datasets. +Your mileage may vary but there is a basic principle that remains true for all projects. + +!!! note + **Keep on Tier 1 only files in active use.** + The space on Tier 1 is limited, your colleagues and other cluster users will be + grateful if you remove from Tier 1 the files you don't immediatly need. + #### DNA sequencing (WES, WGS) +The typical Whole Genome Sequencing of human sample at 100x coverage takes about 150GB storage. +For Whole Exome Sequencing, the data typically takes between 6 to 30 GB. +These large files require considerable computing resources for processing, in particular for the mapping step. +Therefore, for mapping it may be useful to follow a prudent workflow, such as: + +1. For one sample in the cohort, subsample its raw data files (`fastqs`) from the Tier 2 location to Tier 1. [`seqtk`](https://github.com/lh3/seqtk) is your friend! +2. Test, improve & check your processing scripts on those smaller files. +3. Once you are happy with the scripts, copy the complete `fastq` files from Tier 2 to Tier 1. Run the your scripts on the whole dataset, and copy the results (`bam` or `cram` files) back to Tier 2. +4. **Remove the raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly. + +!!! tip + Don't forget to use your `scratch` area for transient operation, for example to sort your `bam` file after mapping. + You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md) + #### bulk RNA-seq +Analysis of RNA expression datasets are typically a long and iterative process, where the data must remain accessible for a significant period. +However, there is usually not need to keep raw data files and mapping results available, once the genes & transcripts counts have been generated. +The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. A typical workflow would be: + +1. Copy your `fastq` files from Tier 2 to Tier 1, +2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1, +3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example, +4. Save the expression levels `R` objects and the output of `salmon`, `STAR` (or any mapper/aligner of your choice) to Tier 2, +5. **Remove the raw data, bam & count files from Tier 1** + +!!! tip + If using `STAR`, don't forget to use your `scratch` area for transient operation. + You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md) + #### scRNA-seq +The analysis workflow of bulk RNA & single cell dataset is conceptually similar: +the large raw files need to be processed only once, and only the outcome of the processing (the gene counts matrix) is required for downstream analysis. +Therefore, a typical workflow would be: + +1. Copy your `fastq` files from Tier 2 to Tier 1, +2. Perform raw data QC (for example with `fastqc`), +3. Get the count matrix, for example using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1, +4. **Remove the raw data, bam & count files from Tier 1** +5. Downstream analysis with for example `seurat`, `scanpy` or `Loupe Browser`. + #### Machine learning ## Data migration process from old `/fast` to CephFS From 7bb0f2da688e234c51ad91637ab90ae1317bcf02 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Fri, 8 Mar 2024 19:43:34 +0100 Subject: [PATCH 04/12] tier 1 work is not mirrored --- bih-cluster/docs/storage/storage-migration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index 4cf93ff4d..f8f6f3e3e 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -32,7 +32,7 @@ Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire | Tier | Location | Path | Retention policy | Mirrored | |:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:| | 1 | User homes | `/data/cephfs-1/home/users/` | Hourly for 48 h, daily for 14 d | yes | -| 1 | Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | yes | +| 1 | Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | no | | 1 | Group/project scratch | `/data/cephfs-1/scratch/` | Daily for 3 d | no | | 2 | Group/project mirrored | `/data/cephfs-2/mirrored/` | Daily for 30 d, weekly for 16 w | yes | | 2 | Group/project unmirrored | `/data/cephfs-2/unmirrored/` | Daily for 30 d, weekly for 16 w | no | From 07aff876b1f76d4223639cf0cb7fe3b2d35deb38 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Fri, 8 Mar 2024 20:55:05 +0100 Subject: [PATCH 05/12] typography & style pass --- bih-cluster/docs/storage/storage-migration.md | 76 +++++++++---------- 1 file changed, 38 insertions(+), 38 deletions(-) diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index f8f6f3e3e..b8b978967 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -1,22 +1,22 @@ ## What is going to happen? -Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new filesystem. -That includes users' home directories, work directories, and workgroup directories. +Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new file system. +That includes users' home directories, work directories, and work-group directories. Once files have been moved to their new locations, `/fast` will be retired. ## Why is this happening? -`/fast` is based on a high performance proprietary hardware (DDN) & filesystem (GPFS). +`/fast` is based on a high performance proprietary hardware (DDN) & file system (GPFS). The company selling it has terminated support which also means buying replacement parts will become increasingly difficult. ## The new storage -There are *two* filesystems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed: +There are *two* file systems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed: - **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity. - **Tier 2** is not as fast, but much larger, almost 3 times the current usable capacity. The **Hot storage** Tier 1 is reserved for large files, requiring frequent random access. Tier 2 (**Warm storage**) should be used for everything else. -Both filesystems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used. -Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimised for cost. +Both file systems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used. +Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost. So these are the three terminologies in use right now: - Cephfs-1 = Tier 1 = Hot storage @@ -24,10 +24,10 @@ So these are the three terminologies in use right now: ### Snapshots and Mirroring Snapshots are incremental copies of the state of the data at a particular point in time. -They provide safety against various "Oups, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. +They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. -Depending on the location and Tier, Cephfs utilises snapshots in differ differently. -Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the datacenter to provide an additional layer of security. +Depending on the location and Tier, Cephfs utilizes snapshots in differ differently. +Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center to provide an additional layer of security. | Tier | Location | Path | Retention policy | Mirrored | |:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:| @@ -51,7 +51,7 @@ User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth. | 2 | Group mirrored | `/data/cephfs-2/mirrored/groups/` | 4 TB | | 2 | Group unmirrored | `/data/cephfs-2/unmirrored/groups/` | On request | | 2 | Project mirrored | `/data/cephfs-2/mirrored/projects/` | On request | -| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/` | On request | +| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/` | individual | There are no quotas on the number of files. @@ -81,7 +81,7 @@ The full list of symlinks is: - HPC web portal cache: `ondemand` - General purpose caching: `.cache` & `.local` - Containers: `.singularity` & `.apptainer` -- Programming languanges libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` +- Programming languages libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` - Others: `.ncbi`, `.vs` !!! warning @@ -125,57 +125,57 @@ The full list of symlinks is: ### Example use cases +Space on Tier 1 is limited. +Your colleagues, other cluster users, and admins will be very grateful if you use it only for files you actively need to perform read/write operations on. +This means main project storage should probably always be on Tier 2 with workflows to stage subsets of data onto Tier 1 for analysis. + These examples are based on our experience of processing diverse NGS datasets. Your mileage may vary but there is a basic principle that remains true for all projects. -!!! note - **Keep on Tier 1 only files in active use.** - The space on Tier 1 is limited, your colleagues and other cluster users will be - grateful if you remove from Tier 1 the files you don't immediatly need. - #### DNA sequencing (WES, WGS) -The typical Whole Genome Sequencing of human sample at 100x coverage takes about 150GB storage. -For Whole Exome Sequencing, the data typically takes between 6 to 30 GB. -These large files require considerable computing resources for processing, in particular for the mapping step. -Therefore, for mapping it may be useful to follow a prudent workflow, such as: +Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150 GB of storage, Whole Exome Sequencing between 6 and 30 GB. +These large files require considerable I/O resources for processing, in particular for the mapping step. +A prudent workflow for these kind of analysis would therefore be the following: 1. For one sample in the cohort, subsample its raw data files (`fastqs`) from the Tier 2 location to Tier 1. [`seqtk`](https://github.com/lh3/seqtk) is your friend! 2. Test, improve & check your processing scripts on those smaller files. 3. Once you are happy with the scripts, copy the complete `fastq` files from Tier 2 to Tier 1. Run the your scripts on the whole dataset, and copy the results (`bam` or `cram` files) back to Tier 2. -4. **Remove the raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly. +4. **Remove raw data & bam/cram files from Tier 1**, unless the downstream processing of mapped files (variant calling, structural variants, ...) can be done immediatly. !!! tip - Don't forget to use your `scratch` area for transient operation, for example to sort your `bam` file after mapping. - You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md) + Don't forget to use your `scratch` area for transient operations, for example to sort your `bam` file after mapping. + More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md). #### bulk RNA-seq Analysis of RNA expression datasets are typically a long and iterative process, where the data must remain accessible for a significant period. -However, there is usually not need to keep raw data files and mapping results available, once the genes & transcripts counts have been generated. -The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. A typical workflow would be: +However, there is usually no need to keep raw data files and mapping results available once the gene & transcripts counts have been generated. +The count files are much smaller than the raw data or the mapped data, so they can live longer on Tier 1. + +A typical workflow would be: -1. Copy your `fastq` files from Tier 2 to Tier 1, -2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1, -3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example, -4. Save the expression levels `R` objects and the output of `salmon`, `STAR` (or any mapper/aligner of your choice) to Tier 2, -5. **Remove the raw data, bam & count files from Tier 1** +1. Copy your `fastq` files from Tier 2 to Tier 1. +2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1. +3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example. +4. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2. +5. **Remove raw data, bam & count files from Tier 1.** !!! tip - If using `STAR`, don't forget to use your `scratch` area for transient operation. - You find more information how efficiently to set up temporary directory [here](../best-practice/temp-files.md) + If using `STAR`, don't forget to use your `scratch` area for transient operations. + More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md) #### scRNA-seq The analysis workflow of bulk RNA & single cell dataset is conceptually similar: -the large raw files need to be processed only once, and only the outcome of the processing (the gene counts matrix) is required for downstream analysis. +Large raw files need to be processed once and only the outcome of the processing (gene counts matrices) are required for downstream analysis. Therefore, a typical workflow would be: -1. Copy your `fastq` files from Tier 2 to Tier 1, -2. Perform raw data QC (for example with `fastqc`), -3. Get the count matrix, for example using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1, -4. **Remove the raw data, bam & count files from Tier 1** -5. Downstream analysis with for example `seurat`, `scanpy` or `Loupe Browser`. +1. Copy your `fastq` files from Tier 2 to Tier 1. +2. Perform raw data QC (for example with `fastqc`). +3. Get the count matrix, e. g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1. +4. **Remove raw data, bam & count files from Tier 1.** +5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`. #### Machine learning From d3884f6ecf7a51889fa1c9b0dad055a744fd96d8 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Sun, 10 Mar 2024 21:39:05 +0100 Subject: [PATCH 06/12] separated introdution to file system from migration doc --- bih-cluster/docs/storage/storage-locations.md | 257 ++++++++---------- bih-cluster/docs/storage/storage-migration.md | 69 +---- bih-cluster/mkdocs.yml | 4 +- 3 files changed, 123 insertions(+), 207 deletions(-) diff --git a/bih-cluster/docs/storage/storage-locations.md b/bih-cluster/docs/storage/storage-locations.md index e9d1c7028..7d30c89c6 100644 --- a/bih-cluster/docs/storage/storage-locations.md +++ b/bih-cluster/docs/storage/storage-locations.md @@ -1,152 +1,123 @@ # Storage and Volumes: Locations - -On the BIH HPC cluster, there are three kinds of entities: users, groups (*Arbeitsgruppen*), and projects. -Each user, group, and project has a central folder for their files to be stored. - -## For the Impatient - -### Storage Locations - -Each user, group, and project directory consists of three locations (using `/fast/users/muster_c` as an example here): - -- `/fast/users/muster_c/work`: - Here, you put your large data that you need to keep. - Note that there is no backup or snapshots going on. -- `/fast/users/muster_c/scratch`: - Here, you put your large temporary files that you will delete after a short time anyway. - **Data placed here will be automatically removed 2 weeks after last modification.** -- `/fast/users/muster_c` (and all other sub directories): - Here you put your programs and scripts and very important small data. - By default, you will have a soft quota of 1GB (hard quota of 1.5GB, 7 days grace period). - However, we create snapshots of this data (every 24 hours) and this data goes to a backup. - -You can check your current usage using the command `bih-gpfs-report-quota user $USER` - -### Do's and Don'ts - -First and foremost: - -- **DO NOT place any valuable data in `scratch` as it will be removed within 2 weeks.** - -Further: - -- **DO** set your `TMPDIR` environment variable to `/fast/users/$USER/scratch/tmp`. -- **DO** add `mkdir -p /fast/users/$USER/scratch/tmp` to your `~/.bashrc` and job script files. -- **DO** try to prefer creating fewer large files over many small files. -- **DO NOT** create multiple copies of large data. - For sequencing data, in most cases you should not need more than raw times the size of the raw data (raw data + alignments + derived results). - -## Introduction - -This document describes the third iteration of the file system structure on the BIH HPC cluster. -This iteration was made necessary by problems with second iteration which worked well for about two years but is now reaching its limits. +This document describes the forth iteration of the file system structure on the BIH HPC cluster. +It was made necessary because the previous file system was no longer supported by the manufacturer and we since switched to distributed [Ceph](https://ceph.io/en/) storage. +For now, the third-generation file system is still mounted at `/fast`. ## Organizational Entities - There are the following three entities on the cluster: -1. normal user accounts ("natural people") -2. groups *(Arbeitsgruppen)* with on leader and an optional delegate -3. projects with one owner and an optional delegate. - -Their purpose is described in the document "User and Group Management". - -## Storage/Data Tiers - -The files fall into one of three categories: - -1. **Home** data are programs and scripts of which there is relatively few but which is long-lived and very important. - Loss of home data requires to redo manual work (like programming). - -2. **Work** data is data of potential large size and has a medium life time and important. - Examples are raw sequencing data and intermediate results that are to be kept (e.g., a final, sorted and indexed BAM file). - Work data can time-consuming actions to be restored, such as downloading large amounts of data or time-consuming computation. - -3. **Scratch** data is data that is temporary by nature and has a short life-time only. - Examples are temporary files (e.g., unsorted BAM files). - Scratch data is created to be removed eventually. - -## Snapshots, Backups, Archive - -- **A snapshot** stores the state of a data volume at a given time. - File systems like GPFS implement this in a copy-on-write manner, meaning that for a snapshot and the subsequent "live" state, only the differences in data need to be store.d - Note that there is additional overhead in the meta data storage. - -- **A backup** is a copy of a data set on another physical location, i.e., all data from a given date copied to another server. - Backups are made regularly and only a small number of previous ones is usually kept. - -- **An archive** is a single copy of a single state of a data set to be kept for a long time. - Classically, archives are made by copying data to magnetic tape for long-term storage. - -## Storage Locations - -This section describes the different storage locations and gives an overview of their properties. - -### Home Directories - -- **Location** `/fast/{users,groups,projects}/` (except for `work` and `scratch` sub directories) -- the user, group, or project home directory -- meant for documents, scripts, and programs -- default quota for data: default soft quota of 1 GB, hard quota of 1.5 GB, grace period of 7 days -- quota can be increased on request with short reason statement -- default quota for metadata: 10k files soft, 12k files hard -- snapshots are regularly created, see Section \ref{snapshot-details} -- nightly incremental backups are created, the last 5 are kept -- *Long-term strategy:* - users are expected to manage data life time independently and use best practice for source code and document management best practice (e.g., use Git). - When users/groups leave the organization or projects ends, they are expected to handle data storage and cleanup on their own. - Responsibility to enforce this is with the leader of a user's group, the group leader, or the project owner, respectively. - -### Work Directories - -- **Location** `/fast/{users,groups,projects}//work` -- the user, group, or project work directory -- meant for larger data that is to be used for a longer time, e.g., raw data, final sorted BAM file -- default quota for data: default soft quota of 1 TB, hard quota of 1.1 TB, grace period of 7 days -- quota can be increased on request with short reason statement -- default quota for metadata: 2 Mfile soft, 2.2M files hard -- no snapshots, no backup -- *Long-term strategy:* - When users/groups leave the organization or projects ends, they are expected to cleanup unneeded data on their own. - HPC IT can provide archival services on request. - Responsibility to enforce this is with the leader of a user's group, the group leader, or the project owner, respectively. - -### Scratch Directories - -- **Location** `/fast/{users,groups,projects}//scratch` -- the user, group, or project scratch directory -- **files will be removed 2 weeks after their creation** -- meant for temporary, potentially large data, e.g., intermediate unsorted or unmasked BAM files, data downloaded from the internet for trying out etc. -- default quota for data: default soft quota of 200TB, hard quota of 220TB, grace period of 7 days -- quota can be increased on request with short reason statement -- default quota for metadata: 2M files soft, 2.2M files hard -- no snapshots, no backup -- *Long-term strategy:* - as data on this volume is not to be kept for longer than 2 weeks, the long term strategy is to delete all files. - -## Snapshot Details - -Snapshots are made every 24 hours. -Of these snapshots, the last 7 are kept, then one for each day. - -## Backup Details - -Backups of the snapshots is made nightly. -The backups of the last 7 days are kept. - -## Archive Details - -BIH HPC IT has some space allocated on the MDC IT tape archive. -User data can be put under archive after agreeing with head of HPC IT. -The process is as describe in Section \ref{sop-data-archival}. +1. **Users** *(natural people)* +2. **Groups** *(Arbeitsgruppen)* with on leader and an optional delegate +3. **Projects** with one owner and an optional delegate + +Each user, group, and project can have storage folders in different locations. + +## Data Types and storage Tiers +Files stored on the HPC fall into one of three categories: + +1. **Home** folders store programs, scripts, and user config which are generally long-lived and very important files. +Loss of home data requires to redo manual work (like programming). + +2. **Work** folders store data of potentially large size which has a medium life time and is important. +Examples are raw sequencing data and intermediate results that are to be kept (e. g. sorted and indexed BAM files). +Work data requires time-consuming actions to be restored, such as downloading large amounts of data or long-running computation. + +3. **Scratch** folder store temporary files with a short life-time. +Examples are temporary files (e. g. unsorted BAM files). +Scratch data is created to be removed eventually. + +Ceph storage comes in two types which differ in their I/O speed, total capacity, and cost. +They are called **Tier 1** and **Tier 2** and sometimes **hot storage** and **warm storage**. +In the HPC filesystem they are mounted in `/data/cephfs-1` and `/data/cephfs-2`. +Tier 1 storage is fast, relatively small, expensive, and optimized for performance. +Tier 2 storage is slow, big, cheap, and built for keeping large files for longer times. +Storage quotas are imposed in these locations to restrict the maximum size of folders. + +### Home directories +**Location:** `/data/cephfs-1/home/` + +Only users have home directories on Tier 1 storage. +This is the starting point when starting a new shell or SSH session. +Important config files are stored here as well as analysis scripts and small user files. +Home folders have a strict storage quota of 1 GB. + +### Work directories +**Location:** `/data/cephfs-1/work/` + +Groups and projects have work directories on Tier 1 storage. +User home folders contain a symlink to their respective group's work folder. +Files shared within a group/project are stored here as long as they are in active use. +Work folders are generally limited to 1 TB per group. +Project work folders are allocated on an individual basis. + +### Scratch space +**Location:** `/data/cephfs-1/scratch/` + +Groups and projects have scratch space on Tier 1 storage. +User home folders contain a symlink to their respective group's scratch space. +Meant for temporary, potentially large data e. g. intermediate unsorted or unmasked BAM files, data downloaded from the internet etc. +**Files in scratch will be automatically removed 2 weeks after their creation.** +Scratch space is generally limited to 10 TB per group. +Projects are allocated scratch on an individual basis. + +### Tier 2 storage +**Location:** `/data/cephfs-2/` + +Groups and projects can be allocated additional storage on the Tier 2 system. +File quotas here can be significantly larger as it is much cheaper and more abundant than Tier 1. + +### Overview + +| Tier | Function | Path | Default Quota | +|:-----|:-----------------|:---------------------------------------------|--------------:| +| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | +| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | +| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | +| 1 | Projects work | `/data/cephfs-1/work/projects/` | individual | +| 1 | Projects scratch | `/data/cephfs-1/scratch/projects/` | individual | +| 2 | Group | `/data/cephfs-2/mirrored/groups/` | On request | +| 2 | Project | `/data/cephfs-2/mirrored/projects/` | On request | + +## Snapshots and Mirroring +Snapshots are incremental copies of the state of the data at a particular point in time. +They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. +Depending on the location and Tier, CephFS creates snapshots in different frequencies and retention plans. +User access to the snapshots is documented in [this document](https://hpc-docs.cubi.bihealth.org/storage/accessing-snapshots). + +| Location | Path | Retention policy | Mirrored | +|:-------------------------|:-----------------------------|:--------------------------------|---------:| +| User homes | `/data/cephfs-1/home/users/` | Hourly for 48 h, daily for 14 d | yes | +| Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | no | +| Group/project scratch | `/data/cephfs-1/scratch/` | Daily for 3 d | no | +| Group/project mirrored | `/data/cephfs-2/mirrored/` | Daily for 30 d, weekly for 16 w | yes | +| Group/project unmirrored | `/data/cephfs-2/unmirrored/` | Daily for 30 d, weekly for 16 w | no | + +Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center. +This provides an additional layer of security i. e. physical damage to the servers. ## Technical Implementation - As a quick (very) technical note: -There exists a file system `fast`. -This file system has three independent file sets `home`, `work`, `scratch`. -On each of these file sets, there is a dependent file set for each user, group, and project below directories `users`, `groups`, and `projects`. -`home` is also mounted as `/fast_new/home` and for each user, group, and project, the entry `work` links to the corresponding fileset in `work`, the same for scratch. -Automatic file removal from `scratch` is implemented using GPFS ILM. -Quotas are implemented on the file-set level. +### Tier 1 +- Fast & expensive (flash drive based), mounted on `/data/cephfs-1` +- Currently 12 Nodes with 10 × 14 TB NVME SSD each + - 1.68 PB raw storage + - 1.45 PB erasure coded (EC 8:2) + - 1.23 PB usable (85 %, ceph performance limit) +- For typical CUBI use case 3 to 5 times faster I/O then the old DDN +- Two more nodes in purchasing process +- Example of flexible extension: + - Chunk size: 45.000 € for one node with 150 TB, i. e. ca. 300 €/TB + +### Tier 2 +- Slower but more affordable (spinning HDDs), mounted on `/data/cephfs-2` +- Currently ten nodes with 52 HDDs slots plus SSD cache installed, per node ca. 40 HDDs with 16 to 18 TB filled, i.e. + - 6.6 PB raw + - 5.3 PB erasure coded (EC 8:2) + - 4.5 PB usable (85 %; Ceph performance limit) +- Nine more nodes in purchasing process with 5+ PB +- Very Flexible Extension possible: + - ca. 50 € per TB, 100 € mirrored, starting at small chunk sizes + +### Tier 2 mirror +Similar hardware and size duplicate (another 10 nodes, 6+ PB) in separate fire compartment. diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index b8b978967..baa9a4f2d 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -1,3 +1,4 @@ +# Migration from old GPFS to new CephFS ## What is going to happen? Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a new file system. That includes users' home directories, work directories, and work-group directories. @@ -10,50 +11,19 @@ The company selling it has terminated support which also means buying replacemen ## The new storage There are *two* file systems set up to replace `/fast`, named *Tier 1* and *Tier 2* after their difference in I/O speed: -- **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity. +- **Tier 1** is faster than `/fast` ever was, but it only has about 75 % of its usable capacity. - **Tier 2** is not as fast, but much larger, almost 3 times the current usable capacity. -The **Hot storage** Tier 1 is reserved for large files, requiring frequent random access. +The **Hot storage** Tier 1 is reserved for files requiring frequent random access, user homes, and scratch. Tier 2 (**Warm storage**) should be used for everything else. Both file systems are based on the open-source, software-defined [Ceph](https://ceph.io/en/) storage platform and differ in the type of drives used. Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost. So these are the three terminologies in use right now: -- Cephfs-1 = Tier 1 = Hot storage -- Cephfs-2 = Tier 2 = Warm storage - -### Snapshots and Mirroring -Snapshots are incremental copies of the state of the data at a particular point in time. -They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. - -Depending on the location and Tier, Cephfs utilizes snapshots in differ differently. -Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center to provide an additional layer of security. - -| Tier | Location | Path | Retention policy | Mirrored | -|:-----|:-------------------------|:-----------------------------|:--------------------------------|---------:| -| 1 | User homes | `/data/cephfs-1/home/users/` | Hourly for 48 h, daily for 14 d | yes | -| 1 | Group/project work | `/data/cephfs-1/work/` | Four times a day, daily for 5 d | no | -| 1 | Group/project scratch | `/data/cephfs-1/scratch/` | Daily for 3 d | no | -| 2 | Group/project mirrored | `/data/cephfs-2/mirrored/` | Daily for 30 d, weekly for 16 w | yes | -| 2 | Group/project unmirrored | `/data/cephfs-2/unmirrored/` | Daily for 30 d, weekly for 16 w | no | - -User access to the snapshots is documented here: https://hpc-docs.cubi.bihealth.org/storage/accessing-snapshots - -### Quotas - -| Tier | Function | Path | Default Quota | -|:-----|:---------|:-----|--------------:| -| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | -| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | -| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | -| 1 | Projects work | `/data/cephfs-1/work/projects/` | individual | -| 1 | Projects scratch | `/data/cephfs-1/scratch/projects/` | individual | -| 2 | Group mirrored | `/data/cephfs-2/mirrored/groups/` | 4 TB | -| 2 | Group unmirrored | `/data/cephfs-2/unmirrored/groups/` | On request | -| 2 | Project mirrored | `/data/cephfs-2/mirrored/projects/` | On request | -| 2 | Project unmirrored | `/data/cephfs-2/unmirrored/projects/` | individual | - -There are no quotas on the number of files. +- Cephfs-1 = Tier 1 = Hot storage = `/data/cephfs-1` +- Cephfs-2 = Tier 2 = Warm storage = `/data/cephfs-2` + +There are no more quotas on the number of files. ## New file locations Naturally, paths are going to change after files move to their new location. @@ -195,28 +165,3 @@ Best practice and/or tools will be provided. !!! note The users' `work` space will be moved to the group's `work` space. - -## Technical details about the new infrastructure -### Tier 1 -- Fast & expensive (flash drive based), mounted on `/data/cephfs-1` -- Currently 12 Nodes with 10 × 14 TB NVME/SSD each installed - - 1.68 PB raw storage - - 1.45 PB erasure coded (EC 8:2) - - 1.23 PB usable (85 %, ceph performance limit) -- For typical CUBI use case 3 to 5 times faster I/O then the old DDN -- Two more nodes in purchasing process -- Example of flexible extension: - - Chunk size: 45 kE for one node with 150 TB, i.e. ca. 300 E/TB - -### Tier 2 -- Slower but more affordable (spinning HDDs), mounted on `/data/cephfs-2` -- Currently ten nodes with 52 HDDs slots plus SSD cache installed, per node ca. 40 HDDs with 16 to 18 TB filled, i.e. - - 6.6 PB raw - - 5.3 PB erasure coded (EC 8:2) - - 4.5 PB usable (85 %; Ceph performance limit) -- Nine more nodes in purchasing process with 5+ PB -- Very Flexible Extension possible: - - ca. 50 Euro per TB, 100 Euro mirrored, starting at small chunk sizes - -### Tier 2 mirror -Similar hardware and size duplicate (another 10 nodes, 6+ PB) in separate fire compartment diff --git a/bih-cluster/mkdocs.yml b/bih-cluster/mkdocs.yml index 294448a14..78ab31d8e 100644 --- a/bih-cluster/mkdocs.yml +++ b/bih-cluster/mkdocs.yml @@ -113,11 +113,11 @@ nav: - "Episode 3": first-steps/episode-3.md - "Episode 4": first-steps/episode-4.md - "Storage": + - "Storage Locations": storage/storage-locations.md + - "Automated Cleanup": storage/scratch-cleanup.md - "Storage Migration": storage/storage-migration.md - "Accessing Snapshots": storage/accessing-snapshots.md - "Querying Quotas": storage/querying-storage.md - - "Storage Locations": storage/storage-locations.md - - "Automated Cleanup": storage/scratch-cleanup.md - "Cluster Scheduler": - slurm/overview.md - slurm/background.md From 9c86bcd0096426e9e22f4504eca80506d93a8f67 Mon Sep 17 00:00:00 2001 From: Eric Blanc Date: Mon, 11 Mar 2024 14:08:41 +0100 Subject: [PATCH 07/12] docs: fixed typos, added machine learning example & link to migration --- bih-cluster/docs/storage/storage-locations.md | 6 +++-- bih-cluster/docs/storage/storage-migration.md | 24 ++++++++++++------- 2 files changed, 20 insertions(+), 10 deletions(-) diff --git a/bih-cluster/docs/storage/storage-locations.md b/bih-cluster/docs/storage/storage-locations.md index 7d30c89c6..e62d07db6 100644 --- a/bih-cluster/docs/storage/storage-locations.md +++ b/bih-cluster/docs/storage/storage-locations.md @@ -3,11 +3,13 @@ This document describes the forth iteration of the file system structure on the It was made necessary because the previous file system was no longer supported by the manufacturer and we since switched to distributed [Ceph](https://ceph.io/en/) storage. For now, the third-generation file system is still mounted at `/fast`. +**The old, third-generation filesystem will be decommissioned soon, please consult the [document describing the migration process](storage-migration.md)!** + ## Organizational Entities There are the following three entities on the cluster: -1. **Users** *(natural people)* -2. **Groups** *(Arbeitsgruppen)* with on leader and an optional delegate +1. **Users** *(real people)* +2. **Groups** *(Arbeitsgruppen)* with one leader and an optional delegate 3. **Projects** with one owner and an optional delegate Each user, group, and project can have storage folders in different locations. diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index baa9a4f2d..c95bf7882 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -20,6 +20,7 @@ Both file systems are based on the open-source, software-defined [Ceph](https:// Tier 1 or Cephfs-1 uses NVME SSDs and is optimized for performance, Tier 2 or Cephfs-2 used traditional hard drives and is optimized for cost. So these are the three terminologies in use right now: + - Cephfs-1 = Tier 1 = Hot storage = `/data/cephfs-1` - Cephfs-2 = Tier 2 = Warm storage = `/data/cephfs-2` @@ -27,7 +28,7 @@ There are no more quotas on the number of files. ## New file locations Naturally, paths are going to change after files move to their new location. -Due to the increase in storage quality options, there will be some more more folders to consider. +Due to the increase in storage quality options, there will be some more folders to consider. ### Users - Home on Tier 1: `/data/cephfs-1/home/users/` @@ -36,7 +37,7 @@ Due to the increase in storage quality options, there will be some more more fol !!! warning User work & scratch spaces are now part of the user's group folder. - This means, groups should coordinate internally to distribute their allotted quota evenly among users. + This means, groups should coordinate internally to distribute their allotted quota according to each user's needs. The implementation is done _via_ symlinks created by default when the user account is moved to its new destination. @@ -104,7 +105,7 @@ Your mileage may vary but there is a basic principle that remains true for all p #### DNA sequencing (WES, WGS) -Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150 GB of storage, Whole Exome Sequencing between 6 and 30 GB. +Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150 GB of storage, Whole Exome Sequencing files occupy between 6 and 30 GB. These large files require considerable I/O resources for processing, in particular for the mapping step. A prudent workflow for these kind of analysis would therefore be the following: @@ -126,10 +127,11 @@ The count files are much smaller than the raw data or the mapped data, so they c A typical workflow would be: 1. Copy your `fastq` files from Tier 2 to Tier 1. -2. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1. -3. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example. -4. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2. -5. **Remove raw data, bam & count files from Tier 1.** +2. Perform raw data quality control, and store the outcome on Tier 2. +3. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1. +4. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example. +5. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2. +6. **Remove raw data, bam & count files from Tier 1.** !!! tip If using `STAR`, don't forget to use your `scratch` area for transient operations. @@ -142,13 +144,19 @@ Large raw files need to be processed once and only the outcome of the processing Therefore, a typical workflow would be: 1. Copy your `fastq` files from Tier 2 to Tier 1. -2. Perform raw data QC (for example with `fastqc`). +2. Perform raw data QC, and store the results on Tier 2. 3. Get the count matrix, e. g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1. 4. **Remove raw data, bam & count files from Tier 1.** 5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`. #### Machine learning +There is no obvious workflow that covers most used cases for machine learning. +However, + +- Training might be done on scratch where data access is quick and data size not as constrained as on work space. But files will disappear after 14 days. +- Some models can be updated with new data, without needing to keep the whole dataset on Tier 1. + ## Data migration process from old `/fast` to CephFS 1. Administrative preparations 1. HPC-Access registration (PIs will receive in invite mail) From db2f847c98db7ade7c92e6b7d0e457b7bd166ed6 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Mon, 11 Mar 2024 19:36:44 +0100 Subject: [PATCH 08/12] moved accessing snapshots into main storage document --- bih-cluster/docs/storage/storage-locations.md | 49 +++++++++++++------ bih-cluster/mkdocs.yml | 1 - 2 files changed, 33 insertions(+), 17 deletions(-) diff --git a/bih-cluster/docs/storage/storage-locations.md b/bih-cluster/docs/storage/storage-locations.md index e62d07db6..d661ff05c 100644 --- a/bih-cluster/docs/storage/storage-locations.md +++ b/bih-cluster/docs/storage/storage-locations.md @@ -1,9 +1,9 @@ # Storage and Volumes: Locations This document describes the forth iteration of the file system structure on the BIH HPC cluster. It was made necessary because the previous file system was no longer supported by the manufacturer and we since switched to distributed [Ceph](https://ceph.io/en/) storage. -For now, the third-generation file system is still mounted at `/fast`. -**The old, third-generation filesystem will be decommissioned soon, please consult the [document describing the migration process](storage-migration.md)!** +!!! warning + For now, the old, third-generation file system is still mounted at `/fast`. **It will be decommissioned soon, please consult [this document describing the migration process](storage-migration.md)!** ## Organizational Entities There are the following three entities on the cluster: @@ -17,15 +17,15 @@ Each user, group, and project can have storage folders in different locations. ## Data Types and storage Tiers Files stored on the HPC fall into one of three categories: -1. **Home** folders store programs, scripts, and user config which are generally long-lived and very important files. -Loss of home data requires to redo manual work (like programming). +1. **Home** folders store programs, scripts, and user config i. e. long-lived and very important files. +Loss of this data requires to redo manual work (like programming). 2. **Work** folders store data of potentially large size which has a medium life time and is important. -Examples are raw sequencing data and intermediate results that are to be kept (e. g. sorted and indexed BAM files). +Examples are raw sequencing data and intermediate results that are to be kept (e. g. sorted and indexed BAM files). Work data requires time-consuming actions to be restored, such as downloading large amounts of data or long-running computation. 3. **Scratch** folder store temporary files with a short life-time. -Examples are temporary files (e. g. unsorted BAM files). +Examples are temporary files (e. g. unsorted BAM files). Scratch data is created to be removed eventually. Ceph storage comes in two types which differ in their I/O speed, total capacity, and cost. @@ -70,21 +70,20 @@ File quotas here can be significantly larger as it is much cheaper and more abun ### Overview -| Tier | Function | Path | Default Quota | -|:-----|:-----------------|:---------------------------------------------|--------------:| -| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | -| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | -| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | -| 1 | Projects work | `/data/cephfs-1/work/projects/` | individual | -| 1 | Projects scratch | `/data/cephfs-1/scratch/projects/` | individual | -| 2 | Group | `/data/cephfs-2/mirrored/groups/` | On request | -| 2 | Project | `/data/cephfs-2/mirrored/projects/` | On request | +| Tier | Function | Path | Default Quota | +|:-----|:----------------|:---------------------------------------------|--------------:| +| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | +| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | +| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | +| 1 | Project work | `/data/cephfs-1/work/projects/` | individual | +| 1 | Project scratch | `/data/cephfs-1/scratch/projects/` | individual | +| 2 | Group | `/data/cephfs-2/mirrored/groups/` | On request | +| 2 | Project | `/data/cephfs-2/mirrored/projects/` | On request | ## Snapshots and Mirroring Snapshots are incremental copies of the state of the data at a particular point in time. They provide safety against various "Ops, did I just delete that?" scenarios, meaning they can be used to recover lost or damaged files. Depending on the location and Tier, CephFS creates snapshots in different frequencies and retention plans. -User access to the snapshots is documented in [this document](https://hpc-docs.cubi.bihealth.org/storage/accessing-snapshots). | Location | Path | Retention policy | Mirrored | |:-------------------------|:-----------------------------|:--------------------------------|---------:| @@ -97,6 +96,24 @@ User access to the snapshots is documented in [this document](https://hpc-docs.c Some parts of Tier 1 and Tier 2 snapshots are also mirrored into a separate fire compartment within the data center. This provides an additional layer of security i. e. physical damage to the servers. +### Accessing snapshots +To access snapshots, simply navigate to the `.snap/` sub-folder of the respective location. +You will find one sub-folder for every snapshot created and in them a complete replica of the folder respective folder at the time of snapshot creation. + +For example: + +- `/data/cephfs-1/home/.snap//users//` +- `/data/cephfs-1/work/.snap//groups//` +- `/data/cephfs-2/unmirrored/.snap//projects//` + +Here is a simple example of how to restore a file: + +```sh +$ cd /data/cephfs-2/unmirrored/.snap/scheduled-2024-03-11-00_00_00_UTC/ +$ ls groups/cubi/ +$ cp groups/cubi/important_file.txt /data/cephfs-2/unmirrored/groups/cubi/ +``` + ## Technical Implementation As a quick (very) technical note: diff --git a/bih-cluster/mkdocs.yml b/bih-cluster/mkdocs.yml index 49fe3ed8a..cd91f71ef 100644 --- a/bih-cluster/mkdocs.yml +++ b/bih-cluster/mkdocs.yml @@ -118,7 +118,6 @@ nav: - "Storage Locations": storage/storage-locations.md - "Automated Cleanup": storage/scratch-cleanup.md - "Storage Migration": storage/storage-migration.md - - "Accessing Snapshots": storage/accessing-snapshots.md - "Querying Quotas": storage/querying-storage.md - "Cluster Scheduler": - slurm/overview.md From 49c75eee45d1e0fa17d32d555090857984424813 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Mon, 11 Mar 2024 20:00:29 +0100 Subject: [PATCH 09/12] update many things --- bih-cluster/docs/storage/querying-storage.md | 5 +- bih-cluster/docs/storage/scratch-cleanup.md | 162 +----------------- bih-cluster/docs/storage/storage-locations.md | 41 +++-- bih-cluster/docs/storage/storage-migration.md | 4 +- bih-cluster/mkdocs.yml | 4 +- 5 files changed, 36 insertions(+), 180 deletions(-) diff --git a/bih-cluster/docs/storage/querying-storage.md b/bih-cluster/docs/storage/querying-storage.md index a4d112886..d9a7bf004 100644 --- a/bih-cluster/docs/storage/querying-storage.md +++ b/bih-cluster/docs/storage/querying-storage.md @@ -1,8 +1,9 @@ # Querying Storage Quotas -!!! info "More Convenient OnDemand Portal" +!!! info "Outdated" - You can also see your quotas in the [Open OnDemand Portal Quotas application](../ondemand/quotas.md). + This document is only valid for the old, third-generation file system and will be removed soon. + Quotas of our new CephFS storage are communicated via the [HPC Access](https://hpc-access.cubi.bihealth.org/) web portal. As described elsewhere, all data in your user, group, and project volumes is subject to quotas. This page quickly shows how to query for the current usage of data volume and file counts for your user, group, and projects. diff --git a/bih-cluster/docs/storage/scratch-cleanup.md b/bih-cluster/docs/storage/scratch-cleanup.md index 3c4cd20d6..9d74c4aa9 100644 --- a/bih-cluster/docs/storage/scratch-cleanup.md +++ b/bih-cluster/docs/storage/scratch-cleanup.md @@ -1,159 +1,7 @@ -# Automated Cleanup of Scratch Volumes +# Automated Cleanup of Scratch -The `scratch` volumes are automatically cleaned up nightly with the following mechanism. +The `scratch` space is automatically cleaned up nightly with the following mechanism. -- The `scratch` volume of each user, group, and project is crawled to identifies files that are older than two weeks. -- These files are moved into a directory that is named by the current date `YYYY-MM-DD` into the same sub folders as it was below `scratch`. -- These scratch directories are removed after two more weeks. - -The cleanup code is a (probably updated) version of the following code: - -```python -#!/usr/bin/env python3 -"""Tool to cleanup the scratch directories on BIH HPC.""" - -import argparse -import datetime -import logging -import os -from os.path import join, dirname, normpath, islink, exists, isdir -import pathlib -import re -import shutil -import sys - -#: The maximum age of files (by mtime) before they are put into the thrash bin. -MAX_AGE_SCRATCH = datetime.timedelta(days=14) - -#: The maximum age of trash cans to keep around. -MAX_AGE_TRASH = datetime.timedelta(days=14) - -#: Name of trash can directory. -TRASH_DIR_NAME = "BIH_TRASH" - - -def setup_trash(args, now): - """Setup the trash directory.""" - trash_path = join(args.scratch_dir, TRASH_DIR_NAME, now.strftime("%Y-%m-%d")) - logging.debug("Today's trash dir is %s", trash_path) - logging.debug(" - mkdir -p %s", trash_path) - logging.debug(" - chown root:root %s", dirname(trash_path)) - logging.debug(" - chmod 0755 %s", dirname(trash_path)) - logging.debug(" - chown root:root %s", trash_path) - logging.debug(" - chmod 0755 %s", trash_path) - if not args.dry_run: - # Ensure that today's trash dir either does not exist or is a directory. - if exists(dirname(trash_path)) and not isdir(dirname(trash_path)): - os.unlink(dirname(trash_path)) - elif exists(trash_path) and not isdir(trash_path): - os.unlink(dirname(trash_path)) - os.makedirs(trash_path, mode=0o775, exist_ok=True) - os.chown(dirname(trash_path), 0, 0) - os.chmod(dirname(trash_path), 0o755) - os.chown(trash_path, 0, 0) - os.chmod(trash_path, 0o755) - return trash_path - -def move_files(args, now, trash_path): - """Move files into the trash directory.""" - logging.debug(" Walking %s recursively", args.scratch_dir) - for root, dirs, files in os.walk(args.scratch_dir): - # Do not go into trash directory itself. - if root == args.scratch_dir: - dirs[:] = [d for d in dirs if d != TRASH_DIR_NAME] - # Only create output directory once. - dir_created = False - # Walk files. - for name in files: - path = join(root, name) - logging.debug(" considering %s", path) - age = now - datetime.datetime.fromtimestamp(os.lstat(path).st_mtime) - if age > MAX_AGE_SCRATCH: - local = path[len(args.scratch_dir) + 1 :] - dest = join(trash_path, local) - logging.debug(" - chown root:root %s", path) - logging.debug(" - chmod 0644 %s", path) - if not args.dry_run: - os.lchown(path, 0, 0) - if not islink(path): - os.chmod(path, 0o644) - if not dir_created: - dir_created = True - logging.debug(" - mkdir -p %s", dirname(dest)) - if not args.dry_run: - os.makedirs(dirname(dest), mode=0o775, exist_ok=True) - logging.debug(" - %s -> %s", path, dest) - if not args.dry_run: - os.rename(path, dest) - - -def empty_trash(args, now): - """Empty the trash cans.""" - - def log_error(_func, path, exc_info): - logging.warn("could not delete path %s: %s", path, exc_info) - - logging.debug(" Emptying the trash cans...") - trash_base = join(args.scratch_dir, TRASH_DIR_NAME) - if os.path.exists(trash_base): - entries = os.listdir(trash_base) - else: - entries = [] - for entry in entries: - if re.match(r"^\d\d\d\d-\d\d-\d\d", entry): - path = join(args.scratch_dir, TRASH_DIR_NAME, entry) - logging.info(" considering %s", path) - folder_date = datetime.datetime.strptime(entry, "%Y-%m-%d") - if now - folder_date > MAX_AGE_TRASH: - logging.info(" - rm -rf %s", path) - if not args.dry_run: - shutil.rmtree(path, onerror=log_error) - logging.debug(" ... done emptying the trash cans.") - - -def run(args): - """Actually execute the scratch directory cleanup.""" - now = datetime.datetime.now() # use one start time - logging.info("Starting to cleanup %s...", args.scratch_dir) - trash_path = setup_trash(args, now) - move_files(args, now, trash_path) - empty_trash(args, now) - logging.info("... done with cleanup.") - - -def main(argv=None): - """Main entry point into the program.""" - parser = argparse.ArgumentParser() - parser.add_argument("scratch_dir", metavar="SCRATCH_DIR", help="Path to the scratch directory") - parser.add_argument("--verbose", "-v", dest="verbosity", default=1, action="count") - parser.add_argument("--quiet", "-q", dest="quiet", default=False, action="store_true") - parser.add_argument( - "--dry-run", - "-n", - dest="dry_run", - default=False, - action="store_true", - help="Do not actually perform the actions", - ) - - if not shutil.rmtree.avoids_symlink_attacks: - raise RuntimeError("Cannot execute with rmtree on unsafe platforms!") - - args = parser.parse_args(argv) - args.scratch_dir = normpath(args.scratch_dir) - - logging.basicConfig(format="%(asctime)-15s %(levelname)3.3s %(message)s") - logger = logging.getLogger() - if args.quiet: - logger.setLevel(logging.WARNING) - elif args.verbosity == 1: - logger.setLevel(logging.INFO) - elif args.verbosity > 1: - logger.setLevel(logging.DEBUG) - - return run(args) - - -if __name__ == "__main__": - sys.exit(main()) -``` \ No newline at end of file +1. Daily snapshots of the `scratch` folder are created and retained for 3 days. +2. Files which were not modified for the last 14 days are removed. +3. Erroneously deleted files can be manually retrieved from the snapshots. diff --git a/bih-cluster/docs/storage/storage-locations.md b/bih-cluster/docs/storage/storage-locations.md index d661ff05c..24efc5464 100644 --- a/bih-cluster/docs/storage/storage-locations.md +++ b/bih-cluster/docs/storage/storage-locations.md @@ -2,7 +2,7 @@ This document describes the forth iteration of the file system structure on the BIH HPC cluster. It was made necessary because the previous file system was no longer supported by the manufacturer and we since switched to distributed [Ceph](https://ceph.io/en/) storage. -!!! warning +!!! warning "Important" For now, the old, third-generation file system is still mounted at `/fast`. **It will be decommissioned soon, please consult [this document describing the migration process](storage-migration.md)!** ## Organizational Entities @@ -31,12 +31,15 @@ Scratch data is created to be removed eventually. Ceph storage comes in two types which differ in their I/O speed, total capacity, and cost. They are called **Tier 1** and **Tier 2** and sometimes **hot storage** and **warm storage**. In the HPC filesystem they are mounted in `/data/cephfs-1` and `/data/cephfs-2`. -Tier 1 storage is fast, relatively small, expensive, and optimized for performance. -Tier 2 storage is slow, big, cheap, and built for keeping large files for longer times. + +- Tier 1 storage is fast, relatively small, expensive, and optimized for performance. +- Tier 2 storage is slow, big, cheap, and built for keeping large files for longer times. + Storage quotas are imposed in these locations to restrict the maximum size of folders. +Amount and utilization of quotas is communicated via the [HPC Access](https://hpc-access.cubi.bihealth.org/) web portal. ### Home directories -**Location:** `/data/cephfs-1/home/` +Location: `/data/cephfs-1/home/` Only users have home directories on Tier 1 storage. This is the starting point when starting a new shell or SSH session. @@ -44,7 +47,7 @@ Important config files are stored here as well as analysis scripts and small use Home folders have a strict storage quota of 1 GB. ### Work directories -**Location:** `/data/cephfs-1/work/` +Location: `/data/cephfs-1/work/` Groups and projects have work directories on Tier 1 storage. User home folders contain a symlink to their respective group's work folder. @@ -53,21 +56,24 @@ Work folders are generally limited to 1 TB per group. Project work folders are allocated on an individual basis. ### Scratch space -**Location:** `/data/cephfs-1/scratch/` +Location: `/data/cephfs-1/scratch/` Groups and projects have scratch space on Tier 1 storage. User home folders contain a symlink to their respective group's scratch space. Meant for temporary, potentially large data e. g. intermediate unsorted or unmasked BAM files, data downloaded from the internet etc. -**Files in scratch will be automatically removed 2 weeks after their creation.** Scratch space is generally limited to 10 TB per group. Projects are allocated scratch on an individual basis. +**Files in scratch will be [automatically removed](scratch-cleanup.md) 2 weeks after their creation.** ### Tier 2 storage -**Location:** `/data/cephfs-2/` +Location: `/data/cephfs-2/` Groups and projects can be allocated additional storage on the Tier 2 system. File quotas here can be significantly larger as it is much cheaper and more abundant than Tier 1. +!!! note + Tier 2 storage is not mounted on the HPC login nodes. + ### Overview | Tier | Function | Path | Default Quota | @@ -115,28 +121,29 @@ $ cp groups/cubi/important_file.txt /data/cephfs-2/unmirrored/groups/cubi/ ``` ## Technical Implementation -As a quick (very) technical note: - ### Tier 1 -- Fast & expensive (flash drive based), mounted on `/data/cephfs-1` +- Fast & expensive +- mounted on `/data/cephfs-1` - Currently 12 Nodes with 10 × 14 TB NVME SSD each - 1.68 PB raw storage - 1.45 PB erasure coded (EC 8:2) - - 1.23 PB usable (85 %, ceph performance limit) + - 1.23 PB usable (85 %, Ceph performance limit) - For typical CUBI use case 3 to 5 times faster I/O then the old DDN - Two more nodes in purchasing process - Example of flexible extension: - Chunk size: 45.000 € for one node with 150 TB, i. e. ca. 300 €/TB ### Tier 2 -- Slower but more affordable (spinning HDDs), mounted on `/data/cephfs-2` -- Currently ten nodes with 52 HDDs slots plus SSD cache installed, per node ca. 40 HDDs with 16 to 18 TB filled, i.e. - - 6.6 PB raw +- Slower but more affordable +- mounted on `/data/cephfs-2` +- Currently 10 nodes with 52 HDDs slots and SSD cache (~40 HDDs per node with 16–18 TB capacity) + - 6.6 PB raw storage - 5.3 PB erasure coded (EC 8:2) - 4.5 PB usable (85 %; Ceph performance limit) -- Nine more nodes in purchasing process with 5+ PB +- More nodes in purchasing process - Very Flexible Extension possible: - ca. 50 € per TB, 100 € mirrored, starting at small chunk sizes ### Tier 2 mirror -Similar hardware and size duplicate (another 10 nodes, 6+ PB) in separate fire compartment. +- Similar in hardware and size (10 nodes, 6+ PB) +- Stored in separate fire compartment. diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index c95bf7882..9cfa531a0 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -35,7 +35,7 @@ Due to the increase in storage quality options, there will be some more folders - Work on Tier 1: `/data/cephfs-1/work/groups//users/` - Scratch on Tier 1: `/data/cephfs-1/scratch/groups//users/` -!!! warning +!!! warning "Important" User work & scratch spaces are now part of the user's group folder. This means, groups should coordinate internally to distribute their allotted quota according to each user's needs. @@ -55,7 +55,7 @@ The full list of symlinks is: - Programming languages libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` - Others: `.ncbi`, `.vs` -!!! warning +!!! warning "Important" Automatic symlink creation will not create a symlink to any conda installation. ### Groups diff --git a/bih-cluster/mkdocs.yml b/bih-cluster/mkdocs.yml index cd91f71ef..c3cf69a6b 100644 --- a/bih-cluster/mkdocs.yml +++ b/bih-cluster/mkdocs.yml @@ -116,9 +116,9 @@ nav: - "Episode 4": first-steps/episode-4.md - "Storage": - "Storage Locations": storage/storage-locations.md - - "Automated Cleanup": storage/scratch-cleanup.md - - "Storage Migration": storage/storage-migration.md + - "Scratch Cleanup": storage/scratch-cleanup.md - "Querying Quotas": storage/querying-storage.md + - "Storage Migration": storage/storage-migration.md - "Cluster Scheduler": - slurm/overview.md - slurm/background.md From 81d41c4963e67d9c2c2f0333254aa18af4a9e7fe Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Tue, 12 Mar 2024 19:46:15 +0100 Subject: [PATCH 10/12] updates --- .../docs/storage/accessing-snapshots.md | 37 ---------- bih-cluster/docs/storage/migration-faq.md | 52 ++++++++++++++ bih-cluster/docs/storage/storage-locations.md | 27 ++++---- bih-cluster/docs/storage/storage-migration.md | 67 ++++++++----------- bih-cluster/mkdocs.yml | 1 + 5 files changed, 96 insertions(+), 88 deletions(-) delete mode 100644 bih-cluster/docs/storage/accessing-snapshots.md create mode 100644 bih-cluster/docs/storage/migration-faq.md diff --git a/bih-cluster/docs/storage/accessing-snapshots.md b/bih-cluster/docs/storage/accessing-snapshots.md deleted file mode 100644 index 4da629945..000000000 --- a/bih-cluster/docs/storage/accessing-snapshots.md +++ /dev/null @@ -1,37 +0,0 @@ -# Accessing Snapshots and Backups - -By now you have probably read that your home directory has strict quotas in place. -You pay this price in usability by the fact that snapshots and backups exist. - -## Snapshot - -Every night, a snapshot is created of all user, group, and project home directories. -The snapshots are placed in the following locations. - -- `/fast/home/.snapshots/$SNAPSHOT/users/$USER` -- `/fast/home/.snapshots/$SNAPSHOT/groups/$GROUP` -- `/fast/home/.snapshots/$SNAPSHOT/projects/$PROJECT` - -The snapshot name `$SNAPSHOT` simply is the 3-letter abbreviation of the week day (i.e., Mon, Tue, Thu, Fri, Sat, Sun). - -The snapshots contains the state of the home directories at the time that they were made. -This also includes the permissions. -That is you can simply retrieve the state of your home directory of last Tuesday (if the directory already existed back then) at: - -- `/fast/home/.snapshots/Tue/users/$USER` - -e.g., - -```bash -$ ls /fast/home/.snapshots/Tue/users/$USER -... -``` - -## Backups - -There are very few cases where backups will help you more than snapshots. -As with snapshots, there is a 7-day rotation for backups. -The backups fully reflect the snapshots and thus everything that is in the backups is also in the snapshots. - -The only time that you will need backups is when the GPFS and its snapshots are damaged. -This protects the files in your home directory against technical errors and other catastrophes such as fire or water damage of the data center. diff --git a/bih-cluster/docs/storage/migration-faq.md b/bih-cluster/docs/storage/migration-faq.md new file mode 100644 index 000000000..f9f12d97b --- /dev/null +++ b/bih-cluster/docs/storage/migration-faq.md @@ -0,0 +1,52 @@ +# Data Migration Tips and tricks +Please use `hpc-transfer-1` and `hpc-transfer-2` for moving large amounts of files. +This not only leaves the compute notes available for actual computation, but also has now risk of your jobs being killed by Slurm. +You should also use `tmux` to not risk connection loss during long running transfers. + +## Useful commands + +1. Define source and target location and copy contents. +```sh +$ SOURCE=/data/gpfs-1/work/projects/my_project/ +$ TARGET=/data/cephfs-2/unmirrored/projects/my-project/ +$ rsync -ah --stats --progress --dry-run $SOURCE $TARGET +``` + +2. Remove the `--dry-run` flag to perform the actual copy process. +3. If you are happy with how things are, add the `--remove-source-files` flag to `rsync`. +4. Check if all files are gone from the SOURCE folder and delete: +```sh +$ find $SOURCE -type f | wc -l +$ rm -r $SOURCE +``` + +!!! Warning + When defining your source location, do not use the `*` wildcard character. + This will not match hidden (dot) files and leave them behind. + +## Conda environments +Conda environment tend to not react well when the folder they are stored in is moved from its original location. +There are numerous ways to move the state of your environments, which are described [here](https://www.anaconda.com/blog/moving-conda-environments). + +A simple way we can recommend is this: + +1. Export all environments prior to the move. +```sh +#!/bin/bash +for env in $(ls .miniforge/envs/) +do + conda env export -n $env -f $env.yml +done +``` + +2. Re-create them after the move: +```sh +$ conda env create -f environment.yml +``` + +!!! Note + If you already moved your home folder, you can still activate your old environments like this: + + ```sh + $ conda activate /fast/home/users/your-user/path/to/conda/envs/env-name-here + ``` \ No newline at end of file diff --git a/bih-cluster/docs/storage/storage-locations.md b/bih-cluster/docs/storage/storage-locations.md index 24efc5464..d89b2357b 100644 --- a/bih-cluster/docs/storage/storage-locations.md +++ b/bih-cluster/docs/storage/storage-locations.md @@ -68,23 +68,26 @@ Projects are allocated scratch on an individual basis. ### Tier 2 storage Location: `/data/cephfs-2/` -Groups and projects can be allocated additional storage on the Tier 2 system. -File quotas here can be significantly larger as it is much cheaper and more abundant than Tier 1. +This is where big files go when they are not in active use. +Groups are allocated 10 TB of Tier 2 storage by default. +File quotas here can be significantly larger as space is much cheaper and more abundant than on Tier 1. !!! note - Tier 2 storage is not mounted on the HPC login nodes. + Tier 2 storage is currently not accessible from HPC login nodes. ### Overview -| Tier | Function | Path | Default Quota | -|:-----|:----------------|:---------------------------------------------|--------------:| -| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | -| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | -| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | -| 1 | Project work | `/data/cephfs-1/work/projects/` | individual | -| 1 | Project scratch | `/data/cephfs-1/scratch/projects/` | individual | -| 2 | Group | `/data/cephfs-2/mirrored/groups/` | On request | -| 2 | Project | `/data/cephfs-2/mirrored/projects/` | On request | +| Tier | Function | Path | Default Quota | +|:-----|:----------------|:-----------------------------------------------|--------------:| +| 1 | User home | `/data/cephfs-1/home/users/` | 1 GB | +| 1 | Group work | `/data/cephfs-1/work/groups/` | 1 TB | +| 1 | Group scratch | `/data/cephfs-1/scratch/groups/` | 10 TB | +| 1 | Project work | `/data/cephfs-1/work/projects/` | On request | +| 1 | Project scratch | `/data/cephfs-1/scratch/projects/` | On request | +| 2 | Group | `/data/cephfs-2/unmirrored/groups/` | 10 TB | +| 2 | Project | `/data/cephfs-2/unmirrored/projects/` | On request | +| 2 | Group | `/data/cephfs-2/mirrored/groups/` | On request | +| 2 | Project | `/data/cephfs-2/mirrored/projects/` | On request | ## Snapshots and Mirroring Snapshots are incremental copies of the state of the data at a particular point in time. diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index 9cfa531a0..585dfd194 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -4,6 +4,13 @@ Files on the cluster's main storage `/data/gpfs-1` aka. `/fast` will move to a n That includes users' home directories, work directories, and work-group directories. Once files have been moved to their new locations, `/fast` will be retired. +Simultaneously we will move towards a more unified naming scheme for project and group folder names. +From now on, all such folders names shall be in [kebab-case](https://en.wikipedia.org/wiki/Letter_case#Kebab_case). +This is Berlin after all. + +Detailed communication about the move will be communicated via the cluster mailinglist and the [user forum](https://hpc-talk.cubi.bihealth.org/). +For technical help, please consult the [Data Migration Tips and tricks](migration-faq.md). + ## Why is this happening? `/fast` is based on a high performance proprietary hardware (DDN) & file system (GPFS). The company selling it has terminated support which also means buying replacement parts will become increasingly difficult. @@ -24,7 +31,7 @@ So these are the three terminologies in use right now: - Cephfs-1 = Tier 1 = Hot storage = `/data/cephfs-1` - Cephfs-2 = Tier 2 = Warm storage = `/data/cephfs-2` -There are no more quotas on the number of files. +More information about CephFS can be found [here](http://localhost:8000/bih-cluster/storage/storage-locations/). ## New file locations Naturally, paths are going to change after files move to their new location. @@ -36,42 +43,24 @@ Due to the increase in storage quality options, there will be some more folders - Scratch on Tier 1: `/data/cephfs-1/scratch/groups//users/` !!! warning "Important" - User work & scratch spaces are now part of the user's group folder. - This means, groups should coordinate internally to distribute their allotted quota according to each user's needs. - -The implementation is done _via_ symlinks created by default when the user account is moved to its new destination. - -| Symlink named | Points to | -|:--------------|:----------| -| `/data/cephfs-1/home/users//work` | `/data/cephfs-1/work/groups//users/` | -| `/data/cephfs-1/home/users//scratch` | `/data/cephfs-1/scratch/groups//users/` | - -Additional symlinks are created from the user's home directory to avoid storing large files (R packages for example) in their home. -The full list of symlinks is: + User `work` & `scratch` spaces are now part of the user's group folder. + This means, groups need to coordinate internally to distribute their allotted quota according to each user's needs. -- HPC web portal cache: `ondemand` -- General purpose caching: `.cache` & `.local` -- Containers: `.singularity` & `.apptainer` -- Programming languages libraries & registry: `R`, `.theano`, `.cargo`, `.cpan`, `.cpanm`, & `.npm` -- Others: `.ncbi`, `.vs` +The implementation is done _via_ symlinks created by default when the user account is moved to its new destination: -!!! warning "Important" - Automatic symlink creation will not create a symlink to any conda installation. +- `~/work -> /data/cephfs-1/work/groups//users/` +- `~/scratch -> /data/cephfs-1/scratch/groups//users/` ### Groups - Work on Tier 1: `/data/cephfs-1/work/groups/` - Scratch on Tier 1: `/data/cephfs-1/scratch/groups/` -- Mirrored work on Tier 2: `/data/cephfs-2/mirrored/groups/` - -!!! note - Un-mirrored work space on Tier 2 is available on request. +- Tier 2 storage: `/data/cephfs-2/unmirrored/groups/` +- Mirrored space on Tier 2 is available on request. ### Projects - Work on Tier 1: `/data/cephfs-1/work/projects/` - Scratch on Tier 1: `/data/cephfs-1/scratch/projects/` - -!!! note - Tier 2 work space (mirrored & un-mirrored) is available on request. +- Tier 2 storage is available on request. ## Recommended practices ### Data locations @@ -81,20 +70,20 @@ The full list of symlinks is: - Tier 2 mirrored: Extra layer of security. Longer term storage of invaluable data. #### Folders -- Home: Persistent storage for configuration files, templates, generic scripts, & small documents. -- Work: Persistent storage for conda environments, R packages, data actively processed or analyzed. -- Scratch: Non-persistent storage for temporary or intermediate files, caches, etc. Automatically deleted after 14 days. +- Home: Configuration files, templates, generic scripts, & small documents. +- Work: Conda environments, R packages, data actively processed or analyzed. +- Scratch: Non-persistent storage for temporary or intermediate files, caches, etc. ### Project life cycle -1. Import the raw data on Tier 2 for validation (checksums, …) +1. Import raw data on Tier 2 for validation (checksums, …) 2. Stage raw data on Tier 1 for QC & processing. -3. Save processing results to Tier 2 and validate copies. +3. Save processing results to Tier 2. 4. Continue analysis on Tier 1. -5. Save analysis results on Tier 2 and validate copies. +5. Save analysis results on Tier 2. 6. Reports & publications can remain on Tier 2. -7. After publication (or the end of the project), files on Tier 1 can be deleted. +7. After publication (or the end of the project), files on Tier 1 should be deleted. -### Example use cases +## Example use cases Space on Tier 1 is limited. Your colleagues, other cluster users, and admins will be very grateful if you use it only for files you actively need to perform read/write operations on. @@ -103,7 +92,7 @@ This means main project storage should probably always be on Tier 2 with workflo These examples are based on our experience of processing diverse NGS datasets. Your mileage may vary but there is a basic principle that remains true for all projects. -#### DNA sequencing (WES, WGS) +### DNA sequencing (WES, WGS) Typical Whole Genome Sequencing data of a human sample at 100x coverage requires about 150 GB of storage, Whole Exome Sequencing files occupy between 6 and 30 GB. These large files require considerable I/O resources for processing, in particular for the mapping step. @@ -118,7 +107,7 @@ A prudent workflow for these kind of analysis would therefore be the following: Don't forget to use your `scratch` area for transient operations, for example to sort your `bam` file after mapping. More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md). -#### bulk RNA-seq +### bulk RNA-seq Analysis of RNA expression datasets are typically a long and iterative process, where the data must remain accessible for a significant period. However, there is usually no need to keep raw data files and mapping results available once the gene & transcripts counts have been generated. @@ -137,7 +126,7 @@ A typical workflow would be: If using `STAR`, don't forget to use your `scratch` area for transient operations. More information on how to efficiently set up your temporary directory [here](../best-practice/temp-files.md) -#### scRNA-seq +### scRNA-seq The analysis workflow of bulk RNA & single cell dataset is conceptually similar: Large raw files need to be processed once and only the outcome of the processing (gene counts matrices) are required for downstream analysis. @@ -149,7 +138,7 @@ Therefore, a typical workflow would be: 4. **Remove raw data, bam & count files from Tier 1.** 5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`. -#### Machine learning +### Machine learning There is no obvious workflow that covers most used cases for machine learning. However, diff --git a/bih-cluster/mkdocs.yml b/bih-cluster/mkdocs.yml index c3cf69a6b..6068608e8 100644 --- a/bih-cluster/mkdocs.yml +++ b/bih-cluster/mkdocs.yml @@ -119,6 +119,7 @@ nav: - "Scratch Cleanup": storage/scratch-cleanup.md - "Querying Quotas": storage/querying-storage.md - "Storage Migration": storage/storage-migration.md + - "Migration FAQ": "storage/migration-faq.md" - "Cluster Scheduler": - slurm/overview.md - slurm/background.md From 4a1c24ed0f5964dcff769d4080a5b8a3630db336 Mon Sep 17 00:00:00 2001 From: Eric Blanc Date: Thu, 14 Mar 2024 15:31:00 +0100 Subject: [PATCH 11/12] docs: fixed typos & added note to hashdeep --- bih-cluster/docs/storage/migration-faq.md | 7 +++++-- bih-cluster/docs/storage/storage-migration.md | 4 ++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/bih-cluster/docs/storage/migration-faq.md b/bih-cluster/docs/storage/migration-faq.md index f9f12d97b..7232af9a2 100644 --- a/bih-cluster/docs/storage/migration-faq.md +++ b/bih-cluster/docs/storage/migration-faq.md @@ -1,6 +1,6 @@ # Data Migration Tips and tricks Please use `hpc-transfer-1` and `hpc-transfer-2` for moving large amounts of files. -This not only leaves the compute notes available for actual computation, but also has now risk of your jobs being killed by Slurm. +This not only leaves the compute notes available for actual computation, but also has no risk of your jobs being killed by Slurm. You should also use `tmux` to not risk connection loss during long running transfers. ## Useful commands @@ -24,6 +24,9 @@ $ rm -r $SOURCE When defining your source location, do not use the `*` wildcard character. This will not match hidden (dot) files and leave them behind. +!!! Note + Paranoid users may want to consider using `hashdeep` to ensure that all files were successfully copied. + ## Conda environments Conda environment tend to not react well when the folder they are stored in is moved from its original location. There are numerous ways to move the state of your environments, which are described [here](https://www.anaconda.com/blog/moving-conda-environments). @@ -49,4 +52,4 @@ $ conda env create -f environment.yml ```sh $ conda activate /fast/home/users/your-user/path/to/conda/envs/env-name-here - ``` \ No newline at end of file + ``` diff --git a/bih-cluster/docs/storage/storage-migration.md b/bih-cluster/docs/storage/storage-migration.md index 585dfd194..096054717 100644 --- a/bih-cluster/docs/storage/storage-migration.md +++ b/bih-cluster/docs/storage/storage-migration.md @@ -117,7 +117,7 @@ A typical workflow would be: 1. Copy your `fastq` files from Tier 2 to Tier 1. 2. Perform raw data quality control, and store the outcome on Tier 2. -3. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 1. +3. Get expression levels, for example using `salmon` or `STAR`, and store the results on Tier 2. 4. Import the expression levels into `R`, using `tximport` and `DESeq2` or `featureCounts` & `edgeR`, for example. 5. Save expression levels (`R` objects) and the output of `salmon`, `STAR`, or any mapper/aligner of your choice to Tier 2. 6. **Remove raw data, bam & count files from Tier 1.** @@ -134,7 +134,7 @@ Therefore, a typical workflow would be: 1. Copy your `fastq` files from Tier 2 to Tier 1. 2. Perform raw data QC, and store the results on Tier 2. -3. Get the count matrix, e. g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 1. +3. Get the count matrix, e. g. using `Cell Ranger` or `alevin-fry`, perform count matrix QC and store the results on Tier 2. 4. **Remove raw data, bam & count files from Tier 1.** 5. Downstream analysis with `seurat`, `scanpy`, or `Loupe Browser`. From b21b6cd9c6aafb80a53cc7eadfb0b80b2a6dc853 Mon Sep 17 00:00:00 2001 From: Thomas Sell Date: Fri, 15 Mar 2024 19:20:11 +0100 Subject: [PATCH 12/12] mention checksums --- bih-cluster/docs/storage/migration-faq.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/bih-cluster/docs/storage/migration-faq.md b/bih-cluster/docs/storage/migration-faq.md index 7232af9a2..6115c8239 100644 --- a/bih-cluster/docs/storage/migration-faq.md +++ b/bih-cluster/docs/storage/migration-faq.md @@ -12,9 +12,16 @@ $ TARGET=/data/cephfs-2/unmirrored/projects/my-project/ $ rsync -ah --stats --progress --dry-run $SOURCE $TARGET ``` -2. Remove the `--dry-run` flag to perform the actual copy process. -3. If you are happy with how things are, add the `--remove-source-files` flag to `rsync`. -4. Check if all files are gone from the SOURCE folder and delete: +2. Remove the `--dry-run` flag to start the actual copying process. +3. Perform a second `rsync` to check if all files were successfully transferred. + Paranoid users might want to add the `--checksums` flag to `rsync` or use `hashdeep`. + Please note the flag `--remove-source-files` which will do exactly as the name suggests, + but leaves empty directories behind. +```sh +$ rsync -ah --stats --remove-source-files --dry-run $SOURCE $TARGET +``` +4. Again, remove the `--dry-run` flag to start the actual deletion. +5. Check if all files are gone from the SOURCE folder and remove the empty directories: ```sh $ find $SOURCE -type f | wc -l $ rm -r $SOURCE @@ -22,10 +29,7 @@ $ rm -r $SOURCE !!! Warning When defining your source location, do not use the `*` wildcard character. - This will not match hidden (dot) files and leave them behind. - -!!! Note - Paranoid users may want to consider using `hashdeep` to ensure that all files were successfully copied. + It will not match hidden (dot) files and leave them behind. ## Conda environments Conda environment tend to not react well when the folder they are stored in is moved from its original location.