Skip to content

GenePattern Docker Architecture

liefeld edited this page Feb 28, 2018 · 20 revisions

Introduction

Experience over the last decade has shown that managing the deployment of modules on GenePattern servers has been much more time consuming then desired. Essentially this is a job of package management. The problem arises in that the collection of modules is very large (>250), they use a wide variety of programming languages and language versions (>25), and have multiple module versions which may have different library or OS dependencies. In addition, the GenePattern production servers use load balancing systems, so this complexity needs to be propagated onto compute cluster nodes. Since the nodes need to have different compute environments for different modules, this has necessitated the development of complex "run-with-env" scripts which use a variety of methods (dotkits, setting PATHs and variables) to dynamically set up the compute environment a module needs before a module is run. A final twist here is that on the current Broad-hosted public system, some of the modules are using libraries that are specific to a specific version of Linux (centos) which means the dynamic set up can ONLY work on nodes running an old version of CENTOS, of which there are very few, and they are dying.

Container systems (e.g. Docker, Singularity) which can help us manage this complexity. Essentially a container is a lightweight virtual machine (VM) that has some arbitrary selection of libraries, and software packages installed. The exact contents of a container are defined by its creator using a file defining the environment (e.g. a 'Dockerfile' for docker) which specifies the base operating system, and what packages and files are to be installed or copied into the container. The container thus provides an encapsulation of a compute environment that is isolated from the host operating system upon which the container is run.

The GenePattern team have determined that by replacing the current run-with-env scripts with containers we can greatly simplify the management of GenePattern servers, and at the same time improve the reproducibility of GenePattern modules.

Also, while Docker has some security issues that prevent it from being used in some data centers, The Dockerfile and containers are a de facto standard at the moment and can be run in other more secure container systems such as Singularity, which are more secure and accepted by the managers of data centers.

Finally another aspect that will be referenced below is that in the current GenePattern systems, the code has always assumed that there is a shared file system between the GenePattern server and the compute nodes.

Some vocabulary

  • Compute Environment - The set of things (OS, libraries, code) that a module requires to be in place for its execution.
  • Container - A lightweight Virtual machine with software packages and libraries installed.
  • Container System - Software that runs containers such as Docker, Singularity, AWS Batch.
  • Docker - The best-known container system. Sometimes used in place of the generic container system name (like we commonly say Kleenex instead of tissue). It has a flaw in that the process that runs containers does so as root, and thus a maliciously crafted container may be able to get root access to the system running the container.
  • Library - Two possible meanings. In the context of a container or OS, it means a low-level compiled DLL or its equivalent that makes system level features available to anything running on the container or OS. In the context of a module it may mean the same thing, or it might refer to a collection of programming-language specific software such as an R or Python module.
  • Singularity - A Container system that runs Docker containers in a secure way.

An optimization

Since we have many (250+) modules that use a much smaller number of compute environments (~25), as a boot-strapping optimization, we will allow a limited amount of container re-use between modules. So instead of initially making a seperate container for each of the 100 modules using Java 7, we will make a single Java 7 container and pass in the module-specific code at runtime. Ideally we would have a seperate unique container for each with the module code inside it (and a new version for each module version) but that is too much work to start. We will however try to make module-specific containers for any new modules in the future.

The implication is that we need to pass the module's directory and an <R_Library> or <python_library> directory as well as the module input files.

The Desired end point(s)

We'll come back to things that need to [happen/be built] for the following scenario to happen, but the following are some of (in that we may add others later) our desired end goals for the conversion of GenePattern to dockerized modules...

We want to support the following three deployment architectures;

  1. local Mac GP server + docker
  2. shared server + singularity
  3. GenePattern on AWS + AWS batch.

Each of these goals requires some variation in how the container is launched (docker, singularity, AWS Batch) and how the files are provisioned to the container.

Common functionality - Different Implementations

For the containerized modules to be able to run there are a few things they must do, but must do differently on each deployment architecture. Implications for these varied implementations will be discussed below

a. Transfer files

  • input/output
  • module code ( contents)
  • module libraries

b. Launch jobs c. Check Job Status d. Retrieve outputs on the GenePattern server

Local Mac

GenePattern is running on a Macbook pro. Modules run via containers on Docker on the same machine. Local file system mounted to containers. a. Mount local file systems to the container (jobResults, tasklib, library path or just once for the server root directory) b. docker run -v localpath:localpath c. Job runs synchronously d. noop since jobResults is where the files were written

Shared GenePattern server

GenePattern running on a shared server. Modules run via containers on Singularity, dispatched to compute nodes. NFS drive mounted to containers. a. Install GP server on an NFS disk. Mount nfs root directory to GP Server and compute nodes. Mount this file system to the container b. c. d. noop since jobResults on nfs is where the files were written

GenePattern on AWS + Batch

GenePattern running on a AWS server. Modules run via AWS Batch job submissions, dispatched to AWS Batch managed compute clusters. Files staged to/from container (on compute nodes) via AWS S3. a. GP server does an "aws s3 sync path/to/jobdir s3://somebucket/path/to/jobdir b. "aws batch submit-job ..." c. "aws batch describe-jobs ..." d. "aws s3 sync s3://somebucket/path/to/jobdir path/to/jobdir "

Unlike the first 2 deployment architectures, we need something more between steps B and C to bring the files from S3 into the container