Skip to content

GenePattern Docker Architecture

liefeld edited this page Feb 28, 2018 · 20 revisions

Introduction

Experience over the last decade has shown that managing the deployment of modules on GenePattern servers has been much more time consuming then desired. Essentially this is a job of package management. The problem arises in that the collection of modules is very large (>250), they use a wide variety of programming languages and language versions (>25), and have multiple module versions which may have different library or OS dependencies. In addition, the GenePattern production servers use load balancing systems, so this complexity needs to be propagated onto compute cluster nodes. Since the nodes need to have different compute environments for different modules, this has necessitated the development of complex "run-with-env" scripts which use a variety of methods (dotkits, setting PATHs and variables) to dynamically set up the compute environment a module needs before a module is run.

Container systems (e.g. Docker, Singularity) which can help us manage this complexity. Essentially a container is a lightweight virtual machine (VM) that has some arbitrary selection of libraries, and software packages installed. The exact contents of a container are defined by its creator using a file defining the environment (e.g. a 'Dockerfile' for docker) which specifies the base operating system, and what packages and files are to be installed or copied into the container. The container thus provides an encapsulation of a compute environment that is isolated from the host operating system upon which the container is run.

The GenePattern team have determined that by replacing the current run-with-env scripts with containers we can greatly simplify the management of GenePattern servers, and at the same time improve the reproducibility of GenePattern modules.

Also, while Docker has some security issues that prevent it from being used in some data centers, The Dockerfile and containers are a de facto standard at the moment and can be run in other more secure container systems such as Singularity, which are more secure and accepted by the managers of data centers.

Some vocabulary

  • Compute Environment - The set of things (OS, libraries, code) that a module requires to be in place for its execution.
  • Container - A lightweight Virtual machine with software packages and libraries installed.
  • Container System - Software that runs containers such as Docker, Singularity, AWS Batch.
  • Docker - The best-known container system. Sometimes used in place of the generic container system name (like we commonly say Kleenex instead of tissue). It has a flaw in that the process that runs containers does so as root, and thus a maliciously crafted container may be able to get root access to the system running the container.
  • Library - Two possible meanings. In the context of a container or OS, it means a low-level compiled DLL or its equivalent that makes system level features available to anything running on the container or OS. In the context of a module it may mean the same thing, or it might refer to a collection of programming-language specific software such as an R or Python module.
  • Singularity - A Container system that runs Docker containers in a secure way.

The Big Picture