-
Notifications
You must be signed in to change notification settings - Fork 17
GenePattern Docker Architecture
Experience over the last decade has shown that managing the deployment of modules on GenePattern servers has been much more time consuming then desired. Essentially this is a job of package management. The problem arises in that the collection of modules is very large (>250), they use a wide variety of programming languages and language versions (>25), and have multiple module versions which may have different library or OS dependencies. In addition, the GenePattern production servers use load balancing systems, so this complexity needs to be propagated onto compute cluster nodes. Since the nodes need to have different compute environments for different modules, this has necessitated the development of complex "run-with-env" scripts which use a variety of methods (dotkits, setting PATHs and variables) to dynamically set up the compute environment a module needs before a module is run. A final twist here is that on the current Broad-hosted public system, some of the modules are using libraries that are specific to a specific version of Linux (centos) which means the dynamic set up can ONLY work on nodes running an old version of CENTOS, of which there are very few, and they are dying.
Container systems (e.g. Docker, Singularity) can help us manage this complexity. Essentially a container is a lightweight virtual machine (VM) that has some arbitrary selection of libraries, and software packages installed. The exact contents of a container are defined by its creator using a file defining the environment (e.g. a 'Dockerfile' for docker) which specifies the base operating system, and what packages and files are to be installed or copied into the container. The container thus provides an encapsulation of a compute environment that is isolated from the host operating system upon which the container is run.
The GenePattern team have determined that by replacing the current run-with-env scripts with containers we can greatly simplify the management of GenePattern servers, and at the same time improve the reproducibility of GenePattern modules.
Also, while Docker has some security issues that prevent it from being used in some data centers, The Dockerfile and containers are a de facto standard at the moment and can be run in other container systems such as Singularity, which are more secure and accepted by the managers of data centers.
Finally, another aspect that will be referenced below is that in the current GenePattern systems, the code has always assumed that there is a shared file system between the GenePattern server and the compute nodes.
- Compute Environment - The set of things (OS, libraries, code) that a module requires to be in place for its execution.
- Container - A lightweight Virtual machine with software packages and libraries installed.
- Container System - Software that runs containers such as Docker, Singularity, AWS Batch.
- Docker - The best-known container system. Sometimes used in place of the generic container system name (like we commonly say Kleenex instead of tissue). It has a flaw in that the process that runs containers does so as root, and thus a maliciously crafted container may be able to get root access to the system running the container.
- Library - Two possible meanings. In the context of a container or OS, it means a low-level compiled DLL or its equivalent that makes system level features available to anything running on the container or OS. In the context of a module it may mean the same thing, or it might refer to a collection of programming-language specific software such as an R or Python module.
- Singularity - A Container system that runs Docker containers in a secure way.
Since we have many (250+) modules that use a much smaller number of compute environments (~25), as a boot-strapping optimization, we will allow a limited amount of container re-use between modules. So instead of initially making a seperate container for each of the 100 modules using Java 7, we will make a single Java 7 container and pass in the module-specific code at runtime. Ideally we would have a seperate unique container for each with the module code inside it (and a new version for each module version) but that is too much work to start. We will however try to make module-specific containers for any new modules in the future.
The implication is that we need to pass the module's directory and an <R_Library> or <python_library> directory as well as the module input files.
We'll come back to things that need to [happen/be built] for the following scenario to happen, but the following are some of (in that we may add others later) our desired end goals for the conversion of GenePattern to dockerized modules...
We want to support the following deployment architectures;
- local Mac GP server + docker
- shared server + singularity
- GenePattern on AWS + AWS batch.
- (stretch goal) GenePattern modules called directly from a GPNB (no GenePattern server)
Each of these goals requires some variation in how the container is launched (docker, singularity, AWS Batch) and how the files are provisioned to the container.
In addition we would like, if possible, to allow the use of third-party (ie module author) containers that do not have to be customized for use with GenePattern. This is fairly simple for the architectures 1,2,4 but is more problematical for #3 (AWS Batch).
For the containerized modules to be able to run there are a few things they must do, but must do differently on each deployment architecture. Implications for these varied implementations will be discussed below
a. Transfer files
- input/output
- module code ( contents)
- module libraries
b. Launch jobs
c. Check Job Status
d. Retrieve outputs on the GenePattern server
GenePattern is running on a Macbook pro. Modules run via containers on Docker on the same machine. Local file system mounted to containers.
a. Mount local file systems to the container (jobResults, tasklib, library path or just once for the server root directory)
b. docker run -v localpath:localpath
c. Job runs synchronously
d. noop since jobResults is where the files were written
GenePattern running on a shared server. Modules run via containers on Singularity, dispatched to compute nodes. NFS drive mounted to containers.
a. Install GP server on an NFS disk. Mount nfs root directory to GP Server and compute nodes. Mount this file system to the container
b. TBD
c. TBD
d. noop since jobResults on nfs is where the files were written
GenePattern running on a AWS server. Modules run via AWS Batch job submissions, dispatched to AWS Batch managed compute clusters. Files staged to/from container (on compute nodes) via AWS S3.
a. GP server does an "aws s3 sync path/to/jobdir s3://somebucket/path/to/jobdir
b. "aws batch submit-job ..."
c. "aws batch describe-jobs ..."
d. "aws s3 sync s3://somebucket/path/to/jobdir path/to/jobdir " Unlike the first 2 deployment architectures, we need something more between steps B and C to bring the files from S3 into the container. Currently this is handled by the scripts stored in the "container_scripts directory of https://github.com/genepattern/docker-aws-common-scripts. The key script (runS3OnBatch.sh) is the entry point for running a module inside a container. It is run inside of a container with the following parameters
- TASKLIB
- INPUT_FILES_DIR
- S3_ROOT
- WORKING_DIR
- EXECUTABLE
- Optionally GP_METADATA_DIR is provided via an environment variable
This script does the following; First it syncs the TASKLIB, INPUT_FILES_DIR and GP_METADATA_DIR from the defined S3 bucket to the identical paths within the container. This should result in the module code and inputs being copied into the container. Next it runs the EXECUTABLE. This should be a script containing the module's command line. Finally it syncs the local TASKLIB and WORKING_DIR back to S3 so that the GenePattern server can access any generated output files (and any changes to TASKLIB which some modules make)
In some cases, containers have additional directories that need to be synch'd. For things common to a container, the script looks for files called /usr/local/bin/runS3Batch_prerun_custom.sh /usr/local/bin/runS3Batch_postrun_custom.sh And runs them before/after the EXECUTABLE to do additional setup. This is commonly used to sync additional R Libraries in from S3. If customization is required for a module different from that of the container, the current implementation does this via additional commands in the $EXECUTABLE. It is not yet clear if this should be abstracted out to a more explicit separation as yet.
GenePattern Notebook calls modules directly, executes them via direct Docker calls. This could involve dispatching them to docker running on a local laptop, or in the case of the GPNB Repository/Server, calling the same docker swarm used to run the GPNB kernels and have it run the module in a docker container. The idea here is to investigate the eventual transition to a serverless GenePattern, but also to reduce data transfer by moving the execution (module) closer to the data (in a GPNB or GPNB server).
a. GPNB server creates a job dir inside its container. Data files are either linked into there or a script is written there that the container will execute to gather input data. GPNB then does an "aws s3 sync path/to/jobdir s3://somebucket/path/to/jobdir.
b. GPNB calls "docker run ..." or "AWS batch submit". Note that if we are using the GPNB on AWS, we can use the same script as the batch deployment architecture (runS3OnBatch.sh) to retrieve data etc from S3 whether we use Batch or not.
c. GPNB either waits for synchronous execution to complete or makes local docker calls to check status
d. "aws s3 sync s3://somebucket/path/to/jobdir path/to/jobdir " to get result files which the container writes to the S3 bucket.
This deployment type has not yet been prototyped.
As noted above, the containers we develop are built with the https://github.com/genepattern/docker-aws-common-scripts contents and the AWS CLI installed. This works for AWS and S3 but would require additional scripts and new container releases if we wanted to use a different cloud storage provider (e.g. Google or Microsoft).
One option might be to simply ensure the containers all include wget or curl. Then a generic entrypoint could be passed the URL to the data retrieval scripts which is retrieved and run. Different scripts could be used for different data sources. The downside to this approach is that the execution environment within the container would start becoming more dynamic and variable, and thus there is a higher QA burden and higher chances for failing at reproducibility. Also, mechanisms for passing authentication tokens may be needed (AWS does this by passing roles to the containers it launches). On the obverse, it is relatively straightforward.
A second option would be to simply create additional container entry point scripts (similar to https://github.com/genepattern/docker-aws-common-scripts/blob/master/container_scripts/runS3OnBatch.sh) for each different execution environment. The downside of this is the need to re-release every container (and possibly every module) when the scripts are added or changed. This is the down side of a static (vs dynamic) approach.
A third option that has been investigated seems to have the potential for a reasonable compromise. If we were to create wrapper containers that contain the (static) machinery for data transfer only, and also docker itself. Then execution would consist of running the transfer container which would do the transfer. The transfer container would then run the module container itself. When the module completes the transfer container would write results back. In addition to letting the 'transfer' containers be static (and thus testable and repeatable) it would obviate the need for module containers to have any GenePattern or AWS specific contents at all. The selection and configuration of the 'wrapper containers' would be up to the process calling the module (ie an individual GenePattern server or a GPNB Jupyter).
It is possible that this same idea could be accomplished using "Docker Compose" or Docker "data volumes" but this is not clear yet to this author.
Prototyping thus far shows that we can have a docker container on AWS batch launch another container on the same compute node. To run a module it requires some modifications to the current GP->AWS interface to allow it to function effectively.
It has been prototyped in 2 ways, first by using the outer container to do S3 sync's to a local mounted volume on the compute node and then mounting the same local volume to the second module container. The second approach also did an S3 sync to a local directory, but then started the container in a sleep loop, then using "docker cp" to move data from the local drive into the container, then it executes the module and returns the data to the local drive for synching back to S3. This second approach is more complicated and will not be the choice for production, but has a benefit in that after the execution of the module, we have a running container that has all libraries loaded and requires no additional dynamic changes to be complete. We can then use the outer container to save this container and push a copy into the AWS ECR (prototyped and working) so that we have a completely usable container for the module for later re-use.
- generate exec.sh script
- generate lists of directories to send to/from S3
- sync directories from GP server to S3
- call aws batch with parameters (for module and DinD)
- poll for job completion
- sync directories back from S3
- read exit code, link outputs, stderr, stdout back into GP server
- Create local directories
- sync data from S3 into local directories
- run module container
- mount local directories
- send exec.sh as command line
- sync local directories back to S3
- be able to run module, ignorant of GenePattern or AWS Batch, on mounted drives
- include bash to run exec.sh
The wrapper needs the following information for its command line:
- S3 root location
- the name of the executable script, exec.sh (expected to be in GP_metadata_dir unless the script name is a constant)
- List of directories to copy from S3 into the local compute node
- it will "mkdir /dirpath "aws s3 sync /dirpath /dirpath
- the directory path are needed not just the S3 URL so that it can create the local dirs before sync'ing. It cannot use the exact path in all cases because it needs to know the local paths also to be able to mount the local paths into the internal container
- Tasklib, R Libraries, GP_MetaDataDir, working dir
- List of Directories to sync back after execution (if different from the inputs, this may be unnecessary)
- Module name and version or LSID (only for saving to the ECR) and/or a container tag to save a completed container as
For the STDOUT and STDERR these could just be set inside the exec.sh script. That will catch only module stdout.
STDOUT and STDERR for the aws sync currently goes to cloudwatch. If this is not adequate we should capture them in seperate stdout and stderr files from that generated by the module
For the module return code, capture it in exec.sh and write it to a file in the GP_metadata_dir The DinD container script's return code is not likely to be interesting and may not be catchable in any case. The container's return code is available via the batch status messages.