Skip to content

Latest commit

 

History

History
345 lines (275 loc) · 17.9 KB

File metadata and controls

345 lines (275 loc) · 17.9 KB

Description

This module creates a startup script that will execute a list of runners in the order they are specified. The runners are copied to a GCS bucket at deployment time and then copied into the VM as they are executed after startup.

Each runner receives the following attributes:

  • destination: (Required) The name of the file at the destination VM. If an absolute path is provided, the file will be copied to that path, otherwise the file will be created in a temporary folder and deleted once the startup script runs.

  • type: (Required) The type of the runner, one of the following:

    • shell: The runner is a shell script and will be executed once copied to the destination VM.

    • ansible-local: The runner is an ansible playbook and will run on the VM with the following command line flags:

      ansible-playbook --connection=local --inventory=localhost, \
        --limit localhost <<DESTINATION>>
    • data: The data or file specified will be copied to <<DESTINATION>>. No action will be performed after the data is staged. This data can be used by subsequent runners or simply made available on the VM for later use.

  • content: (Optional) Content to be uploaded and, if type is either shell or ansible-local, executed. Must be defined if source is not.

  • source: (Optional) A path to the file or data you want to upload. Must be defined if content is not. The source path is relative to the deployment group directory. To ensure correctness of path use ghpc_stage function, that would copy referenced file to the deployment group directory. For example:

    source: $(ghpc_stage("path/to/file"))

    For more examples with context, see the example blueprint snippet. To reference any other source file, an absolute path must be used.

  • args: (Optional) Arguments to be passed to shell or ansible-local runners. For shell runners, these will be passed as arguments to the script when it is executed. For ansible-local runners, they will be appended to a list of default arguments that invoke ansible-playbook on the localhost. Thereforeargs should not include any arguments that alter this behavior, such as --connection, --inventory, or --limit.

Runner dependencies

ansible-local runners require Ansible to be installed in the VM before running. To support other playbook runners in the Cluster Toolkit, we install version 2.11 of ansible-core as well as the larger package of collections found in ansible version 4.10.0.

If an ansible-local runner is found in the list supplied to this module, a script to install Ansible will be prepended to the list of runners. This behavior can be disabled by setting var.prepend_ansible_installer to false. This script will do the following at VM startup:

  • Install system-wide python3 if not already installed using system package managers (yum, apt-get, etc)
  • Install python3-distutils system-wide in debian and ubuntu based environments. This can be a missing dependency on system installations of python3 for installing and upgrading pip.
  • Install system-wide pip3 if not already installed and upgrade pip3 if the version is not at least 18.0.
  • Install and create a virtual environment located at /usr/local/ghpc-venv.
  • Install ansible into this virtual environment if the current version of ansible is not version 2.11 or higher.

To use the virtual environment created by this script, you can activate it by running the following command on the VM:

source /usr/local/ghpc-venv/bin/activate

You may also need to provide the correct python interpreter as the python3 binary in the virtual environment. This can be done by adding the following flag when calling ansible-playbook:

-e ansible_python_interpreter=/usr/local/ghpc-venv/bin/activate

NOTE: ansible-playbook and other ansible command line tools will only be accessible from the command line (and in your PATH variable) after activating this environment.

Staging the runners

Runners will be uploaded to a GCS bucket. This bucket will be created by this module and named as ${var.deployment_name}-startup-scripts-${random_id}. VMs using the startup script created by this module will pull the runners content from a GCS bucket and therefore must have access to GCS.

NOTE: To ensure access to GCS, set the following OAuth scope on the instance using the startup scripts: https://www.googleapis.com/auth/devstorage.read_only.

This is set as a default scope in the vm-instance, schedMD-slurm-on-gcp-login-node and schedMD-slurm-on-gcp-controller modules

Tracking startup script execution

For more information on how to use startup scripts on Google Cloud Platform, please refer to this document.

To debug startup scripts from a Linux VM created with startup script generated by this module:

sudo DEBUG=1 google_metadata_script_runner startup

To view outputs from a Linux startup script, run:

sudo journalctl -u google-startup-scripts.service

Monitoring Agent Installation

This startup-script module has several options for installing a Google monitoring agent. There are two relevant settings: install_stackdriver_agent and install_cloud_ops_agent.

The Stackdriver Agent also called the Legacy Cloud Monitoring Agent provides better performance under some HPC workloads. While official documentation recommends using the Cloud Ops Agent, it is recommended to use install_stackdriver_agent when performance is important.

Stackdriver Agent Installation

If an image or machine already has Cloud Ops Agent installed and you would like to instead use the Stackdriver Agent, the following script will remove the Cloud Ops Agent and install the Stackdriver Agent.

# Remove Cloud Ops Agent
sudo systemctl stop google-cloud-ops-agent.service
sudo systemctl disable google-cloud-ops-agent.service
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --uninstall
sudo bash add-google-cloud-ops-agent-repo.sh --remove-repo

# Install Stackdriver Agent
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh --also-install
curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh --also-install
sudo service stackdriver-agent start

Cloud Ops Agent Installation

If an image or machine already has the Stackdriver Agent installed and you would like to instead use the Cloud Ops Agent, the following script will remove the Stackdriver Agent and install the Cloud Ops Agent.

# UnInstall Stackdriver Agent

sudo systemctl stop stackdriver-agent.service
sudo systemctl disable stackdriver-agent.service
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo dpkg --configure -a
sudo bash add-monitoring-agent-repo.sh --uninstall
sudo bash add-monitoring-agent-repo.sh --remove-repo

# Install ops-agent

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
sudo service google-cloud-ops-agent start

As a reminder, this should be in a startup script, which should run on all Compute nodes via the compute_startup_script on the controller.

Testing Installation

You can test if one of the agents is running using the following commands:

# For Cloud Ops Agent
$ sudo systemctl is-active google-cloud-ops-agent"*"
active
active
active
active

# For Legacy Monitoring and Logging Agents
$ sudo service stackdriver-agent status
stackdriver-agent is running           [  OK  ]
$ sudo service google-fluentd status
google-fluentd is running              [  OK  ]

For official documentation see troubleshooting docs:

Example

- id: startup
  source: modules/scripts/startup-script
  settings:
    runners:
      # Some modules such as filestore have runners as outputs for convenience:
      - $(homefs.install_nfs_client_runner)
      # These runners can still be created manually:
      # - type: shell
      #   destination: "modules/filestore/scripts/install_nfs_client.sh"
      #   source: "modules/filestore/scripts/install_nfs_client.sh"
      - type: ansible-local
        destination: "modules/filestore/scripts/mount.yaml"
        source: "modules/filestore/scripts/mount.yaml"
      - type: data
        source: /tmp/foo.tgz
        destination: /tmp/bar.tgz
      - type: shell
        destination: "decompress.sh"
        content: |
          #!/bin/sh
          echo $2
          tar zxvf /tmp/$1 -C /
        args: "bar.tgz 'Expanding file'"

- id: compute-cluster
  source: modules/compute/vm-instance
  use: [homefs, startup]

In the above example, a new GCS bucket is created to upload the startup-scripts. But in the case where the user wants to reuse existing GCS bucket or folder, they are able to do so by using the gcs_bucket_path as shown in the below example

- id: startup
  source: modules/scripts/startup-script
  settings:
    gcs_bucket_path: gs://user-test-bucket/folder1/folder2
    install_stackdriver_agent: true

- id: compute-cluster
  source: modules/compute/vm-instance
  use: [startup]

License

Copyright 2023 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 1.3
google >= 3.83
local >= 2.0.0
random ~> 3.0

Providers

Name Version
google >= 3.83
local >= 2.0.0
random ~> 3.0

Modules

No modules.

Resources

Name Type
google_storage_bucket.configs_bucket resource
google_storage_bucket_iam_binding.viewers resource
google_storage_bucket_object.scripts resource
local_file.debug_file resource
random_id.resource_name_suffix resource

Inputs

Name Description Type Default Required
ansible_virtualenv_path Virtual environment path in which to install Ansible string "/usr/local/ghpc-venv" no
bucket_viewers Additional service accounts or groups, users, and domains to which to grant read-only access to startup-script bucket (leave unset if using default Compute Engine service account) list(string) [] no
configure_ssh_host_patterns If specified, it will automate ssh configuration by:
- Defining a Host block for every element of this variable and setting StrictHostKeyChecking to 'No'.
Ex: "hpc*", "hpc01*", "ml*"
- The first time users log-in, it will create ssh keys that are added to the authorized keys list
This requires a shared /home filesystem and relies on specifying the right prefix.
list(string) [] no
debug_file Path to an optional local to be written with 'startup_script'. string null no
deployment_name Name of the HPC deployment, used to name GCS bucket for startup scripts. string n/a yes
docker Install and configure Docker
object({
enabled = optional(bool, false)
world_writable = optional(bool, false)
daemon_config = optional(string, "")
})
{
"enabled": false
}
no
enable_docker_world_writable DEPRECATED: use var.docker bool null no
gcs_bucket_path The GCS path for storage bucket and the object, starting with gs://. string null no
http_no_proxy Domains for which to disable http_proxy behavior. Honored only if var.http_proxy is set string ".google.com,.googleapis.com,metadata.google.internal,localhost,127.0.0.1" no
http_proxy Web (http and https) proxy configuration for pip, apt, and yum/dnf and interactive shells string "" no
install_ansible Run Ansible installation script if either set to true or unset and runner of type 'ansible-local' are used. bool null no
install_cloud_ops_agent Warning: Consider using install_stackdriver_agent for better performance. Run Google Ops Agent installation script if set to true. bool false no
install_cloud_rdma_drivers If true, will install and reload Cloud RDMA drivers. Currently only supported on Rocky Linux 8. bool false no
install_docker DEPRECATED: use var.docker. bool null no
install_stackdriver_agent Run Google Stackdriver Agent installation script if set to true. Preferred over ops agent for performance. bool false no
labels Labels for the created GCS bucket. Key-value pairs. map(string) n/a yes
local_ssd_filesystem Create and mount a filesystem from local SSD disks (data will be lost if VMs are powered down without enabling migration); enable by setting mountpoint field to a valid directory path.
object({
fs_type = optional(string, "ext4")
mountpoint = optional(string, "")
permissions = optional(string, "0755")
})
{
"fs_type": "ext4",
"mountpoint": "",
"permissions": "0755"
}
no
prepend_ansible_installer DEPRECATED. Use install_ansible=false to prevent ansible installation. bool null no
project_id Project in which the HPC deployment will be created string n/a yes
region The region to deploy to string n/a yes
runners List of runners to run on remote VM.
Runners can be of type ansible-local, shell or data.
A runner must specify one of 'source' or 'content'.
All runners must specify 'destination'. If 'destination' does not include a
path, it will be copied in a temporary folder and deleted after running.
Runners may also pass 'args', which will be passed as argument to shell runners only.
list(map(string)) [] no

Outputs

Name Description
compute_startup_script script to load and run all runners, as a string value. Targets the inputs for the slurm controller.
controller_startup_script script to load and run all runners, as a string value. Targets the inputs for the slurm controller.
startup_script script to load and run all runners, as a string value.