This module creates a startup script that will execute a list of runners in the order they are specified. The runners are copied to a GCS bucket at deployment time and then copied into the VM as they are executed after startup.
Each runner receives the following attributes:
-
destination
: (Required) The name of the file at the destination VM. If an absolute path is provided, the file will be copied to that path, otherwise the file will be created in a temporary folder and deleted once the startup script runs. -
type
: (Required) The type of the runner, one of the following:-
shell
: The runner is a shell script and will be executed once copied to the destination VM. -
ansible-local
: The runner is an ansible playbook and will run on the VM with the following command line flags:ansible-playbook --connection=local --inventory=localhost, \ --limit localhost <<DESTINATION>>
-
data
: The data or file specified will be copied to<<DESTINATION>>
. No action will be performed after the data is staged. This data can be used by subsequent runners or simply made available on the VM for later use.
-
-
content
: (Optional) Content to be uploaded and, iftype
is eithershell
oransible-local
, executed. Must be defined ifsource
is not. -
source
: (Optional) A path to the file or data you want to upload. Must be defined ifcontent
is not. The source path is relative to the deployment group directory. To ensure correctness of path useghpc_stage
function, that would copy referenced file to the deployment group directory. For example:source: $(ghpc_stage("path/to/file"))
For more examples with context, see the example blueprint snippet. To reference any other source file, an absolute path must be used.
-
args
: (Optional) Arguments to be passed toshell
oransible-local
runners. Forshell
runners, these will be passed as arguments to the script when it is executed. Foransible-local
runners, they will be appended to a list of default arguments that invokeansible-playbook
on the localhost. Thereforeargs
should not include any arguments that alter this behavior, such as--connection
,--inventory
, or--limit
.
ansible-local
runners require Ansible to be installed in the VM before
running. To support other playbook runners in the Cluster Toolkit, we install
version 2.11 of ansible-core
as well as the larger package of collections
found in ansible
version 4.10.0.
If an ansible-local
runner is found in the list supplied to this module,
a script to install Ansible will be prepended to the list of runners. This
behavior can be disabled by setting var.prepend_ansible_installer
to false
.
This script will do the following at VM startup:
- Install system-wide python3 if not already installed using system package managers (yum, apt-get, etc)
- Install
python3-distutils
system-wide in debian and ubuntu based environments. This can be a missing dependency on system installations of python3 for installing and upgrading pip. - Install system-wide pip3 if not already installed and upgrade pip3 if the version is not at least 18.0.
- Install and create a virtual environment located at
/usr/local/ghpc-venv
. - Install ansible into this virtual environment if the current version of ansible is not version 2.11 or higher.
To use the virtual environment created by this script, you can activate it by running the following command on the VM:
source /usr/local/ghpc-venv/bin/activate
You may also need to provide the correct python interpreter as the python3
binary in the virtual environment. This can be done by adding the following flag
when calling ansible-playbook
:
-e ansible_python_interpreter=/usr/local/ghpc-venv/bin/activate
NOTE: ansible-playbook and other ansible command line tools will only be accessible from the command line (and in your PATH variable) after activating this environment.
Runners will be uploaded to a
GCS bucket. This
bucket will be created by this module and named as
${var.deployment_name}-startup-scripts-${random_id}
. VMs using the startup
script created by this module will pull the runners content from a GCS bucket
and therefore must have access to GCS.
NOTE: To ensure access to GCS, set the following OAuth scope on the instance using the startup scripts:
https://www.googleapis.com/auth/devstorage.read_only
.This is set as a default scope in the vm-instance, schedMD-slurm-on-gcp-login-node and schedMD-slurm-on-gcp-controller modules
For more information on how to use startup scripts on Google Cloud Platform, please refer to this document.
To debug startup scripts from a Linux VM created with startup script generated by this module:
sudo DEBUG=1 google_metadata_script_runner startup
To view outputs from a Linux startup script, run:
sudo journalctl -u google-startup-scripts.service
This startup-script
module has several options for installing a Google
monitoring agent. There are two relevant settings: install_stackdriver_agent
and install_cloud_ops_agent
.
The Stackdriver Agent also called the Legacy Cloud Monitoring Agent provides
better performance under some HPC workloads. While official documentation
recommends using the Cloud Ops Agent, it is recommended to use
install_stackdriver_agent
when performance is important.
If an image or machine already has Cloud Ops Agent installed and you would like to instead use the Stackdriver Agent, the following script will remove the Cloud Ops Agent and install the Stackdriver Agent.
# Remove Cloud Ops Agent
sudo systemctl stop google-cloud-ops-agent.service
sudo systemctl disable google-cloud-ops-agent.service
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --uninstall
sudo bash add-google-cloud-ops-agent-repo.sh --remove-repo
# Install Stackdriver Agent
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh --also-install
curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh --also-install
sudo service stackdriver-agent start
If an image or machine already has the Stackdriver Agent installed and you would like to instead use the Cloud Ops Agent, the following script will remove the Stackdriver Agent and install the Cloud Ops Agent.
# UnInstall Stackdriver Agent
sudo systemctl stop stackdriver-agent.service
sudo systemctl disable stackdriver-agent.service
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo dpkg --configure -a
sudo bash add-monitoring-agent-repo.sh --uninstall
sudo bash add-monitoring-agent-repo.sh --remove-repo
# Install ops-agent
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
sudo service google-cloud-ops-agent start
As a reminder, this should be in a startup script, which should run on all
Compute nodes via the compute_startup_script
on the controller.
You can test if one of the agents is running using the following commands:
# For Cloud Ops Agent
$ sudo systemctl is-active google-cloud-ops-agent"*"
active
active
active
active
# For Legacy Monitoring and Logging Agents
$ sudo service stackdriver-agent status
stackdriver-agent is running [ OK ]
$ sudo service google-fluentd status
google-fluentd is running [ OK ]
For official documentation see troubleshooting docs:
- id: startup
source: modules/scripts/startup-script
settings:
runners:
# Some modules such as filestore have runners as outputs for convenience:
- $(homefs.install_nfs_client_runner)
# These runners can still be created manually:
# - type: shell
# destination: "modules/filestore/scripts/install_nfs_client.sh"
# source: "modules/filestore/scripts/install_nfs_client.sh"
- type: ansible-local
destination: "modules/filestore/scripts/mount.yaml"
source: "modules/filestore/scripts/mount.yaml"
- type: data
source: /tmp/foo.tgz
destination: /tmp/bar.tgz
- type: shell
destination: "decompress.sh"
content: |
#!/bin/sh
echo $2
tar zxvf /tmp/$1 -C /
args: "bar.tgz 'Expanding file'"
- id: compute-cluster
source: modules/compute/vm-instance
use: [homefs, startup]
In the above example, a new GCS bucket is created to upload the startup-scripts.
But in the case where the user wants to reuse existing GCS bucket or folder,
they are able to do so by using the gcs_bucket_path
as shown in the below example
- id: startup
source: modules/scripts/startup-script
settings:
gcs_bucket_path: gs://user-test-bucket/folder1/folder2
install_stackdriver_agent: true
- id: compute-cluster
source: modules/compute/vm-instance
use: [startup]
Copyright 2023 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Name | Version |
---|---|
terraform | >= 1.3 |
>= 3.83 | |
local | >= 2.0.0 |
random | ~> 3.0 |
Name | Version |
---|---|
>= 3.83 | |
local | >= 2.0.0 |
random | ~> 3.0 |
No modules.
Name | Type |
---|---|
google_storage_bucket.configs_bucket | resource |
google_storage_bucket_iam_binding.viewers | resource |
google_storage_bucket_object.scripts | resource |
local_file.debug_file | resource |
random_id.resource_name_suffix | resource |
Name | Description | Type | Default | Required |
---|---|---|---|---|
ansible_virtualenv_path | Virtual environment path in which to install Ansible | string |
"/usr/local/ghpc-venv" |
no |
bucket_viewers | Additional service accounts or groups, users, and domains to which to grant read-only access to startup-script bucket (leave unset if using default Compute Engine service account) | list(string) |
[] |
no |
configure_ssh_host_patterns | If specified, it will automate ssh configuration by: - Defining a Host block for every element of this variable and setting StrictHostKeyChecking to 'No'. Ex: "hpc*", "hpc01*", "ml*" - The first time users log-in, it will create ssh keys that are added to the authorized keys list This requires a shared /home filesystem and relies on specifying the right prefix. |
list(string) |
[] |
no |
debug_file | Path to an optional local to be written with 'startup_script'. | string |
null |
no |
deployment_name | Name of the HPC deployment, used to name GCS bucket for startup scripts. | string |
n/a | yes |
docker | Install and configure Docker | object({ |
{ |
no |
enable_docker_world_writable | DEPRECATED: use var.docker | bool |
null |
no |
gcs_bucket_path | The GCS path for storage bucket and the object, starting with gs:// . |
string |
null |
no |
http_no_proxy | Domains for which to disable http_proxy behavior. Honored only if var.http_proxy is set | string |
".google.com,.googleapis.com,metadata.google.internal,localhost,127.0.0.1" |
no |
http_proxy | Web (http and https) proxy configuration for pip, apt, and yum/dnf and interactive shells | string |
"" |
no |
install_ansible | Run Ansible installation script if either set to true or unset and runner of type 'ansible-local' are used. | bool |
null |
no |
install_cloud_ops_agent | Warning: Consider using install_stackdriver_agent for better performance. Run Google Ops Agent installation script if set to true. |
bool |
false |
no |
install_cloud_rdma_drivers | If true, will install and reload Cloud RDMA drivers. Currently only supported on Rocky Linux 8. | bool |
false |
no |
install_docker | DEPRECATED: use var.docker. | bool |
null |
no |
install_stackdriver_agent | Run Google Stackdriver Agent installation script if set to true. Preferred over ops agent for performance. | bool |
false |
no |
labels | Labels for the created GCS bucket. Key-value pairs. | map(string) |
n/a | yes |
local_ssd_filesystem | Create and mount a filesystem from local SSD disks (data will be lost if VMs are powered down without enabling migration); enable by setting mountpoint field to a valid directory path. | object({ |
{ |
no |
prepend_ansible_installer | DEPRECATED. Use install_ansible=false to prevent ansible installation. |
bool |
null |
no |
project_id | Project in which the HPC deployment will be created | string |
n/a | yes |
region | The region to deploy to | string |
n/a | yes |
runners | List of runners to run on remote VM. Runners can be of type ansible-local, shell or data. A runner must specify one of 'source' or 'content'. All runners must specify 'destination'. If 'destination' does not include a path, it will be copied in a temporary folder and deleted after running. Runners may also pass 'args', which will be passed as argument to shell runners only. |
list(map(string)) |
[] |
no |
Name | Description |
---|---|
compute_startup_script | script to load and run all runners, as a string value. Targets the inputs for the slurm controller. |
controller_startup_script | script to load and run all runners, as a string value. Targets the inputs for the slurm controller. |
startup_script | script to load and run all runners, as a string value. |