Skip to content

Commit

Permalink
Add more information re. configuring production sites (#508)
Browse files Browse the repository at this point in the history
* add lots of info to production docs

* Production docs tweaks from review

Co-authored-by: Scott Davidson <49713135+sd109@users.noreply.github.com>

* add prod docs comment re login FIPs

---------

Co-authored-by: Scott Davidson <49713135+sd109@users.noreply.github.com>
  • Loading branch information
sjpb and sd109 authored Jan 8, 2025
1 parent 50fc320 commit 781c2d4
Showing 1 changed file with 145 additions and 5 deletions.
150 changes: 145 additions & 5 deletions docs/production.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,149 @@
# Production Deployments

This page contains some brief notes about differences between the default/demo configuration, as described in the main [README.md](../README.md) and production-ready deployments.
This page contains some brief notes about differences between the default/demo
configuration (as described in the main [README.md](../README.md)) and
production-ready deployments.

- Get it agreed up front what the cluster names will be. Changing this later
requires instance deletion/recreation.

- At least three environments should be created:
- `site`: site-specific base environment
- `production`: production environment
- `staging`: staging environment

A `dev` environment should also be created if considered required, or this
can be left until later.,

These can all be produced using the cookicutter instructions, but the
`production` and `staging` environments will need their
`environments/$ENV/ansible.cfg` file modifying so that they point to the
`site` environment:

```ini
inventory = ../common/inventory,../site/inventory,inventory
```

- To avoid divergence of configuration all possible overrides for group/role
vars should be placed in `environments/site/inventory/group_vars/all/*.yml`
unless the value really is environment-specific (e.g. DNS names for
`openondemand_servername`).

- Where possible hooks should also be placed in `environments/site/hooks/`
and referenced from the `site` and `production` environments, e.g.:

```yaml
# environments/production/hooks/pre.yml:
- name: Import parent hook
import_playbook: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/../site/hooks/pre.yml"
```

- OpenTofu configurations should be defined in the `site` environment and used
as a module from the other environments. This can be done with the
cookie-cutter generated configurations:
- Delete the *contents* of the cookie-cutter generated `terraform/` directories
from the `production` and `staging` environments.
- Create a `main.tf` in those directories which uses `site/terraform/` as a
[module](https://opentofu.org/docs/language/modules/), e.g. :

```
...
module "cluster" {
source = "../../site/terraform/"

cluster_name = "foo"
...
}
```

Note that:
- Environment-specific variables (`cluster_name`) should be hardcoded
into the module block.
- Environment-independent variables (e.g. maybe `cluster_net` if the
same is used for staging and production) should be set as *defaults*
in `environments/site/terraform/variables.tf`, and then don't need to
be passed in to the module.
- Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates
a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`.
To ensure staging environments are a good model for production this should
generally be moved into the `site` environment. It should be be encrypted
using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html)
and then committed to the repository.
- Ensure created instances have accurate/synchronised time. For VM instances
this is usually provided by the hypervisor, but if not (or for bare metal
instances) it may be necessary to configure or proxy `chronyd` via an
environment hook.
- The cookiecutter provided OpenTofu configurations define resources for home and
state volumes. The former may not be required if the cluster's `/home` is
provided from an external filesystem (or Manila). In any case, in at least
the production environment, and probably also in the staging environment,
the volumes should be manually created and the resources changed to [data
resources](https://opentofu.org/docs/language/data-sources/). This ensures that even if the cluster is deleted via tofu, the
volumes will persist.

For a development environment, having volumes under tofu control via volume
resources is usually appropriate as there may be many instantiations
of this environment.

- Enable `etc_hosts` templating:

```yaml
# environments/site/inventory/groups:
[etc_hosts:children]
cluster
```

- Create a site environment. Usually at least production, staging and possibly development environments are required. To avoid divergence of configuration these should all have an `inventory` path referencing a shared, site-specific base environment. Where possible hooks should also be placed in this site-specific environment.
- Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`. To ensure staging environments are a good model for production this should generally be moved into the site-specific environment. It can be be encrypted using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html) and then committed to the repository.
- Ensure created instances have accurate/synchronised time. For VM instances this is usually provided by the hypervisor, but if not (or for bare metal instances) it may be necessary to configure or proxy `chronyd` via an environment hook.
- Remove production volumes from OpenTofu control. In the default OpenTofu configuration, deleting the resources also deletes the volumes used for persistent state and home directories. This is usually undesirable for production, so these resources should be removed from the OpenTofu configurations and manually deployed once. However note that for development environments leaving them under OpenTofu control is usually best.
- Configure Open OpenOndemand - see [specific documentation](openondemand.README.md).

- Modify `environments/site/terraform/nodes.tf` to provide fixed IPs for at least
the control node, and (if not using FIPs) the login node(s):

```
resource "openstack_networking_port_v2" "control" {
...
fixed_ip {
subnet_id = data.openstack_networking_subnet_v2.cluster_subnet.id
ip_address = var.control_ip_address
}
}
```

Note the variable `control_ip_address` is new.

Using fixed IPs will require either using admin credentials or policy changes.

- If floating IPs are required for login nodes, modify the OpenTofu configurations
appropriately.

- Enable persisting login node hostkeys so users do not get annoying ssh warning
messages on reimage:

```yaml
# environments/site/inventory/groups:
[persist_hostkeys:children]
login
```
And configure NFS to include exporting the state directory to these hosts:

```yaml
# environments/common/inventory/group_vars/all/nfs.yml:
nfs_configurations:
# ... potentially, /home defintion from common environment
- comment: Export state directory to login nodes
nfs_enable:
server: "{{ inventory_hostname in groups['control'] }}"
clients: "{{ inventory_hostname in groups['login'] }}"
nfs_server: "{{ nfs_server_default }}"
nfs_export: "/var/lib/state"
nfs_client_mnt_point: "/var/lib/state"
```
See [issue 506](https://github.com/stackhpc/ansible-slurm-appliance/issues/506).

- Consider whether mapping of baremetal nodes to ironic nodes is required. See
[PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485).

- Note [PR 473](https://github.com/stackhpc/ansible-slurm-appliance/pull/473)
may help identify any site-specific configuration.

0 comments on commit 781c2d4

Please sign in to comment.