Skip to content

Development notes

Steve Brasier edited this page May 12, 2022 · 28 revisions

Interfaces

What information is required as input to the cluster/nodes.

Groups:

  • login
  • compute
  • control

Group/host vars:

Odd things

  • For smslabs, control node needs to know login private IP because openondemand_servername is defined using it in group_vars/all/openondemand.yml as we use SOCKS proxy to access. But generally, grafana (default: control) will need to know openondemand (default: login) external address.

Full list for everything cluster is shown below.

Note that api_address and internal_address for hosts both default to inventory_hostname.

  • openhpc_cluster_name: Cluster name. No default, must be set.

    • Required for all openhpc hosts here, but I think this is over-broad: actually probably only control host & db host ( openhpc.enable=database) should require it.
    • NB: slurm.conf templating assumes this is only done on a single controller!
  • openhpc_slurm_control_host: Slurmctld address. Default in common:all:openhpc = {{ groups['control'] | first }}.

    • NB: maybe should use .internal_address?
    • Required for all openhpc hosts. Is needed as delegate_to so must be an inventory_hostname. Is also used as address of slurm controller, which is overloading it really
    • Note Slurm assumes slurmdbd and slurm.conf are in same directory, how does this work configless?
    • For slurmd nodes, we could rewrite /etc/sysconfig/slurmd using cloud-config's write_files.
  • openhpc_slurm_partitions: Partition definitions. Default in common:all:openhpc is single 'compute' partition. NB: requires group "{{ openhpc_cluster_name }}_compute" in environment inventory. Could check groups during validation??

    • Host requirements & comments as above (but for control only)
  • nfs_server. Default in common:all:nfs is nfs_server_default -> "{{ hostvars[groups['control'] | first ].internal_address }}". Required for all clients.

    • For client nodes, could rewrite fstab (done by https://github.com/stackhpc/ansible-role-cluster-nfs/blob/master/tasks/nfs-clients.yml) using cloud-config's mount module.
  • elasticsearch_address: Default in common:all:defaults is {{ hostvars[groups['opendistro'].0].api_address }}. Required for filebeat and grafana hosts.

output.elasticsearch:
  hosts: ["{{ elasticsearch_address }}:9200"]
  protocol: "https"
  ssl.verification_mode: none
  username: "admin"
  password: "{{ vault_elasticsearch_admin_password }}"

(docs). Looks like these support environment vars, so potentially could set this using a systemd unit file fragment. The current systemd unit file is in the appliance - ansible/roles/filebeat/templates/filebeat.service.j2

  • prometheus_address: Default in common:all:defaults is {{ hostvars[groups['prometheus'].0].api_address }} Required for prometheus and grafana hosts - link

  • openondemand_address: Default in common:all:defaults is {{ hostvars[groups['openondemand'].0].api_address if groups['openondemand'] | count > 0 else '' }}. Required for prometheus host - NB this should probably be in prometheus group vars.

  • grafana_address: Default in common:all:grafana is {{ hostvars[groups['grafana'].0].api_address }}. Required for grafana host link.

    • This should probably be moved to common:all:defaults in line with other service endpoints
  • openondemand_servername: Non-functional default '', must be set. Required for openondemand host and grafana host link when both grafana and openondemand exist (which they do for everything). NB this probably requires either a) a FIP or b) a fixed IP when using SOCKS proxy. For latter case this means the control host needs to have the login node's fixed IP available.

  • All the secrets in environment:all:secrets - see secret role's defaults:

    • grafana, elasticsearch, mysql (x2) passwords (all potentially depending on group placement)
    • vault_openhpc_mungekey -> `openhpc_munge_key' (for all openhpc nodes):

Running install tasks only

Which roles can we ONLY run the install tasks from, to build a cluster-independent(*)/no-config image?

In-appliance roles:

  • basic_users: n/a
  • block_devices: n/a
  • filebeat: n/a but downloads Docker container at service start)
  • grafana-dashboards: Downloads grafana dashboards
  • grafana-datasources: n/a
  • hpctests: n/a but reqd. packages are installed as part of openhpc_default_packages.
  • opendistro: n/a but downloads Docker container at service start.
  • openondemand:
  • passwords: n/a
  • podman: prereqs.yml Does package installs

Out of appliance roles:

  • stackhpc.nfs: [main.yml(https://github.com/stackhpc/ansible-role-cluster-nfs/blob/master/tasks/main.yml) installs packages.
  • stackhpc.openhpc: Required and openhpc_packages (see above) installed in install.yml but requires openhpc_slurm_service fact set from main.yml.
  • cloudalchemy.node_exporter:
    • install.yml does binary download from github but also propagation. Could pre-download it and use node_exporter_binary_local_dir but install.yml still needs running as it does user creation too.
    • selinux.yml also does package installations
  • cloudalchemy.blackbox-exporter: Currently unused.
  • cloudalchemy.prometheus: install.yml. Same comments as for cloudalchemy.node_exporter above.
  • cloudalchemy.alertmanager: Currently unused.
  • cloudalchemy.grafana: install.yml does package updates.
  • geerlingguy.mysql: setup-RedHat.yml does package updates BUT needs variables.yml running to load appropriate variables.
  • jriguera.configdrive: Unused, should be deleted.
  • osc.ood: See openondemand above.
  • It's not really cluster-independent as which features are turned on where may vary.
Clone this wiki locally