Skip to content

Monitor GPUs available in Skynet and Ibex

License

Notifications You must be signed in to change notification settings

escorciav/kaust-cluster-status

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project aims to display the GPUs available in Ibex and IVUL cluster.

This was born as a solo project to greedily&manually pick the best cluster to run experiments.

Usage

Requirements: python-3, pandas

  1. Login to the cluster of interest (ibex/skynet) and run

  2. Launch servers

    1. @skynet

      conda activate cluster_status
      FLASK_PORT=5000; FLASK_APP=server.py flask run --port=$FLASK_PORT
    2. @ibex python gdragon.py

  3. (Optional) Make server accessible if ports are blocked

    Should be as simple as ssh -vR 8000:localhost:5000 [user-name]@[server]

Setup

  1. Install miniconda or anaconda.

    Skip this step if you already have

  2. Create the environment

    conda env create -f environment-x86_64.yml

That's all. Don't forget to activate the environment before running any program.

conda activate cluster-status

Documentation

Currently, all the heavy lifting is done in cluster.py. This module simply retrieves status of the cluster from SLURM via subprocess/shell calls. We recommend to navigate the code from server.py to get an idea was going on. The most important functions are partially documented. You can also reach us, and contribute with more documentation.

Cluster info

Grab info about nodes

sinfo -o "%n %A %D %P %T %c %z %m %d %w %f %G"

TODO: explain what all those %? means. low-priority in favor of using this.

Available GPUs

Behind scenes combines cluster info and squeue -o "%u %i %t %b %N"

TODO: explain what all those %? means. low-priority in favor of using this.

Extras

  1. Show reservation

scontrol show reservation | grep -A 3 GROUP_IVUL

  1. List node info

scontrol -o show node

New request protocol - Help wanted

Implement the feats described in issues #9, #5 .

  • Get users

    sacctmgr list users --noheader format=User%-20

  • Get gres list

    scontrol show config | grep -e "GresTypes"

  • Get partitions list

    scontrol show partitions | grep PartitionName

  • List of unaveilable nodes

    sinfo -N --states=DOWN,DRAIN,DRAINED,DRAINING -o \"%N\" --noheader

  • List of nodes or nodes in given partition

    sinfo -h -o %n

    sinfo -h -p $partition_list -o %n

  • Extract computer info

    scontrol show nodes --oneliner --detail | sed 's/\\s/\\n/g' | grep -e "NodeName=" -e "Gres=" -e "GresUsed" -e "CfgTRES=" -e "AllocTRES=" -e "Partitions="

  • List jobs

    scontrol show jobs --oneliner --detail | grep "JobState=RUNNING" | sed 's/\\s/\\n/g' | grep -e "JobId" -e "NumNodes" -e "ArrayJobId" -e "ArrayTaskId" -e "JobName" -e "UserId" -e "StartTime" -e "Partition" -e "^Nodes=" -e "CPU_IDs" -e "Mem=" -e "Gres=" -e "TRES=" -e "TresPerNode="

    Tested in ibex. Gres did not work instead @escorciav found TresPerNode or GRES_IDX.

Credits to situpf

About

Monitor GPUs available in Skynet and Ibex

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published