You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Would like to catch situations like linear growth, spikes at the end of the job, early much before they start to cause problems (i.e. job failures) in production
Could be something like
SimpleMemoryCheck monitors RSS, reports some growth metrics in the framework job report
WM propagates these metrics to monitoring
monitoring raises alerts if certain patterns start to occur
operator responding to alert opens CMSSW GitHub issue
The text was updated successfully, but these errors were encountered:
Would like to catch situations like linear growth, spikes at the end of the job, early much before they start to cause problems (i.e. job failures) in production
Could be something like
The text was updated successfully, but these errors were encountered: