Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Roadmap

Scarlett Li edited this page Jun 22, 2020 · 54 revisions

OpenPAI Roadmap

We typically look out 6 to 12 months and establish topics we want to work on. As we go we learn and our assessment of some of the topics listed changes. Thus, we may add or drop topics as we go.

We describe some initiatives as "investigations" which means our goal in the next few months is to better understand the problem and potential solutions before scheduling actual feature work. Once an investigation is done, we will update our plan, either deferring the initiative or committing to it.

The up to date backlog is tracking in the Backlog issue. Iteration plans can be found --> Iteration Plans.

As always, we will listen to your feedback and adapt our plans if needed.

Themes

Our roadmap covers the following themes:

User themes

  • Easy to use: Simple and just enough UX for data scientists, researchers and students
  • Easy to debug: Easy to understand and use experience for debugging the training jobs

Admin and Ops themes

  • Easy to manage: Fluent installation, upgrade and maintenance experience for IT admin
  • Easy to operate: Easy to use metrics for Operations to understand the resources usage

Easy to use

  • Provide filters for jobs list page #302 @Gerhut
  • Refine Job detail page
    • 1st round refine #2211 @sunqinzheng @qfyin
    • Tensorboard direct link option in job page
  • Make left navigation tree items configurable for IT admins
  • Instead of setting the GPU/memory/core manually and separately, PAI expose simple SKUs that are available in this PAI instance for user to select #2062 @debuggy @qfyin
  • Improve user on board efficiency #2349:
    • Provide better admin set up guidance, user usage guide in the UI/command.
    • Provide better documentation and best practices for admin and users about how to use NFS as PAI's storage
    • data/code/model: NFS and HDFS usage in the portal
    • job submission: Clear and simple job workflow navigations
    • experiment results: Simple and easy to find User Logs for experiments
    • diagnostic: Clear and useful information for diagnostics (self-services)
  • Accurate Job Status, Failure Reason (Category), and useful Resolutions #2326
  • A user home page to provide a concentrated view for all 'my' own related work.
  • As a PAI user, I need to better understand how resources will/are/were allocated for my jobs, so that I know how to better compile my job scripts. Related Issues: #2062, #1943, #1989, #1777, #1819, #1904, #1995, #1968
  • Job list view shows GPU and Task counts
  • As an ops, I need to know the best practice of using PAI VC, and better manageability on VC. Related issues: #2073, #906
  • Provide better documentation, examples and best practices for checkpoint, so job could retry from the most recently progress
  • PAIShare and Marketplace scenarios

Easy to debug

  • Better Job Debugging Experience for End Users #2210 @ydye
    • Job debugging reservation when job failed due to users' error. #2213 @ydye
    • An option for user to decide whether to enable debugging reservation for the job or not. #2214 @ydye
    • Approach to collect the container information which is reserved for job debugging. #2215
    • Display the debugging reservation status for job in webportal #2216
    • Approach to notify users when their jobs are in debugging reservation #2217
    • Provide detail information when the job container exits. #2218

Easy to manage

  • Team wise storage management support #2204 @ydye @wangdian
  • Backward compatible upgrading #2212 @hao1939
  • Role based access control
    • User account integration with AAD #1663
  • A complete story for storage supports
  • Persistent logs for jobs
  • GPU scheduling with priority
  • A complete story about VC bonus, (Per) job preemption choice. #2340
  • As a running service, we should not expose too much info to unknown users before login.
  • PAI everywhere
    • Run PAI on an existing Kubernetes cluster
  • Support to allocate resources for VC by quantity instead of percentage
  • HA support for OpenPAI
  • Cluster/Machine auto maintainance

Easy to operate

  • Detect and alert for unhealthy GPU #2192 @mzmssg @xudifsd
  • Provide the ability to query all the jobs in a Node in PAI Web Portal #2128 @xudifsd
  • Display resource utility per vc/queue metrics in grafana #2208 @xudifsd
  • Ability to generate reports for cluster/vc/resources/users/jobs usage [#2127]
  • Aware and alert for low utilization jobs (https://github.com/Microsoft/pai/issues/2127)
  • GPU status summary
  • List all the GPUs' utilization for 1 machine
  • As an ops, I need the capability to batch create/update/delete user accounts and share with users through their email. Related Issues: #2078, #2085, #921

Better foundation

In addition to the above themes, there are fundamental architecture improvements need to be taken to support all the great features and experiences:

  • End to end job event tracking/logging support
  • Support Automatic Hyperparameter tuning or running Job as (NNI) Experiencment with sub jobs
  • Easy to customized Webportal