-
Notifications
You must be signed in to change notification settings - Fork 549
Roadmap
We typically look out 6 to 12 months and establish topics we want to work on. As we go we learn and our assessment of some of the topics listed changes. Thus, we may add or drop topics as we go.
We describe some initiatives as "investigations" which means our goal in the next few months is to better understand the problem and potential solutions before scheduling actual feature work. Once an investigation is done, we will update our plan, either deferring the initiative or committing to it.
The up to date backlog is tracking in the Backlog issue. Iteration plans can be found --> Iteration Plans.
As always, we will listen to your feedback and adapt our plans if needed.
Our roadmap covers the following themes:
- Easy to use: Simple and just enough UX for data scientists, researchers and students
- Easy to debug: Easy to understand and use experience for debugging the training jobs
- Easy to manage: Fluent installation, upgrade and maintenance experience for IT admin
- Easy to operate: Easy to use metrics for Operations to understand the resources usage
- Provide filters for jobs list page #302 @Gerhut
- Refine Job detail page
- 1st round refine #2211 @sunqinzheng @qfyin
- Tensorboard direct link option in job page
- Make left navigation tree items configurable for IT admins
- Instead of setting the GPU/memory/core manually and separately, PAI expose simple SKUs that are available in this PAI instance for user to select #2062 @debuggy @qfyin
- Improve user on board efficiency #2349:
- Provide better admin set up guidance, user usage guide in the UI/command.
- Provide better documentation and best practices for admin and users about how to use NFS as PAI's storage
- data/code/model: NFS and HDFS usage in the portal
- job submission: Clear and simple job workflow navigations
- experiment results: Simple and easy to find User Logs for experiments
- diagnostic: Clear and useful information for diagnostics (self-services)
- Accurate Job Status, Failure Reason (Category), and useful Resolutions #2326
- A user home page to provide a concentrated view for all 'my' own related work.
- As a PAI user, I need to better understand how resources will/are/were allocated for my jobs, so that I know how to better compile my job scripts. Related Issues: #2062, #1943, #1989, #1777, #1819, #1904, #1995, #1968
- Job list view shows GPU and Task counts
- As an ops, I need to know the best practice of using PAI VC, and better manageability on VC. Related issues: #2073, #906
- Provide better documentation, examples and best practices for checkpoint, so job could retry from the most recently progress
- PAIShare and Marketplace scenarios
- Better Job Debugging Experience for End Users #2210 @ydye
- Team wise storage management support #2204 @ydye @wangdian
- Backward compatible upgrading #2212 @hao1939
- Role based access control
- User account integration with AAD #1663
- A complete story for storage supports
- Persistent logs for jobs
- GPU scheduling with priority
- A complete story about VC bonus, (Per) job preemption choice. #2340
- As a running service, we should not expose too much info to unknown users before login.
- PAI everywhere
- Run PAI on an existing Kubernetes cluster
- Support to allocate resources for VC by quantity instead of percentage
- HA support for OpenPAI
- Cluster/Machine auto maintainance
- Detect and alert for unhealthy GPU #2192 @mzmssg @xudifsd
- Provide the ability to query all the jobs in a Node in PAI Web Portal #2128 @xudifsd
- Display resource utility per vc/queue metrics in grafana #2208 @xudifsd
- Ability to generate reports for cluster/vc/resources/users/jobs usage [#2127]
- Aware and alert for low utilization jobs (https://github.com/Microsoft/pai/issues/2127)
- GPU status summary
- List all the GPUs' utilization for 1 machine
- As an ops, I need the capability to batch create/update/delete user accounts and share with users through their email. Related Issues: #2078, #2085, #921
In addition to the above themes, there are fundamental architecture improvements need to be taken to support all the great features and experiences:
- End to end job event tracking/logging support
- Support Automatic Hyperparameter tuning or running Job as (NNI) Experiencment with sub jobs
- Easy to customized Webportal
If there are any questions or concerns about this wiki, please open OpenPAI Issue directly.
- Developer handbook