Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented visualizations for the NIM metrics #3612

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

LinoyBitan1
Copy link

@LinoyBitan1 LinoyBitan1 commented Jan 2, 2025

Description

JIRA - https://issues.redhat.com/browse/NVPE-31

  • Implemented visualizations for the NIM metrics. the graphs added to Nim tab-

    1 - Graph for KV Cache usage over time
    2 - Line graph for Running, Waiting and Max Request Count
    3 - Line graph with Total Prompt Token Count and Total Generation Token Count
    4 - Area chart with Time to First Token
    5 - Area chart with Time per Output Token
    6 - Donut chart with Success Request and Failed Requests

  • odh-model-controller creates a ConfigMap next to the InferenceService called -metrics-dashboard with a data indicating if metrics are available for the runtime and if so, the ConfigMap will also contain a JSON object specifying the queries required for the graphs mentioned above.

  • odh-dashboard grabs the above-mentioned ConfigMap based on its name, and if metrics are available for the runtime, it will parse the JSON object and create the required graphs.

NIM Metrics tab-

image
NIM Metrics graphs-

image
image
image
image
image

How Has This Been Tested?

Use the following doc -
NIM Metrics_ frontend.pdf

Test Impact

  • Cypress tests added
  • Unit tests of utility functions
  • Tests of new hooks to follow after.

Request review criteria:

Self checklist (all need to be checked):

  • The developer has manually tested the changes and verified that the changes work
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has added tests or explained why testing cannot be added (unit or cypress tests for related changes)

If you have UI changes:

  • Included any necessary screenshots or gifs if it was a UI change.
  • Included tags to the UX team if it was a UI/UX change.

After the PR is posted & before it merges:

  • The developer has tested their solution on a cluster by using the image produced by the PR to main

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress This PR is in WIP state label Jan 2, 2025
Copy link
Contributor

openshift-ci bot commented Jan 2, 2025

Hi @LinoyBitan1. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test The openshift bot needs to label PRs from non members to avoid strain on the CI label Jan 2, 2025
@andrewballantyne
Copy link
Member

FYI @emilys314

@LinoyBitan1 LinoyBitan1 changed the title Add nim metrics Implemented visualizations for the NIM metrics Jan 5, 2025
@LinoyBitan1
Copy link
Author

This is our (Matan Talvi and Linoy Bitan) first PR.

@LinoyBitan1 LinoyBitan1 marked this pull request as ready for review January 5, 2025 12:26
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress This PR is in WIP state label Jan 5, 2025
@emilys314
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test The openshift bot needs `ok-to-test` to allow non member PRs to run the tests. and removed needs-ok-to-test The openshift bot needs to label PRs from non members to avoid strain on the CI labels Jan 7, 2025
@emilys314
Copy link
Contributor

Can you run the linter / formatter? The CI checks will not pass unless it's following the linting rules See https://github.com/opendatahub-io/odh-dashboard/blob/main/CONTRIBUTING.md#linter-testing
vs code can show the errors. Or run npm run test:lint

@emilys314
Copy link
Contributor

Can you also add in the PR description the jira item this is associated with?

@LinoyBitan1
Copy link
Author

LinoyBitan1 commented Jan 8, 2025

done

@LinoyBitan1 LinoyBitan1 closed this Jan 8, 2025
@LinoyBitan1 LinoyBitan1 reopened this Jan 8, 2025
@LinoyBitan1
Copy link
Author

Can you run the linter / formatter? The CI checks will not pass unless it's following the linting rules See https://github.com/opendatahub-io/odh-dashboard/blob/main/CONTRIBUTING.md#linter-testing vs code can show the errors. Or run npm run test:lint

The linter and formatter were run, and we've added the updates in the commit along with a few changes in the tests. Let us know if there's anything else needed!

Comment on lines 42 to 39
const isKServeNIMEnabled = isProjectNIMSupported(currentProject);

const modelMetricsSupported =
modelMetricsEnabled && (modelMesh || kserveMetricsEnabled) && !isKServeNIMEnabled;
const modelMetricsSupported = modelMetricsEnabled && (modelMesh || kserveMetricsEnabled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another linter error since isKserveNIMEnabled is no longer being used
https://github.com/opendatahub-io/odh-dashboard/actions/runs/12686647659/job/35374595034?pr=3612#step:9:48

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We performed the needed changes.

Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 83.80952% with 51 lines in your changes missing coverage. Please review.

Project coverage is 85.82%. Comparing base (3ea00fd) to head (eebe460).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...end/src/api/prometheus/kservePerformanceMetrics.ts 45.16% 34 Missing ⚠️
...metrics/kserve/content/NIMCurrentRequestsGraph.tsx 80.00% 3 Missing ⚠️
...pts/metrics/kserve/useNimMetricsGraphDefinition.ts 84.21% 3 Missing ⚠️
...s/modelServing/screens/metrics/MetricsPageTabs.tsx 62.50% 3 Missing ⚠️
...etrics/kserve/content/NIMRequestsOutcomesGraph.tsx 84.61% 2 Missing ⚠️
...ontend/src/api/prometheus/NimPerformanceMetrics.ts 98.63% 1 Missing ⚠️
.../src/concepts/metrics/kserve/NimMetricsContext.tsx 96.00% 1 Missing ⚠️
...pts/metrics/kserve/content/NIMTokensCountGraph.tsx 92.30% 1 Missing ⚠️
...ts/metrics/kserve/content/NimPerformanceGraphs.tsx 95.65% 1 Missing ⚠️
frontend/src/concepts/metrics/kserve/utils.ts 75.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3612      +/-   ##
==========================================
- Coverage   85.82%   85.82%   -0.01%     
==========================================
  Files        1419     1433      +14     
  Lines       32644    32957     +313     
  Branches     9167     9222      +55     
==========================================
+ Hits        28018    28284     +266     
- Misses       4626     4673      +47     
Files with missing lines Coverage Δ
frontend/src/concepts/metrics/kserve/const.ts 100.00% <100.00%> (ø)
...ts/metrics/kserve/content/NIMKVCacheUsageGraph.tsx 100.00% <100.00%> (ø)
...rics/kserve/content/NIMTimeForFirstTokenGraphs.tsx 100.00% <100.00%> (ø)
...rics/kserve/content/NIMTimePerOutputTokenGraph.tsx 100.00% <100.00%> (ø)
...cepts/metrics/kserve/content/NimMetricsContent.tsx 100.00% <100.00%> (ø)
...ages/modelServing/screens/metrics/MetricsChart.tsx 89.24% <100.00%> (+0.61%) ⬆️
...s/modelServing/screens/metrics/nim/ModelGraphs.tsx 100.00% <100.00%> (ø)
...es/modelServing/screens/metrics/nim/NimMetrics.tsx 100.00% <100.00%> (ø)
...nd/src/pages/modelServing/screens/metrics/types.ts 100.00% <100.00%> (ø)
...rving/screens/metrics/useMetricsPageEnabledTabs.ts 100.00% <100.00%> (ø)
... and 12 more

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ea00fd...eebe460. Read the comment docs.

Copy link
Contributor

@emilys314 emilys314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to add a link to the metrics view in the model list for NIM projects, currently there is no way to get there? frontend/src/pages/modelServing/screens/global/InferenceServiceTableRow.tsx

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even possible to deploy a NIM model using modelmesh? I assume it's not so isModelMesh(model) ? <ModelMeshMetrics /> and the entire file is not needed.

@@ -69,7 +69,7 @@ export const blankDashboardCR: DashboardConfig = {
disableServingRuntimeParams: false,
disableConnectionTypes: false,
disableStorageClasses: false,
disableNIMModelServing: true,
disableNIMModelServing: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? I don't see in the JIRA or PR description that we should be changing this. Can you revert this?

@@ -6,8 +6,10 @@ const useModelMetricsEnabled = (): [modelMetricsEnabled: boolean] => {
).status;
const biasMetricsAreaAvailable = useIsAreaAvailable(SupportedArea.BIAS_METRICS).status;

const nimMetricsAreaAvailable = useIsAreaAvailable(SupportedArea.BIAS_METRICS).status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const nimMetricsAreaAvailable = useIsAreaAvailable(SupportedArea.BIAS_METRICS).status;
const nimMetricsAreaAvailable = useIsAreaAvailable(SupportedArea.NIM_MODEL).status;

I think you intended to check NIM_MODEL?

if (performanceMetricsAreaAvailable) {
enabledTabs.push(MetricsTabKeys.PERFORMANCE);
}
if (biasMetricsAreaAvailable) {
enabledTabs.push(MetricsTabKeys.BIAS);
}
if (nimMetricsAreaAvailable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't show the NIM tab for models / projects that are not NIM. If you make a standard kserve deployment it will be a blank page
image

Copy link

@mtalvi mtalvi Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We changed the disableNIMModelServing flag back to true. In order to test the tab isolation, the flag needs to be changed to false.

};

export const useFetchKserveKVCacheUsageData = (
metricsDef: KserveMetricGraphDefinition,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metricsDef: KserveMetricGraphDefinition,
metricsDef: KserveMetricGraphDefinition | NimMetricGraphDefinition,

I don't think we need to duplicate files (frontend/src/api/prometheus/NimPerformanceMetrics.ts and frontend/src/api/prometheus/kservePerformanceMetrics.ts) with just a different metricsDef type. Can we remove one of them and union the metricsDef type to contain both?

@@ -189,7 +182,8 @@ const MetricsChart: React.FC<MetricsChartProps> = ({
{hasSomeData ? (
<Chart
ariaTitle={title}
containerComponent={containerComponent}
// containerComponent={containerComponent}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: remove the old commented out code

Copy link
Contributor

openshift-ci bot commented Jan 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign manosnoam for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@emilys314
Copy link
Contributor

@LinoyBitan1 LinoyBitan1 force-pushed the add-nim-metrics branch 3 times, most recently from d36865b to 859f097 Compare January 15, 2025 13:04
Comment on lines 38 to 39
//check availability of NIM metrics
const nimMetricsAreaAvailable = useIsAreaAvailable(SupportedArea.NIM_MODEL).status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed because const isNIMAvailable = servingPlatformStatuses.kServeNIM.enabled; will already check it for you and return false if the area is not available

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think that it is needed because that just says if NIM is available or not available in the cluster but not by project. We've verified it with @olavtar .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useIsAreaAvailable(SupportedArea.NIM_MODEL).status; is only checking if it's it's available in the cluster too.

const isNIMAvailable = servingPlatformStatuses.kServeNIM.enabled; is the check for cluster wide nim enablement. And const isKServeNIMEnabled = project ? isProjectNIMSupported(project) : false is the one checking for nim project enablement.

useServingPlatformStatuses() for nim checks const { isNIMAvailable } = React.useContext(NIMAvailabilityContext); and that has

  const isNIMModelServingAvailable = useIsAreaAvailable(SupportedArea.NIM_MODEL).status;

  const fetchNIMAvailability = React.useCallback(async () => {
    if (!isNIMModelServingAvailable) {
      return false;
    }

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right!

const servingPlatformStatuses = useServingPlatformStatuses();
const isNIMAvailable = servingPlatformStatuses.kServeNIM.enabled;
const { projects } = React.useContext(ProjectsContext);
const project = projects.find(byName(model.metadata.namespace)) ?? null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const project = projects.find(byName(model.metadata.namespace)) ?? null;
const project = projects.find(byName(model.metadata.namespace));

small nitpick but ?? null is not needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test The openshift bot needs `ok-to-test` to allow non member PRs to run the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants