Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LNC with trn2 #159

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Support LNC with trn2 #159

wants to merge 3 commits into from

Conversation

movence
Copy link
Contributor

@movence movence commented Jan 9, 2025

LNC (Logical Neuron Cores), concept to represents multiple Neuron cores as a single Neuron core, is a new feature supported with trn2. This requires a new volume mount for nueron-monitor to get LNC configuration.

Description of changes:

  • Bump neuron-monitor image to 1.3.0
  • Add new volume mount for /opt
  • Update GOMEMLIMIT to 320MiB

Tested by deploying the changes to a test cluster:

# describe neuron monitor (some truncated)

Containers:
  neuron-monitor:
    Image:         public.ecr.aws/neuron/neuron-monitor:1.3.0
    Image ID:      public.ecr.aws/neuron/neuron-monitor@sha256:[sha]
    Port:          8000/TCP
    Host Port:     0/TCP
    ...
    State:          Running
      Started:      Thu, 09 Jan 2025 13:03:44 -0500
    Ready:          True
    Limits:
      cpu:     500m
      memory:  256Mi
    Requests:
      cpu:     256m
      memory:  128Mi
    Environment:
      NODE_NAME:    (v1:spec.nodeName)
      PATH:        /usr/local/bin:/usr/bin:/bin:/opt/aws/neuron/bin
      GOMEMLIMIT:  320MiB
    Mounts:
      /etc/amazon-cloudwatch-observability-neuron-cert/ from neurontls (ro)
      /etc/neuron-monitor-config from neuron-monitor-config (rw)
      /opt-aws from aws-config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hxmsx (ro)
    Volumes:
      aws-config:
        Type:          HostPath (bare host directory volume)
        Path:          /opt/aws
        HostPathType:


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link

@mounchin mounchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -36,7 +36,7 @@ spec:
- name: PATH
value: /usr/local/bin:/usr/bin:/bin:/opt/aws/neuron/bin
- name: GOMEMLIMIT
value: 160MiB
value: 320MiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default limit itself is lower than this -

Should we updating that as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants