Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Track] Call for Machine/Device Sponsorship #1906

Open
3 tasks
sunya-ch opened this issue Jan 8, 2025 · 7 comments
Open
3 tasks

[Track] Call for Machine/Device Sponsorship #1906

sunya-ch opened this issue Jan 8, 2025 · 7 comments
Assignees
Labels
discussion Need more discussion help wanted Extra attention is needed kind/feature New feature or request

Comments

@sunya-ch
Copy link
Collaborator

sunya-ch commented Jan 8, 2025

What would you like to be added?

Currently, we have a limited number/type of machines and sensors to train and provide a power model in the model database.
We have an idea to open a call for machine/device sponsorship with a proper recognition.

Here are the tasks I could initially list up

  • write down the procedure and guidelines on how to provide the machine (which information needed manual input, how to integrate the provided self-host machine to the workflow, how to secure the process)
  • ensure compliance to the CNCF what it could be provided back as an appreciation (official recognition)
  • promote the call

I could take care of the first task but still need help for the rest.

Why is this needed?

As mentioned in several research works, the power consumption behavior can be varied by several factors.
Using the right power model to predict power consumption on the machine that has no power meter is critical the precision of the reported values.

@sunya-ch sunya-ch added kind/feature New feature or request help wanted Extra attention is needed discussion Need more discussion labels Jan 8, 2025
@sunya-ch sunya-ch self-assigned this Jan 8, 2025
@SamYuan1990
Copy link
Collaborator

I will specific my points with execution ordering, and summarize with a digram.

We 1st of all thinking about how audience trigger the pipeline, well, here is not integration level which is another topic going to discuss later on. But here is where we start the GHA pipeline. Obviously we want audience just click a button from action page and the pipeline runs, maybe we can provide them by publish a kepler-action on GHA market place?

One step further, we now going to set up pre request for kepler, for example ebpf, rapl.... or just container runtime if audience provide us a clean instance. Or in some cases, audience provide us a kubeconfig file, which allow us avoid setting up container runtime, and we can starts from prometheus, or audience may have prometheus there already installed.

Trigger by GHA, runs on a customer agent which means, in other word, audience should install GHA agent on their proxy instance or a instance which connected to GHA. Which is important that usually an instance inside a data center or audience environment behind firewall or it may have some reason makes the instance can't install GHA agent.

So, the trigger from GHA, which means if audience is able to use GHA agent(customer GHA provide by them) to start the pipeline or just use default GHA runner to start the pipeline.
The we comes here, audience provides us an instance, it can either GHA runner or connected by GHA runner. Do we need to install k8s/build kepler stack from zero?
As CI/CD is most idempotent operation, but do we need build kepler stack from zero every time as long as the instance is provided?

As an alternative, user may just provide us a k8s cluster. Which luck for us that we don't need set up k8s from zero, but, the similar question comes, do we need build metal-ci stack from zero on k8s(like Tekton or prometheus)? or we ask the provider to make it once?

after running, how audience send the result.(maybe both kepler validation result and kepler server model, if they are going to train) back to us?

I suppose we need to define some key steps or nodes here, and decouple what's audience behavior and what's our pipeline code's behavior?
if our code is not meet the requirements, we can have open some issue and make it before go to public session.

@SamYuan1990
Copy link
Collaborator

image

@sunya-ch
Copy link
Collaborator Author

sunya-ch commented Jan 8, 2025

@SamYuan1990 Thank you so much.

We 1st of all thinking about how audience trigger the pipeline, well, here is not integration level which is another topic going to discuss later on. But here is where we start the GHA pipeline. Obviously we want audience just click a button from action page and the pipeline runs, maybe we can provide them by publish a kepler-action on GHA market place?

One step further, we now going to set up pre request for kepler, for example ebpf, rapl.... or just container runtime if audience provide us a clean instance. Or in some cases, audience provide us a kubeconfig file, which allow us avoid setting up container runtime, and we can starts from prometheus, or audience may have prometheus there already installed.

Trigger by GHA, runs on a customer agent which means, in other word, audience should install GHA agent on their proxy instance or a instance which connected to GHA. Which is important that usually an instance inside a data center or audience environment behind firewall or it may have some reason makes the instance can't install GHA agent.

So, the trigger from GHA, which means if audience is able to use GHA agent(customer GHA provide by them) to start the pipeline or just use default GHA runner to start the pipeline. The we comes here, audience provides us an instance, it can either GHA runner or connected by GHA runner. Do we need to install k8s/build kepler stack from zero? As CI/CD is most idempotent operation, but do we need build kepler stack from zero every time as long as the instance is provided?

As an alternative, user may just provide us a k8s cluster. Which luck for us that we don't need set up k8s from zero, but, the similar question comes, do we need build metal-ci stack from zero on k8s(like Tekton or prometheus)? or we ask the provider to make it once?

after running, how audience send the result.(maybe both kepler validation result and kepler server model, if they are going to train) back to us?

I suppose we need to define some key steps or nodes here, and decouple what's audience behavior and what's our pipeline code's behavior? if our code is not meet the requirements, we can have open some issue and make it before go to public session.

Yes, definitely.

We should start from defining the objective and the actor.
Recall our previous discussion, we used to think to call for only the results of model (and trained data if possible).
Another idea is to call for machine to our pipeline.

condition call for trained model call for machine
who trigger the train pipeline contributor (locally) Kepler workflow (like metal-ci)
contributor provides power model (and trained data if possible) self-host runner with token to access or token to create (if on demand)
how to contribute by PR pushed to kepler-model-db directly by PR pushed to metal-ci and then Kepler should automatically updates to kepler-model-db on every Kepler version update
concern contributor needs to maintain updating model on Kepler version change
GHA token sharing as you mentioned above
Kepler needs an agreement, governance, and compliance to use donated machine

Personally, I also would like to go for call for machine way but we need to get through at least the concern mentioned in the table.

@SamYuan1990
Copy link
Collaborator

@sunya-ch or @rootfs , I am not sure if below things looks good as draft in the call for article.


Must read/sign:

As people who donates a machine for contribute kepler validation result or kepler-model into kepler-model-db, please complete the following:

Security

  • Token permission: GHA token sharing as you mentioned above

Donated agreement and governance

  • Kepler needs an agreement, governance, and compliance to use donated machine....

As how it's going to work:

The mini scope:

- [ ] Are you going to train the model or you just want to validation?
- [ ] Kepler validation and kepler model training runs on a k8s with:
	- [ ] power API as RAPL..., to get the validation, as one side from kepler's output as provide by kepler community repo at...., and on the other side, which is machine specific, and which:
		- [ ] may hard for community developer to get as your are the owner of machine.
		- [ ] it's better to reusable, so we encourage you contribute the scripts back to community via PR.
	- [ ] as Kepler needs ebpf, cgroup and other system resources/k8s resources, please make sure your k8s is able to allow a kepler pod pass through to get real data or you are going to use VM solution?
		- [ ] Are you going to use kepler on BM?
		- [ ] Are you going to use kepler with model server?
		- [ ] Are you going to use kepler without an existing model?

After the mini scope, let's considering with integration with GHA:

- [ ] How do I integrate with GHA? Check with your network admin, or sys admin for network and security.
	- [ ] is it possible to make an instance as GHA agent, and I just run kepler metal CI on this specific agent? 
	- [ ] or, I am going to find a bridge server to make it.
		- [ ] if it's a bridge server, how this server communicate with the k8s cluster I am going to test kepler?
		- [ ] if it's a bridge server, can I just use GHA's free VM? and pass through with a kubeconfig.(well, it sounds like security concern here, but in fact to interact with k8s the key seems kubeconfig file isn't it?)

here, config specific, we have an overview for how we are going to integrate with kepler metal CI with GHA as pipeline trigger to reuse community pipeline as much as possible, but considering as machine specific.

- [ ] for the specific machine(as a worker node for k8s cluster), do we have kepler request ready? and pass through works?
- [ ] from k8s cluster point's of view what we are going to pre install and what are we going to leave to community?
	- [ ] prometheus
	- [ ] Tekton
	- [ ] ...
 - [ ] Do we need to change/customer kepler deployment to much our specific node label?
 - [ ] Do we need to change/customer/specific kepler image label?

for an example here, as kepler metal CI can build the stack from zero, so it will install prometheus on k8s every time as a default option. But if there existing prometheus on your k8s, how should we integrate together and avoid prometheus install(you may want this, at least to save your time, or network bandwidth to avoid build prometheus each time)?

Finally get success, after a while, and here things come:

- [ ] How do I upload the test result back to kepler community?
	- [ ] where are the test result file? bridge instance? GHA random instance? or other place?
- [ ] How do I upload the model training result as a part of kepler-model-db?

@sunya-ch
Copy link
Collaborator Author

@SamYuan1990 Thank you for the draft. It is very useful. I think it covers many items that should be in agreement letter.
I have a concern at some items that might confuse the donator with multiple choices and also difficult for maintainer to support these choices too. If possible, I would like to make it simple as the way that @rootfs contributes the token for the spot instances and equinix machine.

@rootfs I think we need advice from CNCF or linux foundation first whether we can call for sponsorship for the project and how's the process and agreement should be. We can add our Kepler-specific request as Sam's mentioned above on that.

@rootfs
Copy link
Contributor

rootfs commented Jan 22, 2025

I would love to have setups with ARM, GPU (NVIDIA, AMD, Intel), and latest CPUs from Intel/AMD to help us validate Kepler on these platforms.

@SamYuan1990
Copy link
Collaborator

if some one donates an instance, how are we going to use the instance.
as we install kepler and kepler model server, train a model on the instance then?
when should the trained result back to model-server-db? how are we validate the result on the specific instance with or without help by donator?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Need more discussion help wanted Extra attention is needed kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants