Skip to content

Introduce Node Lifecycle WG #8396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

No description provided.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/community-management area/slack-management Issues or PRs related to the Slack Management subproject labels Mar 24, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and ardaguclu March 24, 2025 12:17
@k8s-ci-robot k8s-ci-robot added committee/steering Denotes an issue or PR intended to be handled by the steering committee. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 24, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 24, 2025
@atiratree atiratree changed the title Introduce Node Lifecycle WG WIP: Introduce Node Lifecycle WG Mar 24, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025
@atiratree
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025
@rthallisey
Copy link

Looks like I'm not a member of kubernetes org anymore. I was a few years back, but didn't keep up with contributions recently. You can remove me as a lead and I can reapply after some contributions to this WG.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 24, 2025
@atiratree
Copy link
Member Author

We have had impactful conversations with Ryan about this group and its goals. He has experience with cluster maintenance and I look forward to his participation in the WG.

@marquiz
Copy link
Contributor

marquiz commented Mar 25, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from marquiz March 25, 2025 17:09
@selansen
Copy link

@atiratree, I would like to be part of this WG. Pls include me as well.

@evrardjp
Copy link

I have written some PoC that might interest this wg, sign me up.

@evrardjp
Copy link

/cc

@k8s-ci-robot
Copy link
Contributor

@evrardjp: GitHub didn't allow me to request PR reviews from the following users: evrardjp.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kaushik229
Copy link

/cc

@k8s-ci-robot
Copy link
Contributor

@kaushik229: GitHub didn't allow me to request PR reviews from the following users: kaushik229.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like how this is going and am excited to see the wg formed. thank you @atiratree !

@humblec
Copy link
Contributor

humblec commented Apr 16, 2025

I have been exploring the API's in this area and would like to help on this initiative. Considering that, @atiratree, I would like to be part of this WG.

@atiratree atiratree force-pushed the wg-node-lifecycle branch 2 times, most recently from 43ff1f5 to c627543 Compare April 17, 2025 15:41
@atiratree
Copy link
Member Author

Thank you all for your interest!

Just to be on the same page for all visitors, this WG is open to everyone and we will announce the weekly meetings on the dev@kubernetes.io mailing list as soon as the group is formed.

If you are interested in helping us organize/lead this group, please write me on Slack to discuss.

...) and other scenarios to use the new unified node draining approach.
- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to
track and react to each of them. Consider past discussions/historical perspective
(e.g. "tombstones").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SIG node this week, we saw a presentation on node readiness from @ajaysundark, and we have also been discussing the idea of node capabilities, which @pravk03 (among others) is looking into. These seem like related but slightly different topics from each other as well as from the existing taints/tolerations functionality. Can we include these in the scope of this WG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to adding Node Readiness in one of the goals. I can only speak from storage perspective, but one of the long standing problem is to make sure we can only schedule pods that require CSI driver, when said CSI driver is available on the node.

Currently, we don't take this into account and hence a node can have too many pods that require storage scheduled to a node and they remain stuck and they require manual intervention to cleanup.I have added my thoughts here - kubernetes/kubernetes#131208 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should consider node readiness here, and I also want to call out the risk of scope creep of this WG. In the beginning I think we should be very clear about what our definition of done is. this proposal began as a way to close GNS KEP, and while I agree the expanded scope of reworking node API semantics would properly fix the problem, I want to be wary of biting off more than the community can collectively chew.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should explicitly mention Node Readiness.

@haircommander, I share your concern. We're not certain some of these deliverables are solvable, so the language is a little open ended right now. I think we will need to provide clarity as time goes on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have added both of these initiatives to the WG.

Copy link

@ajaysundark ajaysundark Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for including node-readiness into consideration for wg-node-lifecycle.
I'm having the proposal ready for feedbacks https://docs.google.com/document/d/11i2_rewvcbQkFFq1BIwHa7lgIefgZ8Mak-QMZjP7fFs/edit?tab=t.0#heading=h.jtg1hwbt8shd. Happy to collaborate to align on the directions. Please include me relevant channels / discussions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I will take a look and we can add it to the WG agenda.

Btw, the link is wrong above but correct in the enhancements issue.

@@ -60,6 +60,12 @@ subprojects, and resolve cross-subproject technical issues and decisions.
- [@kubernetes/sig-cli-test-failures](https://github.com/orgs/kubernetes/teams/sig-cli-test-failures) - Test Failures and Triage
- Steering Committee Liaison: Paco Xu 徐俊杰 (**[@pacoxu](https://github.com/pacoxu)**)

## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-cli:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is SIG CLI the sponsor or SIG Node?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be SIG Node. We'll fix on the next update

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding this to all the stakeholder SIGs and they are in favor of participating (or at least consulting) in this effort. Btw this file is generated.

@atiratree atiratree force-pushed the wg-node-lifecycle branch 3 times, most recently from 89b5a33 to 6c47dcc Compare April 28, 2025 18:42
@atiratree atiratree changed the title WIP: Introduce Node Lifecycle WG Introduce Node Lifecycle WG Apr 29, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2025
Co-authored-by: Ryan Hallisey <rhallisey@nvidia.com>
@atiratree
Copy link
Member Author

We have just opened voting for the WG meetings. PTAL https://groups.google.com/a/kubernetes.io/g/dev/c/q37DZHUpemA/m/g1vDescLEAAJ

@jonathan-innis
Copy link

The Karpenter maintainer team has been looking forward to seeing something like this get rolling in the community! We've been keeping an eye on kubernetes/enhancements#4212 and kubernetes/enhancements#4563 for a while now and would love to see some form of these things go into the project!

Would love to join and make sure some folks from the project participate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/community-management area/slack-management Issues or PRs related to the Slack Management subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. committee/steering Denotes an issue or PR intended to be handled by the steering committee. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.