Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(emissary): signal SIGINT/SIGTERM in windows correctly #13693

Merged
merged 2 commits into from
Oct 9, 2024

Conversation

mweibel
Copy link
Contributor

@mweibel mweibel commented Oct 2, 2024

Motivation

emissary tries to send a signal but os/Process.Signal only supports sending SIGKILL and returns an error for all other cases.

Modifications

Using code found in hcsshim this changes signal handling in emissary for windows by translating SIGINT and SIGTERM to their appropriate windows signal and sending it to the process.

Verification

custom build of argoexec and tested on several windows workflows as well as manual testing on a windows machine.

@agilgur5 agilgur5 added area/executor area/windows Windows Container support labels Oct 2, 2024
@mweibel mweibel force-pushed the windows-kill branch 3 times, most recently from 0238e38 to e43d4e1 Compare October 3, 2024 20:47
emissary tries to send a signal but `os/Process.Kill` only supports
sending SIGKILL and returns an error for all other cases.

Using code found in hcsshim this changes signal handling in emissary for
windows by translating SIGINT and SIGTERM to their appropriate windows
signal and sending it to the process.

Signed-off-by: Michael Weibel <michael@helio.exchange>
@mweibel
Copy link
Contributor Author

mweibel commented Oct 3, 2024

this should be soon ready, waiting for the CI run to finish.

tested using a custom build of argoexec. Finally windows runs exit within a timely manner when k8s sends a sigterm 🎉

@mweibel mweibel marked this pull request as ready for review October 4, 2024 07:10
@mweibel
Copy link
Contributor Author

mweibel commented Oct 4, 2024

@agilgur5 this is ready to review now.

@juliev0
Copy link
Contributor

juliev0 commented Oct 6, 2024

Just to confirm I have the right context - basically, this is the emissary "argoexec" binary on the main container which initially is sending a SIGTERM to the subprocess, and then when instructed by the "wait" container after some timeout, sends a SIGKILL, right?

So, you're saying that for every shutdown, the SIGTERM command is getting an error back so it's essentially always timing out and having to send the SIGKILL, resulting in non-graceful shutdown?

@juliev0
Copy link
Contributor

juliev0 commented Oct 6, 2024

This is impressive but also wondering if we can avoid such low level logic, including the use of unsafe.Pointer. I was curious what A/I would generate if I asked it how to send a CTRL_SHUTDOWN_EVENT to a subprocess. It gave me this:

func sendCtrlShutdownEvent(pid int) {
	// Get the process handle
	hProcess, err := syscall.OpenProcess(syscall.PROCESS_ALL_ACCESS, false, uint32(pid))
	if err != nil {
		fmt.Println("Error opening process:", err)
		return
	}
	defer syscall.CloseHandle(hProcess)

	// Send the CTRL_SHUTDOWN_EVENT
	const ctrlShutdownEvent = 0x00000001
	if err := syscall.GenerateConsoleCtrlEvent(ctrlShutdownEvent, uint32(pid)); err != nil {
		fmt.Println("Error sending CTRL_SHUTDOWN_EVENT:", err)
	}
}

I haven't honestly vetted it, and it may be incorrect, but could I bother you to check it/compare with what you have?

@mweibel
Copy link
Contributor Author

mweibel commented Oct 7, 2024

Thanks @juliev0 for the review!

Just to confirm I have the right context - basically, this is the emissary "argoexec" binary on the main container which initially is sending a SIGTERM to the subprocess, and then when instructed by the "wait" container after some timeout, sends a SIGKILL, right?

So, you're saying that for every shutdown, the SIGTERM command is getting an error back so it's essentially always timing out and having to send the SIGKILL, resulting in non-graceful shutdown?

Yes, that's correct. Only SIGKILL is processed by os.Signal on windows. Signalling SIGTERM returns an EWINDOWS instead. Emissary ignores return values from osspecific.Kill and that's why this never surfaced.

The behavior on windows is indeed that emissary receives SIGTERM/SIGINT and logs "forwarding signal terminated" but the underlying process never receives this signal. Emissary then exits successfully but the underlying process continues until the pod is SIGKILL'ed.

This is impressive but also wondering if we can avoid such low level logic, including the use of unsafe.Pointer. I was curious what A/I would generate if I asked it how to send a CTRL_SHUTDOWN_EVENT to a subprocess

The low-level logic was extracted from hcsshim and generated by go-winio. I did that mostly because I knew they'd need that functionality too and probably know what they're doing (unlike me, I'm no windows dev at all, I just happen to have to solve issues we have with it). I totally understand what you mean though and will try the code you got from AI as soon as I get to it. I also dug a little deeper into Go's standard library on how they do it and might try if there's a way to replicate it.

@mweibel
Copy link
Contributor Author

mweibel commented Oct 7, 2024

I tested the suggestion you copied from AI - it doesn't work just like that (e.g. syscall.GenerateConsoleCtrlEvent is not a function which exists but instead we need to load it from kernel32 similar to other functions I loaded).

Code which "worked":

var (
	kernel32              = syscall.NewLazyDLL("kernel32.dll")
	procGenerateCtrlEvent = kernel32.NewProc("GenerateConsoleCtrlEvent")
)

// sendCtrlEvent sends a control event (such as CTRL_C_EVENT or CTRL_BREAK_EVENT) to a process group.
// The eventType can be syscall.CTRL_C_EVENT or syscall.CTRL_BREAK_EVENT.
func sendCtrlEvent(eventType uint32, pid uint32) error {
	r1, _, err := procGenerateCtrlEvent.Call(uintptr(eventType), uintptr(pid))
	if r1 == 0 {
		return fmt.Errorf("failed to send control event: %v", err)
	}
	return nil
}

However, GenerateConsoleCtrlEvent can't send a CTRL_C_SHUTDOWN event which is what SIGTERM roughly translates to (or only under very specific circumstances) and we can't use it therefore. I tried it but unfortunately the signal didn't arrive at the subprocess.

Therefore we have to stick with the code in the PR. About unsafe.Pointer usage: checking other syscall code it's mostly used there and there's not really a way around it (or at least I know not enough windows for that). It's also pretty safe in a way that the unsafe pointers we use there (sa and threadID) are both nil - so we don't really use it ;)

Let me know your thoughts @juliev0.

@juliev0
Copy link
Contributor

juliev0 commented Oct 8, 2024

Thanks for trying different things. Okay, if this seems like the only way, then we can do that. And I recognize that if this code is copied from a trusted source, we can have some confidence.

Let me check this again when I get time in the next couple days.

@juliev0 juliev0 merged commit 7a33720 into argoproj:main Oct 9, 2024
30 checks passed
@mweibel
Copy link
Contributor Author

mweibel commented Oct 10, 2024

thanks @juliev0! Will this be part of the next 3.5.x release?

@juliev0
Copy link
Contributor

juliev0 commented Oct 10, 2024

You can request that here: #11997

isubasinghe pushed a commit that referenced this pull request Oct 30, 2024
Signed-off-by: Michael Weibel <michael@helio.exchange>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/executor area/windows Windows Container support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants