-
-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS - Random requests stalling & occasional segfault [raspi aarch64] #64
Comments
Changed nothing and got a different error after spam refreshing...
... not what I was hoping to see, honestly. Edit: adding
There aren't any line numbers or files with these, I'm not omitting them :( |
Right, so, here's what I got before I get some zzz's. https://github.com/Vemahk/zap-issue-64 3 commits, 1 to start, 2 to get the https sample in there (which does not have the issue), and 3 to break it (adding .public_folder w/ favicon). My conclusion so far is it either...
Not terribly sure to be honest. Either way, I'm kinda tired and am gonna go to sleep. Will probably be staring at this tomorrow some, too. Edit: well, I also went and made the change directly to the https example in zap, ran it from zap, and it worked just fine. ... what? So could it be a build issue? I'm so confused at this point 😢 |
So I ended up being able to work around the issue by implementing a (crude) custom .public_folder check. So it's looking like this issue is specifically with facil's .public_folder implementation? |
Please describe the system on which you're seeing the above errors: Linux / macOS / ..? What architecture? Also, what's the status now, and with your custom public folder implementation. Do you still see stalling / segfaults? Were the segfaults real segmentation faults or "just" crashes / exits? |
Running After implementing the custom public_folder implementation, I could no longer cause the stalling and segfaulting to occur. I believe it was actually a segfault:
I did get a couple other memory related errors from malloc and free, though, on different runs:
|
Well, I just had a test run with my public_folder changes that crashed with the The stalling also doesn't seem to be happening with cURL, just my browser, so that might also be another bad lead T.T |
Thx for the update. I have never seen public folder related issues before in zap but haven't used them extensively. Just for "fun", have you tried disabling openssl and see if the erroneous behaviors still get triggered? Just some ideas for reproducibility and gathering more info: You could use some (comptime) consts to configure your example: debug / release mode(s), with/without openssl, with builtin/custom public_folder, etc, add a curl-based rapid-fire client script to try to trigger the behavior, etc. And report the constellations that others could try on their systems. It seems like your crypto libs are strange. Or sth somewhere passes in rubbish data. If it's browser-only: which browser? from what machine? maybe all that can help. |
Yes. More specifically, I was working on my project prior to the TLS changes, and I never saw this. Even on 0.2.5 without TLS enabled, there was no stalling or segfaulting
It's possible they're borked somehow. Though I should note: sometimes the segfault reports a different .so than libcrypto. Including, but not limited to,
Brave connecting over the (local) network. My raspberry pi is terminal only, so I'm developing/running from the pi and connecting to it from my main windows PC using Brave. I'll tinker some more with the reproducibility ideas. Every time I've had a theory so far, it's kinda had a few holes in it. I'm not convinced of the public_folder stuff anymore 😬, though it did seem to alleviate the stalling phenomina. Thanks for the ideas. |
BTW here is my implementation taken from my const std = @import("std");
const zap = @import("zap");
const Self = @This();
allocator: std.mem.Allocator,
endpoint: zap.SimpleEndpoint,
debug: bool = true,
settings: Settings,
www_root_cage: []const u8,
frontend_dir_absolute: []const u8,
fn log(self: *Self, comptime fmt: []const u8, args: anytype) void {
if (self.debug) {
std.debug.print("[frontend endpoint] - " ++ fmt, args);
}
}
pub const Settings = struct {
allocator: std.mem.Allocator,
endpoint_path: []const u8,
www_root: []const u8,
index_html: []const u8,
};
pub fn init(settings: Settings) !Self {
// endpoint path = frontend_dir
if (settings.endpoint_path[0] != '/') {
return error.FrontendDirMustStartWithSlash;
}
var ret: Self = .{
.allocator = settings.allocator,
.endpoint = zap.SimpleEndpoint.init(.{
.path = settings.endpoint_path,
.get = getFrontend,
.post = null,
.put = null,
.delete = null,
}),
.settings = settings,
.frontend_dir_absolute = undefined,
.www_root_cage = undefined,
};
// remember abs path of www_root
ret.www_root_cage = try std.fs.cwd().realpathAlloc(ret.allocator, settings.www_root);
std.log.info("Frontend: using www_root: {s}", .{ret.www_root_cage});
// check for endpoint_path within www_root_cage
const root_dir = try std.fs.cwd().openDir(ret.www_root_cage, .{});
// try to find the frontend subdir = endpoint_path without leading /
const frontend_dir_stat = try root_dir.statFile(settings.endpoint_path[1..]);
if (!(frontend_dir_stat.kind == .directory)) {
return error.NotADirectory;
}
// create frontend_dir_absolute for later
ret.frontend_dir_absolute = try root_dir.realpathAlloc(ret.allocator, settings.endpoint_path[1..]);
std.log.info("Frontend: using frontend root: {s}", .{ret.frontend_dir_absolute});
// check if frontend_dir_absolute starts with www_root_absolute
// to avoid weird linking leading to
if (!std.mem.startsWith(u8, ret.frontend_dir_absolute, ret.www_root_cage)) {
return error.FrontendDirNotInRootDir;
}
return ret;
}
pub fn deinit(self: *Self) void {
self.allocator.free(self.frontend_dir_absolute);
self.allocator.free(self.www_root_cage);
}
pub fn getEndpoint(self: *Self) *zap.SimpleEndpoint {
return &self.endpoint;
}
fn getFrontend(e: *zap.SimpleEndpoint, r: zap.SimpleRequest) void {
const self = @fieldParentPtr(Self, "endpoint", e);
self.getFrontenInternal(r) catch |err| {
r.sendError(err, 505);
};
}
fn getFrontenInternal(self: *Self, r: zap.SimpleRequest) !void {
var fn_buf: [2048]u8 = undefined;
if (r.path) |p| {
var html_path: []const u8 = undefined;
var is_root: bool = false;
// check if we have to serve index.html
if (std.mem.eql(u8, p, self.settings.endpoint_path)) {
html_path = self.settings.index_html;
is_root = true;
} else if (p.len == self.settings.endpoint_path.len + 1 and p[p.len - 1] == '/') {
html_path = self.settings.index_html;
is_root = true;
} else {
// no
html_path = p;
}
if (std.fmt.bufPrint(&fn_buf, "{s}{s}", .{ self.www_root_cage, html_path })) |fp| {
// now check if the absolute path starts with the frontend cage
if (std.mem.startsWith(u8, fp, self.frontend_dir_absolute)) {
try r.setHeader("Cache-Control", "no-cache");
try r.sendFile(fp);
return;
} // else 404 below
} else |err| {
std.debug.print("Error: {}\n", .{err});
// continue with 404 below
}
}
r.setStatus(.not_found);
try r.setHeader("Cache-Control", "no-cache");
try r.sendBody("<html><body><h1>404 - File not found</h1></body></html>");
} I've used that in production with thousands of clients, hundreds of concurrent clients successfully in many projects. |
One more thing: openssl, the command, can be used as a server. You could try to check if you see similar behavior with that: Serving the current dir: $ openssl s_server -accept 4443 -cert mycert.pem -key mykey.pem -WWW (...and it's bedtime for me ) |
Closing this since it seems more like a raspy / aarch64 / FS kind of problem. Feel free to re-open. |
I don't like necroing a thread, but I'm seeing the exact same issue, only while using TLS for our web server. I've reproduced this on our live VPS: Ubuntu 22.04.4 LTS, openssl version 3.0.10, as well as my local machine: Arch Linux with openssl version 3.3.0 (both are x86_64), zig version 0.12.0
As well as the previously described errors. Additional info: It seems to only be crashing when using multiple threads (only 1 worker). After limiting the server to 1 thread and 1 worker, I could no longer cause a crash. (It also seems like less threads = less likely to crash? This is difficult to verify) I am going to soon attempt to create a min repro so we can further investigate what may be causing this problem. |
This is most interesting. Please keep investigating. I've had stuff in production for months, with relative high loads but never more threads than cores: like 16 or less worker threads on the server (fast enough) per app. I just recalled, having done some load tests for the SYCL app last year, so I might want to port that to current and use that as a basis for more elaborate tests (note to self). |
It's just about the end of the day Friday for me, so I'll be working on that min-repro Monday morning. We're using zig 0.12.0 on our servers just FYI, I'll keep you posted. |
Thx. I'm on my way to SYCL in Milan, so I'll remain in CET. I doubt I'll have much time next week as it's a conference week. Just fyi |
I was able to recreate it with the https example simply by changing this: zap.start(.{
.threads = 200,
.workers = 1,
}); I then ran a bit of bash to flood the endpoint with curl: i=1
while true; do
curl -k -X 'GET' 'https://0.0.0.0:4443' || break
echo "$i"
i=$((i+1))
sleep 0.05;
done First time trying, |
Should we make a new issue to track this problem, or perhaps reopen this one? |
Yes pls. You choose |
Created #107 |
Yo. I'm still looking into this and am going to start digging into zap and facil to see if I can't see what's going on, but I wanted to drop what I'm seeing here first so I don't lose it. I'm also going to try and set up a mini version of this in my efforts to narrow down what's happening... I'll be back here to update with more info, but until then...
(reproducible version @ https://github.com/Vemahk/zap-issue-64)
After plugging in TLS and getting the certs squared away, I immediately noticed maybe a third of the requests made to the server would stall for about 6 seconds, and there was an accompanying "UUID error". The client gets the right response after the stall, but it looks like on the server side, it fails and retries internally:
Above, the first request to / was replayed 6 seconds later. The UUID error I think is just late to the party, and the /time call snuck in.
I looked around in zap and found that the UUID error is probably coming from fio.c:3035, which looks to be something to do with actually writing to the response stream...
Some extra notes: it seems like which requests stall are relatively random. I can spam refresh and some make it through, some stall. Eventually, I segfaulted it by continuing to spam the refresh from the browser:
Which is when I thought I should come back here and start writing things down.
The segfault happened inside libcrypto, which makes me think that it could possibly be something screwy with the version of libcrypto that I have? Geeze, I hope not.
Anyway, gonna go back to investigating. I'll come back to update with anything else I find.
The text was updated successfully, but these errors were encountered: