Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BUG: Unexplainable behavior with x-dev-env - worker hangs after first request in GH Actions #7635

Open
theKashey opened this issue Dec 27, 2024 · 3 comments
Labels
bug Something that isn't working regression Break in existing functionality as a result of a recent change

Comments

@theKashey
Copy link

theKashey commented Dec 27, 2024

Which Cloudflare product(s) does this pertain to?

Wrangler

What versions are you using?

3.86.0 [Wrangler]

What operating system and version are you using?

Mac Sonoma 14.7.1

Please provide a link to a minimal reproduction

No response

Describe the Bug

We were trying to update wrangler for quite a while and here are some very weird problems we've experienced.

  • first of all - everything works (almost) fine on local and in the Cloudflare. The only problem is Playwright running tests in GitHubActions (via mcr.microsoft.com/playwright:v1.48.1-focal)
  • locally problem can be reproduced by changing worker code - eventually after another reload it will stop processing requests, but will accept connection (so socket will hang). Any other change will restore the functionality
  • everything works fine on 3.78.10
  • any version bump will require --x-dev-env=false, or it would not work
  • as this flag was removed at 3.86.1 any newer version dont work.
  • unless you use --log-level=debug which somehow resolve the problem (tests are passing 9 out of 10)

We run 7 workers in parallel, with the last one - "gateway" starting after all. Here are logs from the area of interest

v3.78.10

[WebServer] [1] [GATEWAY]  ⛅️ wrangler 3.78.10 (update available 3.95.0)
[WebServer] [1] [GATEWAY] --------------------------------------------------------
[WebServer] [1] [GATEWAY] Using vars defined in .dev.vars
[WebServer] [1] [GATEWAY] Your worker has access to the following bindings:
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service
[WebServer] [1] [GATEWAY] [wrangler:inf] Ready on http://localhost:8081
[WebServer] [1] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] @service:gateway@ : [get]/ : dev,US,IAD
[WebServer] [1] [GATEWAY] [wrangler:inf] GET / 400 Bad Request (598ms)
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 18 tests using 1 worker
[WebServer] [1] [GATEWAY] [wrangler:inf] GET /v1/widget.js 200 OK (28ms)
/// ⬆️ working

v3.86.0 and --x-dev-env=false

[WebServer] [1] [GATEWAY]  ⛅️ wrangler 3.86.0 (update available 3.99.0)
[WebServer] [1] [GATEWAY] -------------------------------------------------------
[WebServer] [1] [GATEWAY] Using vars defined in .dev.vars
[WebServer] [1] [GATEWAY] Your worker has access to the following bindings:
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions [not connected]
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2 [not connected]
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service [not connected]
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service [not connected]
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services [not connected]
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service [not connected]
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] Service bindings & durable object bindings connect to other `wrangler dev` processes running locally, with their connection status indicated by [connected] or [not connected]. For more details, refer to https://developers.cloudflare.com/workers/runtime-apis/bindings/service-bindings/#local-development
[WebServer] [1] [GATEWAY] [wrangler:inf] Ready on http://localhost:8081
[WebServer] [1] [WebServer] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] [wrangler:inf] GET / 400 Bad Request (497ms)
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
[WebServer] [1] [GATEWAY] ⎔ Reloading local server...
[WebServer] [1] [GATEWAY] [wrangler:inf] GET /v1/widget.js 200 OK (24ms)
/// ⬆️ working

3.86.1 (without -x-dev-env)

[WebServer] [1] [GATEWAY]  ⛅️ wrangler 3.86.1 (update available 3.99.0)
[WebServer] [1] [GATEWAY] -------------------------------------------------------
[WebServer] [1] [GATEWAY] Your worker has access to the following bindings:
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions [not connected]
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2 [not connected]
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service [not connected]
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service [not connected]
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services [not connected]
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service [not connected]
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] Service bindings & durable object bindings connect to other `wrangler dev` processes running locally, with their connection status indicated by [connected] or [not connected]. For more details, refer to https://developers.cloudflare.com/workers/runtime-apis/bindings/service-bindings/#local-development
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] [wrangler:inf] Ready on http://localhost:8081
[WebServer] [1] [GATEWAY] Using vars defined in .dev.vars
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions [connected]
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2 [connected]
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service [connected]
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service [connected]
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services [connected]
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service [connected]
[WebServer] [1] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] [wrangler:inf] GET / 400 Bad Request (502ms)
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
/// 😵 hangs

It does not matter which URL to try. It will handle the first request, but none other.
"hangs" mean that Playwright can establish a connection to get a resource (widget.js) but will never get anything.

v3.99.0

totally equal to 3.86.1

v3.99.0 and --log-level=debug

output is too verbose, but I found an interesting difference between adding debug for everybody or just gateway

  • everybody
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000003, method: 'Network.enable', result: {} }
[WebServer] [1] [SERVICES] [@theway/template-service] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/template-service] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '7296e37d6767447e' } }
[WebServer] [1] [SERVICES] [@theway/impressions-service-v2] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/impressions-service-v2] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '66815b1649719b4f' } }
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
[WebServer] [1] [SERVICES] [@theway/node-service] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/node-service] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '85b82413505896e2' } }
[WebServer] [1] [GATEWAY] @service:gateway@ : [get]/v1/widget.js : dev,US,IAD 
⬆️⬆️ its working! ⬆️⬆️
  • just gateway
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000003, method: 'Network.enable', result: {} }
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms) 
                                                              ⬆️ What is that? ⬆️
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000004}
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000004, result: { id: 'd413491f66c9cc25' } }
⬆️ "no reaction" ⬆️
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000005} 
⬆️ that is already another request ⬆️, there would be "no reaction" as well

Adding logs makes test just a little slower, so we are ok to processed with this solution, but I dont really understand what can be a problem here

My main consern here is a call to /cdn-cgi/ProxyWorker/pause
In a working scenario the last message related to ProxyWorker is play

[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/play 204 No Content (14ms)

We had a long history with wrangler flakes in CI, so it runs quite "reliable"

  • first run a "dummy" worker, waits for healthcheck
  • later starts all leaf services, waits for healthcheck
  • then starts the gateway, waits for healthcheck

Every service will eventually produce the /cdn-cgi/ProxyWorker/pause message, and it's somehow a matter of luck of the last message is /cdn-cgi/ProxyWorker/play

In the "good scenario" this is what I can see in the logs right after gateway start

[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] Ready on http://localhost:8081
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (6ms)
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms)
[WebServer] [1] [WebServer] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms)
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (2ms)
[WebServer] [1] [SERVICES] [@theway/promotion-service] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/promotion-service] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '1107c602f204561f' } }
[WebServer] [1] [GATEWAY] RemoteRuntimeController teardown beginning...
[WebServer] [1] [GATEWAY] RemoteRuntimeController teardown complete
[WebServer] [1] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms)
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] handleProxyControllerIncomingMessage {"type":"reloadStart"}
[WebServer] [1] [GATEWAY] ✘ [ERROR] workerd/server/server.c++:4017: info: Inspector is listening
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (0ms)
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/play 204 No Content (14ms)

A log of pause requests, but the very last one is play.

Please provide any relevant error logs

No response

@theKashey theKashey added the bug Something that isn't working label Dec 27, 2024
@github-project-automation github-project-automation bot moved this to Untriaged in workers-sdk Dec 27, 2024
@irvinebroque
Copy link
Contributor

The only problem is Playwright running tests in GitHubActions

Can you describe more about your setup here? Are you running wrangler dev and then running real-browser integration tests with Playwright, telling Playwright to access the local dev server that wrangler dev exposes? (localhost)

Alternatively, it seems like you could:

  1. Publish a new version of your Worker (without deploying it to production)
  2. Run your integration tests against a Preview URL

...such that what you're testing is actually running on Cloudflare, using real resources, closest to what you'll see in production, etc.?

@theKashey
Copy link
Author

theKashey commented Dec 28, 2024

Are you running wrangler dev and then running real-browser integration tests with Playwright, telling Playwright to access the local dev server that wrangler dev exposes? (localhost)

Correct. We are spinning up a local stack and running top-level tests against it. It's better to say - we work E2E tests in a way we work during the local dev.
And it works without any issues on localhost, and with so many different issues at GitHubActions. Starting with IPv6 not really working there, so we had to pin everything to 127.0.0.1 in order for connections to be resolved (see cypress-io/github-action#803) and ending as "freezing" reported here.

Alternatively, it seems like you could:

While it's possible to deploy all workers to preview, its unclear:

  • how to handle multiple PRs. We use different environments (local, staging, prod), but we dont use them for PRs as a few ongoing PRs overriding a single shared environment is a thing.
  • how to connect multiple workers in the same formation. Likely I can autogen a part of wrangler.toml to handle this.
  • and cleanup such formation once CI is done

There are more questions than answers the current documentation or examples provide.

I would be more than happy to use "the real CloudFlare" for the "real testing", and likely today I already can do that for the merge queue (deploy to staging, test). But I need to better understand some nuances first.

PS: I've also tried #7251. Unfortunately it does not work as well:

  • the command hangs before starting server
  • debug logs shows that multiple services are entering "teardown" right after first worker init
  • the last message after all workers init is that Proxy cannot reuse disposed service

I do wonder if this feature is yet unstable, or there is something very wrong in my configuration.

@emily-shen emily-shen added the regression Break in existing functionality as a result of a recent change label Jan 2, 2025
@emily-shen
Copy link
Contributor

Hiya, we're looking into the main issue with x-dev-env, but also very interested in the issues you're having with #7251. If you have the time, would you be able to create a separate issue with a minimal repro that demonstrates the issues you're having? we can double check your setup, and its definitely a feature we'd like to be stable...! 🫠

thank you! 🧡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working regression Break in existing functionality as a result of a recent change
Projects
Status: Untriaged
Development

No branches or pull requests

3 participants