Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RequestHandler Timed Out But Actually Browser Error #2785

Open
1 task
danielcrabtree opened this issue Dec 28, 2024 · 0 comments
Open
1 task

RequestHandler Timed Out But Actually Browser Error #2785

danielcrabtree opened this issue Dec 28, 2024 · 0 comments
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@danielcrabtree
Copy link

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/browser (BrowserCrawler)

Issue description

I've been sporadically getting the error message "requestHandler timed out after 130 seconds" on some crawls.

It turns out that my requestHandler is not timing out at all. In fact, my requestHandler code is not even being called.

I'm using PlaywrightCrawler, but the issue is that BrowserCrawler's _runRequestHandler is delegated to BasicCrawler's _runTaskFunction. This function adds a timeout to the _runRequestHandler call, which might normally be the user's requestHandler, but in the case of PlaywrightCrawler, its the wrapped BrowserCrawler's _runRequestHandler.

So in my case, something is going wrong periodically with the creation of the page on the browser pool or possibly with the navigation or cookies, but I had no way of knowing that from the logs. The requestHandler time out should be reserved for the user level code and not encapsulate or capture issues with the browser pool as that is rather confusing.

As a separate issue, periodically (and seemingly at random) one of the calls in BrowserCrawler's _runRequestHandler prior to calling the user's requestHandler is hanging indefinitely. I suspect it might be the call to open the Playwright page, but haven't been able to verify that. I think it needs a separate timeout over the creation of the page and the subsequent page loading activities, this could be a much shorter timeout as I expect these calls are usually relatively fast. And I think that if one of these timeouts occur, it needs to reset the browser, abandon the session, or similar before retrying the request.

Code sample

No response

Package version

3.12.1

Node.js version

20

Operating system

No response

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@danielcrabtree danielcrabtree added the bug Something isn't working. label Dec 28, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

1 participant