Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page.goto: net::ERR_INVALID_ARGUMENT #327

Open
junior-g opened this issue Nov 12, 2024 · 3 comments
Open

Page.goto: net::ERR_INVALID_ARGUMENT #327

junior-g opened this issue Nov 12, 2024 · 3 comments

Comments

@junior-g
Copy link

junior-g commented Nov 12, 2024

I am getting following error for my basic scrapy with playwright
error:

Request: <GET https://www.croma.com/robots.txt> (resource type: document)
2024-11-12 17:41:00 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www.croma.com/robots.txt>: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/robots.txt
Call log:
navigating to "https://www.croma.com/robots.txt", waiting until "load"
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
    result = context.run(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1253, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 379, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 432, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 461, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 563, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 8818, in goto
    await self._impl_obj.goto(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/robots.txt
Call log:
navigating to "https://www.croma.com/robots.txt", waiting until "load"

Request  ------  <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880>
Request Headers:  <CaseInsensitiveDict: {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'Accept-Language': 'en-US,en;q=0.9', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36', 'Upgrade-Insecure-Requests': '1', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Connection': 'keep-alive', 'Host': 'www.croma.com', 'Sec-Ch-Ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"', 'Sec-Ch-Ua-Mobile': '?0', 'Sec-Ch-Ua-Platform': '"macOS"', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-User': '?1'}>
2024-11-12 17:41:00 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 2 (2 for all contexts)
2024-11-12 17:41:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880> (resource type: document)
2024-11-12 17:41:00 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
    result = context.run(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1253, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 379, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 432, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 461, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 563, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 8818, in goto
    await self._impl_obj.goto(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880
Call log:
navigating to "https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880", waiting until "load"
# CromaSpider
class CromaSpider(scrapy.Spider):
    name = "croma"
    allowed_domains = ["www.croma.com"]
    start_urls = ["https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880"]

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.close()
        print("Response parsing")
        print(response.xpath('/html/body/main/div/div[3]/div/div[1]/div[2]/div[1]/div/div/div/div[3]/div/ul/li').get()
              )
        pass
        
# middleware.py
----
 request.meta["playwright"] = True

request.meta["playwright_include_page"] = True
---

# setting.py
DOWNLOADER_MIDDLEWARES = {
   "scrapy_2_crawl_service.middlewares.Scrapy2CrawlServiceDownloaderMiddleware": 543
}
DOWNLOAD_HANDLERS = {
   "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
   "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

I am following this - https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/

Why I am getting this error

(edited to adjust formatting)

@elacuesta
Copy link
Member

elacuesta commented Nov 12, 2024

I'm sorry, I cannot reproduce:

# test.py
import scrapy


class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
        },
        "USER_AGENT": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
        "LOG_LEVEL": "INFO",
    }

    def start_requests(self):
        yield scrapy.Request(
            url="URL",
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.screenshot(path="croma.png")
        await page.close()
        print("Response parsing")
        print(response.xpath("//h1/text()").get())
$ scrapy runspider test.py
...
2024-11-12 16:50:44 [scrapy.core.engine] INFO: Spider opened
2024-11-12 16:50:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-11-12 16:50:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:49 [scrapy-playwright] INFO: Launching browser chromium
2024-11-12 16:50:49 [scrapy-playwright] INFO: Browser chromium launched
Response parsing
iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model) 
2024-11-12 16:50:52 [scrapy.core.engine] INFO: Closing spider (finished)
...

Note that I had to set a custom User-Agent, otherwise I was getting 403 status responses.

Versions used:

$ playwright --version       
Version 1.48.0


$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.42


$ scrapy version -v                   
Scrapy       : 2.11.2
lxml         : 5.2.2.0
libxml2      : 2.12.6
cssselect    : 1.2.0
parsel       : 1.9.1
w3lib        : 2.2.1
Twisted      : 24.3.0
Python       : 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
pyOpenSSL    : 24.2.1 (OpenSSL 3.3.1 4 Jun 2024)
cryptography : 43.0.0
Platform     : Linux-6.5.0-45-generic-x86_64-with-glibc2.35

@junior-g
Copy link
Author

@elacuesta thanks for the quick reply. Yes it is working when working on project separately. but one more issue when I make

"PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": True,
        },

the prints None not the title.
why so?

@elacuesta
Copy link
Member

elacuesta commented Nov 14, 2024

I see. I suppose the site could be detecting and blocking headless browsers, I'm seeing the same behavior with standalone Playwright:

import asyncio

from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto(URL)
        await page.screenshot(path="page.png")
        print(await page.locator("//h1").text_content())
        await browser.close()

if __name__ == "__main__":
    asyncio.run(main())

prints iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model), however by passing headless=True I get a 403 response with "Access Denied".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants