Page.goto: net::ERR_INVALID_ARGUMENT #327

junior-g · 2024-11-12T12:16:46Z

I am getting following error for my basic scrapy with playwright
error:

Request: <GET https://www.croma.com/robots.txt> (resource type: document)
2024-11-12 17:41:00 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www.croma.com/robots.txt>: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/robots.txt
Call log:
navigating to "https://www.croma.com/robots.txt", waiting until "load"
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
    result = context.run(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1253, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 379, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 432, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 461, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 563, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 8818, in goto
    await self._impl_obj.goto(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/robots.txt
Call log:
navigating to "https://www.croma.com/robots.txt", waiting until "load"

Request  ------  <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880>
Request Headers:  <CaseInsensitiveDict: {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'Accept-Language': 'en-US,en;q=0.9', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36', 'Upgrade-Insecure-Requests': '1', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Connection': 'keep-alive', 'Host': 'www.croma.com', 'Sec-Ch-Ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"', 'Sec-Ch-Ua-Mobile': '?0', 'Sec-Ch-Ua-Platform': '"macOS"', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-User': '?1'}>
2024-11-12 17:41:00 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 2 (2 for all contexts)
2024-11-12 17:41:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880> (resource type: document)
2024-11-12 17:41:00 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
    result = context.run(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1253, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 379, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 432, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 461, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 563, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 8818, in goto
    await self._impl_obj.goto(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880
Call log:
navigating to "https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880", waiting until "load"

# CromaSpider
class CromaSpider(scrapy.Spider):
    name = "croma"
    allowed_domains = ["www.croma.com"]
    start_urls = ["https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880"]

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.close()
        print("Response parsing")
        print(response.xpath('/html/body/main/div/div[3]/div/div[1]/div[2]/div[1]/div/div/div/div[3]/div/ul/li').get()
              )
        pass
        
# middleware.py
----
 request.meta["playwright"] = True

request.meta["playwright_include_page"] = True
---

# setting.py
DOWNLOADER_MIDDLEWARES = {
   "scrapy_2_crawl_service.middlewares.Scrapy2CrawlServiceDownloaderMiddleware": 543
}
DOWNLOAD_HANDLERS = {
   "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
   "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

I am following this - https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/

Why I am getting this error

(edited to adjust formatting)

The text was updated successfully, but these errors were encountered:

elacuesta · 2024-11-12T19:51:40Z

I'm sorry, I cannot reproduce:

# test.py
import scrapy


class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
        },
        "USER_AGENT": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
        "LOG_LEVEL": "INFO",
    }

    def start_requests(self):
        yield scrapy.Request(
            url="URL",
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.screenshot(path="croma.png")
        await page.close()
        print("Response parsing")
        print(response.xpath("//h1/text()").get())

$ scrapy runspider test.py
...
2024-11-12 16:50:44 [scrapy.core.engine] INFO: Spider opened
2024-11-12 16:50:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-11-12 16:50:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:49 [scrapy-playwright] INFO: Launching browser chromium
2024-11-12 16:50:49 [scrapy-playwright] INFO: Browser chromium launched
Response parsing
iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model) 
2024-11-12 16:50:52 [scrapy.core.engine] INFO: Closing spider (finished)
...

Note that I had to set a custom User-Agent, otherwise I was getting 403 status responses.

Versions used:

$ playwright --version       
Version 1.48.0


$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.42


$ scrapy version -v                   
Scrapy       : 2.11.2
lxml         : 5.2.2.0
libxml2      : 2.12.6
cssselect    : 1.2.0
parsel       : 1.9.1
w3lib        : 2.2.1
Twisted      : 24.3.0
Python       : 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
pyOpenSSL    : 24.2.1 (OpenSSL 3.3.1 4 Jun 2024)
cryptography : 43.0.0
Platform     : Linux-6.5.0-45-generic-x86_64-with-glibc2.35

junior-g · 2024-11-13T00:19:12Z

@elacuesta thanks for the quick reply. Yes it is working when working on project separately. but one more issue when I make

"PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": True,
        },

the prints None not the title.
why so?

elacuesta · 2024-11-14T12:20:32Z

I see. I suppose the site could be detecting and blocking headless browsers, I'm seeing the same behavior with standalone Playwright:

import asyncio

from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto(URL)
        await page.screenshot(path="page.png")
        print(await page.locator("//h1").text_content())
        await browser.close()

if __name__ == "__main__":
    asyncio.run(main())

prints iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model), however by passing headless=True I get a 403 response with "Access Denied".

elacuesta added the could not reproduce label Nov 12, 2024

elacuesta added the Stale label Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page.goto: net::ERR_INVALID_ARGUMENT #327

Page.goto: net::ERR_INVALID_ARGUMENT #327

junior-g commented Nov 12, 2024 •

edited by elacuesta

Loading

elacuesta commented Nov 12, 2024 •

edited

Loading

junior-g commented Nov 13, 2024

elacuesta commented Nov 14, 2024 •

edited

Loading

Page.goto: net::ERR_INVALID_ARGUMENT #327

Page.goto: net::ERR_INVALID_ARGUMENT #327

Comments

junior-g commented Nov 12, 2024 • edited by elacuesta Loading

elacuesta commented Nov 12, 2024 • edited Loading

junior-g commented Nov 13, 2024

elacuesta commented Nov 14, 2024 • edited Loading

junior-g commented Nov 12, 2024 •

edited by elacuesta

Loading

elacuesta commented Nov 12, 2024 •

edited

Loading

elacuesta commented Nov 14, 2024 •

edited

Loading