Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I change the browser to Firefox? #772

Open
chung1912 opened this issue Oct 28, 2024 · 8 comments
Open

How do I change the browser to Firefox? #772

chung1912 opened this issue Oct 28, 2024 · 8 comments

Comments

@chung1912
Copy link

How do I change the browser to Firefox?

@SwapnilSonker
Copy link
Contributor

@chung1912 what kind of issue you are having specifically, want to know more about it.

@SwapnilSonker
Copy link
Contributor

@VinciGit00 is there any detail about this issue? If there is any I want to work on this issue.

@VinciGit00
Copy link
Collaborator

@SwapnilSonker Please add just Firefox on that with the docloader

@SwapnilSonker
Copy link
Contributor

PR - #848
@VinciGit00, have a look at it and tell me if anything else has to be done about it.

@PeriniM
Copy link
Collaborator

PeriniM commented Jan 12, 2025

Hey @chung1912 from the new version v1.36 you can change it with

graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "temperature": 0,
        "model_tokens": 4096
    },
    "loader_kwargs": {
        "backend": "selenium",
        "browser_name": "firefox"
    },
    "verbose": True,
    "headless": False
}

@PeriniM PeriniM closed this as completed Jan 12, 2025
@chung1912
Copy link
Author

Hey @chung1912 from the new version v1.36 you can change it with

graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "temperature": 0,
        "model_tokens": 4096
    },
    "loader_kwargs": {
        "backend": "selenium",
        "browser_name": "firefox"
    },
    "verbose": True,
    "headless": False
}

AttributeError: 'ChromiumLoader' object has no attribute 'ascrape_selenium'

@PeriniM PeriniM reopened this Jan 12, 2025
@SwapnilSonker
Copy link
Contributor

@chung1912 looking into it, will fix the bug most probably.

@mark-antal-csizmadia
Copy link

mark-antal-csizmadia commented Jan 20, 2025

Hey! Great package, good experience using it.

I think the issue is that in the ChromiumLoader's lazy_load and alazy_load function, when the getattr method is used to get the function for a given backend (scraping_fn), this scraping_fn method is always only called with the url and the browser_name parameter is never passed from the ChromiumLoader's self namespace. You can see it here.

So I guess, in the code below:

def lazy_load(self) -> Iterator[Document]:
        """
        Lazily load text content from the provided URLs.

        This method yields Documents one at a time as they're scraped,
        instead of waiting to scrape all URLs before returning.

        Yields:
            Document: The scraped content encapsulated within a Document object.
        """
        scraping_fn = (
            self.ascrape_with_js_support
            if self.requires_js_support
            else getattr(self, f"ascrape_{self.backend}")
        )

        for url in self.urls:
            html_content = asyncio.run(scraping_fn(url))
            metadata = {"source": url}
            yield Document(page_content=html_content, metadata=metadata)

the scraping_fn = ( self.ascrape_with_js_support if self.requires_js_support else getattr(self, f"ascrape_{self.backend}") ) line should be changed so that for some of the scraping functions such as ascrape_playwright (but not ascrape_undetected_chromedriver), the browser_name is also passed (so it can for instance be set to firefox).

Great work, thanks a lot!

Edit

A possible temporary solution while the fix is on the way is to patch the scrapegraphai.docloaders.chromium.ChromiumLoader.lazy_load function yourself. This function is an intermediary between the backend-specific scraper functions and your code. When I look at the code, the firefox browser is meant to be used with both selenium and playwright - I'll use playwright in my code below. So for instance, if you add a new file called my_patch.py with the code below:

""" This is a patch for the ScrapeGraphAI library. As soon as a fix is in the main library, this file can be removed. 
See more at: https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/772
Particularly, note the comment: https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/772#issuecomment-2603320393
"""
import asyncio
from functools import partial
from typing import Iterator
from langchain_core.documents import Document


def lazy_load_patched(self) -> Iterator[Document]:
    """
    Patches https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/31087937bef20eadcb83e28688077eff13ed2780/scrapegraphai/docloaders/chromium.py#L438
    so that self.browser_name is passed to ascrape_playwright.
    See discussion at https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/772#issuecomment-2603320393.
    """
    scraping_fn = (
        self.ascrape_with_js_support
        if self.requires_js_support
        else getattr(self, f"ascrape_{self.backend}")
    )
    
    # the patch: the partial function is used to pass the browser_name argument
    # to ascrape_playwright
    if self.backend == "playwright":
        scraping_fn = partial(scraping_fn, browser_name=self.browser_name)
    # end of patch

    for url in self.urls:
        html_content = asyncio.run(scraping_fn(url))
        metadata = {"source": url}
        yield Document(page_content=html_content, metadata=metadata)

and then whenever you call the scraper (for intance in a main.py script) you patch the mentioned method as shown below:

from unittest.mock import patch
from my_patch import lazy_load_patched
from scrapegraphai.graphs import SmartScraperGraph

# your code, define prompt, source, schema, etc.

graph_config = {
       # your config such as llm name, temperature, headless, etc.
        "loader_kwargs": {
            "backend": "playwright",
            "browser_name": "firefox"
        }
    }
smart_scraper_graph = SmartScraperGraph(
        prompt=prompt,
        source=source,
        config=config,
        schema=schema
    )

with patch('scrapegraphai.docloaders.chromium.ChromiumLoader.lazy_load', new=lazy_load_patched):
    some_data = smart_scraper_graph.run()

This should patch the original code so that the firefox browser can be used with playwright backend. Feel free to modify the code to accomodate your needs.

I hope this helps! Happy hacking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants