Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium automation page load extremely slow for Premium posts #24

Open
reidben opened this issue Dec 12, 2024 · 1 comment
Open

Selenium automation page load extremely slow for Premium posts #24

reidben opened this issue Dec 12, 2024 · 1 comment

Comments

@reidben
Copy link

reidben commented Dec 12, 2024

Nice work, thanks for this libary. Getting a performance issue here - pages are taking forever to load in Selenium / Edge... but do load eventually. Here's my stats currently running after nearly an hour:

4%|███████▉ | 16/387 [43:24<18:02:09, 175.01s/it]

(Line 434: self.driver.get(url) ) taking forever - Edge automation just shows spinning wheel on tab with page apparently fully loading eventually.

I wonder if Substack have implemented anti-bot measures do you think? Have tested network connection, very fast and pages loading fine in BeautfulSoup. Apologies if someone's raised this already.

@jontutcher
Copy link

Hey @reidben - are you running a macbook with apple silicon? It seems like this might be a slowdown because it's using an Intel/x64 version of Edge.

I've had a bit of success switching it out for the Chrome (arm64) driver in substack_scraper.py. To do this, you need chrome installed and the corresponding chromedriver binary in /usr/local/bin (download from: https://developer.chrome.com/docs/chromedriver/downloads).

Just in case it helps - I only started using this project yesterday and am not familiar enough to offer this up as a proper solution!

-        options = EdgeOptions()
-        if headless:
-            options.add_argument("--headless")
-        if edge_path:
-            options.binary_location = edge_path
-        if user_agent:
-            options.add_argument(f'user-agent={user_agent}')  # Pass this if running headless and blocked by captcha
-
-        if edge_driver_path:
-            service = Service(executable_path=edge_driver_path)
-        else:
-            service = Service(EdgeChromiumDriverManager().install())
+        # options = EdgeOptions()
+        # if headless:
+        #     options.add_argument("--headless")
+        # if edge_path:
+        #     options.binary_location = edge_path
+        # if user_agent:
+        #     options.add_argument(f'user-agent={user_agent}')  # Pass this if running headless and blocked by captcha
+        #
+        # if edge_driver_path:
+        #
+        # if edge_driver_path:
+        #     service = Service(executable_path=edge_driver_path)
+        # else:
+        #     service = Service(EdgeChromiumDriverManager().install())
+        #
+        # self.driver = webdriver.Edge(service=service, options=options)
+
+        options = webdriver.ChromeOptions()
+        self.driver = webdriver.Chrome(options=options)
 
-        self.driver = webdriver.Edge(service=service, options=options)
         self.login()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants