Skip to content
This repository has been archived by the owner on Dec 28, 2023. It is now read-only.

Retry #18

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,19 @@ Crawlera middleware won't be able to handle them.

The endpoint of a specific Crawlera instance

* `CRAWLERA_FETCH_ON_ERROR` (type `enum.Enum` - `crawlera_fetch.OnError`,
default `OnError.Raise`)

What to do if an error occurs while downloading or decoding a response. Possible values are:
* `OnError.Raise` (raise a `crawlera_fetch.CrawleraFetchException` exception)
* `OnError.Warn` (log a warning and return the raw upstream response)
* `OnError.Retry` (retry the failed request, up to `CRAWLERA_FETCH_RETRY_TIMES` times -
Requires Scrapy>=2.5)

* `CRAWLERA_FETCH_RAISE_ON_ERROR` (type `bool`, default `True`)

**_Deprecated, please use `CRAWLERA_FETCH_ON_ERROR`_**

Whether or not the middleware will raise an exception if an error occurs while downloading
or decoding a response. If `False`, a warning will be logged and the raw upstream response
will be returned upon encountering an error.
Expand All @@ -76,6 +87,21 @@ Crawlera middleware won't be able to handle them.
Default values to be sent to the Crawlera Fetch API. For instance, set to `{"device": "mobile"}`
to render all requests with a mobile profile.

* `CRAWLERA_FETCH_SHOULD_RETRY` (type `Optional[Callable, str]`, default `None`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pawelmhm:

I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

CRAWLERA_FETCH_SHOULD_RETRY receives a callable (or the name of a callable within the spider) to be used to determine if a request should be retried. Perhaps it could be named differently, I'm open to suggestions. CRAWLERA_FETCH_ON_ERROR is the setting to determine what to do with errors. I made OnError.Warn the default, just to keep backward-compatibility, but perhaps OnError.Retry can be a better default.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of scrapy/scrapy#4902. I wanted to avoid code duplication but I guess I can just use the upstream function if available and fall back to copying the implementation.

Copy link

@pawelmhm pawelmhm Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explanation. Is there any scenario where you don't need retry? Because in my experience it is very rare not to want retry of internal server errors, timeouts, or bans?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of

I think in some projects people are using old Scrapy versions and they will have to update Scrapy to most recent versions, which will be extra work, extra effort for them. If they are stuck on some old versions like 1.6 or 1.7 updating to 2.5 might not be straigtforward.

But my main point after thinking about this is that why do we actually need custom retry? Why can't we handle this in retry middleware by default? like all other HTTP error codes? There are 2 use cases mentioned by Taras in issue on GH, but I'm not convinced about them and after talking with developer of Fetch API I hear they plan to change behavior to return 500 anbd 503 HTTP status codes instead of 200 HTTP status code with error code in response body.


**_Requires Scrapy>=2.5_**

A boolean callable that determines whether a request should be retried by the middleware.
If the setting value is a `str`, an attribute by that name will be looked up on the spider
object doing the crawl. The callable should accept the following arguments:
`response: scrapy.http.response.Response, request: scrapy.http.request.Request, spider: scrapy.spiders.Spider`.
If the return value evaluates to `True`, the request will be retried by the middleware.

* `CRAWLERA_FETCH_RETRY_TIMES` (type `Optional[int]`, default `None`)

The maximum number of times a request should be retried.
If `None`, the value is taken from the `RETRY_TIMES` setting.

### Spider attributes

* `crawlera_fetch_enabled` (type `bool`, default `False`)
Expand Down
127 changes: 113 additions & 14 deletions crawlera_fetch/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@
import logging
import os
import time
import warnings
from enum import Enum
from typing import Optional, Type, TypeVar
from typing import Callable, Optional, Type, TypeVar, Union

import scrapy
from scrapy.crawler import Crawler
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.http.request import Request
from scrapy.http.response import Response
from scrapy.responsetypes import responsetypes
Expand All @@ -18,6 +20,11 @@
from scrapy.utils.reqser import request_from_dict, request_to_dict
from w3lib.http import basic_auth_header

try:
from scrapy.downloadermiddlewares.retry import get_retry_request
except ImportError:
get_retry_request = None


logger = logging.getLogger("crawlera-fetch-middleware")

Expand All @@ -34,6 +41,12 @@ class DownloadSlotPolicy(Enum):
Default = "default"


class OnError(Enum):
Warn = "warn"
Raise = "raise"
Retry = "retry"


class CrawleraFetchException(Exception):
pass

Expand Down Expand Up @@ -74,11 +87,56 @@ def _read_settings(self, spider: Spider) -> None:
self.download_slot_policy = settings.get(
"CRAWLERA_FETCH_DOWNLOAD_SLOT_POLICY", DownloadSlotPolicy.Domain
)

self.raise_on_error = settings.getbool("CRAWLERA_FETCH_RAISE_ON_ERROR", True)

self.default_args = settings.getdict("CRAWLERA_FETCH_DEFAULT_ARGS", {})

# what to do when an error hapepns?
self.on_error_action = None # type: Optional[OnError]
if "CRAWLERA_FETCH_RAISE_ON_ERROR" in settings:
warnings.warn(
"CRAWLERA_FETCH_RAISE_ON_ERROR is deprecated, "
"please use CRAWLERA_FETCH_ON_ERROR instead",
category=ScrapyDeprecationWarning,
stacklevel=2,
)
if settings.getbool("CRAWLERA_FETCH_RAISE_ON_ERROR"):
self.on_error_action = OnError.Raise
else:
self.on_error_action = OnError.Warn
if "CRAWLERA_FETCH_ON_ERROR" in settings:
self.on_error_action = settings.get("CRAWLERA_FETCH_ON_ERROR")
if self.on_error_action == OnError.Retry and get_retry_request is None:
logger.warning("Cannot retry on error, Scrapy>=2.5 required")
self.on_error_action = None
if self.on_error_action is None:
self.on_error_action = OnError.Raise

# should retry?
self.should_retry = settings.get("CRAWLERA_FETCH_SHOULD_RETRY")
if self.should_retry is not None:
if get_retry_request is None:
logger.warning("Retry feature disabled, Scrapy>=2.5 required")
self.should_retry = None
elif isinstance(self.should_retry, str):
try:
self.should_retry = getattr(spider, self.should_retry)
except AttributeError:
logger.warning(
"Could not find a '%s' callable on the spider - retries are disabled",
self.should_retry,
)
self.should_retry = None
elif not isinstance(self.should_retry, Callable): # type: ignore
logger.warning(
"Invalid type for retry function: expected Callable"
" or str, got %s - retries are disabled",
type(self.should_retry),
)
self.should_retry = None
if self.should_retry is not None:
self.retry_times = settings.getint("CRAWLERA_FETCH_RETRY_TIMES")
if not self.retry_times:
self.retry_times = settings.getint("RETRY_TIMES")

def spider_opened(self, spider):
try:
spider_attr = getattr(spider, "crawlera_fetch_enabled")
Expand Down Expand Up @@ -163,6 +221,21 @@ def process_request(self, request: Request, spider: Spider) -> Optional[Request]
request.meta[META_KEY] = crawlera_meta
return request.replace(url=self.url, method="POST", body=body_json)

def _get_retry_request(
self,
request: Request,
reason: Union[Exception, str],
stats_base_key: str,
) -> Optional[Request]:
return get_retry_request(
request=request,
reason=reason,
stats_base_key=stats_base_key,
spider=self.crawler.spider,
max_retry_times=self.retry_times,
logger=logger,
)

def process_response(self, request: Request, response: Response, spider: Spider) -> Response:
if not self.enabled:
return response
Expand Down Expand Up @@ -193,11 +266,17 @@ def process_response(self, request: Request, response: Response, spider: Spider)
response.status,
message,
)
if self.raise_on_error:
if self.on_error_action == OnError.Raise:
raise CrawleraFetchException(log_msg)
else:
elif self.on_error_action == OnError.Warn:
logger.warning(log_msg)
return response
elif self.on_error_action == OnError.Retry:
return self._get_retry_request(
request=request,
reason=message,
stats_base_key="crawlera_fetch/retry/error",
)
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

try:
json_response = json.loads(response.text)
Expand All @@ -213,14 +292,22 @@ def process_response(self, request: Request, response: Response, spider: Spider)
exc.lineno,
exc.colno,
)
if self.raise_on_error:
if self.on_error_action == OnError.Raise:
raise CrawleraFetchException(log_msg) from exc
else:
elif self.on_error_action == OnError.Warn:
logger.warning(log_msg)
return response
elif self.on_error_action == OnError.Retry:
return self._get_retry_request(
request=request,
reason=exc,
stats_base_key="crawlera_fetch/retry/error",
)

server_error = json_response.get("crawlera_error") or json_response.get("error_code")
original_status = json_response.get("original_status")
self.stats.inc_value("crawlera_fetch/response_status_count/{}".format(original_status))

server_error = json_response.get("crawlera_error") or json_response.get("error_code")
request_id = json_response.get("id") or json_response.get("uncork_id")
if server_error:
message = json_response.get("body") or json_response.get("message")
Expand All @@ -237,13 +324,17 @@ def process_response(self, request: Request, response: Response, spider: Spider)
message,
request_id or "unknown",
)
if self.raise_on_error:
if self.on_error_action == OnError.Raise:
raise CrawleraFetchException(log_msg)
else:
elif self.on_error_action == OnError.Warn:
logger.warning(log_msg)
return response

self.stats.inc_value("crawlera_fetch/response_status_count/{}".format(original_status))
elif self.on_error_action == OnError.Retry:
return self._get_retry_request(
request=request,
reason=server_error,
stats_base_key="crawlera_fetch/retry/error",
)

crawlera_meta["upstream_response"] = {
"status": response.status,
Expand All @@ -260,14 +351,22 @@ def process_response(self, request: Request, response: Response, spider: Spider)
url=json_response["url"],
body=resp_body,
)
return response.replace(
response = response.replace(
cls=respcls,
request=original_request,
headers=json_response["headers"],
url=json_response["url"],
body=resp_body,
status=original_status or 200,
)
if self.should_retry is not None:
if self.should_retry(response=response, request=request, spider=spider):
return self._get_retry_request(
request=request,
reason="should-retry",
stats_base_key="crawlera_fetch/retry/should-retry",
)
return response

def _set_download_slot(self, request: Request, spider: Spider) -> None:
if self.download_slot_policy == DownloadSlotPolicy.Domain:
Expand Down