- Removed official support for Python 2.7, 3.4, 3.5 and 3.6, and added official support for Python 3.9, 3.10 and 3.11.
- Deprecated
SplashJsonResponse.body_as_unicode()
, to be replaced bySplashJsonResponse.text
. - Removed calls to obsolete
to_native_str
, removed in Scrapy 2.8.
Security bug fix:
If you use HttpAuthMiddleware (i.e. the
http_user
andhttp_pass
spider attributes) for Splash authentication, any non-Splash request will expose your credentials to the request target. This includesrobots.txt
requests sent by Scrapy when theROBOTSTXT_OBEY
setting is set toTrue
.Use the new
SPLASH_USER
andSPLASH_PASS
settings instead to set your Splash authentication credentials safely.Responses now expose the HTTP status code and headers from Splash as
response.splash_response_status
andresponse.splash_response_headers
(#158)The
meta
argument passed to thescrapy_splash.request.SplashRequest
constructor is no longer modified (#164)Website responses with 400 or 498 as HTTP status code are no longer handled as the equivalent Splash responses (#158)
Cookies are no longer sent to Splash itself (#156)
scrapy_splash.utils.dict_hash
now also works withobj=None
(225793b
)Our test suite now includes integration tests (#156) and tests can be run in parallel (
6fb8c41
)There’s a new ‘Getting help’ section in the
README.rst
file (#161, #162), the documentation aboutSPLASH_SLOT_POLICY
has been improved (#157) and a typo as been fixed (#121)Made some internal improvements (
ee5000d
,25de545
,2aaa79d
)
- fixed issue with response type detection.
- Scrapy 1.0.x support is back;
- README updates.
SPLASH_COOKIES_DEBUG
setting allows to log cookies sent and received to/from Splash incookies
request/response fields. It is similar to Scrapy's builtinCOOKIES_DEBUG
, but works for Splash requests;- README cleanup.
- Warning about HTTP methods is no longer logged for non-Splash requests.
SplashAwareDupeFilter
andsplash_request_fingerprint
are improved: they now canonicalize URLs and take URL fragments in account;cache_args
value fingerprints are now calculated faster.
cache_args
SplashRequest argument andrequest.meta['splash']['cache_args']
key allow to save network traffic and disk storage by not storing duplicate Splash arguments in disk request queues and not sending them to Splash multiple times. This feature requires Splash 2.1+.
To upgrade from v0.4 enable SplashDeduplicateArgsMiddleware
in settings.py:
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }
- SplashFormRequest class is added; it is a variant of FormRequest which uses Splash;
- Splash parameters are no longer stored in request.meta twice; this change should decrease disk queues data size;
- SplashMiddleware now increases request priority when rescheduling the request; this should decrease disk queue data size and help with stale cookie problems.
Package is renamed from scrapyjs
to scrapy-splash
.
An easiest way to upgrade is to replace scrapyjs
imports with
scrapy_splash
and update settings.py
with new defaults
(check the README).
There are many new helpers to handle JavaScript rendering transparently;
the recommended way is now to use scrapy_splash.SplashRequest
instead
of request.meta['splash']
. Please make sure to read the README if
you're upgrading from scrapyjs - you may be able to drop some code from your
project, especially if you want to access response html, handle cookies
and headers.
- new SplashRequest class; it can be used as a replacement for scrapy.Request to provide a better integration with Splash;
- added support for POST requests;
- SplashResponse, SplashTextResponse and SplashJsonResponse allow to
handle Splash responses transparently, taking care of response.url,
response.body, response.headers and response.status. SplashJsonResponse
allows to access decoded response JSON data as
response.data
. - cookie handling improvements: it is possible to handle Scrapy and Splash cookies transparently; current cookiejar is exposed as response.cookiejar;
- headers are passed to Splash by default;
- URLs with fragments are handled automatically when using SplashRequest;
- logging is improved:
SplashRequest.__repr__
shows both requested URL and Splash URL; - in case of Splash HTTP 400 errors the response is logged by default;
- an issue with dupefilters is fixed: previously the order of keys in JSON request body could vary, making requests appear as non-duplicates;
- it is now possible to pass custom headers to Splash server itself;
- test coverage reports are enabled.
- Scrapy 1.0 and 1.1 support;
- Python 3 support;
- documentation improvements;
- project is moved to https://github.com/scrapy-plugins/scrapy-splash.
Fixed fingerprint calculation for non-string meta values.
Initial release