From 427e6697711e3683a292571307eadb6ff546ef9f Mon Sep 17 00:00:00 2001 From: Madison Bahmer Date: Thu, 21 May 2015 17:08:57 -0400 Subject: [PATCH] Minor typos in documentation This should be the last commit before 1.0 --- docs/README.md | 2 +- docs/index.rst | 2 -- docs/topics/advanced.rst | 4 ++-- docs/topics/crawler.rst | 4 ++-- docs/topics/kafkamonitor.rst | 6 +++--- docs/topics/overview.rst | 2 -- 6 files changed, 8 insertions(+), 12 deletions(-) diff --git a/docs/README.md b/docs/README.md index 92dac86f..da245a94 100644 --- a/docs/README.md +++ b/docs/README.md @@ -13,4 +13,4 @@ Serving on http://127.0.0.1:8000 ... ``` -in order to view the documentation as you live edit it on your machine. Note that the `default` theme is overridden by readthedocs when uploading, so don't mind that the local documentation is different from what you see online. +You will now be able to view the documentation as you live edit it on your machine. Note that the `default` theme is overridden by readthedocs when uploading, so don't mind that the local documentation is different from what you see online. diff --git a/docs/index.rst b/docs/index.rst index e0d9527e..eabb9b0a 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -8,8 +8,6 @@ Scrapy Cluster |version| Documentation This documentation provides everything you need to know about the Scrapy based distributed crawling project, Scrapy Cluster. -.. note:: As of 5/15/15 an official tagged release is getting close, we are just trying to ensure everything works easily for new users and iron out any linger documentation issues. Thank you for your patience and interest! - .. toctree:: :hidden: :maxdepth: 1 diff --git a/docs/topics/advanced.rst b/docs/topics/advanced.rst index b1e90b60..3be152d8 100644 --- a/docs/topics/advanced.rst +++ b/docs/topics/advanced.rst @@ -46,7 +46,7 @@ As a friendly reminder, the following processes should be monitored: Scrapy Cluster Response Time ---------------------------- -The Scrapy Cluster Response time is dependent on two factors: +The Scrapy Cluster Response time is dependent on a number of factors: - How often the Kafka Monitor polls for new messages @@ -59,7 +59,7 @@ The Scrapy Cluster Response time is dependent on two factors: With the Kafka Monitor constantly monitoring the topic, there is very little latency for getting a request into the system. The bottleneck occurs mainly in the core Scrapy crawler code. -The more crawlers you have running and spread across the cluster, the lower the average response time will be for a crawler to receive a request. For example if a single spider goes idle for 5 seconds, you would expect a your maximum response time to be 5 seconds, the minimum response time to be 0 seconds, but on average your response time should be 2.5 seconds for one spider. As you increase the number of spiders in the system the likelihood that one spider is polling also increases, and the cluster performance will go up. +The more crawlers you have running and spread across the cluster, the lower the average response time will be for a crawler to receive a request. For example if a single spider goes idle and then polls every 5 seconds, you would expect a your maximum response time to be 5 seconds, the minimum response time to be 0 seconds, but on average your response time should be 2.5 seconds for one spider. As you increase the number of spiders in the system the likelihood that one spider is polling also increases, and the cluster performance will go up. The final bottleneck in response time is how quickly the request can be conducted by Scrapy, which depends on the speed of the internet connection(s) you are running the Scrapy Cluster behind. This final part is out of control of the Scrapy Cluster itself. diff --git a/docs/topics/crawler.rst b/docs/topics/crawler.rst index 4a88b854..1c268922 100644 --- a/docs/topics/crawler.rst +++ b/docs/topics/crawler.rst @@ -46,7 +46,7 @@ Each crawl job that is submitted to the cluster is given a priority, and for eve :alt: Breath First :align: center -As you can see above. the initial seed url generates 4 new links. Since we are using a priority based queue, the spiders continue to pop from the highest priority crawl request, and then decrease the priority for level deep they are from the parent request. Any new links are fed back into the same exact queue mechanism but with a lower priority to allow for the equal levels links to be crawled first. +As you can see above. the initial seed url generates 4 new links. Since we are using a priority based queue, the spiders continue to pop from the highest priority crawl request, and then decrease the priority for level deep they are from the parent request. Any new links are fed back into the same exact queue mechanism but with a lower priority to allow for the equal leveled links to be crawled first. When a spider encounters a link it has already seen, the duplication filter based on the request’s ``crawlid`` will filter it out. The spiders will continue to traverse the resulting graph generated until they have reached either their maximum link depth or have exhausted all possible urls. @@ -144,7 +144,7 @@ redis\_spider.py A base class that extends the default Scrapy Spider so we can crawl continuously in cluster mode. All you need to do is implement the ``parse`` method and everything else is taken care of behind the scenes. -.. note:: There is a method within this class called ``reconstruct_headers()`` that is very important you take advantage of! The issue we ran into was that we were dropping data in our headers fields when encoding the item into json. The Scrapy shell didn’t see this issue, print statements couldn’t find it, but it boiled down to the python list being treated as a single element. We think this may be a formal defect in Python 2.7 but have not made an issue yet as the bug needs much more testing.* +.. note:: There is a method within this class called ``reconstruct_headers()`` that is very important you take advantage of! The issue we ran into was that we were dropping data in our headers fields when encoding the item into json. The Scrapy shell didn’t see this issue, print statements couldn’t find it, but it boiled down to the python list being treated as a single element. We think this may be a formal defect in Python 2.7 but have not made an issue yet as the bug needs much more testing. link\_spider.py ^^^^^^^^^^^^^^^ diff --git a/docs/topics/kafkamonitor.rst b/docs/topics/kafkamonitor.rst index e590f63f..ca6c6891 100644 --- a/docs/topics/kafkamonitor.rst +++ b/docs/topics/kafkamonitor.rst @@ -37,7 +37,7 @@ Design Considerations The design of the Kafka Monitor stemmed from the need to define a format that allowed for the creation of crawls in the crawl architecture from any application. If the application could read and write to the kafka cluster then it could write messages to a particular kafka topic to create crawls. -Soon enough those same applications wanted the ability to retrieve and stop their crawls from that same interface, so we decided to make a dynamic interface that could support all of the request needs, but utilize the same base code. In the future this base code could expanded to handle any different style of request, as long as there was an validation of the request and a place to send the result to. +Soon enough those same applications wanted the ability to retrieve and stop their crawls from that same interface, so we decided to make a dynamic interface that could support all of the request needs, but utilize the same base code. In the future this base code could expanded to handle any different style of request, as long as there was a validation of the request and a place to send the result to. From our own internal debugging and ensuring other applications were working properly, a utility program was also created on the side in order to be able to interact and monitor the kafka messages coming through. This dump utility can be used to monitor any of the Kafka topics within the cluster. @@ -86,7 +86,7 @@ Example Crawl Requests: python kafka-monitor.py feed '{"url": "http://www.dmoz.org/", "appid":"testapp", "crawlid":"abc123", "maxdepth":2, "priority":90}' -s settings_crawling.py -- Submits a dmoz.org crawl spidering 2 links deep with a high priority +- Submits a dmoz.org crawl spidering 2 levels deep with a high priority :: @@ -160,7 +160,7 @@ Optional: - **priority:** The priority of which to given to the url to be crawled. The Spiders will crawl the highest priorities first. -- **allowed_domains:** A list of domains that the crawl should stay within. For example, putting [ "cnn.com" ] will only continue to crawl links of that domain. +- **allowed_domains:** A list of domains that the crawl should stay within. For example, putting ``[ "cnn.com" ]`` will only continue to crawl links of that domain. - **allow_regex:** A list of regular expressions to apply to the links to crawl. Any hits within from any regex will allow that link to be crawled next. diff --git a/docs/topics/overview.rst b/docs/topics/overview.rst index 972ab6c7..4cde84c7 100644 --- a/docs/topics/overview.rst +++ b/docs/topics/overview.rst @@ -7,8 +7,6 @@ The goal is to distribute seed URLs among many waiting spider instances, whose r The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of of the crawl requests. -.. note:: As of 4/27/15 an official tagged release is getting close, we are just trying to consolidate documentation and ensure everything works easily for new users. Thank you for your patience and interest! If you would like to jump right in anyways, the :doc:`./quickstart` guide is complete. - Dependencies ------------