-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI for displaying information about Cluster #25
Comments
something like this? https://django-dynamic-scraper.readthedocs.io/en/latest/introduction.html |
Not a web interface for generating spider, but a web interface for visualizing the API data generated from working with the cluster. The REST services wrapper that interacts with Kafka would be able to use the same APIs as Kafka does, so it would allow the data returned to be ingested into a primitive series of pages that displays info about your scraping cluster. For example:
Basically, anything you can get from the Kafka API you should be able to see when visiting the UI. |
I think there are several choices (i.e. airflow, nifi) with good UI, but for now i think nifi maybe the best bet since airflow don't support kafka but only rabbitMQ for now. |
Apache NiFi is nice but is not the UI I am looking for. Picture a very plain ui like for Hadoop, Hbase, Spark, Storm or another popular open source Apache project. It tells basic information about the cluster, and allows primitive manipulation of the controls within. The Rest services in #24 built via Flask, with a basic Angular (or different framework) front end. The UI doesn't actually care what is behind the scenes, it just interacts with the rest server. The rest server then interacts with the cluster. It helps keeps things separated and abstract. |
Pushing this back to 1.3, with the focus on getting a solid rest service (#24) for 1.2 |
ok! Will the rest service also tackle the problem of crawling the same website (i.e. news or ecommerce sites) over time, such that same/duplicated items will not be outputted? |
The rest service does not take care of that, it sounds like you would need a customized spider to do better duplicate detection, or to modify the very vanilla RFPDupefilter to handle detection of duplicates at a site wide scale. That class prevents duplicates from being crawled, but you may want to add similar logic to the item pipeline to drop items that have the same footprint criteria you need to meet your filtering needs. If you have further questions about that please move the conversation to Gitter, as this issue is for UI conversation. |
The Spark UI is a nice functional interface for handling Spark jobs. Something similar for scrapy-cluster would be a good addition to the project. Potential a central interface for scheduling tasks to be sent to the cluster through 'rest' and also viewing Kafka, Redis and Crawler stats/metrics? What core features @madisonb, would you like to see in the UI? Lightweight UI could be created with flask and some static html templates or a more extensible UI with angular/react. I would be interested in contributing and could start building something in another week or so. |
@damienkilgannon Thanks for your interest! I have a bit more work to do on the Rest passthrough endpoint (Kibana logs) but it is pretty much ready to go. I built a special branch of the docs if you are interested here before it gets merged into the My vision for the UI would be to utilize the existing Kafka Monitor API return values to make something like you are saying above -> display basic information about the cluster, view spiders, backlogs, and do some basic interaction. Maybe even a raw JSON page for those who have custom setups. One thing I do have a question about is "Can we make the UI independent of the Rest service?" I imagine the interaction being like:
While UI is not my specialty, I would like to have unit testing and best practices applied here. If someone doesn't like the UI built, they can still use the rest endpoint and build their own - so I would like to leave them decoupled. I would suggest initially to look at the Kafka Monitor API docs here (since that is just translated into Rest) and we can iterate on things. I also don't want to build something that is trying to emulate what we do with Kibana. Kibana is awesome and our UI should complement it or help assist those that dont have access to the more complex ELK stack. |
Yes, that sounds good. Definitely think keeping it decoupled from the rest app is the best way forward. Testing and best practices are fair enough, maybe can look at mocking the rest api for testing of the UI. Will mock/sketch up UI design I am thinking about. Easier to compare notes and understand projects needs that way. |
Something like this? @madisonb |
Those look like good first mock ups, you certainly have got my wheels turning on what we would really like to see in a UI! Keep in mind we can generate priorities for work for version Lots of thoughts to follow. Template/Theme
Overview/Landing Page
I actually like the idea of moving the "Submit Job" UI to the landing page, and since this is the raw tool for the cluster I would like to expose all configurable options (perhaps at an "Advanced Submit Job" screen. Keeping it simple is great, I say go with Active Jobs
If you were to click on this list item, you would be taken to a sub-job page to view the 'high' and 'low' priority for every domain, and the total count for each domain. I picture a table/list view with all of this information available to the user. Within each sub-job view, there is a button to Stats Or, we could use everything if we go with the sub-pages, it makes for an easy tree breakdown of
I'm not sure if we want to reinvent D3 charts, or Kibana for that matter, but we have some nice numbers to work with - keep in mind the stats are somewhat dynamic, so hardcoding values like Output Raw JSON That gives us the following sitemap breakdown:
After writing all of this up, I see things I that are needed, but probably wont get to until
What are your thoughts? It seems like a lot - and it is a lot. But any help is appreciated! |
@madisonb great feed back, thanks. That first swipe at the UI design has got us on the same page now. Clearly, there is loads of functionality that can be incorporated in the UI as you mentioned above and I think it should be all included as long as features don't make the design or code base over complicated hampering potential for future extensibility. I will re-do the mock UI design based on your feedback and evaluate what could be left until 1.3. Will be back in a few days. |
@madisonb I have made some adjustments to your sitemap, but follows the same idea:
Landing page/overview page calls 'rest' to make stats:all request to populate the top row with overview of cluster stats (request is made on a page refresh). 'rest' is then called again to make a job submit using the 'basic job submit' form on landing page. And finally at the end of the land page the user can initiate a 'rest' to query active jobs based on appid. All following on from your description and breakdown previous. Top right corner has link to the readthedocs site. Styling will be keep clean and simple with options for the user to customize to a certain extent. See a new mock up of the landing page using bare bootstrap styling. In regards to redis-montior, kafka-monitor and crawler stats; I think its a good idea to keep them exposed on the main nav bar. They would probably be the features which would contain the most valuable info for a user and quick access would be nice. I would think a simple page refresh and even a refresh button to fire of a call to 'rest' to populate the stats on these pages would be suitable. Nothing fancy for v1.2 on presentation of stats, just raw data to get started. Let me know your thoughts. I can probably getting starting building this later in the week? |
Mockup looks great! I finally was able to tidy up the rest component so now that is merged into the EDIT: Full speed ahead |
Hey @madisonb, so what I have ended up doing is creating a kafka producer which periodically sends stats request at a preset time frequency and I have then created a consumer which is continuously listening for the responses to these requests. The consumer will validate stats messages and write to a file. The file will be overwriting each time a new stats response is received, thus acting as cache for the latest stats from the cluster. The UI then loads these most recent stats, yet to decide best way to get the UI to reload/refresh the stats. Trying to keep things as simple and straight forward as possible. Will commit some code in the coming days to discuss further. |
Created a pull request for this, the first pull request is going to provide a very simple UI. With the goal of extending its functionality in the future. This UI will serve to key tasks; provide update on current cluster status (redis connected, kafka connected) and a form to submit a crawl request to the cluster. |
You guys are are my kinda of peeps! year a while know I have been putting aside creating a project with the same scope though just know introduced to cluster... Just earlier today started the build process after getting tired of scp'ing and ssh'ing into my production cloud account to run my data mining projects... Running the spider in "headless"... detached really, subprocessing etc... Got some time of so here I go!! About to give cluster a spin, see if it turns me off of scrapyd, cheers! |
Hey guys! What is the situation with this one? |
@villeristi both #174 and #116 are in a partial working state to get the UI into the main branches. |
We need a small stand-alone web UI that ties in with the rest components in #24 to visualize the data generated by the cluster. You should also be able to submit API requests to the cluster.
Preferably this web ui and rest services are together and it is just deployed as a single running process.
The text was updated successfully, but these errors were encountered: