This is my first time using the Go programming language, so I thought I would write a simple web crawler.
The crawler will only accept a single domain, and gather a mapping of the site and all assets of each page.
Overview:
- Accepts a single domain
- Does not crawl subdomains
- Obeys robots.txt (if one can be found)
- Examines Content-Type header of the http.Get response, discards anything with a 'Content-Type' that is not 'text/html'
Future Work:
- re-factor crawl into a goroutine so more than one crawl can be happening at a time
- obey robots.txt
- research more idiomatic testing practices in Go
- refactor Page into its own package