gocrawl/README.md at master · mcmillhj/gocrawl · GitHub

This is my first time using the Go programming language, so I thought I would write a simple web crawler.

The crawler will only accept a single domain, and gather a mapping of the site and all assets of each page.

Overview:

Accepts a single domain
Does not crawl subdomains
Obeys robots.txt (if one can be found)
Examines Content-Type header of the http.Get response, discards anything with a 'Content-Type' that is not 'text/html'

Future Work:

re-factor crawl into a goroutine so more than one crawl can be happening at a time
obey robots.txt
research more idiomatic testing practices in Go
refactor Page into its own package