Wikipedia crawler example using Serverless
To run this repo you'll need:
- Install Node.js https://nodejs.org/
- Install Serverless https://serverless.com/framework/docs/getting-started/
- Install Python distribution, e.g. Anaconda https://www.anaconda.com/download/
- Install your favorite Python ID, e.g. VSCode https://code.visualstudio.com/; support for JavaScript, NodeJS, YAML etc. is a plus.
- Install Git https://git-scm.com/
- Register on AWS https://aws.amazon.com/ and install AWS CLI; you should not surpass the free tier.
- For Windows 10 users it is useful to have Ubuntu for Windows https://tutorials.ubuntu.com/tutorial/tutorial-ubuntu-on-windows#0
- For deployment package builder it is good to have Docker installed (https://www.docker.com/get-started) along with Serverless Python Requirements Plugin (https://www.npmjs.com/package/serverless-python-requirements)
To install required plugins type:
npm install
The way to fix Docker Toolbox daemon error is to set a number of environment variables, as follows:
SET DOCKER_TLS_VERIFY=1
SET DOCKER_HOST=tcp://192.168.99.100:2376
SET DOCKER_CERT_PATH=%USERPROFILE%\.docker\machine\machines\default
SET DOCKER_MACHINE_NAME=default
SET COMPOSE_CONVERT_WINDOWS_PATHS=true
%USERPROFILE%
should point to your home directory (you can check it using echo %USERPROFILE%
); if it is not set correctly, change manually to your home in DOCKER_CERT_PATH
given above.
The description is taken from https://www.mydatahack.com/resolving-docker-deamon-is-not-running-error-from-command-prompt/
To check the package content use:
sls info
To deploy the functions to AWS Lambda use the following command:
sls deploy -v
To test the function wiki
you can do it in local mode:
sls invoke local -f wiki -d "{\"lang\": \"pl\"}"
of with the test file
sls invoke local -f wiki -p tests/wiki_test.json
Similarly, you can invoke deployed function on AWS Lambda:
sls invoke -f wiki -p tests/wiki_test.json -l
Note that without -l
the logging will not be shown.
To see the logs from the deployed function wiki
use the following command:
sls logs -f wiki
There are occasionally some issues with using serverless-python-requirements
plugin using Docker, especially on Windows. You can use the plugin with option dockerizePip: false
and use local Python (in this case executable python3.7
) for zip creation; on Windows it is best to use Ubuntu for Windows, as the target image on AWS is Linux.
If you want to use Docker build, which is best if you use packages requiring compilation(e.g. numpy
), set dockerizePip: true
, on Windows dockerizePip: non-linux
.
Alternatively, you can containerize your function using Docker as a web service, in this case using Flask. This can be used in many container orchestration systems like Kubernetes and similar. Note that the service is simpler and just returns the crawled data; you need to handle data adding yourself either in the crawler service or as a separate service. See app.py
and Dockerfile
for details.
To build the service on your machine use the build command, e.g.:
docker build -t wiki-crawler .
To run the service use the run command, e.g.:
docker run -p 8080:8080 wiki-crawler
You can push the image to the repository of your choosing if it works correctly, see the documentation for details.
Recently number of providers added an option to run Docker containers in a serverless fashion. One of them is Google Cloud Run, which is effectively managed version of Knative. It allows for easy shipping of Docker containers, scales them up and down, even to 0, an allows for event triggering; see the respective products' documentation for details.
Before deployment you need to build your image and ship it to a container registry, like the Google Cloud one.
To use Google Cloud you need to register and get the account. Also, you need to get the Google Cloud SDK, see the documentation for details.
Build the docker locally with the below naming scheme.
docker build -t gcr.io/$PROJECT_ID/wiki-crawler .
Where PROJECT_ID
is your Google Cloud Project project ID; you can get it via Google Cloud SDK:
export PROJECT_ID=$(gcloud config get-value project)
You can test the image the same way as before locally; note the name change.
docker run -p 8080:8080 gcr.io/$PROJECT_ID/wiki-crawler
Before pushing the image the the registry you may need to setup the authentication for docker via the below command.
gcloud auth configure-docker
You should now be able to push the image to the registry using the command.
docker push gcr.io/$PROJECT_ID/wiki-crawler
To deploy the cloud run you need to use gcloud
; see quick start manual for details. To deployed the pushed image use the command as below; it will ask you few questions about the deployment.
gcloud beta run deploy wiki-crawler --image gcr.io/$PROJECT_ID/wiki-crawler
Alternatively, you can use the provided cloudbuild.yaml
file, which is Cloud Build config definition; see the documentation for details. To submit the build use the below command.
gcloud builds submit
To change the default substitution variables, type as follows, e.g. to change the service name:
gcloud builds submit --substitutions=_SERVICE_NAME=new-wiki-crawler