Skip to content
Ayush Ranjan edited this page Jan 10, 2020 · 2 revisions

Terminology

Grading Stage

A Grading stage consists of a Docker image and zero or more environment variables, some of which might consist of templates. For example, you can have an image named autograder and template that with an environment variable STUDENT_ID which specifies which student to run the autograder for. A grading stage fails if the container returns a non-zero exit code or if the container times out.

Config

Grading Pipeline

Consists of one or more grading stages which are run sequentially (as per the order specified in the array).

Grading Job

An instance of a grading pipeline is a grading job. For example, you can specify one grading pipeline to grade a student. There will be many instances of this pipeline based on the number of students you specify. A grading job fails if any of the intermediate grading stages fail. The grading job is aborted in case of any stage failure (subsequent stages are not executed) and is marked as failed.

Grading Run

Usually represents the grading of a single assignment. Consists of the following:

  • Pre-processing job: This is optional. This is executed before any of the other jobs are scheduled for this run. If this fails, the grading run is marked as failed and none of the other jobs are scheduled. If not defined, student jobs are scheduled right away.
  • Student grading jobs: Consists of many grading jobs which will be executed simultaneously. These jobs are distributed across the grading machines. Ideally, each student job grades one student. The grading run is unaffected by the failure of any of these jobs (since student's code might break things, timeout the containers, etc).
  • Post-processing job: This is optional. This is executed after all student jobs have finished. The grading run is marked as failed if this job fails.
  • Environment Variables: These environment variables are injected into every container run for a grading job for this assignment

Config

Worker Nodes

The Broadway Graders communicate with the API by making requests. All the alive worker nodes communicating with the API form the grading cluster. We keep track of alive worker nodes using the heartbeat protocol. The size of the grading cluster can be scaled up or down based on the requirement. The worker nodes are currently not capable of listening to requests. So the API can not notify them when events occur (like a grading job is ready to run). As a result, all the communication is designed to be initiated by the grader. The worker nodes are responsible for:

  • Polling grading jobs from the API (once they are available) and run those grading jobs. They have to check periodically with the API if a grading job is available by making periodic requests. If there is a job in the queue, it will be immediately sent back as a response to the request.
  • Once a grader has successfully received a grading job to run, they should send the results of the grading job back to the API after they are done executing the job. Once the API receives the results, it will mark the job as succeeded/failed and schedule the next batch of jobs appropriately. For example, the API receives the results for the pre-processing job and the job was successful, it will then schedule the student jobs.
  • Send periodic heartbeats. The API merely updates the worker node's state here.

Failure Detection

The API expects a heartbeat every HEARTBEAT_INTERVAL seconds (specified in the config). If a worker node does not send a heartbeat in 2 * HEARTBEAT_INTERVAL seconds, the API declares it dead. The API checks for dead worker nodes every HEARTBEAT_INTERVAL seconds using periodic callbacks. In case the grader crashes while executing a grading job, the API will declare it dead (since it will stop receiving heartbeats) and will mark that grading job as failed.

Job Queue

The scheduling is done by pushing the grading jobs onto a queue once they are ready to be run. The graders poll this queue periodically. The status code for the poll request is set to QUEUE_EMPTY_CODE if the queue is empty. : The following properties have to be satisfied for a grading job to be on the queue:

  • A student-job is on the queue if and only if either:
    • no pre-processing job exists for the grading run
    • the pre-processing job exists for the grading run and has already been executed and marked as succeeded
  • The post-processing job is on the queue if and only if all student jobs have been executed and marked as succeeded or failed.

The queue allows for concurrent runs from multiple courses. The job queue can be populated with grading jobs belonging to different grading runs as long as the above properties hold true.

Authentication

All requests are authorized through auth tokens which are passed in the header of the requests.

A cluster token is determined when the API is started. This cluster token is used to authenticate requests from the graders. Hence all the worker endpoints are authenticated with the cluster token. This cluster token has to be handed to the graders to start them.

On the other hand, each course can specify a list of tokens to authenticate requests pertaining to their course. They can use any of these to make requests for themselves. This prevents courses from making requests for each other and starting grading runs for each other.

Design Considerations

File sharing between stages

All containers (representing grading stages) in a grading job can share files among themselves using the /job/ directory inside the container. Before the start of any grading job, we create a temporary directory (which is completely destroyed once the job is over) on the local FS and mount that directory onto all the containers of that grading job at path /job/ in the container. So for example, if the first container writes to /job/file.txt, subsequent containers should be able to see that file at the same path.

Publishing grades at the same time

A course might want to publish grades for all students at the same time as opposed to releasing a student's grade as soon as their grading job finishing. Sometimes grading runs might take really long and it will be unfair to some students who will get their results much later than others.

A course can build a service which is responsible for collecting results and publishing them. We could use the pre-processing grading job to register a grading run to that service. Each student grading job can generate a result (in any form) and post it to the service. The post-processing stage can just signal the service to publish the grades then.

Automation

We suggest that courses build a CLI which can generate the required configs and ping Broadway endpoints to trigger runs at certain times. They could also look at Broadway On-Demand which is an open-source web app built around Broadway API. It lets students request their own AG runs and lets the course staff view the grading job logs and apply various AG run policies.

Admin Role

Note that this design entails an admin role. Only the admin has access to the MongoDB instance, API instance, cluster token, and all course tokens. The admin is responsible for:

  • Collecting the list of auth tokens for every course
  • Starting the API with course tokens
  • Extracting the cluster tokens
  • Starting the graders using the cluster token and adding them to the grading cluster.
  • Maintaining the grading cluster and keeping track of its health.
  • Scaling the grading cluster size based on system load, demand, and traffic.