Otter-Grader is a lightweight open-source command-line tool for developing and grading Jupyter Notebook assignments at scale. It enables instructors to produce an assignment and its autograder from just a single notebook.
Otter was developed by Christopher Pyles, while working with Data Science Undergraduate Studies at UC Berkeley. Since its pilot in 2020, Otter has been adopted by instructors at a wide variety of institutions, from a university in Japan to a high school in North Carolina, and has been deployed in courses with enrollments ranging from 15 to 1500+.
Attendees will find our talk particularly useful if they’ve created notebooks for educational purposes, and/or if they’ve worked with grading infrastructure such as
nbgrader or Gradescope.
Part 1: Authoring Assignments
We’ll start by demonstrating how to author assignment notebooks in Python using Otter.
One of the reasons Otter is so convenient is that an entire assignment and autograder can be developed in just a single “source” notebook. That notebook consists of exposition, solution code that students need to produce, inline autograder tests, and other metadata. After creating a source notebook, a single use of the
otter assign command-line tool produces a student-facing version of the notebook. In this notebook, students only see the skeleton code their instructor wants them to start with (rather than the solution), and instead of seeing the nitty-gritty details of all autograder tests, they only see calls to the function
grader.check, which displays the test cases that their code for a given question failed.
Part 2: Releasing and Collecting Assignments
In addition to creating a student-facing assignment notebook,
otter assign also generates a portable
autograder.zip file that instructors can run to compute grades. This autograder can be run anywhere that
pip install otter-grader can be run – most commonly, this is in a Docker container on a personal computer or on Gradescope, a popular LMS.
We will demonstrate a common workflow in use at multiple institutions, which involves:
Hosting student-facing notebooks on GitHub.
Providing students with
nbgitpuller links that open the relevant assignment on an institution-hosted JupyterHub server.
Configuring Gradescope to automatically run all autograder tests and provide feedback upon submission (or, alternatively, autograding submissions locally in a Docker container).
All of this will be illustrated from the perspective of both an instructor and a student.
Part 3: Adaptations, Shortcomings, and Future Plans
A common data science workflow is to use notebooks for exploration, but to write permanent code using an IDE. In one of our courses, we promote this workflow by distributing assignments as notebooks containing the question prompts, while requiring students to submit their work in
.py files. To support this use case, we wrote a wrapper around Otter which takes a notebook containing all problem descriptions, solutions, and test code, and generates a student-facing
.py file containing skeleton function definitions. Our wrapper allows us to generate our multi-format assignments using a single source document, thereby significantly reducing the likelihood of errors. We will start the third part of the talk by discussing the motivation behind this type of assignment, how we used
nbconvert to support the adaptation, and how much easier this adaptation of Otter makes it to create and edit these assignments than the prior pre-Otter solution.
Then, more broadly, we will discuss shortcomings of Otter that have been identified by other instructors. Some shortcomings are pedagogical:
- One may argue that Otter’s presentation of test cases encourages students to “guess and check” their work.
- Additionally, depending on the domain, it can be difficult to craft autograder tests when students’ implementations vary significantly. (For instance, in cases where random sampling is involved.)
Other shortcomings are infrastructural:
- The current Otter metadata syntax isn’t supported in third-party platforms like Google Colab.
- Otter does not support question randomization in any way, e.g. it can’t create “versions” of assignments with questions in different orders, which can limit its usefulness for exams.
To conclude, we will summarize recent updates made to Otter and discuss planned future directions, including how we plan to address some of the aforementioned shortcomings.