Testing unit tests for Python code assignments in CodeGrade
Guides
February 15, 2021

Testing the Tests: Autograding student unit tests in Python assignments (using code coverage and mutation testing)

All computer scientists will have to learn how to write effective unit tests at some point in their (academic) career. Because of that, most computer science degrees teach their students how to do this in some form or another nowadays. Sometimes this is already done in introductory courses, to lay a good foundation, but often we see this is done in more advanced Software Development courses.

For this guide, I have researched the best ways to autograde student unit tests for Python assignments and will explain step by step how you can automatically assess pytest unit tests in CodeGrade. However, all theory and principles covered will also be very useful for achieving this in different programming languages and articles on different programming languages (e.g. Java’s JUnit) will follow soon. This blog will explain how to grade unit tests, click here to learn how to grade python code by using automated (unit) tests in CodeGrade.

Unit Test assessment metrics

There are multiple ways in which we can assess unit tests, two industry standard metrics will be covered in this guide: code coverage and mutation tests. Both serve different purposes and differ in complexity.

  • Code Coverage is the most common metric to assess unit test quality. It very simply measures what percentage of source code is covered by the unit tests that you have written. Different metrics can be used to calculate this, for instance the percentage of subroutines or the percentage of instructions of the code that the tests cover. 
  • Mutation Testing is a more advanced metric to assess unit test quality, that goes beyond simple code coverage. During mutation testing, a tool makes minor changes to your code (known as mutations), that should break your code. It then checks whether a mutation makes your unit tests fail (this is what we want if our unit test is correct and complete) or not (meaning our unit test was not correct or incomplete).

You may now be wondering why we want to go as advanced as Mutation Testing to test our simple unit test assignment. The answer is that even though we do measure how many lines or instructions our unit tests cover with the Code Coverage metric, we do not know how well these lines are actually covered by our tests. Unit tests themselves are written code too, and can have bugs or typos in them or can miss to test certain edge cases. A bug or incompleteness in your unit test can be found very well using mutation testing, making it a very useful metric to assess the quality of unit tests for educational purposes. Of course, a combination of both metrics is even more effective and useful to assess your students!

Code Coverage Assessment for Python using pytest-cov

As it is by far the most easy to set up, we will start by assessing the student unit tests on Code Coverage in AutoTest. This can be very easily autograded as the resulting percentage can be simply converted to the points we give our students.

Pytest, one of the most common unit testing frameworks for Python and supported out of the box by CodeGrade, has a really nifty package to calculate code coverage called pytest-cov. We can install this in the Global Setup script together with installing cg-pytest using cg-pytest install pytest-cov. It is now very easy to run our student unit tests, while also reporting on their test coverage by using the unit test step in one of your AutoTest categories, by just adding `-- --cov=. --cov-report xml:cov.xml`to your usual command. This will generate a coverage report in xml format called `cov.xml`. Your final command for your unit test step will become:

-!- CODE language-console -!-cg-pytest run -- --cov=. --cov-report xml:cov.xml $FIXTURES/test_sudoku.py

Now, all we have to do is parse the coverage report and use the percentage to give our students points. Luckily, CodeGrade’s Capture Points step is exactly what we need for this. I wrote a small script called `print_cov.py` and uploaded this as a fixture. Running this in the Capture Points step will print the coverage rate, which will then be used to calculate the number of points gotten for this test.

Capturing the Code Coverage in CodeGrade
Start autograding all your Python assignments now!

Mutation Testing for Pyton using mutmut

For most basic assignments, code coverage will be a good metric to easily and quickly assess the completeness of unit tests that students write. For more advanced assignments however, adding mutation testing can greatly improve the quality of grading pytests by also grading the quality and correctness of the tests.

After researching possible python packages that do mutation testing, I found out that mutmut is by far the easiest and most supported one. Additionally, I found a great article by Opensource.com, which shares some very easy example code which you can use to try out mutmut and I also use for this example assignment.

We can install mutmut in the same way as we did with pytest-cov above, in the Global Setup Script while installing cg-pytest: `cg-pytest install pytest-cov mutmut`. Afterwards, in the Per-student setup script, we will move the unit tests submitted by the students to a test folder (you can also enforce this using Hand In Requirements), using the command `mkdir tests; mv test_angle.py tests/test_angle.py;`. Make sure pytest tests have to be named something starting with `tests_` in order to be valid, you can also enforce this using Hand In Requirements.

We are now ready to run mutmut, this can be done in a simple Run Program step using the command: `mutmut run --paths-to-mutate angle.py --tests-dir tests/ --runner "python3.7 -m pytest -x --assert=plain" || true`. In this command, we specify `angle.py` as the script to mutate (uploaded by the student) and point to the tests directory we set up in the Per-student setup script. Finally, we have to select a runner to make sure python version 3.7 is used, for which we have installed pytest. We finish the command with `|| true`, which makes sure our exit code is always successful and thus we get a point for the Run Program.

After running mutmut, we can now use a Capture Points step again to capture the percentage of mutations killed. I use the following command to generate the `junitxml` report, write it to `mutmut.xml` and then again use a simple python script to parse and print:

-!- CODE language-console -!-mutmut junitxml --suspicious-policy=ignore --untested-policy=ignore > mutmut.xml && python3 $FIXTURES/print_mutmut_score.py

-!- CODE language-python -!-import xml.etree.ElementTree as ET
tree = ET.parse('mutmut.xml')
root = tree.getroot()
failures = int(root.get('failures'))
tests = int(root.get('tests'))
print((tests - failures) / tests)

CodeGrade Mutation Test autograder category

Teaching unit testing for python effectively

The tools explained in this guide will help you set up effective automatic grading for your python assignments in which students have to create unit tests themselves. With CodeGrade’s continuous feedback, students can submit their code and unit tests many times and improve after getting their instant feedback.

You can make this process even more effective by adding custom descriptions to your tests, which you can see I also did in the last screenshot in this guide. Many of the steps we covered use relatively large and hard to understand commands, which can be rather confusing for our students. Recently, we have added the possibility to customize test descriptions in CodeGrade, which allows you to rename your tests to something that will help your students understand them better, for example: 

Mutation Test Score: The percentage of mutations that got caught by your unit tests. 0% means none were caught and 100% means all were caught. (see chapter 3)


I hope this guide has helped you set up your assignments in which you grade unit tests from students. As I mentioned, these metrics are applicable to many other programming languages, and these will be covered in new guides very soon. Would you like to learn more about code coverage or mutation testing in CodeGrade? Or would you like to apply this to another language? Feel free to email me at support@codegrade.com and I’d be more than happy to help you out!

Devin Hillenius

Devin Hillenius

Co-founder, Product Expert
Devin is co-founder and Product Expert at CodeGrade. During his studies Computer Science and work as a TA at the University of Amsterdam, he developed CodeGrade together with his co-founders to make their life easier. Devin supports instructors with their programming courses, focusing on both their pedagogical needs and innovative technical possibilities. He also hosts CodeGrade's monthly webinar.

Continue reading

Watch our ChatGPT and Coding Education webinar!

Watch CodeGrade's webinar on using ChatGPT in coding courses to help students work with this new tool and even use it in your grading worklfow yourself!

Join our webinar on ChatGPT in Coding Education!

Join CodeGrade’s CEO, Youri Voet, for a webinar on the impact of ChatGPT on computer science education. Learn how to make ChatGPT-proof coding assignments, teach AI literacy and how to use ChatGPT to set up automatic testing for your coding assignments.

New features: Assignment Schedules and Asynchronous Assignments

On May 8th and May 22nd, respectively, CodeGrade will launch two exciting new features. These features are Assignment Schedules and Asynchronous Assignments.

5 ways CS instructors can benefit from ChatGPT in their workflow

Learn about 5 ways computer science teachers can use OpenAI's ChatGPT or other AI assistants for their code grading and teaching workflows.

Sign up to our newsletter

Book a quick 30-minute demo and see how it works for you!