MCQs in the age of AI

I shared this guide on Writing Good Multiple Choice Test Questions on social media and the first response I got on both LinkedIn and BlueSky was whether MCQs hold up in the age of AI. As first year lead for many years, this is a problem I’ve put a huge amount of thought into, so if you’re interested in all those thoughts, buckle up.

I am assuming that the long-form version of the question “do MCQs hold up in the age of AI?” is really “does the academic integrity of online open-book MCQs hold up in the age of AI?” And the answer to that question is quite categorically no but let me take a little detour about what we did when we moved online for covid because it also serves as a defence of MCQs, which is honestly the real point of this post.

High stakes MCQ exams

In 2020, our Level 1 exam was an 80-question MCQ exam that students had 60 minutes to complete. It was worth 40% of their final course grade, with the rest coming from an essay and an engagement portfolio of low-stakes data skills, weekly quizzes, and research participation. In the first semester of 2020, we ran the exam online using much the same questions that we’d used when the exam was in-person and it was a disaster. The distribution that had held for years crumbled. The average grade was a high A.

Now before anyone calls me a demon for decrying students performing well, I’ve got no issue with assessments where the grades skew high because they’ve all learned a lot and studied hard, but this was not that. The issue was that we’d given them factual MCQs that could be easily Googled, for example:

Which of the following best describes a between-subjects design in experimental research?

  1. The same participants take part in all conditions of the experiment.

  2. Different groups of participants experience only one condition each.

  3. Participants are observed in their natural environment without intervention.

  4. The experiment includes both repeated measures and independent groups.

So we burned down the exam questions and started again. I’m continually amazed by my team, they’re wonderful people who are truly excellent at their jobs and care deeply about learning and teaching and I can’t over-estimate the work they put into this. We consulted the literature on best practice for MCQs (Blake Harvard has a great blog about how to make more effective MCQs) and rewrote all the questions to require students to apply their knowledge and, crucially, questions that could not be easily answered by Google at the time. For example, the above factual MCQ became the below applied MCQ:

A researcher designs an experiment where participants either drink caffeinated coffee or decaffeinated coffee and then take part in a reaction time test. Based on the information provided, what is the design of this experiment?

  1. Between -subjects
  2. Within-subjects
  3. Mixed-design
  4. Case study

The distribution returned to a peak of a low to middling B / 2:1 but more than that, I think we created a much, much better exam, one that measured their knowledge and their ability to apply it, rather than their memory. They could have all their notes with them but if they hadn’t put in the time to understand what they were being taught, it wouldn’t help much. One of my key frustrations with the discussion over exams is the default assumption that they are all bad. Let’s say it again for the people are the back “EXAMS ARE NOT BAD, BAD EXAMS ARE BAD”. You give students a 100% closed book exam that only tests how well they can remember a bunch of names and dates, with no thought to accessibility, and little in the way of learning support throughout the semester, then sure, exams are shit.

But rote memorisation is not the same as knowledge. And students need knowledge. You can’t critically evaluate something you don’t know. There is a very robust literature on the impact of prior knowledge on learning new knowledge and skills and for many introductory courses (like first year psychology), building that knowledge base so that they can go on and do more interesting things is the point. Please stop asking me if I’d like to replace my first year exam with a podcast or experiential learning. I am happy to die on the hill that a) “authentic assessment” as a response to every question about assessment and feedback has become a hollow, meaningless slogan for people without true expertise in effective learning and teaching (see also, active learning) and b) having a broad knowledge base of psychological concepts and theories is an authentic part of being a psychologist.

And this is actually my main concern when it comes to AI. Yes academic integrity is important but for me, the fact that they won’t learn anything is what keeps me up at night. If there was some way they could use AI to cheat but still have developed all that core knowledge then I would honestly care less. But what I’ve seen over the last few years is a slow creep of students who are unable to tackle those more interesting tasks of analysing, evaluating, and creating, because either they, or their education system, skipped over developing core knowledge. There is also unquestionably a workload component to MCQ exams as well. I have 600-700 students on my course each year and they provide an effective and efficient method of testing their knowledge acquisition. That’s important.

What was the point of this blog post again? Oh right. If you have a high stakes summative open-book MCQ exam, it needs to go back in person, rather than in the bin. Whether it’s susceptible to AI is an entirely different discussion to whether it’s still a useful assessment.

Low stakes MCQs

You might think I’ve written enough but I also want to discuss the issue of low-stakes MCQs, because we’ve done a lot of work on those as well.

In our first semester Level 1 course, we had two types of low-stakes MCQs. First, there were weekly MCQs that were related to the lecture content and these were originally a mark for participation (5% in total) because there’s also a robust literature on the impact of practice testing and distributed practice on learning and the quizzes supported students to study continuously. Second, we had an open-book MCQ about data skills and programming in R that was worth 5% of their grade and arrived in week 6 of term. Pre-AI, this MCQ was very, very effective at identifying students who had been keeping up with their data skills work and those who had disengaged / were trying to cram. Post-AI, pointless.

But again, the issue here isn’t academic integrity. These are low-stakes assessments where we expected the grades to skew high. I don’t care if they all get As, I care if they’re learning. So our response has been to change the grading scheme rather than the MCQs themselves. This year, students get two attempts at each MCQ (I must acknowledge that I stole this idea from Dr. Carolina Kuepper-Tetzel, Learning Scientist and all-round excellent work wife). On the first attempt, they’re instructed to not use any notes and instead use it as a pure test of their knowledge. After the first attempt, they see which ones they got right and wrong and then on the second attempt, they can use whatever they want to improve their score.

By doing this, they still get the boost that comes from practice testing and they also get feedback on how their learning is progressing. But the second attempt means that they get the grade and again, it’s low-stakes and the assessment load of the course has been built to withstand a high skew on these components.

More importantly, I hope that what we’re doing is implicitly and explicitly teaching students about effective study strategies and the difference between assessment for learning and assessment for grades and that we’re targeting motivation rather than just banging on about misconduct. Of course, students could still use AI to complete the first attempt but particularly when it comes to the data skills MCQ, it’s going to be very easy to identify them in the first-attempt distribution (very few used to get full-marks on that test). I’m not going to use this as some sort of integrity test (see again: not a demon) but I can use it to make the point that if these grades are your own work, this is amazing, but if they represent the use of AI, here’s why you’re only hurting yourself.

Emily Nordmann
Emily Nordmann
Senior Lecturer in Psychology

I am a teaching-focused Senior lecturer and conduct research into the relationship between learning, student engagement, and technology.