Thursday, November 15, 2018

Validity and Reliability


I’ve been slowly working my way through Making Good Progress? The Future of Assessment for Learning by Daisy Christodoulou. It’s one of the clearest expositions of the differences between formative and summative assessment, and how teachers can improve the implementation of both forms of assessment. While the book is geared towards grade school, as a college instructor I still found it enlightening. The book makes excellent points with well-chosen examples, and is very carefully and clearly written. It draws from the work of Dylan Wiliam, who has written the equally excellent (but longer and more detailed) Embedded Formative Assessment.


Today’s post focuses on Chapter 3 of Christodoulou’s book, “Making Valid Inferences”. I will pose this as a series of questions, with quotations from the book in italics in answer to each question.

What is the key purpose of summative assessments?

[Summative] judgements have to be shared and consistent across different schools, different teachers and different pupils [students].

How does summative assessment differ from formative assessment?

The purpose of a formative assessment, by contrast, is to give teachers and pupils information that forms the basis for successful action in improving performance… [to] give them a better idea about what they should do next.

Can you use the same assessment tool for both formative and summative assessment?

It is possible… for example, when a teacher administers a paper from a previous year in order to help students prepare for an exam… evidence from the test can be used to produce a shared meaning, such as a grade, and some useful consequences, such as identifying certain areas of relative strength or weakness that a pupil needs to work on.

[But] different purposes pull assessments in different directions… for example, in a classroom discussion a pupil may display ‘a frown of puzzlement’ …[which] tells the teachers something extremely useful and allows them to adjust their teaching accordingly… however [it] is of much less value in providing a shared summative meaning. The reverse is also true. Knowing that a pupil got a grade ‘A’ on a formal test is an accurate shared meaning, but it provides a teacher with relatively little information that will change their teaching.

Why is designing such dual-purpose assessments tricky involving trade-offs?

The purpose to which an assessment is going to be put does impact on its design, which makes it harder to simplistically ‘repurpose’ assessments.

To bring home her argument, Christodoulou defines two key terms: Validity and Reliability.

In an assessment displaying high validity, we should be able to say something about the student’s knowledge in the domain being tested (assuming a standard set of learning outcomes for the same course). “Validity refers not to a test or assessment itself but to the inferences we make based on the test results.” If the student earned an ‘A’ on a standardized math test, can we infer something about the student’s effective numeracy capabilities? What would a ‘B’ or ‘C’ tell us? So, what we infer about the student’s knowledge in the assessed domain is the key point here. We can’t practically test the entire domain, particularly at more advanced levels, so the inference is typically made on a sample of the domain thus further complicating the issue. If a different set of test questions was used that sampled the domain in a roughly similar way, would the student perform similarly?

That’s where Reliability comes in. Is the measure reliable? Yes, if students taking a different version of the same exam showed similar performance. If the exam was taken at a different time in a different room, performance should be similar. If the exam was graded by someone different, the result should be similar. Practically, this is a thorny issue. The issue of different graders could be reduced by using multiple choice questions rather than open-ended ones. Making different versions of an exam that ‘test the same’ is tricky. And every individual student is different – one might be having a bad day, one might be approaching food coma just having eaten lunch, one might be affected more by noise or lighting.

What sort of error bars are we looking at when we try to measure reliability? Christodoulou provides some data (and plenty of references). I recommend reading her book to see what this looks like, but I can tell you that it’s tricky. She also usefully divides up assessments into two types – a Quality model and a Difficulty model. In the Quality model, “students perform tasks and marker judges how well they performed”. This is how one might assess a more open-minded piece of work, say an essay or an artistic performance. In the Difficulty model, “students answer a series of questions with increasing difficulty”. This is what one might see in math and science exams.

There is a further complication. The relationship between Validity and Reliability is asymmetric. A reliable assessment may show poor validity. Christodoulou provides the example of doing away with writing assignments and essays, and reducing everything you’re trying to measure to multiple-choice questions on grammar. But if you want to know whether a student can write effectively, the multiple-choice exam has poor validity even if it might show high reliability. On the other hand, a test with low reliability makes validity increasingly questionable. If the student performs significantly differently on a different day or with a different version of the same exam, what can you infer from the results? It’s highly unclear.

How does one go about designing such assessments? What do grades signify and how can we use them productively? We will tackle this and more in a subsequent post.

No comments:

Post a Comment