I’ve been slowly
working my way through Making Good Progress? The Future of Assessment for
Learning by Daisy Christodoulou. It’s one of the clearest
expositions of the differences between formative and summative assessment, and
how teachers can improve the implementation of both forms of assessment. While
the book is geared towards grade school, as a college instructor I still found it
enlightening. The book makes excellent points with well-chosen examples, and is
very carefully and clearly written. It draws from the work of Dylan Wiliam, who
has written the equally excellent (but longer and more detailed) Embedded Formative Assessment.
Today’s post
focuses on Chapter 3 of Christodoulou’s book, “Making Valid Inferences”. I will
pose this as a series of questions, with quotations from the book in italics in
answer to each question.
What is the key
purpose of summative assessments?
[Summative]
judgements have to be shared and consistent across different schools, different
teachers and different pupils [students].
How does summative
assessment differ from formative assessment?
The purpose of
a formative assessment, by contrast, is to give teachers and pupils information
that forms the basis for successful action in improving performance… [to] give
them a better idea about what they should do next.
Can you use the same
assessment tool for both formative and summative assessment?
It is possible…
for example, when a teacher administers a paper from a previous year in order
to help students prepare for an exam… evidence from the test can be used to
produce a shared meaning, such as a grade, and some useful consequences, such
as identifying certain areas of relative strength or weakness that a pupil
needs to work on.
[But] different
purposes pull assessments in different directions… for example, in a classroom
discussion a pupil may display ‘a frown of puzzlement’ …[which] tells the
teachers something extremely useful and allows them to adjust their teaching
accordingly… however [it] is of much less value in providing a shared summative
meaning. The reverse is also true. Knowing that a pupil got a grade ‘A’ on a
formal test is an accurate shared meaning, but it provides a teacher with
relatively little information that will change their teaching.
Why is designing
such dual-purpose assessments tricky involving trade-offs?
The purpose to
which an assessment is going to be put does impact on its design, which makes
it harder to simplistically ‘repurpose’ assessments.
To bring home her
argument, Christodoulou defines two key terms: Validity and Reliability.
In an assessment
displaying high validity, we should be able to say something about the
student’s knowledge in the domain being tested (assuming a standard set of
learning outcomes for the same course). “Validity refers not to a test or
assessment itself but to the inferences we make based on the test results.” If
the student earned an ‘A’ on a standardized math test, can we infer something
about the student’s effective numeracy capabilities? What would a ‘B’ or ‘C’
tell us? So, what we infer about the student’s knowledge in the assessed domain
is the key point here. We can’t practically test the entire domain,
particularly at more advanced levels, so the inference is typically made on a
sample of the domain thus further complicating the issue. If a different set of
test questions was used that sampled the domain in a roughly similar way, would
the student perform similarly?
That’s where Reliability
comes in. Is the measure reliable? Yes, if students taking a different version
of the same exam showed similar performance. If the exam was taken at a
different time in a different room, performance should be similar. If the exam
was graded by someone different, the result should be similar. Practically,
this is a thorny issue. The issue of different graders could be reduced by
using multiple choice questions rather than open-ended ones. Making different
versions of an exam that ‘test the same’ is tricky. And every individual
student is different – one might be having a bad day, one might be approaching
food coma just having eaten lunch, one might be affected more by noise or
lighting.
What sort of error
bars are we looking at when we try to measure reliability? Christodoulou
provides some data (and plenty of references). I recommend reading her book to
see what this looks like, but I can tell you that it’s tricky. She also
usefully divides up assessments into two types – a Quality model and a
Difficulty model. In the Quality model, “students perform tasks and marker
judges how well they performed”. This is how one might assess a more open-minded
piece of work, say an essay or an artistic performance. In the Difficulty
model, “students answer a series of questions with increasing difficulty”. This
is what one might see in math and science exams.
There is a further
complication. The relationship between Validity and Reliability is asymmetric.
A reliable assessment may show poor validity. Christodoulou provides the
example of doing away with writing assignments and essays, and reducing
everything you’re trying to measure to multiple-choice questions on grammar.
But if you want to know whether a student can write effectively, the
multiple-choice exam has poor validity even if it might show high reliability.
On the other hand, a test with low reliability makes validity increasingly
questionable. If the student performs significantly differently on a different
day or with a different version of the same exam, what can you infer from the
results? It’s highly unclear.
How does one go
about designing such assessments? What do grades signify and how can we use
them productively? We will tackle this and more in a subsequent post.
No comments:
Post a Comment