Monday, October 23, 2023

Tests and Polls

I had not considered that polling has remarkably much in common with school exams and tests. This epiphany comes from reading Measuring Up by Daniel Koretz, an education professor at Harvard. His area of expertise is the impact of high-stakes testing in American schools at the K-12 level. I’m only read the first three chapters and there is already lots of meat for me to chew on. Koretz begins the prologue by noting the ubiquity of testing, and that “achievement testing seems reassuringly straightforward and commonsensical: we give students tasks to perform, see how they do on them, and thereby judge how successful they or their schools are.”

 

Then he proceeds to dismantle the notion. From what I gather the point of the book is to give the reader a sense of the complex enterprise of achievement testing, and discuss why “test scores are widely misunderstood and misused”. And that misuse has far-reaching ramifications. Koretz knows that he’s wading into a minefield. He thinks that test scores can be valuable but only if one understands their strengths and limitations. The problem with us human beings is that we want quick and easy answers. We quickly gravitate to what numbers and things we can count – surprising, perhaps, given the general phobia that many have towards math.

 

The fundamental issue, according to Koretz, is that “test scores do not provide a direct and complete measure of educational achievement. Rather, they are incomplete measures, proxies for the more comprehensive measures that we would ideally use but that are generally unavailable to us.” Why are they incomplete? First, “tests can measure only a subset of the goals of education”. Second, “tests are generally very small samples of behavior that we use to make estimates of students’ mastery of very large domains of knowledge and skill.” No surprises here.

 

So how is testing like polling? When conducting a poll we’re using the responses from a thousand or so people as a proxy to get a sense of how millions of people might answer those questions. Getting a representative sample is therefore crucial to polling. In the same way, when constructing a test, our hope is that by testing a limit yet representative sample of the subject matter we are getting a measure (of sorts) of the larger range of knowledge and/or skills we want our students to have learned. Just as it isn’t feasible to poll everybody, it isn’t feasible to ask questions to test every tidbit of knowledge covered in class.

 

There’s an additional piece I hadn’t considered. In a poll for say voting outcomes, what you care about is what it tells you about how millions might have responded. You’re not actually concerned about the impact of the thousand people you polled – their specific votes won’t change the outcome of a large election. Similarly, in a test, we shouldn’t be worried about how students answered a particular question, what we care about is “the larger set of knowledge and skills it represents”. That being said, if all my G-Chem students bombed Lewis structures on an exam, I’d be concerned that they haven’t learned one of the most important things in the course.

 

Koretz also argues for the importance of standardization in exams. What it means is uniformity: “examinees face the same tasks, administered in the same manner and scored in the same way… to avoid irrelevant factors that might distort comparisons among individuals.” How do you go about doing this? By thinking carefully about what you’re asking on the test and whether your questions allow you to differentiate the knowledge and skills between different examinees in the subject matter. Koretz emphasizes that you are not creating such differences by your test, rather you are “revealing differences that already existed”. Another point that struck home for me is when Koretz argues that validity “is the single most important criterion for evaluating achievement testing”. But it’s not the validity of the tests we’re concerned about, but rather the validity of what we are inferring from the test scores.

 

What about the incompleteness of the test with respect to the goals of education? Koretz cites the work of E. F. Lindquist, a giant in test construction – versions of which are used all over the country. Lindquist argues that “only some of these goals of education are amenable standardized testing” and therefore one should never use only test scores to draw larger conclusions about students’ abilities. Furthermore, what instructors focus on in the classroom, being a proxy for the broader goal of education, can be far removed. The utility of what students are learning today may in some cases be revealed only years down the road and in an oblique fashion such that one doesn’t realize that the schooling provided the foundation knowledge or skill.

 

All this make me think about how I construct my exams. I do try to ensure that the questions I ask are representative of the larger domain area. Can I validly draw the inference that a student who does well on my exams has a good grasp of the domain area? To some extent, but I’m not sure I have the quantitative data to prove it. There is no ‘clean’ experiment I’ve carried to separate out any confounding variables. Do my exams differentiate the students? Yes, to some extent. Failing is rare in my classes, but there can be quite a number of C’s (and a few D’s) depending on the class and the year. There can also be quite a number of A’s and B’s, especially if I’m teaching the Honors section – the students are indeed academically stronger. Because my class sizes are small (not exceeding 40), there can be fluctuations in the distribution from one year to the next. But by and large, my grades have stayed steady over the years.

 

Reading Measuring Up reminded me of the incompleteness of exams. In G-Chem and P-Chem, my exams are the lion’s share of the grade. I think exams are reasonably good proxies in these classes, but not necessarily for other classes I’ve taught (labs, special topics, research methods). It made me stop to think about how I weight different aspects of my class and how I think about what a student’s grade in the class represents. I recognize that my interpretation of these matters may differ significantly from my colleagues, especially those outside the natural sciences. And that’s okay. I’m looking forward to reading the rest of Koretz’s book!

No comments:

Post a Comment