Thursday, December 19, 2019

Grade Norming


“I grade on an absolute scale. There is no curve. This means that potentially everyone in the class could get an A. This is unlikely to happen. It also means that potentially everyone could fail. This is even less likely.”

That’s what I tell my students on the first day of class, in every single one of my classes – because the law of small numbers applies when one teaches in a liberal arts college with small-ish class sizes. My students know what this absolute scale is score-wise because it’s explicitly stated in the syllabus. Mostly, they are heartened by there not being a curve; they likely focus on the part where they might score A’s and not where they might fail. The reality at semester’s end is, in my intro-level chemistry courses, something resembling a normal distribution (with fatter tails). The mean and median grades depend on the size of the class and average student interest and ability. (In physical chemistry, the distribution is often bimodal.)

Larger institutions with huge introductory lecture courses might grade on a curve. In other countries, professors may not have the final say on the final grades because the university administration may impose grade norming. This sounds like anathema to faculty at U.S. institutions, but the issue is more complex than at first glance. I, for one, have no plans to change my grading policies. (I tell students I don’t assign their grades; rather students earn their grades.) That being said, it’s not because I’m absolutist. Rather I’m in a system that gives faculty autonomy over grades and I have particular ideas as to what constitutes A work, B work, C work, and so on. Also, the law of small numbers. (For more on the purpose of final grades, see here.)

All the above is to preface the hullaballoo this week in higher education circles from a white paper published by the National Bureau of Economic Research (NBER). The somewhat cryptic title is “Equilibrium Grade Inflation with Implications for Female Interest in STEM Majors”. I wouldn’t have noticed it if not for the InsideHigherEd (IHE) article “Grading for STEM Equity”, with the provocative lede:

Study suggests that professors should standardize their grading curves, saying it’s an efficient way to boost women’s enrollment in STEM.

That definitely catches eyeballs, particular since the article opens with:

Harsher grading policies in [STEM] courses disproportionately affect women – because women value good grades significantly more than men do according to [NBER paper]. What to do? The study’s authors suggest restricting grading policies that equalize average grades across classes, such as curving all courses around a B grade. Beyond helping close STEM’s gender gap, they wrote, such a policy change would boost overall enrollment in STEM classes.

The IHE article is short, yet does a good job summarizing the main research points. Cue the extensive comments section; some thoughtful, others clearly indicating they have not read the NBER paper (and skimmed the IHE article too quickly), none of which is surprising.

The paper itself is quite interesting. The suggestion of grade norm(aliz)ing around a B average comes from building an economic model based on extensive data from the University of Kentucky, and then applying counterfactuals to examine how students might sort themselves differently into majors and their associated classes. One can quibble with the model parameters, for example, I thought the professor utility function they applied was much too simplistic, but overall I felt that they had reasonable justification for their model (from my non-expert point of view). I recommend reading the paper if you’re interested in the details. (I read the actual NBER Dec 2019 article but if you’re trying to avoid a paywall, searching the article title will reveal earlier working copies that are somewhat close to the final version.)

The data is interesting. Unsurprising was that STEM classes were associated with lower grades. (The authors grouped Economics, Finance, Accounting, and Data Sciences with the standard STEM areas.) Average grades were 2.94 and 3.27 for STEM and non-STEM respectively. For women, these averages were 3.00 and 3.37, i.e., women score better than men in both STEM and non-STEM. One compounding factor is that STEM classes were on average twice as large as non-STEM classes (80 versus 40) – likely due to those large intro STEM classes. Also unsurprising was that self-reported outside-of-class study time was 40% higher in STEM classes. More shocking were the actual average numbers of 3.37 and 2.45 hrs/week for STEM and non-STEM respectively. That’s very low! While self-reporting is always suspect, that’s 20,000 students and the over and under-estimations might cancel out. We also know from other longitudinal studies that study hours have decreased steadily over the years and the U of K numbers are not out-of-whack for the present decade.

Looking at details more closely, larger classes do indeed show inverse correlation with grades. Classes with more women have higher average grades. Classes with more women have higher study hours. And then the kicker: Classes with higher grades show less self-reported study time. The authors note that “grade inflation may have negative consequences for learning.”

The meat of the NBER paper is the model they build whereby “grading policies influence enrollment decisions directly because students value grades but also indirectly through incentivizing (costly) study effort.” Each course is assigned a payoff based on a student’s preference for the course, how much time he/she is willing to study, and an expected grade based on such effort. Students sort themselves into courses and receive potential grades that depend on academic preparation, study effort, professor “grading policy”, among other things. There’s a bunch of math and the model is parameterized.

Some interesting things that come out of the model: Women study a third more than men. Doubling study effort leads to larger grade increases in STEM versus non-STEM; the extremes are Engineering (0.37 grade increase) and Management & Marketing (0.13 grade increase). There’s a likely-to-be-controversial table showing the ability weights of women being lower in STEM areas, with Chemistry & Physics at the bottom of the pack. Expected GPAs for both women and men are also lowest in Chemistry & Physics (and lower in STEM overall), however, interestingly, stronger students tend to sort towards STEM. This is not because men are necessarily better; women still earn higher grades in STEM, but they also earn higher grades in non-STEM and tend to flock there. Women study more regardless.

The modeling of professor preferences is interesting. STEM professors prefer lower average grades and higher workloads than non-STEM. Hmm… I wonder if that’s true of me compared to my non-science colleagues. The model also suggests that “both STEM and non-STEM professors prefer to give out higher grades with lower workloads in upper-division classes.” Hmm… that’s definitely not true for me workload-wise because my standard upper-division class is Physical Chemistry – considered the hardest and least liked by our majors. I might prefer to give higher grades, but I don’t actually end up doing so. The average grade in my P-Chem classes is slightly lower than in my G-Chem classes, but not by much. The model assumes professors prefer smaller classes (true, I think), but the weighting factor in the model leads to lower grades in STEM classes even though there is higher demand from students. That’s eerie. I don’t think I subconsciously give lower grades in larger classes – students earn their grades! – but I don’t disagree with the trend. I see it in my own classes. I’d like to think it’s because I’m more effective at helping a larger proportion of individual students (who need the extra help) in a smaller class. Time taken up by students in office hours doesn’t change substantially with class size (but it does change a little).

After building and parameterizing their model, the researchers can start testing counterfactuals and examining how this affects the so-called STEM gap – that women disproportionately choose non-STEM areas. The three largest factors that narrow the gap are equalizing non-grade preferences, equalizing grade preferences, and grade norming around a B average. There isn’t much an institution can do about the first two areas. While much outreach has been done to encourage more women into STEM, non-grade preferences remain – not necessarily good or bad, just different. I’m not sure what, if anything, can be done to equalize grade preferences between men and women. I’m certainly not going to ask women to lower their expectations and study less. That leaves grade norming to a B average. The model suggests that this would actually make a difference to the STEM gap, and it’s one that an institution could institute. Mind you, this is grade norming across all areas, i.e., STEM classes would have their grade norms moved up, while non-STEM classes would have their grade norms moved down. I’m not sure you’d get sufficient faculty buy-in to do this in the U.S., while institutions in other countries might already do this. Interestingly, one of the counterfactuals that has little effect is having more women faculty in STEM. I’m not going to comment any further on that one.

I don’t like the idea of norming to a B average. I don’t like grade norming at all. If a student showed they understood roughly 75-80% of the material, then I think they would be deserving of a B. (My B-range is pegged from 70-84%.) If the average student shows less (as determined by exams, homework, quizzes, etc), then the average student shouldn’t be earning a B. Then again, I’m the one writing the exams and setting the level of difficulty. If I made my exams “easier”, the average would go up. The question is: What is “average”?  In a Chronicle of Higher Education article twenty years ago, refreshing for its candor, a Dartmouth professor writes that “we imagine our students to be at a mythical Average U., and give the grades that they would get there.”

Maybe that’s what I’m subconsciously doing. I think C is average, and that the average student in my average class is slightly above that average (i.e., C+/B- borderline). When I have a stronger, smaller, more motivated, class – the average goes up. Not because of bias, I don’t think. I’ve tested this unsystematically by occasionally recycling final exam questions.

And now that this post is four pages long and I’m starting to wade into the phenomena of grade inflation, I think I should hold my flood of thoughts for the moment. You can wait eagerly (or not) for my next post!

No comments:

Post a Comment