Reflecting on the season of standardized testing
And we're back from a 3-week hiatus that was the result of a guys' trip, being busy, and being somewhat demotivated. But now we're back!
Most of my job in late spring and early summer of every year involves analyzing data from standardized tests, and the analyses I’m doing range from back-of-the-napkin estimates of pass rates to careful, rigorous statistical modeling looking for effects of some instructional decision or program. Plus also updating dashboards – there’s lots of that.
This is to say that during this window, my work-brain is mostly noodling through standardized test results. And I figured that I’d serve up a very lukewarm take based on some conversations I’ve been having recently with folks.
First, some disclaimers and context. I’m mostly talking about state standardized tests here – federally-mandated end-of-year assessments in most core content area courses (e.g. math, English language arts, various science and social studies courses). In Virginia, these are the Standards of Learning (SOL) assessments. Other states have comparable tests. Some of these reflections may also apply to other sorts of standardized assessments, like the SAT, but those aren’t the focus here.
I’m also going to use Virginia and our tests as examples throughout, because those are the ones I’m most familiar with, but I imagine other states and their tests are presumably very very similar.
Anyway. My hopefully very common-sensical take is that, when we draw conclusions or make inferences from tests like the SOLs, we have to simultaneously consider that:
- Each test is (probably) a good measure of content mastery, and
- Each test is a single point-in-time measure
Regarding the first point – that each test is probably a good measure of content mastery. I mean, the federal government provides guidelines for the development of these tests, and Virginia contracts with Pearson – a very large company that can pay to employ very smart and capable psychometricians – to create and administer the SOL tests. There’s quite a bit I don’t trust about the Virginia Department of Education (at least in its current state), but end-of-year assessments are so heavily regulated and there’s so much money involved in them that it would be incredibly difficult for VDOE to regularly create objectively bad tests. My only real hangup – and why I say that they’re “probably” good tests – is that VDOE doesn’t publish their technical reports describing test development and psychometric properties, so it’s hard to be certain about the quality of these tests. If you want these reports, you have to request them. Which is kinda bonkers.
What do I mean by a “good test” or a “bad test,” though? There are lots of ways to go about answering this, but one of the most straightforward ways comes from classical test theory (CTT). CTT is predicated on a very obvious but very ingenious axiom – that, when a person takes a test, their test score (X) is the sum of their “true score” (T) and some amount of error (E):
X = T + E
Examples abound here, particularly because “error” is loosely defined and can mean any number of things. Say I overachieved on an algebra test because I guessed correctly on several multiple choice questions. Well, then my error term (E) is positive and is inflating my test score, X (my algebra proficiency according to the test), relative to my true score, T (my actual algebra proficiency). Say I underachieved because I ate a bowl of Lucky Charms that morning and had a stomach ache. Well then my error term is negative and is deflating my test score, relative to my true score. Say the actual questions used to measure proficiency in algebra are imperfect measures – maybe some are word problems that also require reading proficiency. This gets factored into the error term, too.
When I first encountered this in a course on psychological measurement in grad school, it felt so obvious that I couldn’t believe we were spending several weeks on it. But it lays the foundation for this whole framework where,if you consider a test score as partitionable into true score and error, and if you can actually quantify X, T, and E, then the whole endeavor of test development becomes an exercise in minimizing the error.
Getting back to the idea of a “good” or “bad” test, though. If we’re embracing a purely psychometric definition of a good test, then we want the one that minimizes the proportion of a test score that is attributable to error, where again “error” is anything that isn’t the thing we’re trying to measure (e.g. proficiency in algebra). In the real world, there are other things that matter, like test length and test format, but we’ll ignore those for now. Writing a high-quality test that measures a student’s proficiency in algebra, one that minimizes the implicit error in that measure, is largely a statistics problem. Lots of teachers write very good tests, but none that I know of are conducting rigorous psychometric analyses of their tests and putting these tests through multiple rounds of revision and peer review and statistically minimizing the error in the tests. To be clear, I don’t think you need to do all of this to write a very good test, but to write one that is scalable and empirically high-quality and absolutely minimizes error, you do.
This isn’t meant to be a dig at teachers. Teachers do a billion different things, and one of those billion things is creating tests. Test developers do basically 1 thing.
So anyway, I tend to trust the results from state standardized tests like the SOL, at least in general.
That said, even if we do consider these tests to be good measures of student proficiency in a given subject, they’re measures at a single point in time. A standardized test score isn’t some eternal truth handed down from God. Tests measure what’s happening at one point in time, and as we get farther away from that one point in time, it becomes less and less valid to assume the test score is still an accurate measure of current proficiency. This seems very obvious, and if you apply this same logic elsewhere, it seems even moreso. Imagine saying to someone, “well, I weighed 180 pounds 9 months ago, so that’s probably what I weigh now.”
And yet, because there’s so much pomp and circumstance around standardized testing, I often see people cling to these results as if they’re the only data points worth considering, without considering that they certainly have a shelf life.
If we revisit the earlier equation, X = T + E, we might consider that, as time elapses after a student takes a test, the absolute value of the error term continues to increase.
So I suppose my very very very tepid take here is that we should take standardized test scores seriously, but less so the older they are.
If you’re enjoying reading these weekly posts, please consider subscribing to the newsletter by entering your email in the box below. It’s free, and you’ll get new posts to your email every Friday morning.