A meditation on test scoring
Illustrator: Eric Hanson
As a public school English teacher, I observe standardized testing season each year with a sort of grim fascination. So this is it, I think as I pace around my silent classroom, peering over kids’ shoulders at articles about parasailing. Line graphs tracking the rainfall in Tulsa. Parts of speech. Functions of x. These are the 34 questions that will determine some aspect of my students’ futures, —as well as our school’s yearly progress report, my own teacher report card, and soon, possibly, my salary. I wince at my young charges’ careless mistakes. I see eyes rove. I know who came in yawning, who’s feeling sick, whose brother is back in jail. So many variables (input) producing a single-digit score (output). Functions of x.
Last month, several weeks after those long quiet mornings, I got another glimpse inside the testing industrial complex. Instead of reporting to my own school and teaching my 8th-grade writing classes, I reported to the gymnasium of a middle school uptown, along with 100-odd other New York City teachers, to score 5th-grade New York State math exams. I was there in place of a colleague, a special education teacher—as are the vast majority of the educators pulled out of schools to grade the state tests—who had already spent most of the prior week scoring reading tests and begged me to take her spot in “the warehouse.” (I should note that most standardized test scoring is done not by teachers, but by temp workers—who must possess at least a four-year college degree and are paid a low hourly rate.)
I arrived on the Upper East Side that Tuesday morning in a chilly May downpour, already on my second cup of coffee. Students with overstuffed backpacks jostled into the school’s double doors ahead of me, squealing about the rain. Inside, I headed toward the open gym door, where I could see 20 round tables, bedecked in red and blue plastic tablecloths. Clusters of damp-haired teachers were settling into the folding chairs, chatting quietly, leafing through newspapers. They were mostly young, in their 30s—some older and distinguished looking—with bright eyes, sensible footwear, books. This was no temp crew; these seemed like people who should be off teaching children. I took my seat among them.
After a spirited welcome sermon from the site manager, we scorers had to be trained. This involved several steps, starting with deciphering three pages of guidelines with stipulations like:
In all questions that provide response spaces for two numerical answers and require work to be shown for both parts, if one correct numerical answer is provided but no work is shown in either part, the score is zero. If two correct numerical answers are provided but no work is shown in either part, the score is one.
A series of boxes labeled “Holistic Rubric” informed us that a two-point response “may reflect some misunderstanding of the underlying mathematical concepts,” while a one-point response “demonstrates only a limited understanding” thereof. Semantically, we were in the deep end.
Around me sat dozens of focused teachers, brows furrowed, eyes and lips and pencils moving silently through the subclauses. Rereading, circling, raising hands to ask “What if . . .” questions. I wondered, “What if all this brainpower could be devoted to the subject of actual teaching and learning?” Not that it never is; not that we don’t all pour tremendous energy into our own classrooms and schools. But the Department of Ed has never plucked me from my classroom and asked me to do that kind of precise, intense thinking—for three full days—about how kids learn best. I was struck by the sight of us all, serious and alert, poring over these scoring guidelines like legal briefs. What could these minds do if given a different task?
When we finished reading, we were asked to peruse several scored sample problems (three-pointers, two-pointers, one-pointers, zeroes), then to try scoring a few sample questions on our own. At that point, and from then on, we were encouraged to cross-check with our neighbors, to debate, and together to refer back to the rubric. Throughout the day, I found this social aspect of the grading process to be crucial; in fact, it may be the only thing guaranteeing any consistency in the final scores. In teams of two, four, sometimes eight, we were able to find some consensus about whether a child’s response fulfilled the task in a much more consistent way than we might have achieved alone. (It’s interesting to consider how often students are prohibited, or at least discouraged, from engaging in that type of collective thinking, in favor of silent individual work. Even in some “cooperative learning” classrooms, student collaboration still feels teacher-driven or scripted. But that’s another story.)
However, despite its relative perks, our social working process was vital precisely because of one scary fact: So much about these tests is subjective. I alluded earlier to the “input variables” that make standardized exams a less than reliable measure of a child’s intelligence—what subject matter appears in the passages, how the questions are worded, the amount of sleep or breakfast a child had that day. But last week I saw how subjective things are even on the output end. Was that a decimal or a stray mark? Is this a four or a nine? We dutifully passed tests around the table, weighing students’ responses against the sample questions, squinting at crossed-out work, shaking our heads.
“But he got it!”
“But he wrote add when he meant multiply.”
“But he did multiply!”
Our table was especially diligent; we spent 20 minutes debating a question on one of our last tests for the day and, after getting conflicting opinions from three site coordinators, ended up calling the question in to the city testing headquarters. (Final verdict: full credit.) The higher-ups’ response made us realize that we may have scored several other students’ answers too harshly, so we asked the site manager to reopen the boxes of tests we had turned in. The other round tables in the gym were empty—it looked like an extremely boring prom had just ended—as we unshouldered our bags and sat back down to make sure we hadn’t denied any children their rightful points.
The results of our search: Three other students’ responses got bumped up to full credit, and we found two full tests and one half test that hadn’t been checked at all.
So much focus and energy and intelligence in that room. So much riding on those numbers. And even a table of overachievers couldn’t get through day one without a few glaring mistakes and a whole lot of gray.