Apologies to Sandra Cisneros

How ETS' computer-based writing assessment misses the mark

By Maja Wilson

Illustrator: Katherine Streeter

Illustration: Katherine Streeter

At 27 years old, I was quickly becoming an English-teacher stereotype: Frizzy hair escaped my hastily secured bun, and my eyes had gone slightly wild from reading pile after pile of student papers.

It was exam time during my first year of teaching high school writing courses. I’d been determined not to turn the mandatory finals into a multiple choice test, so I’d assigned a final revision and had given an in-class reflective writing assignment during the exam hour. But I had only one work day to read and grade the papers and reflective writings, calculate and submit semester grades, and prepare for my next round of courses. After toiling alone in my classroom, I had moved my piles of papers to the teachers’ lounge in an increasingly futile attempt to preserve my sanity.

Another teacher appeared, weighted down with her own stack of exams. “It’s a lot of work, isn’t it?” she empathized. She understood! But as she turned and began to feed her stack of papers into the Scan-Tron machine on the counter, my wan smile faded. “Sometimes I have to run these through two or three times to make sure there are no mistakes,” she said, shaking her head, clearly not noticing that my eyes were now boring into the back of her skull.

Five years later, I still haven’t given students in my writing courses a multiple-choice final exam. I’d like to chalk up my resistance to idealism and strength of character, but it’s partly an issue of logistics. I’ve got to keep the students occupied for an entire hour and a half. How many multiple-choice questions can you devise about the steps of the writing process? Secretly, though, I’ve held out hope; if technological progress can send a probe to Mars and pack 15,000 of my favorite songs into an ipod, why can’t it devise a way to grade student writing?

Free Trial from ETS

The truth is, the technology to grade student writing has been around for a while. In the late 1990s, the Graduate Management Admissions Test (GMAT) began using Educational Testing Service’s (ETS) e-rater, a computer-based grading program, to score applicants’ essays along with one human reader; even my state’s standardized testing program was considering a computer grading program the following year.

But when it comes to technology, I’m usually a bit behind the times. So I didn’t begin to seriously consider computer writing assessment until this summer, when I received an email offering a free trial of ETS’s Criterion — a web-based grading program.

I’d had no idea how widespread computer grading systems already were; in 2004, Criterion was scoring and responding to 10,000 student essays a week, according to article in AI by Jill Burstein, a researcher at ETS who developed Criterion. ETS’s website claimed that “hundreds of thousands of students” were now using Criterion.

I quickly registered for my free trial, and browsed Criterion’s website while I waited for the email confirming my guest password. ETS pointed out that its grading tools “only process information; they cannot read and analyze it.” When someone sits down to grade a paper, the first thing they do is read the paper; how can a computer program claim to evaluate writing when computers can’t “read”? Although Criterion’s promises of accurate scores and “immediate feedback” for students appealed to my dream of organizing my file cabinets on teacher work days, I remained skeptical. If I were to embrace ETS’s Criterion, I needed first to compare it with what I wanted for my students and what I knew about writing pedagogy and assessment.

Reading involves a complex set of interactions among language, experience, emotion, and thought that I suspected were at the heart of writing assessment. For example, today I finished reading The Year of Magical Thinking in which Joan Didion chronicles her thoughts and memories during the year following her husband’s death. In the last chapter, this sentence — which recalls a trip Didion took with her husband to Indonesia and Malaysia — caught in my throat: “Some of the islands that were there then would now be gone, just shallows.” My entire being responded to the metaphor of loss in the image of the disappearing islands, as I layered Didion’s story of grief and my own experiences of loss onto those 14 words. Many a grammatically correct sentence had failed to inspire or move me like this one had; my assessment of its power rested firmly on the experiences and ideas that I brought to the reading as well as the images and rhythms of language Didion had built in earlier chapters. Simply put, my assessment depended on my ability to read. Without the ability to read, what can Criterion do?

How Criterion “Reads”

Criterion uses formulas to compare a student’s performance to exemplars, or pre-scored student papers. It then rates the student’s work based on these comparisons. For example, if the sentence length ratios and instance of repetitious words in Emily’s essay match the data gathered from the set of pre-scored C- essays, Emily will receive a C- for style. And how does Criterion define “Style”? According to Burstein, Criterion finds errors in “Style” by looking for “the use of passive sentences, as well as very long or very short sentences within the essay. Another feature of potentially undesirable style that the system detects is the presence of overly repetitious words.”

In other words, when e-rater looks at Didion’s last chapter, it isn’t interested in grief or loss or metaphor. It isn’t interested in what Didion wrote in chapter 2 about her fear of her daughter’s death and how this fear — combined with her husband’s death and my own losses — create a hum underneath the sentence about disappearing islands. Criterion bypasses the reading process by analyzing and responding to the surface of the writing.

I was leery of Criterion’s reliance on pre-scored essays because I’ve often been pleasantly surprised at a student’s unique approach to a writing assignment — an approach that may break my preconceived visions of quality. Last week, Tyler wrote a run-on sentence that blew my mind; if it hadn’t moved me so deeply, my green pen (Is green the new red?) would have had a field day with it. But Tyler was illustrating his tendency to zone in and out of the world around him and the world of his mind, resulting in a kind of muddy image-soup that could only be captured in a half-page sentence. It is absolutely true that — like Criterion — I have images in my head of previous students’ work when I read new papers. But I like to think that my mind isn’t as rigid as a computer; I can leave the exemplars behind when I’m swept away on the current of beautiful writing that doesn’t fit the mold. My ability to read — with all the complex interactions involved — allows me this freedom.

Despite my suspicions that e-rater couldn’t grade as well as a human being, research from links on Criterion’s website suggested that human readers and Criterion’s scoring program agreed on scores more often than humans agreed with each other. The logic was dizzying; e-rater scored more like humans than humans did? But the human readers against which e-rater were compared had been trained to use the same scoring rubric used by Criterion. I’d become suspicious of the rubric’s ability to capture what I value about students’ writing. When I used them, I struggled to narrow my reactions to students’ writing into the language and categories of the rubric. Some writers clearly followed directions too well, writing horribly boring but mechanically correct and organized essays that scored high, while some writers took huge risks, producing engaging and often beautiful work that scored low because the rubric didn’t value the risks and unique approaches they’d taken. I wondered if the researchers’ implication that e-rater scored more like humans than humans did had been inverted. Did e-rater grade like a human, or had rubrics forced humans to grade like a computer?

And to be honest, I was less interested in Criterion’s claim to grade student writing and more interested in its promise to give “immediate feedback” not only on grammar and mechanics, but also on style, organization, and development. Feedback was critical to my teaching of writing; if I did nothing but grade, I’d have plenty of time to organize those files. But I wanted for my students the kind of feedback I’d experienced from interested readers I’d had all throughout my life — from my father, my friend Sarah, and from the participants in writing workshops I’d taken as an adult. None of these readers had based their reactions on criteria based on the surface of my work. Instead, they’d let my words roll around in their minds and then told me how they had been affected. When I tried to provide this kind of feedback for my students, I not only saw their writing improve, but I saw them investing in the writing process with a commitment I’d never seen when I focused on grading. Like Linda Christensen, who describes her refusal to grade students’ papers in “My Dirty Little Secret,” (Rethinking Schools, Vol. 19, No. 2.) I’d decided to give full credit for a paper once the student had revised and we were both satisfied with it, no matter how many revisions it took.

I saw the benefits of this decision with Jessica, who came to my class convinced that she was a horrible writer. She handed in a short piece of descriptive writing about her favorite place, her couch, and she told me that she hated it and that I should just throw it out after I failed it. I wasn’t engaged at all in her piece until I read one line that Jessica wrote about how she sat with her grandmother on the couch every day. I pointed out this line to Jessica, telling her how much I liked it and how it made me wonder about her grandmother and how they’d come to see the couch as a kind of common ground. Perhaps she really wanted to write about her grandmother, using her favorite place on the couch as a frame for describing their relationship. Later that week, Jessica handed in another draft of the paper, this time smiling as she handed it in. After I read it out loud to the class — anonymously, as she’d requested — she raised her hand and proudly and loudly demanded her paper back so that she could show it to her grandmother.

Testing Criterion

How would I test Criterion’s feedback? My guest username and password had arrived, and as I considered the empty textbox into which I was invited to type an essay, I briefly considered submitting one of my student’s papers. But then I thought of Sandra Cisneros’ “My Name” from House on Mango Street, a wonderful model for narrative and descriptive writing used by writing teachers all over the country. Every time I read “My Name” to my students — and I always read it out loud because I love to play with the rhythm Cisneros creates — my students and I find different things to love. Last year, Maria lit up when I read, “It is the Mexican records my father plays on Sunday mornings when he is shaving, songs like sobbing.” She began to talk about the Mexican music her own father played, and how those songs had always felt vaguely sad. This year, when I read it, Tom stopped me after this passage about Esperanza’s namesake, her grandmother: “I would’ve liked to have known her, a wild horse of a woman, so wild she wouldn’t marry. Until my great-grandfather threw a sack over her head and carried her off. Just like that, as if she were a fancy chandelier. That’s the way he did it.” Tom, the class clown, frowned and called out, “He didn’t really carry her off! That’s crazy!” Ayame said, “It’s a metaphor! It means she really didn’t want to marry.” When Tom protested that she shouldn’t have agreed to marry him, the class was off onto a heated conversation about gender and culture that I hadn’t planned but wasn’t about to shut down. Since “My Name” ends as Esperanza fantasizes about the names she could choose for herself, “Esperanza as Lisandra or Maritza or Zeze the X. Yes. Something like Zeze the X will do,” I often invite my students to consider the names that might fit them better than the names they’ve been given. And sometimes, we simply read it, savor it quietly and meditatively, and begin to write about our own lives. I wondered what feedback Criterion would give a writer such as Cisneros. I typed “My Name” into the blank textbook and clicked “Submit.”

Criterion delivered perfectly on its first promise; it created a printout of responses about grammar, mechanics, style, organization, and development in less than thirty seconds. But Criterion was not overjoyed by Cisneros’ writing; the first thing that struck me about the feedback was that it offered no praise. Praise is important in my own feedback because it helps students become better writers. When I sit down to write, about 90 percent of what spills out is not suitable for public consumption. Part of becoming a good writer is the ability to recognize the 10 percent with potential and the courage to purge the rest. When I praise part of a student’s paper, I am pointing out the standard they have set for themselves. But Criterion cannot recognize what writers do well. It only defines and describes deficits, and its feedback to Cisneros merely pointed out “mistakes” in “My Name” — repetitious word use, use of fragments, and problems with organization and development. What would happen to “My Name” if Criterion’s immediate feedback had shaped it?

“Improving” Cisneros

Criterion had highlighted words and phrases that might need work, and directed me to the Writer’s Handbook — provided by Criterion via links. With profuse apologies to Cisneros, I set out to revise “My Name” in the hopes of finding out what effect Criterion might have on our students’ revision and writing.

I began by examining Criterion’s comments on Cisneros’ grammar. It highlighted “A muddy color.” “My great grandmother.” and “Esperanza.” Criterion pointed out that these were fragments, requiring a subject and a predicate. Although changing the fragments to complete sentences changed the rhythm of “My Name,” I tried to preserve Cisneros’s syntax. For example, “My great grandmother.” became “I would have like to have known my great grandmother, a wild horse of a woman.” In the Mechanics section, Criterion had highlighted “the one nobody sees,” pointing out an extra article. This suggestion wasn’t a mandate, so I left the extra article intact. So far, so good.

Incorporating Criterion’s comments about style proved a bigger challenge. First, I replaced Cisneros’ word repetitions. I also took the program’s advice and combined short sentences into long ones: the second, third, and fifth sentences became one streamlined sentence: “In Spanish, my name, which is Esperanza, has a muddy sound and means too many letters, sadness, and waiting.”

Organization and development caused me the most consternation. Criterion asked if the first three sentences were the introduction, then directed me to the Writer’s Handbook for suggestions on how to “capture the reader’s interest, provide background information about your topic, and present your thesis sentence.” I settled on “general background” and wrote about how examining the meaning of one’s name in different languages may lead to insights. The new improved introduction set me up nicely for my next task: to write a clear thesis statement that would “organize, predict, control, and define the essay.”

The sentence about the records played by Esperanza’s father — highlighted and identified as the thesis — surely didn’t meet the criteria for a good thesis. It was back to the Writer’s Handbook, where I learned that perhaps I needed to define the terms used and prepare the reader for the points that would be made later. To reorganize the rest of Cisneros’ writing into supporting material with strong topic sentences, I had to think long and hard about the point of Cisneros’s loosely associative narrative. But I managed to make all of these associations clear in the end.

By the time I got around to taking Criterion’s suggestion to transform the end of “My Name,” it was easy to write a conclusion that “reminds the reader about your thesis, stresses the importance of the ideas you have developed, and leaves the reader with thought-provoking ideas.”It was time to resubmit the essay to Criterion.

Names mean different things in different languages. While many people never consider the origin of their names, comparing the different meanings of one’s name can lead to interesting insights. An insight is some kind of deep understanding one comes to about an issue — global or personal. When I began to reflect on the meaning of my name in different languages, I not only began to understand personal truths about my family legacy, about the nature of gender and culture, and about myself, but I discovered a plan to grow in the future.

The negative associations I have with my name in Spanish reflect my family’s legacy and make me question if I should accept this legacy. In Spanish, my name, which is Esperanza, has a muddy sound and means too many letters, sadness, and waiting. The muddy sadness that I associate with my name in the Spanish language reminds me of the Mexican records my father plays on Sunday mornings when he is shaving, songs similar to sobbing. Why were the songs so sad, and how did this sadness affect me and my family?

The sadness of this legacy began because Mexican men don’t like their women strong. This caused sadness for my great grandmother, for whom I am named. While my great grandmother was a wild horse of a woman when she was young, my grandfather forced her to marry him against her will and she spent the rest of her life looking out of the window, the way that so many women do, waiting. I wonder if she was sorry because she couldn’t be all the things she wanted to be, or if she made the best of her life. In Spanish, the name Esperanza is made out of something soft, like silver, but I don’t want to be soft, like my grandmother was forced to become. I don’t want to inherit her place by the window, a sad place that was dictated by her culture and her gender.

I am not sure that my name fits me. When I think about the meaning of my name in English, I am torn between the culture of my family and the culture in which I find myself here in the United States. My name in English means hope. While I hope that my life in the United States will be different than my grandmother’s life, I do not like everything about the culture here. I know this because I do not like the way the other students at school pronounce my name. At school they pronounce my name strangely, as if the syllables are made out of tin and hurt the roof of their mouths. My reaction to the way they say my name shows that I am not altogether comfortable here. Still, my dislike of my name in Spanish and my grandmother’s role shows that I am not comfortable with my family’s culture.

Perhaps I don’t fit in anywhere and must create my own identity. If I were to name myself, I would choose a strong sounding name. With this strong name that I chose for myself, I would reclaim my identity, which is lost between two worlds right now. I would choose a name such as Lisandra or Mirtza or Zeze the X. Yes, the strong sound of Zeze the X will do. Those who are struggling with their own identity would be wise to consider the meaning of their name in different languages and to rename themselves, as I have done.

New and Improved?

My rewrite avoided most of Criterion’s criticism. When Criterion highlighted my introduction and asked if it provided background information, I was able to answer affirmatively. In addition, my essay was 270 words longer, and ETS research said longer was better. I was also pleased that I’d managed to turn Cisneros’ essay into five paragraphs; according to ETS, Criterion was based on a “five paragraph essay strategy for developing writers.”

But did my revisions make “My Name” a better piece of writing? I did what Criterion couldn’t do and re-read Cisneros’ original work and my own revision. This reading rendered the question ridiculous. True, Cisneros’ work wasn’t an argumentative, expository piece. Was my test fair? A halfhearted disclaimer on Criterion’s website simultaneously claimed not to “stifle creative writing” and to be better suited to grade writing done for standardized tests. But Criterion also claimed to be able to grade and respond to narrative and descriptive work, which Cisneros’ piece certainly was.

I also questioned the distinction between “creative writing” and expository and argumentative writing. Much of my own argumentative writing relied on what I considered to be creative descriptions and narrative, and I found that my students argued and informed more effectively when they adopted creative approaches to their investigations and writing. Most of the argumentative and informative writing I enjoyed reading in The Atlantic Monthly or in bestsellers such as Bill Bryson’s A Short History of Nearly Everything demonstrated this creative, reflective, and layered approach. Distinguishing between creative writing and other kinds of writing represented a kind of delusion that allowed Criterion to get away with reshaping all writing into the five-paragraph essay — a form I’d never enjoyed reading, writing, or teaching — simply because it is easier to grade.

But I didn’t dislike the five-paragraph essay just because I found it boring. I disliked it because I mistrusted its intentions. The five-paragraph essay, along with the computer programs developed to evaluate them, separated thought from language. The thought embodied by the words no longer mattered as long as the thesis sentence was in the right place and the three main points were captured in topic sentences and the conclusion was safely summative. Teaching the five-paragraph essay turned writing into a formula to be memorized and spit out again and again. It turned students into factory workers who cranked out essays that all looked alike. It turned me into a quality control manager slamming my template down in perfect rhythm over every essay and looking for deviations from the norm. This process didn’t honor my own mind as a teacher or my students’ minds as writers.

But the standardization of writing pedagogy and assessment doesn’t just encourage formulaic teaching and writing; it creates a pyramid scheme with students and teachers at the bottom and corporations such as ETS at the top. Corporations make a fortune every year telling overburdened and underpaid teachers how to teach — and selling products to help students pass the tests. ETS, with its commitment to standardized testing and its subsequent development and marketing of Criterion, are poised to capitalize on this phenomenon.

If the considerable time and effort that has been put into computer grading systems that produce standardized writers had been put into reducing instructors’ loads, perhaps we wouldn’t be so desperately looking to testing corporations for solutions to problems that they have created. But the corporations that profit from the perception of our incompetence can’t let that happen. If they trusted teachers to teach and if they trusted students to think and question, they’d be out of a job.

Will resisting Criterion’s claims thwart ETS’s attempt to turn writing from an art to a test-taking skill? Maybe, maybe not. But for the sake of Cisneros and all of the students in my class whose writing has the power to change all of our lives, I’m not buying into this one.

Maja Wilson (maja384@yahoo.com) lives and teaches in Ludington, Mich. She is the author of Rethinking Rubrics in Writing Assessment, Heinemann Books (2006). She expects not to be sent any more free trials from ETS.