Is It Fair and Accurate for AI to Grade Standardized Tests?

Texas is turning over some of the scoring process of its high-stakes standardized tests to robots.

News outlets have detailed the rollout by the Texas Education Agency of a natural language processing program, a form of artificial intelligence, to score the written portion of standardized tests administered to students in third grade and up.

Like many AI-related projects, the idea started as a way to cut the cost of hiring humans.

Texas found itself in need of a way to score exponentially more written responses on the State of Texas Assessments of Academic Readiness, or STAAR, after a new law mandated that at least 25 percent of questions be open-ended — rather than multiple choice — starting in the 2022-23 school year.

Officials have said that the auto-scoring system will save the state millions of dollars that otherwise would have been spent on contractors hired to read and score written responses — with only 2,000 scorers needed this spring compared to 6,000 at the same time last year.

Using technology to score essays is nothing new. Written responses for the GRE, for example, have long been scored by computers. A 2019 investigation by Vice found that at least 21 states use natural language processing to grade students’ written responses on standardized tests.

Still, some educators and parents alike felt blindsided by the news about auto-grading essays for K-12 students. Clay Robison, a Texas State Teachers Association spokesperson, says that many teachers learned of the change through media coverage.

“I know the Texas Education Agency didn’t involve any of our members to ask what they thought about it,” he says, “and apparently they didn’t ask many parents either.”

Because of the consequences low test scores can have for students, schools and districts, the shift to use technology to grade standardized test responses raises concerns about equity and accuracy.

Officials have been eager to stress that the system does not use generative artificial intelligence like the widely-known ChatGPT. Rather, the natural language processing program was trained using 3,000 written responses submitted during past tests and has parameters it will use to assign scores. A quarter of the scores awarded will be reviewed by human scorers.

“The whole concept of formulaic writing being the only thing this engine can score for is not true,” Chris Rozunick, director of the assessment development division at the TEA, told the Houston Chronicle.

The Texas Education Agency did not respond to EdSurge’s request for comment.

Equity and Accuracy

One question is whether the new system will fairly grade the writing of children who are bilingual or who are learning English. About 20 percent of Texas public school students are English learners, according to federal data, although not all of them are yet old enough to sit for the standardized test.

Rocio Raña is the CEO and co-founder of LangInnov, a company that uses automated scoring for its language and literacy assessments for bilingual students and is working on another one for writing. She’s spent much of her career thinking about how education technology and assessments can be improved for bilingual children.

Raña is not against the idea of using natural language processing on student assessments. She recalls one of her own graduate school entrance exams was graded by a computer when she came to the U.S. 20 years ago as a student.

What raised a red flag for Raña is that, based on publicly available information, it doesn’t appear that Texas developed the program over what she would consider a reasonable timeline of two to five years — which she says would be ample time to test and fine-tune a program’s accuracy.

She also says that natural language processing and other AI programs tend to be trained with writing from people who are monolingual, white and middle-class — certainly not the profile of many students in Texas. More than half of students are Latino, according to state data, and 62 percent are considered economically disadvantaged.

“As an initiative, it’s a good thing, but maybe they went about it in the wrong way,” she says. “‘We want to save money’ — that should never be done with high-stakes assessments.”

Raña says the process should involve not just developing an automated grading system over time, but deploying it slowly to ensure it works for a diverse student population.

“[That] is challenging for an automated system,” she says. “What always happens is it's very discriminatory for populations that don't conform to the norm, which in Texas are probably the majority.”

Kevin Brown, executive director of the Texas Association of School Administrators, says a concern he’s heard from administrators is about the rubric the automated system will use for grading.

“If you have a human grader, it used to be in the rubric that was used in the writing assessment that originality in the voice benefitted the student,” he says. “Any writing that can be graded by a machine might incentivize machine-like writing.”

Rozunick of the TEA told the Texas Tribune that the system “does not penalize students who answer differently, who are really giving unique answers.”

In theory, any bilingual or English learner students who use Spanish could have their written responses flagged for human review, which would assuage fears that the system would give them lower scores.

Raña says that would be a form of discrimination, with bilingual children’s essays graded differently than those who write only in English.

It also struck Raña as odd that after adding more open-ended questions to the test, something that creates more room for creativity from students, Texas will have most of their responses read by a computer rather than a person.

The autograding program was first used to score essays from a smaller group of students who took the STAAR standardized test in December. Brown says that he’s heard from school administrators who told him they saw a spike in the number of students who were scored zero on their written responses.

“Some individual districts have been alarmed at the number of zeros that students are getting,” Brown says. “Whether it’s attributable to the machine grading, I think that’s too early to determine. The larger question is about how to accurately communicate to the families, where a child might have written an essay and gotten a zero on it, how to explain it. It's a difficult thing to try to explain to somebody.”

A TEA spokesperson confirmed to the Dallas Morning News that previous versions of the STAAR test only gave zeros to responses that were blank or nonsensical, and the new rubric allows for zeros based on content.

High Stakes

Concerns about the possible consequences of using AI to grade standardized tests in Texas can’t be understood without also understanding the state’s school accountability system, says Brown.

The Texas Education Agency distills a wide swath of data — including results from the STAAR test — into a single letter grade of A through F for each district and school. It’s a system that feels out of touch to many, Brown says, and the stakes are high. The exam and annual preparation for it was described by one writer as “an anxiety-ridden circus for kids.”

The TEA can take over any school district that has five consecutive Fs, as it did in the fall with the massive Houston Independent School District. The takeover was triggered by the failing letter grades of just one out of its 274 schools, and both the superintendent and elected board of directors were replaced with state appointees. Since the takeover, there’s been seemingly nonstop news of protests over controversial changes at the “low-performing” schools.

“The accountability system is a source of consternation for school districts and parents because it just doesn’t feel like it connects sometimes to what’s actually happening in the classroom,” Brown says. “So any time I think you make a change in the assessment, because accountability [system] is a blunt force, it makes people overly concerned about the change. Especially in the absence of clear communication about what it is.”

Robison says that his organization, which represents teachers and school staff, advocates abolishing the STAAR test altogether. The addition of an opaque, automated scoring system isn’t helping state education officials build trust.

“There’s already a lot of mistrust over the STAAR and what it purports to represent and accomplish,” Robison says. “It doesn't accurately measure student achievement, and there’s lots of suspicion that this will deepen the mistrust because of the way most of us were surprised by this.”