Calculating ratios for different types of pronouns in civil rights speeches. Counting the frequency and proximity of vowel sounds, consonant sounds and rhymes in rap music.
Those are just two examples of the projects students have taken on in teacher Peter Nilsson’s “Distant Reading” course at Deerfield Academy, an independent boarding school in western Massachusetts.
What is “distant reading”? It’s when a large body of literary texts are analyzed using some sort of algorithm (it can also involve data visualization). The name is in opposition to “close reading,” where generally, one text, or narrow passages of that text, are acutely pored over. But scholars can employ a bit of both in any given project, and can apply the tools of distant reading to one work of literature, says Mark Marino, the director of communication at the Electronic Literature Organization, a group that focuses on the writing, reading and publishing of literature in the digital age.
Nilsson says that at its core, the course he created with three of his fellow teachers (another English teacher and two computer science teachers) is about using mathematical software (in this case, Wolfram Mathematica) to analyze texts, such as Donald Trump’s Wikipedia page and Tolstoy’s “War and Peace,” and gain fresh insights.
“Software today enables you to analyze text mathematically very easily,” Nilsson says. “Once you have the ability to do that, then you have the ability to explore texts in really fascinating ways,” such as “text mining,” which includes analyzing the frequency that certain words pop up.
Zero Programming Experience
Keeping in mind that most of their students had zero programming experience, Nilsson and his fellow teachers structured the course around five different projects.
The first lasted about three days, and it got the students to start thinking quantitatively about words. They explored and played around with the Google Ngram Viewer, a search engine that can show the use of certain phrases and parts of speech in books. Then, the instructors introduced students to Wolfram Language, the programming language used in Wolfram Mathematica. Students started with two warm-up projects where they experienced writing in the programming language.
The first year of the course, all students analyzed Shakespeare’s “Hamlet” and a few texts of their choice. For the second warm up assignment, they analyzed style and usage in their own high school writing (from papers they still had on their laptops), comparing it to literature writers. The second year of the course, both warm-up projects consisted of students comparing their own writing to literary selections they chose.
Nilsson says although Wolfram Language is intuitive, there is still some syntax that requires time to figure out. The educators thought of this part of the course as akin to “an immersion classroom for a foreign language,” where one learns a new language solely in that language.
“We just said, look here’s some things you can do," Nilsson says. "And then we said, now you do it. And they imitated mostly, but with their own data, so they had different results—results that were fueled by their own interests.”
Finally, the students had to complete two independently-designed projects. Students, now familiar with some of the capabilities of the software choose their own topics (such as social networks in the Bible and home-city newspapers’ coverage of their sports teams versus that of rival city newspapers), asked questions and used the tools to answer them.
Open Source Tools
Nilsson points out that he is not the first person to venture into this type of computational approach to literature. Take the Electronic Literature Organization, the group mentioned earlier.
There’s also the Stanford Literary Lab, where Mark Algee-Hewitt serves as director. An example of what the lab is working on? A project on Harry Potter fanfiction. Graduate students, as well as some undergraduate students, work in the lab alongside faculty. Among the majors the undergraduates working in the lab have are English and computer science.
At Stanford, undergraduates can also pick from several classes that merge computer science and the humanities.
Algee-Hewitt teaches one of those classes. Called “Literary Text Mining,” students learn how to use natural language processing, coding and statistics to find new information about literature and then interpret it, making literary arguments. Typically, Algee-Hewitt uses a corpus of short stories around a theme, such as detective stories and children’s stories. At certain times, he also has students analyze additional texts.
“Last year, when they learned topic modeling (and discussed what a topic was from either a computational or literary perspective) we used popular young adult fiction to illustrate that a topic could be anything from “growing up” to “horses,” depending on the scale of what we think a “topic” is,” Algee-Hewitt adds. “We used poetry to look at authorship attribution because poets tend to be more sparse in their language choices.”
For their homework assignments, students get to select a collection of texts they’d like to work on. “We work on the class corpus throughout the quarter, and then when they have homework assignments, they use their own corpus. So, they see how to do the method on something that everyone is familiar with in class—and then on their own, they do it again with texts that they are interested in.”
Like Nilsson’s class, students enrolled in Algee-Hewitt’s course don’t need to have computer-science backgrounds. In fact, he says the vast majority of them are humanities students who have no CS knowledge whatsoever. They just need the willingness to learn throughout the ten weeks of the class.
Sometimes, Algee-Hewitt teaches the class in the programming language R, other times, in python, and sometimes, in both. Either way, he says he always goes with an open source programming environment for this particular class, which is very methodology-focused. He explains that he wants students to understand “from the ground up” exactly what they’re doing. A danger of tools, he adds, is that they can make a lot of decisions for you.
“I want students to be very aware of all the decisions that go into every aspect,” Algee-Hewitt says.
Unstructured Play in Learning
Algee-Hewitt says using computer science to analyze literature can be “really liberating” for students, giving them a set of tools to help them interact with their mostly digital world. As an example, he points to a student who took his literary text mining class last year who did her project on apologies given by famous men who were caught up in the Me Too movement. Algee-Hewitt says she approached him after the class asking for help expanding her project and getting the research published.
“She feels like superwoman now, because she has the ability to do these really intense analyses of text in a way that she has never thought of before,” Algee-Hewitt says.
The class ran for two years at Deerfield, each time with two sections. The first year, each section was taught by a paired team of a computer science teacher and an English teacher. The second year, Nilsson taught one of the sections on his own. “Distant Reading” is currently on hiatus, as Nilsson is on sabbatical, but he plans to offer it again.
At the end of the day, Nilsson and the other teachers who built the elective didn’t want students to only learn the syntax of the programming language and some of the functions.
“We realized over the course of teaching this that we want them to learn question formulation, problem decomposition and argumentation,” Nilsson says. They also wanted students to experience unstructured play in learning, structured experimentation and persistence.
“One of the things that we’re going for,” he says, "is helping them understand that they can mix disciplines—that they can think computationally about texts, that they can apply computer science skills and thinking to other areas of life than what they might think.”