:::: MENU ::::

Introducing Text Analytics to Undergraduates

I teach a methods course called Research Methods in Digital Humanities that’s geared toward our digital humanities majors and technology and humanities graduate students. Our basic course materials include the useful Digital_Humanities text from MIT Press, Python tutorials from Codecademy, and a few videos on our class YouTube Channel. My goals for the course are to introduce students to research in digital humanities through a variety of case studies, hands on labs, and readings of transmedia projects. One of our hands on labs is based on John Laudun’s activity for introducing undergraduates to computational methods for textual analysis (see his excellent blog post for more info).

The lab asks students to use computational tools to compare two or more texts along a number of dimensions: e.g., structure, length, themes, and word choice. They can work on this part in groups or individually. Then, using the results they’ve generated, they must write individual blog posts discussing their findings.

While Laudun uses one text – Richard Connell’s “The Most Dangerous Game” – I require students to use two or more texts they find on Project Gutenberg. I do so for a few reasons:

First, earlier in the semester I ask them to contribute to a data census in order to learn about metadata and data reuse. Requiring them to use Project Gutenberg ensures that they get some practice re-using data, and they spend their time analyzing rather than collecting data. It also allows them to choose texts they are actually interested in. As I learned the first time I walked through this exercise, texts I think are going to be engaging often aren’t. My excitement about comparing Jane Austen’s Emma to the film adaptations Emma (1996) and Clueless (1995) was met with blank stares for 45 minutes.

Second, comparing texts rather than analyzing a single text seems to help students understand why we do text analysis at all. They begin to see why we care about features such as structure and word choice when they see that authors have made different choices.

Third, comparing texts requires them to do most of these activities twice, and repetition helps them learn. Especially because most of my students have never used Python or Project Gutenberg or done any text analysis, it’s important to give them a chance to practice. When they do the analysis for the first text, it’s rough. They fail a lot – they get results that don’t make sense, they forget to change some setting in the code, whatever. Doing it again allows them to end on a high note – the second or third or n’th time, the analysis goes smoothly, and they can see that they’ve made progress.

I assume you can install and run Python. I’ve forked Laudun’s Useful Python Scripts for Text repository. Clone the repo, run <code>pip install -r requirements.txt</code>, and you should be ready to get started.

Changes to Useful Python Scripts for Text
I’ve made a number of modifications to Laudun’s original scripts, and here’s some more info about what/why I did.

First, I’m working my way through each file editing them to use main() functions and if __name__ == “__main__”: main() calls. StackOverflow has a few good posts about why to do this. See What does `if __name__ == “__main__”:` do?, for instance. The gist is that declaring and then calling a main function separates the functions from the code that should execute. It also means that stuff in the main() function happens only when you call it as a standalone script (i.e., not when you use it in other programs).

Second, I’m dividing the scripts into sets of functions generally. Why? Functions run faster. Again, StackOverflow has more info on why. Also, functions are cleaner than scripts and can be used in other programs. If you’ve looked at any of my older code on Github, you know I used to write straight scripts all the time too. I’ve seen the function light.

Third, I’m changing the way stats.py counts lines, paragraphs, words, etc. to accommodate Project Gutenberg texts. In Laudun’s original code, each line was a paragraph, but Project Gutenberg texts have blank lines between paragraphs and multiple lines with paragraphs.

Fourth, eventually I will fix the wordcloud functions. They are based on word_cloud from amueller (read more on his blog), but they don’t work out of the box for me (or Laudun, apparently). For now, I have students use Wordle to make their word clouds. Adeline Koh recommends some other word cloud activities and tools in her Hybrid Pedagogy article about introducing digital humanities to undergrads as well. She doesn’t discuss the Python and computation approaches that Laudun and I use, though.


So, what do you think ?

You must be logged in to post a comment.