Textual Analysis: The “Text” Part

Another marvelously instructive post from HASTAC scholar Tobias Hrynick, Department of History, Fordham University. Here he outlines some systematic tips and guidelines for creating a digitized version of a text. 

The digital analysis of text constitutes the core of the digital humanities. It is here that Roberto Busa and his team began, and though subsequent scholars have expanded somewhat, exploring the possibilities of digital platforms for applying geographic analysis or presenting scholarship to wider audiences, the humanists’ interest in text has ensured the growth of a healthy trunk directly up from the root, along with the subsequent branches.

Necessary for all such projects at the outset, however, is the creation of a machine-readable text, on which digital analytical tools can be brought to bear. This process is generally more tedious than difficult, but it is nevertheless fundamental to digital scholarship, and a degree of nuance can be applied to it. What follows is intended as a basic introduction to some of the appropriate techniques, intended to highlight useful tools, including some (such as Antconc and Juxta) which also have powerful and user-friendly analytic functionality.

Acquiring Text

The appropriate way of acquiring a machine-readable text file (generally a .txt file, or some format which can be easily converted to .txt), and the difficulty involved in doing so, varies according to several factors. Often, digital versions of the text will already exist, so long as the text is old enough that the copyright has expired, or new enough that it was published digitally. Google Books, Project Gutenberg, and Archive.org all maintain substantial databases of free digital material. These texts, however, are all prone to errors – Google Books and Archive.org texts are generally created with a process of scanning and automated processing that is likely to produce just as many errors as performing this process yourself.

Such automated processing is called Optical Character Recognition (OCR). It requires a great deal of labor intensive scanning if you are working from a print book – though a purpose-built book scanner with a v-shaped cradle will speed the work considerably, and a pair of headphones will do a great deal to make the process more bearable.

Toby 1st photo.png

Once you have .pdf or other image files of all the relevant text pages, these files can be processed by one of a number of OCR software packages. Unfortunately, while freeware OCR software does exist, most of the best software is paid. Adobe Acrobat (not to be confused with the freely available Adobe Reader) is the most common, but another program, ABBYY Finereader deserves special mention for additional flexibility, particularly for more complicated page layouts, and a free trial version.

As a quick glance through the .html version of any Archive.org book will confirm, the outcome of an OCRing process is far from a clean copy. If a clean copy is required, you will need to expend considerable effort editing the text.

Toby 2nd photo

The other option is to simply re-type a given text in print or image format into a text editor – both Apple and Windows machines come with native text-editors, but if you are typing at length into such an editor, you might prefer a product like Atom or Notepad++. Neither of these platforms provides any crucial additional functionality, but both offer tabbed displays, which can be useful for editing multiple files in parallel; line numbers, which are helpful for quickly referencing sections of text; and a range of display options, which can make looking at the screen for long periods of time more pleasant. Alternately, you can choose to type out text in a word processor and then copy and paste it into a plain-text editor.

Assuming there is no satisfactory digital version of your text already available, the choice between scanning and OCRing, and manually retyping should be made keeping the following factors in mind:

  1. How long is your text?

This is important for two reasons. First, the longer a text is, the more the time advantage of OCR comes into play. Second, the longer a text is, the more individual errors within it become acceptable, which can sometimes make the time-consuming process of editing OCRed text by hand less critical. Digital textual analysis is best at making the king of broad arguments about a text in which a large sample size can insulate against the effects of particular errors.

The problem of this argument in favor of OCR is that it assumes the errors produced will be essentially random. OCR systems, however, when they make mistakes are likely to make that same mistake over and over again – particularly common are errors between the letters i, n, m, and r, and various combinations thereof – such errors are likely to cascade across the whole text file. A human typist might make more errors over the course of a text, especially a long text in a clear type-face, but the human is likely to make more random errors, which a large sample size can more easily render irrelevant.

That said, OCR should still generally be favored for longer texts. While automated errors can skew your results more severely than human ones, they are also more amenable to automated correction, as will be discussed in the next section.

  1. What is the quality of your print or image version?

Several features of a text which might cause a human reader to stumble only momentarily will cripple an OCR systems ability to render good text. Some such problems include:

  • A heavily worn type-face.
  • An unusual type-face (such as Fractur).
  • Thin pages, with ink showing through from the opposite side.

If your text or image has any of these features, you can always try OCRing to check the extent of the problem, but it is wise to prepare yourself for disappointment and typing.

  1. How do you want to analyze the text?

Different kinds of study demand different textual qualities. Would you like to know how many times the definite article occurs relative to the indefinite article in the works of different writers? Probably, you don’t need a terribly high quality file to make such a study feasible. Do you want to create a topic model (a study of which words tend to occur together)? Accuracy is important, but a fair number of mistakes might be acceptable if you have a longer text. Do you intend to make a digital critical edition highlighting differences between successive printings of nineteenth century novels? You will require scrupulous accuracy. None of these possibilities totally preclude OCRing, especially for longer texts, but if you choose to OCR, expect a great deal of post-processing, and if the text is relatively short, you might be better served to simply retype it.

Cleaning Text

Once you have produced a digital text, either manually or automatically, there are several steps you can take to help reduce any errors you may have inadvertently introduced. Ultimately, there is no substitute for reading, slowly and out loud, by an experienced proof-reader. A few automated approaches, however, can help to limit the labor for this proof-reader or, if the required text quality is not high, eliminate the necessity altogether.

  1. Quick and Dirty: Quickly correcting the most prominent mistakes in an OCRed text file.

One good way of correcting some of the most blatant errors which may have been introduced, particularly the recurring errors which are common in the OCRing process, is with the use of concordancing software – software which generates a list of all the words which occur in a text. One such program is Antconc, which is available for free download, and contains a package of useful visualization tools as well.

Toby 3rd photo.png

Once you have produced a text file, you can load it using AntConc, and click on the tab labeled Word List. This will produce a list of all the words occurring in the text, listed in order of frequency. Read through this list, noting down any non-words, or words whose presence in the text would be particularly surprising. Once you have noted down all the obvious and likely mistakes, you can correct them using the Find and Find and Replace tools on your preferred text editor.

This method of correction is far from fool-proof. Some subtle substitutions of one plausible word for another will likely remain. This is, however, a good way of quickly eliminating the most glaring errors from your text file.

A similar effect can be achieved using the spell-check and grammar check functions on a word processor, but there are several reasons the concordance method is generally preferable. First, reading through the list of words present in the text will tend to draw your attention to words which are real, but unlikely to be accurate readings in the context of the text, which would be over-looked by spelling and grammar-check functions. Second, a concordancer will present all the variants of a given word which occur in the text – through alternate spelling, use of synonyms, or varying grammatical forms (singular vs. plural, past vs. future) – which might be significant for your analysis.

  1. Slow and Clean: Cross-Collating multiple Text Files

A more extreme way of correcting digitized text is to produce multiple versions and to collate them together. Because of the frequency of similar errors being repeated across OCR versions, comparing two OCR versions is of limited use (although if you have access to more than one version of OCR software, it might be worth trying). It is of greater potential use if you compare two hand-typed versions, or a hand-typed version and an OCRed version, which are much less likely to contain identical errors.

Cross-comparison of two documents can be accomplished even using the merge document tools on Microsoft Word. A somewhat more sophisticated tool which can accomplish the same task is Juxta. This is an online platform (available also as a downloadable application), which is designed primarily to help produce editions from multiple varying manuscripts or editions, but which is just as effective as a way of highlighting errors which were introduced in the process of digitization.

Toby 4th photo

This process is a relatively thorough way of identifying errors in digitized text, which can even identify variations that might escape the attention of human proofreaders. The major weakness of the technique, however, is that it requires you to go through the effort of producing multiple different versions, ideally including one human-typed version. If you need a scrupulously corrected digital text, however, it is a powerful tool in your belt, and in the event that multiple digital versions of your text have already produced, it is an excellent way of using them in concert with one another – another strength of the Juxta platform is that you can collate together many different versions of the same text at once.

Conclusion

Once you have a digitized and cleaned version of the text in which you are interested, a world of possibilities opens up. At a minimum, you should be able to use computer search functions to quickly locate relevant sections within the text, while at maximum you might choose to perform complex statistical analysis, using a coding language like R or Python.

A good way to start exploring some of the possibilities of digital textual analysis is to go back and consider some of the tools associated with Antconc other than its ability to concordance a text. Antconc can be used to visualize occurrences of a word or phrase throughout a text, and to identify words which frequently occur together. Another useful tool for beginners interested in text analysis is Voyant, which creates topic models – visualizations of words which frequently occur together in the text, which can help to highlight key topics.

Advertisements

Imagining Digital Pedagogy at Fordham

This is your life:

images

You just finished teaching your American History class. You slam-dunked a lecture on the transcontinental railroad’s influence on national commerce, communication, and territorial expansion. Students nodded, took vigorous notes, and were eager to participate in a lively discussion following your lecture. It was a good class. You think to yourself: tweed blazers with elbow-patches do help you scrutinize the past and question mainstream ideas more effectively. As you make a note to add more iron-on patches to your shopping cart on Amazon, you see a particularly eager student waiting to catch your attention after class.

This student–probably two weeks shy of declaring a history major–stays behind to tell you about her family’s connection to the U.S. railroad industry. As you wipe the dry eraseboard clean, she draws insightful connections between your lecture and her family’s experience in Tennessee. Apparently, this student’s family owned a company that helped establish, build, and expand railroad lines in the region in the 1880s. She’s excited about the connection. She wants to understand her family’s influence on railroad growth in a broader historical context. She’s eager to use the research tools you’ve helped her cultivate. You know, there might just be elbow-patches in her future.

You give a passing nod to the frazzled composition instructor who teaches in the room after you; he’s carrying a stack of freshly graded three-paragraph essays and looks tired. In the hall, you continue talking to the student, asking leading questions, and giving insights–just as you begin to encourage her to explore the topic in her final paper, you realize: “I don’t want to read that.”

Let me rephrase. It’s not a question of what you want, exactly. You care about the student’s development as a writer, and you don’t question their ability to make a convincing historical argument. Rather, this student’s project presents a genre problem. An 8-page research essay on a Tennessee railroad, regional geography, and national commerce could indeed be compelling (hell, I’d read it). Academic prose, however, might not be the most appropriate genre for communicating geographical expansion over time; papers are an inherently limited, linear format. This research is perfectly suited for something more dynamic–like a digital map.

Anelise H. Shrout, Postdoctoral Fellow in Digital Studies, shared an experience similar to this in her workshop on Digital Pedagogy on Friday, October 16th. In this session, Shrout encouraged an interdisciplinary group of Fordham graduate students and staff to thoughtfully integrate digital assignments into undergraduate courses.

IMG_1119

Not only are some assignments better suited for digital media, but, according to Shrout, an online publication platform will give student work a life beyond the classroom. Student research doesn’t have to be limited to a conversation at the dry eraseboard or a document, stapled with one-inch margins. For example, if the aforementioned student created a Neatline map that tracked the growth of their family’s railroad over time, she could share her final product with her family and circulate it to people within the region of influence. Encouraging students to share the fruits of their research with people outside of academia might just spark intellectual curiosity and critical thinking in the vast elsewhere incorporated by the internet. Believe me, as a kid who grew up with spotty dial-up in the middle of nowhere, access and exposure to quality humanistic work can be transformative. And, yes, I’ll go there: if we are truly committed to “the discovery of Wisdom and the transmission of Learning” as our Jesuit mission would suggest, incorporating digital pedagogy can do a world of good.

Bringing computer power to old questions does not water-down the values humanists hold dear. Instead, digital innovation can help breathe new life into our teaching and research. As Shrout puts it, computers can help free up brain space for us and give us more mental energy to tackle big questions. Why not help our students understand humanistic inquiry through, against, and alongside the digital media that binds many of our social networks together?

Throughout the workshop, Shrout offered useful insights on evaluation and implementation of digital projects based on her extensive experience. She warned teachers that the guidelines need to be clear and evaluation must be explicit and fair. Even if you free yourself from the mountain of three-paragraph essays, you face new obstacles of evaluation. As someone who has enthusiastically embraced digital research and pedagogy, I’m with Shrout–I think these obstacles are worth taking on.

And in case you missed it, she offered several good avenues for the hows of digital pedagogy. I challenge you to take from this grab-bag of stellar digital tools (ranking from easiest implementation to most complex):

Post by: Christy L. Pottroff

Exciting Spring Events!

After a hiatus last semester, the Fordham Graduate Student Digital Humanities Group is back with a bang.  We’ve got a great list of events coming up, and two series going on.

FGSDH Events
Rose Hill Campus, 2pm-3pm
February 4: Debates in the Digital Humanities
February 25: Digital Pedagogy
March 25: Building and Maintaining an Online Profile
April 18: Wikipedia Edit-A-Thon

Topics in Digital Mapping Events
Lincoln Center Campus, 3-5pm Workshops, 2-3pm Meet&Greet
February 11: Thinking about Time with Maps: Timelines/Palladio
March 4: Georectifying/MapWarper
April 15: Intro to CartoDB

Presentation on “Digital Humanities” Graduate Course at Pratt) – 12/4/13

Last week, I presented to the Fordham Graduate Student Digital Humanities Group on the course I have been taking during the Fall 2013 semester at the Pratt Institute. While the class is taught in a Library Science Masters program, the professor (Chris Sula) and the bulk of readings and discussion are not library-specific. Below is a link to my presentation, which includes hyperlinks to several of the resources used in the class:Image of first slide of PresentationMy part of the discussion was to show how a graduate level course specifically on Digital Humanities can be structured. The benefit to the way this class was laid out (as well as the assignments required) has been the focus on learning about how this emerging field works socially, theoretically, and practically. This means that we did not focus on learning specific tools, although we were briefly introduced to and encouraged to play with several. Instead, we focused on what Digital Humanities research looks like; how is DH being adopted within/across the humanities; how to start, manage, and preserve projects; and, how to integrate thinking about the user into a project’s development.

After laying out this model, the group discussed whether such a course would be possible or appropriate to initiate at Fordham. Our discussion brought up a variety of concerns and ideas of how DH fits into the Fordham graduate experience – with respect to both research and teaching. There was enthusiasm for creating a Research Methods course for humanists (ex: for English and History students) to teach and discuss both traditional and DH methods of research. The thirst for integrating DH methods and traditional research was a promising result of this meeting.

Thanks to everyone who attended. We look forward to hosting some great events in Spring 2014!

Photo of Kristen Mapes– Kristen Mapes

Tomorrow (Dec. 4), 12:30-2:00pm, Dealy 115 – Talk & Discussion led by Kristen Mapes on Digital Humanities Class

Please join us tomorrow, Dec. 4, from 12:30-2:00pm in Dealy 115. Kristen Mapes willl speak about taking “Digital Humanities” as a graduate level course at the Pratt Institute.

Topics to be discussed: What topics are covered? How are they addressed? What is the value of taking a DH-specific class rather than simply incorporating DH into pre-existing classes?

This will be an informal conversation about Digital Humanities as a course topic and  the graduate student perspective on learning about DH in a formal way. Come to hear and discuss (and eat cookies) tomorrow at 12:30 in Dealy 115!

See you there!