If someone says “gothic” to you, do you think of the lush rolling countryside or a sunny day? Chances are you don’t. Most people – myself included – associate that word with the dark, mysterious, and even frightening. Maybe you picture ornate stone architecture of a castle with gargoyles. Or perhaps foreboding skies rolling over said castle. Or very morose, pale people wearing black capes and veils. Or vampires with all of the above. This post details my first experience with the digital humanities.
About a year ago, Caroline Winter, a PhD student at the University of Victoria, emailed me with a question. She had assembled a corpus of 134 works of European Gothic literature that had been written or translated into English, ranging from the 18th century to the early 20th. Caroline had a hunch that gothic literature was more vibrant than most people thought, but lacked the quantitative background to analyze her large data set. Could I write a short script to count and analyze color words within her corpus? This post details my first experience with the digital humanities — applying simple computational tools programmers use every day to the data relevant to traditional humanities disciplines.
Originally a quick Python project for a Sunday afternoon, my journey attempting to answer Caroline’s question eventually turned into a talk at PyCon. Through some pretty straightforward counting and matching techniques, we were able to find several interesting patterns that challenged my gloomy picture of “gothic”. To follow along using Phantom of the Opera as an example text, take a look at the companion Jupyter Notebook on Github.
Beyond black & white
The first step in the project was to define which color words we were looking for. The challenge here was that both the vocabulary used to describe color and the actual coloring of objects themselves were different in the gothic era than in present day.
Rather than guess about historical color words, we turned to the Oxford English Dictionary’s Historical Thesaurus (hereafter the Historical Thesaurus). It lists color words used in English and primarily in Europe, the year of each one’s first recorded use, and its color family.
After adding some html color names based on color grouping to our csv file of the original data set I read a csv file with the Historical Thesaurus data into a short function and eliminated everything that came into usage after 1914, since it’s not clear from the data when words fell out of usage.
def id_color_words(): “”” Gets color words from the csv file and puts them into a dict where key = word and value = (hex value, color family); removes pre-1914 color words. “”” color_word_dict = {} modern_color_words = [] color_data = csv.DictReader(open(‘./color_names.csv’), delimiter=”,”, quotechar='”‘) for row in color_data: return color_word_dict, modern_color_words |
ACCURATE ANALYSIS
This gave us a dictionary of 980 pre-WWI color words ranging from the familiar, like blue (first usage in 1300), crimson (1416), or jet (1607), to the uncommon, like corbeau (1810, dark green), damask (1598, pink) or ochroid (1897, pale yellow). There were also some instances where the way words were categorized reflected a historical state of familiar things. For example, ‘glass’ is categorized as a greyish green, not pale blue or clear as we may think of it today.
Now we knew what we were looking for, but generating an accurate analysis was about more than simply counting these color words.
‘rose’ != ‘rose’ != ‘rose’
English is a tricky language, with many words that sound the same meaning different things and many words that look the same meaning different things depending on their context. ‘Rose’ is a great example: it can be a noun, adjective, or verb, as demonstrated in the gif below.
So which words should we count? Should every word on the list be included?
To make this decision, we needed to write more code to parse our corpus and look at the results.
I used the function below to get the text ready for analysis. It does three things. First, it reads in the .txt file for the work we’re analyzing. Then, the function removes the “gristle” of stop words, punctuation, and upper case letters. Finally, it leverages the pos_tag function from the Natural Language Toolkit (NLTK) to tag each remaining word as a part of speech (noun, verb, adjective, etc.).
FUNCTION
def process_text(filename): “”” This function generates a list of tokens with punctuation stopwords, and spaces removed for the whole text. It also applies NLTK’s part of speech tagging function to determine if words are nouns, adjectives, verbs, etc. “”” text_tokens = [] # create a list of the words and punctuation we want to remove before analyzing with open(filename) as text: return text_tokens |
This function outputs the whole text that looks like this – as you can see the NLTK pos_tag doesn’t look like it gets the part of speech correct every time, but it’s pretty close.
Next, we needed to isolate the color words from the text and do some analysis of the context to make sure there weren’t any glaring issues in the data we were generating. Here Caroline’s literature background was extremely helpful in identifying what looked inaccurate, and I went off to pull out the context of the suspicious words so she could make a final call.
- Isabella, a yellowish color that was also the name of a couple of characters in our corpus;
- Imperial, a purple color that in the texts actually meant the political structure, not the color; and
- Angry, sometimes used to describe a red-pink flushed color, but was used more often as an emotion word than a color word.
Looking at the preliminary data, we also found that we got some combinations of color words, like “rose” followed by “red” or “milky” followed by “white”. While the Historical Thesaurus covers common combinations of these when they’re joined with a “-” (e.g. “rose-red”) we decided to isolate those examples in the output of the find_color_words to help us determine if we wanted to exclude those samples from the final analysis.
Analysis & Visualization – the (really) fun part
With adjustments made to the color word list, we can run the tagged text through the find_color_words function below and see both the concurrent color words and the full list. To do this, the code below leverages Python’s itertools with a couple of helper functions: pairwise and is_color_word.
def pairwise(iterable): “”” Returns a zip object, consisting of tuple pairs of words where the second word of tuple #1 will be the first word of tuple #2, e.g. [(‘room’, ‘NN’), (‘perfume’, ‘NN’), (‘lady’, ‘NN’), (‘black’, ‘JJ’)] from our `processed` variable becomes: [((‘room’, ‘NN’), (‘perfume’, ‘NN’)) ((‘perfume’, ‘NN’), (‘lady’, ‘NN’)) ((‘lady’, ‘NN’), (‘black’, ‘JJ’)) ((‘black’, ‘JJ’), (‘contents’, ‘NNS’)] “”” a, b = tee(iterable) next(b, None) return zip(a, b) def is_color_word(word, color_dict): def find_color_words(t, color_dict): for o in t: for p, n in pairwise(t): return color_words, concurrent_color_words |
Template: Here’s what we get from this function.First, a list of all of the identified color words in the text and their tag, like this:[(‘yellow’, ‘JJ’), (‘black’, ‘JJ’), (‘mourning’, ‘NN’), (‘rose-red’, ‘JJ’), (‘lily-white’, ‘JJ’), (‘black’, ‘JJ’), (‘black’, ‘JJ’), (‘black’, ‘JJ’), (‘white’, ‘JJ’), (‘yellow’, ‘NN’), (‘plum’, ‘NN’), (‘glass’, ‘NN’), (‘red’, ‘JJ’), (‘coral’, ‘JJ’), (‘pink’, ‘NN’), (‘iron’, ‘NN’), (‘glass’, ‘NN’), (‘pink’, ‘JJ’), (‘candid’, ‘JJ’), (‘blue’, ‘JJ’)]Second, we get a list of tuples containing the color words that were adjectives or nouns closely followed by another adjective or noun in the original text. From The Phantom of the Opera, we get examples like:((‘glass’, ‘NN’), (‘champagne’, ‘NN’)) ((‘pink’, ‘NN’), (‘white’, ‘JJ’)) ((‘gold’, ‘NN’), (‘purple’, ‘NN’)) ((‘water’, ‘NN’), (‘bluey’, ‘NN’))In most cases we didn’t think one of these took anything away from or obscured the other; in fact their close association often painted a clearer picture of color texture. So we left both words in.From this you can get some summary stats, like what percentage of all uncommon words in the text were color words (Phantom is 0.9%), and what proportion are nouns versus adjectives (Phantom is 52-47).But the really fun part is using those HTML color groups to plot the use of color in the text.The Jupyter Notebook contains a couple of examples with matplotlib that are really straightforward to implement, like this bar chart showing the colors used in The Phantom of the Opera. Kite created a Github repository here where you can access the code from this and other posts on their blog.There are many of interesting options for visualizing this data. The original talk included a website, built with the Django framework, ChartJS, and lots of CSS – online here – where we visualized each
book as a series of color blocks in their order of appearance.Even with the limitations of HTML color applied to a broad palette,you’ll see that a lot of the books are not as dark and gloomy as their “gothic” label might lead you to believe. This makes sense: the supernatural is a strong theme in Gothic literature, but so is contrasting it with the beauty of the natural world that was considered both a haven and a dwindling reality during the dawn of the industrial revolution.Note: This Article Originally Appeared On Kite