I, too, am tired of hearing about Hillary’s use of a private email server. On the other hand, it led to a pretty neat data set to unpack: a dump of emails she’s sent and received.
I played around with this data set a bit and was particularly interested in how different groups of people interacted with Hillary. Did men use shorter sentences than women, for example? Did her staffers send one-liners versus ambassadors who sent lengthy emails? Did she have interesting relationships with people we might not be familiar with?
These are obviously a host of questions to answer, but I’m starting to chip away at some of this, in part for a project I’m doing for my natural language processing class. Lesson learned along the way: visualizing text is hard. I found that the norm for text visualizations out there, such as word clouds or circle packing, was reductionist for some of the data I have, like topic models or k-means clustering.
For a simple representation to start, I generated a scatter plot visualization using mpld3 (created by my NLP professor for another assignment) which creates interactive matplotlib graphs for the browser. It’s clunky to navigate (you need to switch to a zoom-in mode, drag a rectangular portion of the graph to zoom in on, then switch again to the cursor mode to scroll over words), but it’s interesting to see which words appear together for a first step.
I’m using a n-gram analysis to figure out differences in word usage between genders (spoiler: there isn’t a significant difference).
The k-means clustering between genders was slightly more interesting:
(threshold: 10 occurrences per word, clusters = 25, window size=3). Side note: I need to increase the number of clusters, perhaps the threshold as well.
I’m still slightly stumped on how to represent this well. In the meantime, I’ll be working on creating an interactive to show some of the LDA topic modeling analysis I did as well as finding a clean way to display vocabulary differences between different categories of people.
This is obviously a work in progress — if you’d like to follow along, I’m actively working on this for the next month on GitHub (forgive the messy Python scripts).