What does Hillary Clinton’s inbox look like?

I, too, am tired of hearing about Hillary’s use of a private email server. On the other hand, it led to a pretty neat data set to unpack: a dump of emails she’s sent and received.

I played around with this data set a bit and was particularly interested in how different groups of people interacted with Hillary. Did men use shorter sentences than women, for example? Did her staffers send one-liners versus ambassadors who sent lengthy emails? Did she have interesting relationships with people we might not be familiar with?

These are obviously a host of questions to answer, but I’m starting to chip away at some of this, in part for a project I’m doing for my natural language processing class. Lesson learned along the way: visualizing text is hard. I found that the norm for text visualizations out there, such as word clouds or circle packing, was reductionist for some of the data I have, like topic models or k-means clustering.

For a simple representation to start, I generated a scatter plot visualization using mpld3 (created by my NLP professor for another assignment) which creates interactive matplotlib graphs for the browser. It’s clunky to navigate (you need to switch to a zoom-in mode, drag a rectangular portion of the graph to zoom in on, then switch again to the cursor mode to scroll over words), but it’s interesting to see which words appear together for a first step.

I’m using a n-gram analysis to figure out differences in word usage between genders (spoiler: there isn’t a significant difference).

The k-means clustering between genders was slightly more interesting:

(threshold: 10 occurrences per word, clusters = 25, window size=3). Side note: I need to increase the number of clusters, perhaps the threshold as well.

Female k-clusters

Male k-clusters

I’m still slightly stumped on how to represent this well. In the meantime, I’ll be working on creating an interactive to show some of the LDA topic modeling analysis I did as well as finding a clean way to display vocabulary differences between different categories of people.

This is obviously a work in progress — if you’d like to follow along, I’m actively working on this for the next month on GitHub (forgive the messy Python scripts).