Folders and Topics

I’ve mentioned previously that finding relevant documents in the folders in the boxes at NARA can be tricky because of the nomenclature used to delinate folders, usually places or people. Historical research doesn’t neatly conform to these divisions, yet most historians adhered to them because of constraints on how much time you can spend at the archive.

In an effort to challenge these divisions, I’ve used topic modeling to try and come up with an alternative way to divide these folders. Now topic modeling has rightfully come under some strong critique in recent years because of the assumptions it makes about language and structures in texts. When I first tried topic modelling I felt both like I was doing magic but also deeply suspicious because the results were so varied.

Yet, even with these drawbacks, topic modelling actually works fairly well on diplomatic cables because they are written in a much more systematic way than poetry or literature.

I’m hoping to eventually embed this topic model on this page so anyone can click and investigate these topics, but for now here are some screenshots of the topics that I generated from box 9.

First half of Box 9 Topic Model

Second half of Box 9 Topic Model

In this second set of images, I’ve clicked on topic number 7, the ‘congo’ topic cluster. Topic modeling lets you see what documents have the highest assignment for each topic, which helps rethink the divisions of folders within this box.

First half of Congo Cluster Topic Model

Second half of Congo Cluster Topic Model