⚡️ This lesson has a notebook with all the code that you can download here. Be sure to download the datasets and correct Python libraries as well.

In the last few classes, we have started to get deeper into text analysis, learning about topics like unsupervised versus supervised modeling, as well as tokenization and vectorization. In this class, we will continue to build on these concepts, learning about more advanced text analysis techniques and how they can help us study cultural data.

From the last class, the assignment was to try and find some patterns in the Humanist Listserv dataset, and specifically to see if we could identify any discourses that were distinctive of the early internet era versus the later web 2.0 era.

To do this, we’re initially going to try out using TF-IDF to see if we can identify any distinctive words or phrases that are associated with these two time periods. We’ll start by loading the Humanist Listserv dataset and then using the TfidfVectorizer from scikit-learn to create a TF-IDF matrix. We’ll then use this matrix to identify the most distinctive words or phrases for each time period.

Here’s the code to do this:

import pandas as pd
import altair as alt
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the Humanist Listserv dataset
humanist_vols = pd.read_csv("web_scraped_humanist_listserv_volumes.csv")

#save our texts to a list
documents = humanist_vols.volume_text.tolist()

#Create a vectorizer
vectorizer = TfidfVectorizer(max_df=.8)
# Fit the vectorizer to our documents
transformed_documents = vectorizer.fit_transform(documents)

# Now get the top features for each document
transformed_documents_as_array = transformed_documents.toarray()

# Get the dates for each volume
dates = humanist_vols.inferred_start_year.tolist()

# Create an empty list to store our results
tfidf_results = []

# Loop through each document and get the top terms
for counter, doc in enumerate(transformed_documents_as_array):
    # Zip together the terms and the scores
    tf_idf_tuples = list(zip(vectorizer.get_feature_names_out(), doc))
    # Sort the terms by score
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)
    # Add the date to the dataframe
    one_doc_as_df['inferred_start_year'] = dates[counter]
    # Append the dataframe to our list
    tfidf_results.append(one_doc_as_df)
# Concatenate all the dataframes together
tfidf_df = pd.concat(tfidf_results)
# Sort the dataframe by score
tfidf_df = tfidf_df.sort_values(by=['score'], ascending=False)
# Get the top ten terms for each year
top_terms = tfidf_df.groupby('inferred_start_year').apply(lambda x: x.sort_values('score', ascending=False).head(10)).reset_index(drop=True)
# Convert the inferred_start_year to a datetime
top_terms['inferred_start_year'] = pd.to_datetime(top_terms['inferred_start_year'], format='%Y')

Now we should see the following output if we print out the top_terms DataFrame:

  term score inferred_start_year
0 num 0.61525 1987-01-01 00:00:00
1 utorepas 0.519124 1987-01-01 00:00:00
2 bitnet 0.232467 1987-01-01 00:00:00
3 vax 0.137863 1987-01-01 00:00:00
4 prolog 0.122248 1987-01-01 00:00:00

This output shows the top ten terms for each year in the Humanist dataset. We can also try visualizing the results using Altair with the following code:

selection = alt.selection_point(fields=['term'], bind='legend')
chart = alt.Chart(top_terms).mark_bar().encode(
    y='score',
    x='inferred_start_year:T',
    color=alt.Color('term', legend=alt.Legend(title='Term', orient='right', symbolLimit=len(top_terms['term'].unique()), columns=5), scale=alt.Scale(scheme='tableau20')),
    tooltip=['term', 'score', 'year(inferred_start_year)'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(selection).properties(
    title='Top 10 Terms by TF-IDF Score in Humanist Volumes'
)
chart

Now clicking on the terms in the legend will highlight the terms in the chart or we can hover over them to see their values. This visualization starts to give us a sense of what terms are most distinctive for each year in the Humanist dataset. Many of these terms are either numbers like years, which would appear uniquely in each volume, or specific terms like “utorepas” or “bitnet” that might be associated with the early internet era. We can also see the rise of digital humanities in the later volumes.

To get at what is most distinctive about the early internet era versus the later web 2.0 era, we can try altering our corpus to group volumes by period and then compare the top terms for each period. We can then use the TfidfVectorizer to create a TF-IDF matrix for each period and compare the top terms for each period. Here’s the code to do this:

# Group the volumes by period
humanist_vols['period'] = pd.cut(humanist_vols['inferred_start_year'], bins=[float('-inf'), 2000, 2010, 2020], labels=['early_internet', 'web_2.0', 'contemporary'])
# Create a vectorizer
vectorizer = TfidfVectorizer(max_df=.8)
# Fit the vectorizer to our documents
transformed_documents = vectorizer.fit_transform(humanist_vols.groupby('period')['volume_text'].apply(' '.join).tolist())
# Now get the top features for each document
transformed_documents_as_array = transformed_documents.toarray()
# Get the periods for each volume
periods = humanist_vols['period'].unique()
# Create an empty list to store our results
tfidf_results = []
# Loop through each document and get the top terms
for counter, doc in enumerate(transformed_documents_as_array):
    # Zip together the terms and the scores
    tf_idf_tuples = list(zip(vectorizer.get_feature_names_out(), doc))
    # Sort the terms by score
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)
    # Add the date to the dataframe
    one_doc_as_df['period'] = periods[counter]
    # Append the dataframe to our list
    tfidf_results.append(one_doc_as_df)
# Concatenate all the dataframes together
tfidf_df = pd.concat(tfidf_results)
# Sort the dataframe by score
tfidf_df = tfidf_df.sort_values(by=['score'], ascending=False)
# Get the top thirty terms for each period
top_terms = tfidf_df.groupby('period').apply(lambda x: x.sort_values('score', ascending=False).head(30)).reset_index(drop=True)

Now we can visualize the top terms for each period using Altair with the following code:

top_terms['period'] = top_terms['period'].astype(str)
selection = alt.selection_point(fields=['term'], bind='legend')
chart = alt.Chart(top_terms).mark_bar().encode(
    y='score',
    x=alt.X('period', sort=['early_internet', 'web_2.0', 'contemporary'], axis=alt.Axis(title='Period')),
    color=alt.Color('term', legend=alt.Legend(title='Term', orient='right', symbolLimit=len(top_terms['term'].unique()), columns=5), scale=alt.Scale(scheme='tableau20')),
    tooltip=['term', 'score', 'period'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(selection).properties(
    title='Top 30 Terms by TF-IDF Score in Humanist Volumes by Period'
)
chart

So now we are seeing which terms are most distinctive for most period, and we could start researching each term to discover if it is associated with particular technologies or discourses. We could also start to experiment with the hyper-parameters of the TfidfVectorizer to see if we can get more meaningful results.

Hyper-parameters are the settings that we can adjust when we create a model or algorithm. For example, in the TfidfVectorizer, we can adjust the max_df parameter to set the maximum document frequency for a term to be included in the TF-IDF matrix. We could also adjust the min_df parameter to set the minimum document frequency for a term to be included in the TF-IDF matrix. We could also adjust the ngram_range parameter to set the range of n-grams to include in the TF-IDF matrix. By adjusting these hyper-parameters, we can get different results from our TF-IDF analysis.

Unsupervised and Supervised Text Analysis

So far, we have been looking at how to identify distinctive terms in a corpus of text data using TF-IDF, and while this is a powerful algorithm, it is only the tip of the iceberg when it comes to how we can explore textual data. For example, we haven’t really been exploring what words might appear with words related to technology or the internet, or how different words might cluster together in the corpus. To do this we can use something often called clustering to group similar documents together based on the words they contain.

Clustering is an unsupervised machine learning technique that groups similar data points together based on their features. In class we discussed the difference between unsupervised and supervised machine learning, but as a refresher in unsupervised machine learning, we don’t have labeled data, so we are trying to find patterns in the data without knowing what the patterns are. In supervised machine learning, we have labeled data, so we are trying to predict the labels based on the features of the data.

This table gives a good overview of the differences between unsupervised and supervised machine learning:

And this figure helps visualize how they work:

Both approaches have their tradeoffs and can be useful in different contexts. For example, unsupervised machine learning can be useful when we don’t have labeled data or when we want to explore patterns in the data without knowing what the patterns are. Supervised machine learning can be useful when we have labeled data and want to predict the labels based on the features of the data. In the case of our Humanist listserv dataset, we don’t have labeled data, so we might want to use unsupervised machine learning to explore patterns in the data.

Topic Modeling

While there are a number of clustering algorithms that we could use to group similar documents together, one of the most popular unsupervised machine learning techniques for text data is topic modeling. Topic modeling is a type of unsupervised machine learning that identifies topics in a corpus of text data. Topic modeling works by identifying groups of words that frequently appear together in the corpus and assigning them to topics. Each topic is a group of words that are related to each other, and each document in the corpus is assigned a distribution over the topics.

We can use scikit-learn and the LatentDirichletAllocation class to perform topic modeling on our Humanist listserv dataset. Here’s the code to do this:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
# Preprocess the text
vectorizer = TfidfVectorizer(max_df=0.8)
tfidf = vectorizer.fit_transform(humanist_vols['volume_text'])

# Perform topic modeling
lda = LatentDirichletAllocation(n_components=len(humanist_vols), max_iter=20, random_state=0)
lda.fit(tfidf)

# Get the top words for each topic
top_words = vectorizer.get_feature_names_out()
topic_words = {}
for topic, comp in enumerate(lda.components_):
    word_idx = np.argsort(comp)[::-1][:10]
    topic_words[topic] = [top_words[i] for i in word_idx]

# Print the top words for each topic
for topic, words in topic_words.items():
    print(f"Topic #{topic}: {', '.join(words)}")

This should give us the following output:

topic 0 1 2 3 4 5 6 7 8 9
0 astra na pali wesolowski oikawa hurd voorhis texpert makrolog ps2
1 xxx ruhc kis hums riao97 riao tiiap bellagio wipo nea
2 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
3 2003 2002 tocs 0558 mivu digicult oesi m0 eprints odell
4 epas aaisp ceth etext 8080 netscape fuer dsu hull 908
5 2008 barracuda esmtp 2002 nz outbound messagelabs helo ccreegan asg
6 2014 dhsi crowdsourcing dhoxss cerch computationalists ahrc attachments fishwick
7 e9 e5 f4 ef 3db7 frankel e1 3d3d f0 mw99
8 2004 qs neach uottawa jodi artfl tambovtsev unicode rs pmc
9 num deleted ninch htm cest wlm xml 2784 amico website
10 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
11 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
12 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
13 2014 spf lemme helo forks dah pacling vspace neder centerline
14 2016 matlit livingstone fã¼r worldcis uned dsh hcomp linhd 2017
15 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
16 2017 upei derivational tandfonline derimo2017 2018 aclc hiatt yisr20 loi
17 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
18 2006 ubiquity fqs 1007 doi mccarty_at_kcl 2980 3dx arundel google
19 digitalhumanities onlinehome s16382816 joyent archiver postfix woodward bounces dhhumanist php
20 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
21 num bitnet utorepas vax cst tlg kraft cdt wordperfect ccat
22 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
23 2007 2008 wmccarty fludd infinitum 1617 ahrc jiia tl ichim07
24 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
25 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
26 2011 a0 fractional interedition pocos mw2012 kflc jascha utpjournals dh2012
27 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
28 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
29 2018 mailing_list_multi fish benzon fatal mathophobia flaws yisr20
30 1998 ftp gopher marchand cti 5801 ippe tact ___ brians
31 žikovský chaffee chainâ chainsaw chaining chaines chained chainbi_at_lycos chainbi chaim
32 cultivate ucita gratias agere dchamber evtoc primitives www10 evelyn virilio

You’ll notice that I’m setting the number of topics to be the same as the number of volumes in the Humanist dataset. This is because the LatentDirichletAllocation algorithm requires us to specify the number of topics, and in this case, we are trying to identify topics for each volume in the dataset. We could also try setting the number of topics to a smaller number to see if we can identify broader topics in the dataset or change our TF-IDF parameters to see if that leads to more distinctive words being surfaced.

Classifiers

Alternatively, we could use supervised machine learning techniques to predict the time period of each volume in the Humanist dataset based on the words they contain. This would involve training a classifier on a subset of the data where we know the time period of each volume and then using the classifier to predict the time period of the remaining volumes. This is a common approach in text analysis when we have labeled data and want to predict the labels based on the features of the data.

In our case, we might want to use our earlier periods as labels and then try to predict which terms are most distinctive of each period. We could then use these terms to predict the time period of the remaining volumes in the dataset. Here’s the code to do this:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Create a vectorizer
vectorizer = TfidfVectorizer(max_df=0.8)
# Fit the vectorizer to our documents
transformed_documents = vectorizer.fit_transform(humanist_vols['volume_text'])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(transformed_documents, humanist_vols['period'], test_size=0.2, random_state=0)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predict the time period of the test set
y_pred = clf.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

Which gives us the following results:

  precision recall f1-score support
contemporary 1 0.333333 0.5 3
early_internet 1 1 1 2
web_2.0 0.5 1 0.666667 2
accuracy 0.714286 0.714286 0.714286 0.714286
macro avg 0.833333 0.777778 0.722222 7
weighted avg 0.857143 0.714286 0.690476 7

This is telling us how well our classifier is doing at predicting the time period of each volume in the Humanist dataset. We can see that the classifier is doing pretty well, with an accuracy of 0.71, which means that it is correctly predicting the time period of each volume about 71% of the time. You’ll notice that it is doing better at predicting the early internet and web 2.0 periods than the contemporary period, which might be because the contemporary period is more similar to the web 2.0 period than the early internet period.

We could also extract the terms most associated with each period using the following code:

# Get the coefficients for each term
coefficients = clf.coef_
# Get the terms
terms = vectorizer.get_feature_names_out()
# Create a dataframe of the terms and coefficients
terms_df = pd.DataFrame({'term': terms, 'contemporary': coefficients[0], 'early_internet': coefficients[1], 'web_2.0': coefficients[2]})
# Get the top terms for each period
top_terms = terms_df.melt(id_vars='term', var_name='period', value_name='coefficient').sort_values(by='coefficient', ascending=False).groupby('period').head(10)

This gives us the following output:

term period coefficient
num early_internet 1.21206
digitalhumanities contemporary 1.01717
2006 web_2.0 0.480645
onlinehome contemporary 0.474791
s16382816 contemporary 0.474791
2002 web_2.0 0.456101
joyent web_2.0 0.356053
bitnet early_internet 0.341151
2008 web_2.0 0.307246
dhhumanist contemporary 0.296768
archiver contemporary 0.285431
cest contemporary 0.270183
2003 web_2.0 0.250799
woodward web_2.0 0.244343
2007 web_2.0 0.241049
esmtp web_2.0 0.220966
2009 web_2.0 0.209739
barracuda web_2.0 0.196746
2015 contemporary 0.195885
deleted early_internet 0.192321
postfix contemporary 0.191936
ninch web_2.0 0.180353
ubiquity web_2.0 0.175565
ftp early_internet 0.174169
1007 web_2.0 0.170139
2016 contemporary 0.165034
doi web_2.0 0.161822
2017 contemporary 0.149745
utorepas early_internet 0.147523
2014 contemporary 0.133621
gopher early_internet 0.130811
listmember_interface contemporary 0.124616
htm web_2.0 0.123987
php contemporary 0.123323
fqs web_2.0 0.120207
ppp contemporary 0.119089
vax early_internet 0.119023
spam contemporary 0.113921
3dx web_2.0 0.110824
mccarty_at_kcl web_2.0 0.110779
autolearn contemporary 0.108205
all_trusted contemporary 0.108185
10009 contemporary 0.108185
epas early_internet 0.106211
nz web_2.0 0.105384
bayes_00 contemporary 0.105316
arundel web_2.0 0.103139
xml web_2.0 0.0991835
spamassassin contemporary 0.0983372
bounces contemporary 0.0963914
7848 web_2.0 0.0939354
userid contemporary 0.0937591
infobits web_2.0 0.0908213
messagelabs web_2.0 0.0893617
aaisp web_2.0 0.0881107
outbound web_2.0 0.0877741
2013 contemporary 0.0827581
vhost contemporary 0.0808539
wlm web_2.0 0.079479
wikipedia web_2.0 0.0789063
asg web_2.0 0.0786985
tlg early_internet 0.0785648
qs early_internet 0.0773452
2784 web_2.0 0.0771147
cst early_internet 0.0748274
uribl_blocked contemporary 0.0744614
ichim99 early_internet 0.0687415
kraft early_internet 0.0599564
beenthere contemporary 0.0591908
membership_form contemporary 0.0587342
listmember contemporary 0.0564102
ham contemporary 0.0561662
cdt early_internet 0.0557935
ccat early_internet 0.0549989
spf contemporary 0.0546195
ippe early_internet 0.0529816
pst early_internet 0.0520612
wordperfect early_internet 0.0517592
cti early_internet 0.0487855
marchand early_internet 0.0480226
coombs early_internet 0.0474053
xxx early_internet 0.0466514
brownvm early_internet 0.0463937
cni early_internet 0.04483
ceth early_internet 0.0447585
bene early_internet 0.0445725
brians early_internet 0.0445527
osher early_internet 0.044098
neach early_internet 0.0438512
wordcruncher early_internet 0.0428789

We can then visualize the top terms for each period using Altair with the following code:

# visualize top terms
top_terms['period'] = top_terms['period'].astype(str)
selection = alt.selection_point(fields=['term'], bind='legend')

# Define the sort order for the periods
period_order = ['early_internet', 'web_2.0', 'contemporary']

chart = alt.Chart(top_terms).mark_bar().encode(
    x=alt.X('period', sort=['early_internet', 'web_2.0', 'contemporary'], axis=alt.Axis(title='Period')),
    y=alt.Y('coefficient:Q'),  # Sort terms by score in descending order
    color=alt.Color('term', legend=alt.Legend(title='Term', orient='right', symbolLimit=len(top_terms['term'].unique()), columns=5), scale=alt.Scale(scheme='tableau20')),
    tooltip=['term', 'coefficient', 'period'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(selection).properties(
    title='Top 10 Terms by Coefficient in Logistic Regression Model by Period'
)
chart

Now we have officially experimented with two types of machine learning!! We could also try other classifiers like RandomForestClassifier or SVM to see if we get different results. We could also try different hyper-parameters for the TfidfVectorizer to see if we get different results. We could also try different feature selection techniques to see if we get different results. There are many ways to experiment with text analysis, and the more you experiment, the more you’ll learn about the data and the algorithms.

Resources

We have only briefly touched on the possibilities of text analysis in this lesson. There are many more techniques and algorithms that you can use to explore textual data. Here are some resources to help you learn more about text analysis:

  • Thomas Jurczyk, “Clustering with Scikit-Learn in Python,” Programming Historian 10 (2021), https://doi.org/10.46430/phen0094.
  • Matthew J. Lavin, “Regression Analysis with Scikit-Learn (part 1 - Linear),” Programming Historian 11 (2022), https://doi.org/10.46430/phen0099.
  • Matthew J. Lavin, “Regression Analysis with Scikit-Learn (part 2 - Logistic),” Programming Historian 11 (2022), https://doi.org/10.46430/phen0100.
  • John R. Ladd, “Understanding and Using Common Similarity Measures for Text Analysis,” Programming Historian 9 (2020), https://doi.org/10.46430/phen0089.
  • Matthew J. Lavin, “Analyzing Documents with TF-IDF,” Programming Historian 8 (2019), https://doi.org/10.46430/phen0082.

Updated: