# Advanced Text Analysis Notebook

This is a notebook to help you experiment with text analysis methods. The goal is to try your hand at understanding how these transformations happen and how they can help us understand patterns in unstructured text data.

## Load Libraries

First, try commenting each of the libraries with a small note about what they do and a link to their documentation.

In [1]:
import pandas as pd
import altair as alt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np

## Load Datasets

Both datasets are available in Google Drive at this link [https://drive.google.com/drive/folders/1zNTPmQm_HtR1zeCpRAp3XSDBsZuZq0U5?usp=sharing](https://drive.google.com/drive/folders/1zNTPmQm_HtR1zeCpRAp3XSDBsZuZq0U5?usp=sharing). Download the datasets and load them into the notebook. See if you can understand the structure of the data.

In [None]:
# Load the Humanist Listserv dataset
humanist_vols = pd.read_csv("web_scraped_humanist_listserv_volumes.csv")
pudding_data = pd.read_csv("categorized_pudding_public_scripts.csv")

## TF-IDF Experiments

This is the code from our lesson in class. Try running it and see if you can understand what it does. Then, try to modify it to see if you can get different results. Use the documentation to help you understand what each parameter does [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). In particular, try out the `max_df`, `min_df`, and `ngram_range` parameters.

In [None]:
#save our texts to a list
documents = humanist_vols.volume_text.tolist()

#Create a vectorizer
vectorizer = TfidfVectorizer(max_df=.8)
# Fit the vectorizer to our documents
transformed_documents = vectorizer.fit_transform(documents)

# Now get the top features for each document
transformed_documents_as_array = transformed_documents.toarray()

# Get the dates for each volume
dates = humanist_vols.inferred_start_year.tolist()

# Create an empty list to store our results
tfidf_results = []

# Loop through each document and get the top terms
for counter, doc in enumerate(transformed_documents_as_array):
    # Zip together the terms and the scores
    tf_idf_tuples = list(zip(vectorizer.get_feature_names_out(), doc))
    # Sort the terms by score
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)
    # Add the date to the dataframe
    one_doc_as_df['inferred_start_year'] = dates[counter]
    # Append the dataframe to our list
    tfidf_results.append(one_doc_as_df)
# Concatenate all the dataframes together
tfidf_df = pd.concat(tfidf_results)
# Sort the dataframe by score
tfidf_df = tfidf_df.sort_values(by=['score'], ascending=False)
# Get the top ten terms for each year
top_terms = tfidf_df.groupby('inferred_start_year').apply(lambda x: x.sort_values('score', ascending=False).head(10)).reset_index(drop=True)
# Convert the inferred_start_year to a datetime
top_terms['inferred_start_year'] = pd.to_datetime(top_terms['inferred_start_year'], format='%Y')

In [None]:
selection = alt.selection_point(fields=['term'], bind='legend')
chart = alt.Chart(top_terms).mark_bar().encode(
    y='score',
    x='inferred_start_year:T',
    color=alt.Color('term', legend=alt.Legend(title='Term', orient='right', symbolLimit=len(top_terms['term'].unique()), columns=5), scale=alt.Scale(scheme='tableau20')),
    tooltip=['term', 'score', 'year(inferred_start_year)'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(selection).properties(
    title='Top 10 Terms by TF-IDF Score in Humanist Volumes'
)

### Periodization of Humanist Code

In [None]:
# Group the volumes by period
humanist_vols['period'] = pd.cut(humanist_vols['inferred_start_year'], bins=[float('-inf'), 2000, 2010, 2020], labels=['early_internet', 'web_2.0', 'contemporary'])
# Create a vectorizer
vectorizer = TfidfVectorizer(max_df=.8)
# Fit the vectorizer to our documents
transformed_documents = vectorizer.fit_transform(humanist_vols.groupby('period')['volume_text'].apply(' '.join).tolist())
# Now get the top features for each document
transformed_documents_as_array = transformed_documents.toarray()
# Get the periods for each volume
periods = humanist_vols['period'].unique()
# Create an empty list to store our results
tfidf_results = []
# Loop through each document and get the top terms
for counter, doc in enumerate(transformed_documents_as_array):
    # Zip together the terms and the scores
    tf_idf_tuples = list(zip(vectorizer.get_feature_names_out(), doc))
    # Sort the terms by score
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)
    # Add the date to the dataframe
    one_doc_as_df['period'] = periods[counter]
    # Append the dataframe to our list
    tfidf_results.append(one_doc_as_df)
# Concatenate all the dataframes together
tfidf_df = pd.concat(tfidf_results)
# Sort the dataframe by score
tfidf_df = tfidf_df.sort_values(by=['score'], ascending=False)
# Get the top thirty terms for each period
top_terms = tfidf_df.groupby('period').apply(lambda x: x.sort_values('score', ascending=False).head(30)).reset_index(drop=True)

In [None]:
top_terms['period'] = top_terms['period'].astype(str)
selection = alt.selection_point(fields=['term'], bind='legend')
chart = alt.Chart(top_terms).mark_bar().encode(
    y='score',
    x=alt.X('period', sort=['early_internet', 'web_2.0', 'contemporary'], axis=alt.Axis(title='Period')),
    color=alt.Color('term', legend=alt.Legend(title='Term', orient='right', symbolLimit=len(top_terms['term'].unique()), columns=5), scale=alt.Scale(scheme='tableau20')),
    tooltip=['term', 'score', 'period'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(selection).properties(
    title='Top 30 Terms by TF-IDF Score in Humanist Volumes by Period'
)
chart

## Topic Modeling Experiments

Try running the code from our lesson in class. See if you can understand what it does. Then, try to modify it to see if you can get different results. Use the documentation to help you understand what each parameter does [https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html). In particular, try out the `n_components` and differing TF-IDF parameters. Also try using the pudding_data dataset to see if you can get different results.

In [None]:
# Preprocess the text
vectorizer = TfidfVectorizer(max_df=0.8)
tfidf = vectorizer.fit_transform(humanist_vols['volume_text'])

# Perform topic modeling
lda = LatentDirichletAllocation(n_components=len(humanist_vols), max_iter=20, random_state=0)
lda.fit(tfidf)

# Get the top words for each topic
top_words = vectorizer.get_feature_names_out()
topic_words = {}
for topic, comp in enumerate(lda.components_):
    word_idx = np.argsort(comp)[::-1][:10]
    topic_words[topic] = [top_words[i] for i in word_idx]

# Print the top words for each topic
for topic, words in topic_words.items():
    print(f"Topic #{topic}: {', '.join(words)}")

## Classification Experiments

Try out the code from our lesson in class. See if you can understand what it does. Then, try to modify it to see if you can get different results. Use the documentation to help you understand what each parameter does [https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). In particular, try out the TFIDF parameters. Also try using the `gender_category` in the pudding_data dataset to see if you can get different results.

In [None]:
# Create a vectorizer
vectorizer = TfidfVectorizer(max_df=0.8)
# Fit the vectorizer to our documents
transformed_documents = vectorizer.fit_transform(humanist_vols['volume_text'])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(transformed_documents, humanist_vols['period'], test_size=0.2, random_state=0)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predict the time period of the test set
y_pred = clf.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

In [None]:
# Get the coefficients for each term
coefficients = clf.coef_
# Get the terms
terms = vectorizer.get_feature_names_out()
# Create a dataframe of the terms and coefficients
terms_df = pd.DataFrame({'term': terms, 'contemporary': coefficients[0], 'early_internet': coefficients[1], 'web_2.0': coefficients[2]})
# Get the top terms for each period
top_terms = terms_df.melt(id_vars='term', var_name='period', value_name='coefficient').sort_values(by='coefficient', ascending=False).groupby('period').head(10)

In [None]:
# visualize top terms
top_terms['period'] = top_terms['period'].astype(str)
selection = alt.selection_point(fields=['term'], bind='legend')

# Define the sort order for the periods
period_order = ['early_internet', 'web_2.0', 'contemporary']

chart = alt.Chart(top_terms).mark_bar().encode(
    x=alt.X('period', sort=['early_internet', 'web_2.0', 'contemporary'], axis=alt.Axis(title='Period')),
    y=alt.Y('coefficient:Q'),  # Sort terms by score in descending order
    color=alt.Color('term', legend=alt.Legend(title='Term', orient='right', symbolLimit=len(top_terms['term'].unique()), columns=5), scale=alt.Scale(scheme='tableau20')),
    tooltip=['term', 'coefficient', 'period'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(selection).properties(
    title='Top 10 Terms by Coefficient in Logistic Regression Model by Period'
)
chart