Notebook Assignment
First Assignment
In class, we worked on getting the text from the Humanist listserv through web scraping.
The final piece of this assignment is to take the data from the plain text files and save it to a dataframe in Pandas. Going forward we’ll be using this dataset to do some initial text analysis.
If you feel comfortable with assignment, feel free to go ahead and use either scripts, notebooks, or a combination of both.
- With our existing code (available here), we have access to each of the html and text from each of the emails of the first volume. Try reworking your code to work with this url
https://humanist.kdl.kcl.ac.uk/Archives/Converted_Text/
to get all the volumes without having to scrape each volume individually. Think about what data you want to get and what you don’t need to keep. Do we want to keep the html? Or do we just want the text? (hint try googling for theget_text()
or.text
methods in Beautiful Soup) - Now that we have the data from the webpage, we need to decide what other metadata we want to include with it. What information can we get from the
url
variable? How would you get the years for each volume? (hint checkout thesplit()
method in Python) - Finally we have all the data we want to store, so now we have to decide how to persist it? If we have metadata and data for each volume what data structure would be best suited? A list or a dictionary? What about a list of dictionaries? How would we add data from each volume to a variable that lived outside of the two for loops?
- Say we got our metadata and data saved to a new variable, the final piece is to save this to a new dataframe. Read this Stack Overflow answer for how to save our variable to a dataframe https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe/53831756#53831756. Try and find the reference to
from_records()
,from_dict()
, andorient=columns
in the answer, and try to save our variable to new a dataframe calledhumanist_vols
. - Finally, so we don’t have to web scrape every time, try using the following code:
humanist_vols.to_csv('web_scraped_humanist_listserv.csv')
What does this do to our data? You can read about the to_csv()
method here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv.
Second Assignment
Let’s take a deeper dive into the datasets underlying The Pudding “Film Dialogue” article https://pudding.cool/2017/03/film-dialogue/. So far we have worked with the film scripts and now we will bringing in additional data from the The Pudding website.
- Create a new Jupyter notebook and read in the three datasets from the Github repository https://github.com/matthewfdaniels/scripts/. Take a look at the documentation in the repository and discuss what you think each file contains.
- Once you’ve loaded in the data into the notebook, discuss what data you think the columns contain and check if there’s any missing data.
- Try to answer the following questions:
- How could we tell if the amount of dialogue was increasing over time in movies? How might this influence the assessment about the breakdown of gender dialogue?
- How could test if there was any relationship between the film’s gross value and the amount of dialogue in the film?
To answer these questions you’ll need to merge, aggregate, and calculate some basic stats for these datasets.
As a bonus, try creating a plot of visualize the answer to each of these questions.
Submitting Assignments
When you’re ready to submit your assignment, you will need to push it up to Github.
Because we are now working with larger datasets we may hit the Github Large Files limitation. To prevent us pushing up our data, we can use something called a .gitignore
file. This is a file that tells Github which files to ignore when pushing our data.
- First step is create the
.gitignore
file. You can use eithertouch
orni
to create the file (for Mac and Windows respectively). You should create the file in your IS310-final-project directory. - Then once you’ve created the file, you can add the following to the file:
*.csv
/ipynb_checkpoints
Save that file and try git status
in your terminal. You should no longer see those files as listed.
- Now you can push your data to Github. Details on Github Workflows available here in case you forgot.