CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email [email protected]


Exploring Metadata and Pre-Processing

Description of methods in this notebook: This notebook shows how to explore and pre-process the metadata of a dataset using Pandas.

The following processes are described:

  • Importing a CSV file containing the metadata for a given dataset ID
  • Creating a Pandas dataframe to view the metadata
  • Pre-processing your dataset by filtering out unwanted texts
  • Exporting a list of relevant IDs to a CSV file
  • Visualizing the metadata of your pre-processed dataset by the number of documents/year and pages/year

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Take me to the Research Version of this notebook ->

Difficulty: Intermediate

Completion time: 45 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: CSV file

Libraries Used:

Research Pipeline: None


Import your dataset

We'll use the tdm_client library to automatically retrieve the metadata for a dataset. We can retrieve metadata in a CSV file using the get_metadata method.

Enter a dataset ID in the next code cell.

If you don't have a dataset ID, you can:

In [ ]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the tdm_client, passing the dataset_id as an argument using the get_metadata method.

In [ ]:
# Import the `tdm_client`
import tdm_client

# Pull in our dataset CSV using
# The .get_metadata() method downloads the CSV file for our metadata
# to the /data folder and returns a string for the file name and location
# dataset_metadata will be a string containing that file name and location
dataset_metadata = tdm_client.get_metadata(dataset_id)

We are ready to import pandas for our analysis and create a dataframe. We will use the read_csv() method to create our dataframe from the CSV file.

In [ ]:
# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv(dataset_metadata)

We can confirm the size of our dataset using the len() function on our dataframe.

In [ ]:
original_document_count = len(df)
print(f'Total original documents: {original_document_count}')

Now let's take a look at the data in our dataframe df. We will set pandas to show all columns using set_option() then get a preview using head().

In [ ]:
# Set the pandas option to show all columns
# Setting None gives us all columns
# To show less columns replace None with an integer
pd.set_option("max_columns", None) 

# Show the first five rows of our dataframe
# To show a different number of preview rows
# Pass an integer into the .head()
df.head() 

Here are descriptions for the metadata types found in each column:

Column Name Description
id a unique item ID (In JSTOR, this is a stable URL)
title the title for the item
isPartOf the larger work that holds this title (for example, a journal title)
publicationYear the year of publication
doi the digital object identifier for an item
docType the type of document (for example, article or book)
provider the source or provider of the dataset
datePublished the publication date in yyyy-mm-dd format
issueNumber the issue number for a journal publication
volumeNumber the volume number for a journal publication
url a URL for the item and/or the item's metadata
creator the author or authors of the item
publisher the publisher for the item
language the language or languages of the item (eng is the ISO 639 code for English)
pageStart the first page number of the print version
pageEnd the last page number of the print version
placeOfPublication the city of the publisher
wordCount the number of words in the item
pageCount the number of print pages in the item
outputFormat what data is available (unigrams, bigrams, trigrams, and/or full-text)

Filtering out columns using Pandas

If there are any columns you would like to drop from your analysis, you can drop them with:

df = df.drop(['column_name1', 'column_name2', ...], axis=1)

In [ ]:
# Drop each of these named columns
# axis=1 specifies we are dropping columns
# axis=0 would specify to drop rows
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished'], axis=1)

# Show the first five rows of our updated dataframe
df.head()

Filtering out rows with Pandas

Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let's examine the first and last ten rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.

In [ ]:
# Preview the first ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.head(10)
In [ ]:
# Preview the last ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.tail(10)

Remove all rows without data for a particular column

For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is creator.

In [ ]:
# Remove all texts without an author
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'
In [ ]:
# Print the total original documents followed by the current number
print(f'Total original documents: {original_document_count}')
print(f'Total current documents: ', len(df))

Remove row based on the content of a particular column

We can also remove texts, depending on whether we do (or do not) want a particular value in a column. Here are a few examples.

In [ ]:
# Remove all items with a particular title
# Change title to desired column
# Change `Review Article` to your undesired title

df = df[df.title != 'Review Article']
print('Total current documents ', len(df))
In [ ]:
# Keep only items with a particular language
# Change language to desired column
# Change 'eng' to your desired language

df = (df[df.language == 'eng']) # Change to another language code for other languages
print('Total current documents ', len(df))
In [ ]:
# Remove all items with less than 3000 words
# Change wordCount to desired column
# Change '> 3000' to your desired expression to evaluate

df = df[df.wordCount > 1500]
print('Total current documents ', len(df))
In [ ]:
# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))

Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.

In [ ]:
# Preview the first 50 rows of your dataset
# If all the items look good, move to the next step.
df.head(50)

Saving a list of IDs to a CSV file

In [ ]:
# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
df["id"].to_csv('data/pre-processed_' + dataset_id + '.csv')

Download the "pre-processed_###.csv" file (where ### is the dataset_id) for future analysis. You can use this file in combination with the dataset ID to automatically filter your texts and reduce the processing time of your analyses.


Visualizing the Pre-Processed Data

In [ ]:
# For displaying plots
%matplotlib inline
In [ ]:
# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); 

# Read more about Pandas dataframe plotting here: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

And now let's look at the total page numbers by year.

In [ ]:
# Group the data by publication year and aggregated sum of the page counts into a bar chart

df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);
In [ ]: