Smithsonian Open Access, allows us to download, share, and reuse millions of the Smithsonian’s images and data from across the Smithsonian’s 19 museums, nine research centers, libraries, archives, and the National Zoo.

This notebook introduces how to explore the repository and create a CSV dataset. In additon, this example applies computer vision methods based on face detection which has gained relevance especially in fields like photography and marketing.

The Open Access API requires an API key to access the endpoints. Please register with https://api.data.gov/signup/ to get a key.

Setting up things

In [ ]:
import requests, csv
import json
import pandas as pd
import cv2
import os
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw, ImageFont, ImageOps
from io import BytesIO

Global configuration

In this section, we can add our api_key, the text that we want to use to search and retrieve the elements, and the number of records to retrieve.

In [ ]:
api_key = 'L20kTDAWj35bazo1Zhwx8wN5Ua0zKmhHz8PtIacX' # add your own api_key
q = 'theodore roosevelt' # querystring
rows = '100' # number of records to retrieve

Accesssing Smithsonian repository

Please visit https://edan.si.edu/openaccess/apidocs/#api-search-search for more information.

In [ ]:
url = 'https://api.si.edu/openaccess/api/v1.0/search'
r = requests.get(url, params = {'q': q, 'start':'0', 'rows': rows, 'api_key': api_key })
print(r.url)
response = r.text

Creating a csv file

In [ ]:
csv_out = csv.writer(open('si_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
csv_out.writerow(['id','title', 'date', 'media_usage', 'data_source', 'dimensions', 'sitter', 'type', 'medium', 'artist', 'manifestUrl', 'imageUrl'])

Reading the results and retrieving the metadata

In [ ]:
results = json.loads(response)

for r in results['response']['rows']:
    print(r['id'] + ' ' +  r['title'])
    print(r)
    
    # getting the identifiers of the records to access the IIIF manifests
    try:
        for i in range(len(r['content']['descriptiveNonRepeating']['online_media']['media'])):
            idsId = r['content']['descriptiveNonRepeating']['online_media']['media'][i]['idsId']
            print(idsId)

            # retrieving the manifest
            iiifUrl = 'https://ids.si.edu/ids/manifest/' + idsId
            iiifItemResponse = requests.get(iiifUrl)

            imageUrl = 'https://ids.si.edu/ids/iiif/' + idsId + '/full/full/0/default.jpg'
            print(imageUrl)

            iiifItem = json.loads(iiifItemResponse.text)

            # retrieving metadata
            title = date = licence = datasource = dimensions = sitter = typem = medium = artist =''

            for i in iiifItem['metadata']:
                if i['label'] == 'Title':
                    title = i['value']
                elif i['label'] == 'Date':
                    date = i['value']
                elif i['label'] == 'Media Usage':
                    licence = i['value']
                elif i['label'] == 'Data Source':
                    datasource = i['value']
                elif i['label'] == 'Dimensions':
                    dimensions = i['value']
                elif i['label'] == 'Sitter':
                    sitter = i['value']
                elif i['label'] == 'Type':
                    typem = i['value']
                elif i['label'] == 'Medium':    
                    medium = i['value']
                elif i['label'] == 'Artist':    
                    artist = i['value']
                else: pass
                
            csv_out.writerow([idsId,title,date,licence,datasource,dimensions,sitter,typem,medium,artist,iiifUrl,imageUrl])

    except:
        print("An exception occurred")  

Create some summary data

We can use Pandas to give us a quick overview of the dataset.

In [ ]:
# Load the CSV file from GitHub.
# This puts the data in a Pandas DataFrame
df = pd.read_csv('si_records.csv')

Have a peek

In [ ]:
df

How many items?

In [ ]:
# How many items?
len(df)

Exploring the authors

In [ ]:
# Get unique values
artist = pd.unique(df['artist'].str.split('|', expand=True).stack()).tolist()
for a in sorted(artist):
    print(a)

How often is each name used?

In [ ]:
# Splits the people column and counts frequencies
artist_counts = df['artist'].str.split('|').apply(lambda x: pd.Series(x).value_counts()).sum().astype('int').sort_values(ascending=False).to_frame().reset_index(level=0)
# Add column names
artist_counts.columns = ['name', 'count']
# Display with horizontal bars
display(artist_counts.style.bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'}))

Creating a list of unique types

In [ ]:
# Get unique values
types = pd.unique(df['type']).tolist()
for type in sorted(types, key=str.lower):
    print(type)

Face detection with OpenCV

Face detection is a computer vision technology that locates human faces in digital images and videos. Let's try to identify.

Open Source Computer Vision Library\cite{opencv} (OpenCV) is an open source computer vision and machine learning software library. In this example, images are treated as an standard Numpy array containing pixels of data points.

Read in the image using the imread function. We will be using the colored 'mandrill' image for demonstration purpose.

In [ ]:
test_image = cv2.imread('smithsonian-example.jpg')

plt.imshow(test_image)

The type and shape of the array.

In [ ]:
#type(test_image)
print(test_image.shape)

Regarding the color, we expected a bright colored image but we obtained a bluish image. That happens because OpenCV and matplotlib have different orders of primary colors

While OpenCV reads images using BGR, matplotlib uses RGB. To avoid this issue, we will transform how matplotlib expects it using a cvtColor function.

In [ ]:
rgb_image = cv2.cvtColor(test_image, cv2.COLOR_BGR2RGB)
plt.imshow(rgb_image)
In [ ]:
#We define a function
def convertToRGB(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

Haar cascade files

OpenCV comes with a lot of pre-trained classifiers. For instance, there are classifiers for smile, eyes, face, etc.

Loading the classifier for frontal face

In [ ]:
haar_cascade_face = cv2.CascadeClassifier('opencv/haarcascade_frontalface_default.xml')

Face detection

We shall be using the detectMultiscale module of the classifier. This function will return a rectangle with coordinates(x,y,w,h) around the detected face.

In [ ]:
faces_rects = haar_cascade_face.detectMultiScale(test_image, scaleFactor = 1.2, minNeighbors = 5);

# Let us print the no. of faces found
print('Faces found: ', len(faces_rects))

Our next step is to loop over all the coordinates it returned and draw rectangles around them using Open CV. We will be drawing a green rectangle with a thickness of 2

In [ ]:
for (x,y,w,h) in faces_rects:
     cv2.rectangle(test_image, (x, y), (x+w, y+h), (0, 255, 0), 2)

Finally, we display the original image to identify if the face has been detected correctly.

In [ ]:
plt.imshow(convertToRGB(test_image))

Let's do face detection for all the images

First we download all the portraits. It may take a while since portratis due to its size and quality.

In [ ]:
for index, row in df.iterrows():
    print(index, row['imageUrl'])
    response = requests.get(row['imageUrl'])
    img = Image.open(BytesIO(response.content))
    img.save('is-images/m-{}.jpg'.format(row['id']), quality=90)

Finally, we process all the images

In [ ]:
rows = 20
files = os.listdir('is-images')

fig = plt.figure(figsize=(100,300))
for num, x in enumerate(files):
    #img = Image.open('is-images/'+ x)
    img = cv2.imread('is-images/'+ x)
    
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces_rects = haar_cascade_face.detectMultiScale(img_gray, scaleFactor = 1.2, minNeighbors = 5);      
    
    for (x,y,w,h) in faces_rects:
        cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)
        
    #convert image to RGB and show image
    img_face = convertToRGB(img)    
    
    plt.subplot(rows,5,num+1)
    plt.axis('off')
    plt.imshow(img_face)