In [1]:
import json
import csv
import requests
from collections import Set
from pprint import pprint

Picmaglia Data import

Often, when building applications with Datamaglia a large amount of user data needs to be imported through the Datmaglia API. This is typically the case when integrating Datamaglia with an existing application where we haven't been posting user information and actions to the Datamaglia API as they occur.

This notebook will show how we can use the Datamaglia API to import data from our Picmaglia application. We have three CSV files that we have exported from our database: users.csv (simply a list of usernames), out_pics.csv (pictures, including some metadata about what user created them and the location where they were taken), and out_likes.csv (a list of username, picture pairs where the user has liked the picture). We would like to import this data into Datamaglia so that we can start generating recommendationed pictures for users, based on their previous liked photos. This import will be done in 5 steps:

  1. Configure the application in the management console
  2. Define helpers and constants
  3. Import users
  4. Import pictures
  5. Import user-LIKES->picture actions

Management Console Configuration

We first must define our data model in the Datamaglia Managment Console. Ultimately we are interested in users liking photos, so our data model looks likes this:

Picmaglia Admin

Note the configuration of the Subgraph here, person(id)-[:likes]->content(url). This specifies the types of the data objects we will be inserting into the Datamaglia API and what user actions we will use to generate recommendations. For more information on the concept of a Subgraph and configuration Datamagli data models see our Getting Started Tutorial.

Define helpers and constants

Here we will define some helper functions and some constants. These are documented below, but pay special attention to the URLS that we define.

In [2]:
# subgraph ID from management console. 
SUBGRAPH_ID = "6044a63406e748bc9c1cd54c1a77f4da"

# Our API key, also from te management console to authenticate our requests.
API_KEY = "f88edf839cad4600b139bed5d6184efb"

# Specify Auth-Token header
HEADERS = {'Auth-Token': API_KEY, 'Content-Type': 'application/json'}

# Base URL for Datamaglia API
BASE_URL = 'https://api.datamaglia.com/v1{}'

# URL for inserting sources (users)
SRC_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/sources/')

# URL for inserting targets (pictures)
TARGET_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/targets')

# URL for inserting relationships (user -[likes]-> photo)
REL_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/relationships/')
In [3]:
# helper function to iterate through a list in chunks of a specified size
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

Importing users

Our users are defined in a data/users.csv as simple list of usernames, one username per row. We will load this file and make POST requests to the Datamaglia API to add these users as Sources in our Subgraph.

In [4]:
# init empty list to hold our users
users = []

# use csv.DictReader to parse the file and append to our users list
with open('data/users.csv') as f:
    for line in csv.DictReader(f):
        users.append(line)
        
print users[0:2]
[{'users': 'jacubert'}, {'users': 'namirte_stevens'}]

Next we define a function to take a list of users and create the structure that will be serialized to JSON and sent as the body of our POST request to insert these users in the Datamaglia API. Note that subgraphs/{subgraphId}/sources takes an entities JSON array for batch inserts.

In [5]:
def createSources(users):
    # list comprehension to create a dict for each user in the users list of the form {id: username}
    entities = [
        {
            'id': user['users']
        } for user in users
    ]
    payload = {'entities': entities} # This payload will be serialized to JSON and sent with our request
    resp = requests.post(SRC_URL, headers=HEADERS, data=json.dumps(payload)) # Make the POST request
    print resp # 204 status 

Here we call the createSources function with the usernames we loaded previously in batches of 100

In [8]:
for chunk in chunker(users[7000:], 100):
    createSources(chunk)
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>
<Response [204]>

Importing Pictures

Next we import our pictures which represent the Target piece of our Subgraph. These are the items that we will be recommending to users based on previous photos they have liked. This piece is very similar to the Importing Users section above, however here we demonstrate how we can store arbitrary properties for a given Source or Target.

In [9]:
# init empty pictures list
pictures = []

# read csv file and add each picture dict to the pictures list
with open('data/out_pic.csv') as f:
    for line in csv.DictReader(f):
        pictures.append(line)
        
pprint(pictures[0:2])
[{'lat': '27.74613857589044',
  'lon': '-15.585136413574219',
  'text': 'my travel',
  'url': 'https://ppcdn.500px.org/95503699/7cf68738631dba1d1b946b3e0a90ab21264c238f/3.jpg?v=11',
  'user': 'danno_80'},
 {'lat': '-34.829728',
  'lon': '19.984637',
  'text': 'Another shot from a shipwreck at the coast of Cape Agulhas in South Africa.',
  'url': 'https://ppcdn.500px.org/95503635/39614c9852eadb4c5dd63a75a764eb7df3a5e6c8/3.jpg?v=10',
  'user': 'AndreasKunz1'}]

Next we define the createTargets function which will create the JSON structure for our pictures and make the POST request to the Datamaglia API. Note that we can define a properties list of key, value pairs. This data will be returned to us when we generate recommendations. A common use case for this is storing data necessary to render the client views for recommendations, without making a separate request to hydrate the recommended objects from the client backend.

In [10]:
def createTargets(pictures):
    entities = [
        {
            'id': pic['url'],
            'properties': [
                {'key': 'lat', 'value': pic['lat']},
                {'key': 'lon', 'value': pic['lon']},
                {'key': 'text', 'value': pic['text']},
                {'key': 'user', 'value': pic['user']}
            ]
        } for pic in pictures
    ]
    payload = {'entities': entities}
    resp = requests.post(TARGET_URL, headers=HEADERS, data=json.dumps(payload))
    print resp # Should be 204
In [ ]:
for chunk in chunker(pictures, 100):
    createTargets(chunk)
<Response [204]>
<Response [204]>

Import relationships

Now that we have created our Sources and Targets using the Datamaglia API we can post the relationships, or actions, that define our user preferences. These relationships form the data that will be used for generating recommendations.

In [28]:
likes = []
with open('data/out_like.csv') as f:
    for line in csv.DictReader(f):
        likes.append(line)
In [29]:
def createLikes(likes):
    entities = [
    {
        'weight': 0,
        'source': like['user'],
        'target': like['pic']
    } for like in likes
    ]
    payload = {'entities': entities}
    resp = requests.post(REL_URL, headers=HEADERS, data=json.dumps(payload))
    print resp # Should be 204
In [30]:
for chunk in chunker(likes, 100):
    createLikes(chunk)
<Response [204]>
<Response [204]>

We've now imported all our data and are ready to query for recommendations!