import json
import csv
import requests
from collections import Set
from pprint import pprint
Often, when building applications with Datamaglia a large amount of user data needs to be imported through the Datmaglia API. This is typically the case when integrating Datamaglia with an existing application where we haven't been posting user information and actions to the Datamaglia API as they occur.
This notebook will show how we can use the Datamaglia API to import data from our Picmaglia application. We have three CSV files that we have exported from our database: users.csv
(simply a list of usernames), out_pics.csv
(pictures, including some metadata about what user created them and the location where they were taken), and out_likes.csv
(a list of username, picture pairs where the user has liked the picture). We would like to import this data into Datamaglia so that we can start generating recommendationed pictures for users, based on their previous liked photos. This import will be done in 5 steps:
We first must define our data model in the Datamaglia Managment Console. Ultimately we are interested in users liking photos, so our data model looks likes this:
Note the configuration of the Subgraph here, person(id)-[:likes]->content(url)
. This specifies the types of the data objects we will be inserting into the Datamaglia API and what user actions we will use to generate recommendations. For more information on the concept of a Subgraph and configuration Datamagli data models see our Getting Started Tutorial.
Here we will define some helper functions and some constants. These are documented below, but pay special attention to the URLS that we define.
# subgraph ID from management console.
SUBGRAPH_ID = "6044a63406e748bc9c1cd54c1a77f4da"
# Our API key, also from te management console to authenticate our requests.
API_KEY = "f88edf839cad4600b139bed5d6184efb"
# Specify Auth-Token header
HEADERS = {'Auth-Token': API_KEY, 'Content-Type': 'application/json'}
# Base URL for Datamaglia API
BASE_URL = 'https://api.datamaglia.com/v1{}'
# URL for inserting sources (users)
SRC_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/sources/')
# URL for inserting targets (pictures)
TARGET_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/targets')
# URL for inserting relationships (user -[likes]-> photo)
REL_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/relationships/')
# helper function to iterate through a list in chunks of a specified size
def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
Our users are defined in a data/users.csv
as simple list of usernames, one username per row. We will load this file and make POST
requests to the Datamaglia API to add these users as Sources in our Subgraph.
# init empty list to hold our users
users = []
# use csv.DictReader to parse the file and append to our users list
with open('data/users.csv') as f:
for line in csv.DictReader(f):
users.append(line)
print users[0:2]
[{'users': 'jacubert'}, {'users': 'namirte_stevens'}]
Next we define a function to take a list of users and create the structure that will be serialized to JSON and sent as the body of our POST request to insert these users in the Datamaglia API. Note that subgraphs/{subgraphId}/sources
takes an entities
JSON array for batch inserts.
def createSources(users):
# list comprehension to create a dict for each user in the users list of the form {id: username}
entities = [
{
'id': user['users']
} for user in users
]
payload = {'entities': entities} # This payload will be serialized to JSON and sent with our request
resp = requests.post(SRC_URL, headers=HEADERS, data=json.dumps(payload)) # Make the POST request
print resp # 204 status
Here we call the createSources
function with the usernames we loaded previously in batches of 100
for chunk in chunker(users[7000:], 100):
createSources(chunk)
<Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]> <Response [204]>
Next we import our pictures which represent the Target piece of our Subgraph. These are the items that we will be recommending to users based on previous photos they have liked. This piece is very similar to the Importing Users section above, however here we demonstrate how we can store arbitrary properties for a given Source or Target.
# init empty pictures list
pictures = []
# read csv file and add each picture dict to the pictures list
with open('data/out_pic.csv') as f:
for line in csv.DictReader(f):
pictures.append(line)
pprint(pictures[0:2])
[{'lat': '27.74613857589044', 'lon': '-15.585136413574219', 'text': 'my travel', 'url': 'https://ppcdn.500px.org/95503699/7cf68738631dba1d1b946b3e0a90ab21264c238f/3.jpg?v=11', 'user': 'danno_80'}, {'lat': '-34.829728', 'lon': '19.984637', 'text': 'Another shot from a shipwreck at the coast of Cape Agulhas in South Africa.', 'url': 'https://ppcdn.500px.org/95503635/39614c9852eadb4c5dd63a75a764eb7df3a5e6c8/3.jpg?v=10', 'user': 'AndreasKunz1'}]
Next we define the createTargets
function which will create the JSON structure for our pictures and make the POST request to the Datamaglia API. Note that we can define a properties
list of key, value
pairs. This data will be returned to us when we generate recommendations. A common use case for this is storing data necessary to render the client views for recommendations, without making a separate request to hydrate the recommended objects from the client backend.
def createTargets(pictures):
entities = [
{
'id': pic['url'],
'properties': [
{'key': 'lat', 'value': pic['lat']},
{'key': 'lon', 'value': pic['lon']},
{'key': 'text', 'value': pic['text']},
{'key': 'user', 'value': pic['user']}
]
} for pic in pictures
]
payload = {'entities': entities}
resp = requests.post(TARGET_URL, headers=HEADERS, data=json.dumps(payload))
print resp # Should be 204
for chunk in chunker(pictures, 100):
createTargets(chunk)
<Response [204]> <Response [204]>
Now that we have created our Sources and Targets using the Datamaglia API we can post the relationships, or actions, that define our user preferences. These relationships form the data that will be used for generating recommendations.
likes = []
with open('data/out_like.csv') as f:
for line in csv.DictReader(f):
likes.append(line)
def createLikes(likes):
entities = [
{
'weight': 0,
'source': like['user'],
'target': like['pic']
} for like in likes
]
payload = {'entities': entities}
resp = requests.post(REL_URL, headers=HEADERS, data=json.dumps(payload))
print resp # Should be 204
for chunk in chunker(likes, 100):
createLikes(chunk)
<Response [204]> <Response [204]>
We've now imported all our data and are ready to query for recommendations!