Complex and Nested Data

In this notebook, I wanna try to go through a similar process as this Spark notebook on processing complex and nested data https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html

In [37]:
data =  [
 (0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""),
 (1, """{"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }"""),
 (2, """{"device_id": 2, "device_type": "sensor-ipad", "ip": "88.36.5.1", "cca3": "ITA", "cn": "Italy", "temp": 18, "signal": 25, "battery_level": 5, "c02_level": 1372, "timestamp" :1475600500 }"""),
 (3, """{"device_id": 3, "device_type": "sensor-inest", "ip": "66.39.173.154", "cca3": "USA", "cn": "United States", "temp": 47, "signal": 12, "battery_level": 1, "c02_level": 1447, "timestamp" :1475600502 }"""),
(4, """{"device_id": 4, "device_type": "sensor-ipad", "ip": "203.82.41.9", "cca3": "PHL", "cn": "Philippines", "temp": 29, "signal": 11, "battery_level": 0, "c02_level": 983, "timestamp" :1475600504 }"""),
(5, """{"device_id": 5, "device_type": "sensor-istick", "ip": "204.116.105.67", "cca3": "USA", "cn": "United States", "temp": 50, "signal": 16, "battery_level": 8, "c02_level": 1574, "timestamp" :1475600506 }"""),
(6, """{"device_id": 6, "device_type": "sensor-ipad", "ip": "220.173.179.1", "cca3": "CHN", "cn": "China", "temp": 21, "signal": 18, "battery_level": 9, "c02_level": 1249, "timestamp" :1475600508 }"""),
(7, """{"device_id": 7, "device_type": "sensor-ipad", "ip": "118.23.68.227", "cca3": "JPN", "cn": "Japan", "temp": 27, "signal": 15, "battery_level": 0, "c02_level": 1531, "timestamp" :1475600512 }"""),
(8 ,""" {"device_id": 8, "device_type": "sensor-inest", "ip": "208.109.163.218", "cca3": "USA", "cn": "United States", "temp": 40, "signal": 16, "battery_level": 9, "c02_level": 1208, "timestamp" :1475600514 }"""),
(9,"""{"device_id": 9, "device_type": "sensor-ipad", "ip": "88.213.191.34", "cca3": "ITA", "cn": "Italy", "temp": 19, "signal": 11, "battery_level": 0, "c02_level": 1171, "timestamp" :1475600516 }"""),
(10,"""{"device_id": 10, "device_type": "sensor-igauge", "ip": "68.28.91.22", "cca3": "USA", "cn": "United States", "temp": 32, "signal": 26, "battery_level": 7, "c02_level": 886, "timestamp" :1475600518 }"""),
(11,"""{"device_id": 11, "device_type": "sensor-ipad", "ip": "59.144.114.250", "cca3": "IND", "cn": "India", "temp": 46, "signal": 25, "battery_level": 4, "c02_level": 863, "timestamp" :1475600520 }"""),
(12, """{"device_id": 12, "device_type": "sensor-igauge", "ip": "193.156.90.200", "cca3": "NOR", "cn": "Norway", "temp": 18, "signal": 26, "battery_level": 8, "c02_level": 1220, "timestamp" :1475600522 }"""),
(13, """{"device_id": 13, "device_type": "sensor-ipad", "ip": "67.185.72.1", "cca3": "USA", "cn": "United States", "temp": 34, "signal": 20, "battery_level": 8, "c02_level": 1504, "timestamp" :1475600524 }"""),
(14, """{"device_id": 14, "device_type": "sensor-inest", "ip": "68.85.85.106", "cca3": "USA", "cn": "United States", "temp": 39, "signal": 17, "battery_level": 8, "c02_level": 831, "timestamp" :1475600526 }"""),
(15, """{"device_id": 15, "device_type": "sensor-ipad", "ip": "161.188.212.254", "cca3": "USA", "cn": "United States", "temp": 27, "signal": 26, "battery_level": 5, "c02_level": 1378, "timestamp" :1475600528 }"""),
(16, """{"device_id": 16, "device_type": "sensor-igauge", "ip": "221.3.128.242", "cca3": "CHN", "cn": "China", "temp": 10, "signal": 24, "battery_level": 6, "c02_level": 1423, "timestamp" :1475600530 }"""),
(17, """{"device_id": 17, "device_type": "sensor-ipad", "ip": "64.124.180.215", "cca3": "USA", "cn": "United States", "temp": 38, "signal": 17, "battery_level": 9, "c02_level": 1304, "timestamp" :1475600532 }"""),
(18, """{"device_id": 18, "device_type": "sensor-igauge", "ip": "66.153.162.66", "cca3": "USA", "cn": "United States", "temp": 26, "signal": 10, "battery_level": 0, "c02_level": 902, "timestamp" :1475600534 }"""),
(19, """{"device_id": 19, "device_type": "sensor-ipad", "ip": "193.200.142.254", "cca3": "AUT", "cn": "Austria", "temp": 32, "signal": 27, "battery_level": 5, "c02_level": 1282, "timestamp" :1475600536 }""")]

Now let's through this in XND

In [38]:
import xnd
eventsDS = xnd.xnd(data)
eventsDS.type
Out[38]:
ndt("20 * (int64, string)")

Looks good! Let's see if we can make that a struct instead of a tuple.

In [39]:
eventsDS = xnd.xnd(data, type="20 * {id: int64, device: string}")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-1f72a7c7018c> in <module>()
----> 1 eventsDS = xnd.xnd(data, type="20 * {id: int64, device: string}")

/usr/local/miniconda3/envs/numba-xnd/lib/python3.6/site-packages/xnd/__init__.py in __new__(cls, value, type, dtype, levels, typedef, dtypedef)
    125         else:
    126             type = typeof(value)
--> 127         return super().__new__(cls, type=type, value=value)
    128 
    129     def __repr__(self):

TypeError: xnd: expected dict, not 'tuple'

No, looks like we can't do this.

Are memory of tuple and struct the same? Can we cast one to the other in XND?

Now let's parse the JSON. Ideally we could do this in XND, but let's use Python for now for JSON parsing.

Do we have a way of mapping a function (json.loads) over a column in xnd? Like can I get a `20 stringfrom a20 (int64, string)`?

In [40]:
import json

eventsFromJSONDF = xnd.xnd([{"id": i.value, "json": json.loads(s.value)} for (i, s) in eventsDS])
eventsFromJSONDF.type
Out[40]:
ndt("20 * {id : int64, json : {device_id : int64, device_type : string, ip : string, cca3 : string, cn : string, temp : int64, signal : int64, battery_level : int64, c02_level : int64, timestamp : int64}}")

This works :) Now let's get a couple of columns.

Can we "flatten" a struct in xnd somehow? Like this in spark: https://stackoverflow.com/questions/38753898/how-to-flatten-a-struct-in-a-spark-dataframe. Would this best be done by numba compiling this Python code?

In [41]:
jsDF = xnd.xnd([{
    "id": r['id'].value,
    "device_type": r['json']["device_type"].value,
    "ip": r['json']["ip"].value,
    "cca3": r['json']["cca3"].value
} for r in eventsFromJSONDF])
jsDF
Out[41]:
xnd([{'id': 0, 'device_type': 'sensor-ipad', 'ip': '68.161.225.1', 'cca3': 'USA'},
     {'id': 1, 'device_type': 'sensor-igauge', 'ip': '213.161.254.1', 'cca3': 'NOR'},
     {'id': 2, 'device_type': 'sensor-ipad', 'ip': '88.36.5.1', 'cca3': 'ITA'},
     {'id': 3, 'device_type': 'sensor-inest', 'ip': '66.39.173.154', 'cca3': 'USA'},
     {'id': 4, 'device_type': 'sensor-ipad', 'ip': '203.82.41.9', 'cca3': 'PHL'},
     {'id': 5, 'device_type': 'sensor-istick', 'ip': '204.116.105.67', 'cca3': 'USA'},
     {'id': 6, 'device_type': 'sensor-ipad', 'ip': '220.173.179.1', 'cca3': 'CHN'},
     {'id': 7, 'device_type': 'sensor-ipad', 'ip': '118.23.68.227', 'cca3': 'JPN'},
     {'id': 8, 'device_type': 'sensor-inest', 'ip': '208.109.163.218', 'cca3': 'USA'},
     ...],
    type='20 * {id : int64, device_type : string, ip : string, cca3 : string}')

OK that seems to work!

Is there a way to pretty print this in a table?

Now let's do some filering:

In [50]:
eventsDS.type
Out[50]:
ndt("20 * (int64, string)")

Can we infer types from existing xnd types? For example, xnd([xnd(1), xnd(2)]) Then we don't have to get .value when we are making xnd from other xnds

We do the filtering in Python instead of xnd.

In [61]:
devices = [json.loads(device.value) for (_, device) in eventsDS]
devicesDF = xnd.xnd([device for device in devices if device['temp'] > 10 and device['signal'] > 15])
devicesDF.type
Out[61]:
ndt("14 * {device_id : int64, device_type : string, ip : string, cca3 : string, cn : string, temp : int64, signal : int64, battery_level : int64, c02_level : int64, timestamp : int64}")

How do we get a nice plot of this like we can in spark?

One option is to convert it to pandas dataframe... Lemme put this on my todolist. Make NumPy and Pandas DF wrappers for XND so you can use them in Altair or Dask.

In [71]:
devicesDF[0]['signal']
Out[71]:
xnd(23, type='int64')
In [80]:
devicesUSDF = xnd.xnd(sorted([device.value for device in devicesDF], key=lambda d: (d['signal'], d['temp'])))
devicesUSDF.short_value(10)
Out[80]:
[{'device_id': 8,
  'device_type': 'sensor-inest',
  'ip': '208.109.163.218',
  'cca3': 'USA',
  'cn': 'United States',
  'temp': 40,
  'signal': 16,
  'battery_level': 9,
  'c02_level': 1208,
  ...: ...},
 {'device_id': 5,
  'device_type': 'sensor-istick',
  'ip': '204.116.105.67',
  'cca3': 'USA',
  'cn': 'United States',
  'temp': 50,
  'signal': 16,
  'battery_level': 8,
  'c02_level': 1574,
  ...: ...},
 {'device_id': 17,
  'device_type': 'sensor-ipad',
  'ip': '64.124.180.215',
  'cca3': 'USA',
  'cn': 'United States',
  'temp': 38,
  'signal': 17,
  'battery_level': 9,
  'c02_level': 1304,
  ...: ...},
 {'device_id': 14,
  'device_type': 'sensor-inest',
  'ip': '68.85.85.106',
  'cca3': 'USA',
  'cn': 'United States',
  'temp': 39,
  'signal': 17,
  'battery_level': 8,
  'c02_level': 831,
  ...: ...},
 {'device_id': 6,
  'device_type': 'sensor-ipad',
  'ip': '220.173.179.1',
  'cca3': 'CHN',
  'cn': 'China',
  'temp': 21,
  'signal': 18,
  'battery_level': 9,
  'c02_level': 1249,
  ...: ...},
 {'device_id': 1,
  'device_type': 'sensor-igauge',
  'ip': '213.161.254.1',
  'cca3': 'NOR',
  'cn': 'Norway',
  'temp': 30,
  'signal': 18,
  'battery_level': 6,
  'c02_level': 1413,
  ...: ...},
 {'device_id': 13,
  'device_type': 'sensor-ipad',
  'ip': '67.185.72.1',
  'cca3': 'USA',
  'cn': 'United States',
  'temp': 34,
  'signal': 20,
  'battery_level': 8,
  'c02_level': 1504,
  ...: ...},
 {'device_id': 0,
  'device_type': 'sensor-ipad',
  'ip': '68.161.225.1',
  'cca3': 'USA',
  'cn': 'United States',
  'temp': 25,
  'signal': 23,
  'battery_level': 8,
  'c02_level': 917,
  ...: ...},
 {'device_id': 2,
  'device_type': 'sensor-ipad',
  'ip': '88.36.5.1',
  'cca3': 'ITA',
  'cn': 'Italy',
  'temp': 18,
  'signal': 25,
  'battery_level': 5,
  'c02_level': 1372,
  ...: ...},
 ...]

OK I don't really feel like going through the rest of this notebook and replicating all of it.

A few other things I noticed was that it supports Map types. We could support these as tuples of keys and values in XND, if we are ok with O(N) value lookups.

Conclusions

There are definately some things that are missing if we compare using xnd to spark here. These are:

  1. Map types
  2. converting between tuples and structs
  3. Taking "column". Like if we have a list of structs, converting this to a list of a certain value of that struct. This is also like flattening a list of struct into a struct of lists.
  4. Filtering and sorting
  5. Displaying Xnd objects that resemble dataframes as dataframes, by converting to pandas or subclassing or acting like pandas Dataframe.
  6. Serializing to/from JSON

I think that 6. (JSON) could be handled by libxnd, since it really has to do with doing effecient creation of xnd structures. It would be harder to do that in a higher level library and get good performance.

For 5. (viewing as DF) I think this could be handled by a higher level library that uses XND. So it uses XND as data backing but provides a user facing API on top of it.

For 1, 2, 3, 4, I am not sure where these should live. It would be nice if we could do filtering inside XND, but maybe this belongs in a higher level library? I can see array abstraction library being useful for this, but then that library wouldn't just be array abstractions but more like nested data abstractions.

In [ ]: