This is a quick experiment in visualising format results from DROID using a Jupyter Notebook.
First we load in some example results....
from io import StringIO
import pandas as pd
import requests
url = 'https://raw.githubusercontent.com/exponential-decay/demystify/master/opf-test-corpus-test-output/opf-test-corpus-droid-analysis.csv'
s=requests.get(url).text
df=pd.read_csv(StringIO(s), keep_default_na=False)
df
ID | PARENT_ID | URI | FILE_PATH | NAME | METHOD | STATUS | SIZE | TYPE | EXT | LAST_MODIFIED | EXTENSION_MISMATCH | SHA1_HASH | FORMAT_COUNT | PUID | MIME_TYPE | FORMAT_NAME | FORMAT_VERSION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | format-corpus | Done | Folder | 2014-02-28T15:49:11 | False | |||||||||
1 | 3 | 2 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | video | Done | Folder | 2014-02-28T15:48:47 | False | |||||||||
2 | 4 | 3 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | Quicktime | Done | Folder | 2014-02-28T15:48:59 | False | |||||||||
3 | 5 | 4 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | apple-intermediate-codec.mov | Signature | Done | 319539 | File | mov | 2014-02-18T16:58:16 | False | d097cf36467373f52b974542d48bec134279fa3f | 1 | x-fmt/384 | video/quicktime | Quicktime | |
4 | 6 | 4 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | animation.mov | Signature | Done | 1020209 | File | mov | 2014-02-18T16:58:16 | False | edb5226b963f449ce58054809149cb812bdf8c0a | 1 | x-fmt/384 | video/quicktime | Quicktime | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
394 | 396 | 395 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | InDesign | Done | Folder | 2014-02-28T15:49:11 | False | |||||||||
395 | 397 | 396 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | Neddy_Flyer_ft_HeatherRyan.jpg | Signature | Done | 1620612 | File | jpg | 2014-02-18T16:58:08 | False | 884de50cb1c052c0e10bef306850ee995d965175 | 1 | fmt/41 | image/jpeg | Raw JPEG Stream | |
396 | 398 | 396 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | Neddy_Flyer_HeatherRyan.pdf | Signature | Done | 59106 | File | 2014-02-18T16:58:08 | False | 9e19b76e8364c840945bc380ab5f98f00a23ab80 | 1 | fmt/17 | application/pdf | Acrobat PDF 1.3 - Portable Document Format | 1.3 | |
397 | 399 | 396 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | Neddy_Flyer_HeatherRyan.indd | Signature | Done | 1503232 | File | indd | 2014-02-18T16:58:08 | False | d9211fe38e79f34fb7a043fe34df59527fb6e179 | 1 | fmt/196 | Adobe InDesign Document | CS | |
398 | 400 | 396 | file:////10.1.4.222/gda/archives-sample-files/... | \\10.1.4.222\gda\archives-sample-files\opf-for... | Neddy_Flyer_README_HeatherRyan.md.rtf | Signature | Done | 1210 | File | rtf | 2014-02-18T16:58:08 | False | 3665b0c1457f996359939746752fe86f2025b68d | 1 | fmt/50 | application/rtf, text/rtf | Rich Text Format | 1.5-1.6 |
399 rows × 18 columns
Now we have the data, we can explore ways to visualise it.
Here's a simple bar chart of all the different types and PUIDs...
import altair as alt
alt.Chart(df).mark_bar().encode(
x=alt.X('PUID', sort='-y'),
y='count()',
color='TYPE',
tooltip=['TYPE','PUID', 'FORMAT_NAME', 'FORMAT_VERSION', 'count()']
).interactive()
Alternatively, we can group together the different MIME types, and use the colours to show the various PUIDs associated with each...
alt.Chart(df).mark_bar().encode(
x=alt.X('MIME_TYPE', sort='-y'),
y='count()',
color=alt.Color('PUID', legend=None),
tooltip=['TYPE','MIME_TYPE', 'PUID', 'FORMAT_NAME', 'FORMAT_VERSION', 'count()']
).interactive()