Analyzing a simple Crowdcrafting application with Enki

PyBossa provides a very simple and interesting package for analyzing any PyBossa application statistically: Enki.

The package allows you to statistically analyzed all the contributed task runs to your application, for example for the Video Pattern Recognition application in Crowdcrafting.

In order to use it, all you have to do is the following: first, install it.

NOTE: it takes some minutes to compile all the required libraries, so be patient.

In [4]:
!pip install enki
Requirement already satisfied (use --upgrade to upgrade): enki in ./env/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): pybossa-client in ./env/lib/python2.7/site-packages (from enki)
Requirement already satisfied (use --upgrade to upgrade): pandas in ./env/lib/python2.7/site-packages (from enki)
Requirement already satisfied (use --upgrade to upgrade): requests>=0.13.0 in ./env/lib/python2.7/site-packages (from pybossa-client->enki)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in ./env/lib/python2.7/site-packages (from pandas->enki)
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in ./env/lib/python2.7/site-packages (from pandas->enki)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.6.1 in ./env/lib/python2.7/site-packages (from pandas->enki)
Requirement already satisfied (use --upgrade to upgrade): six in ./env/lib/python2.7/site-packages (from python-dateutil->pandas->enki)
Cleaning up...

After installing the package you can import it:

In [5]:
import enki

And then, you can start analyzing the application with the following command:

In [6]:
e = enki.Enki(api_key='private', endpoint='http://crowdcrafting.org', app_short_name='vimeo')

As you can see, the api_key is your private key, but for reading the API you don't need a valid one.

Then, you can get all the completed tasks and its associated task_runs in order to start analyzing them:

In [7]:
e.get_all()

The previous command downloads all the tasks and task_runs, and creates 4 variables:

  • e.tasks: a list of the application tasks
  • e.task_runs: a dictionary where the key is the completed task.id, and its associated task_runs
  • e.tasks_df: a Pandas data frame for analyzing them
  • e.task_runs_df: a dictionary of data frames for analyzing the task_runs per task

Let's have a look to e.tasks:

In [8]:
e.tasks
Out[8]:
[pybossa.Task(256071),
 pybossa.Task(256072),
 pybossa.Task(256073),
 pybossa.Task(256074),
 pybossa.Task(256075)]

And now to e.task_runs:

In [9]:
e.task_runs
Out[9]:
{256071: [pybossa.TaskRun(81768),
  pybossa.TaskRun(82458),
  pybossa.TaskRun(88027),
  pybossa.TaskRun(94595),
  pybossa.TaskRun(96020),
  pybossa.TaskRun(109055),
  pybossa.TaskRun(122053),
  pybossa.TaskRun(134030),
  pybossa.TaskRun(145662),
  pybossa.TaskRun(147257),
  pybossa.TaskRun(154250),
  pybossa.TaskRun(163983),
  pybossa.TaskRun(163987),
  pybossa.TaskRun(165169),
  pybossa.TaskRun(175363),
  pybossa.TaskRun(179023),
  pybossa.TaskRun(189571),
  pybossa.TaskRun(196359),
  pybossa.TaskRun(196736),
  pybossa.TaskRun(209642),
  pybossa.TaskRun(210179),
  pybossa.TaskRun(210652),
  pybossa.TaskRun(211688),
  pybossa.TaskRun(214622),
  pybossa.TaskRun(220670),
  pybossa.TaskRun(226763),
  pybossa.TaskRun(229376),
  pybossa.TaskRun(229411),
  pybossa.TaskRun(232332),
  pybossa.TaskRun(235082)],
 256072: [pybossa.TaskRun(81769),
  pybossa.TaskRun(88028),
  pybossa.TaskRun(94596),
  pybossa.TaskRun(122054),
  pybossa.TaskRun(145673),
  pybossa.TaskRun(147291),
  pybossa.TaskRun(154267),
  pybossa.TaskRun(163984),
  pybossa.TaskRun(163988),
  pybossa.TaskRun(165170),
  pybossa.TaskRun(175364),
  pybossa.TaskRun(179024),
  pybossa.TaskRun(189576),
  pybossa.TaskRun(196360),
  pybossa.TaskRun(196737),
  pybossa.TaskRun(209643),
  pybossa.TaskRun(210180),
  pybossa.TaskRun(210653),
  pybossa.TaskRun(211689),
  pybossa.TaskRun(214623),
  pybossa.TaskRun(229377),
  pybossa.TaskRun(229378),
  pybossa.TaskRun(229379),
  pybossa.TaskRun(229412),
  pybossa.TaskRun(232333),
  pybossa.TaskRun(235083),
  pybossa.TaskRun(235126),
  pybossa.TaskRun(236173),
  pybossa.TaskRun(236825),
  pybossa.TaskRun(257037)],
 256073: [pybossa.TaskRun(81770),
  pybossa.TaskRun(94597),
  pybossa.TaskRun(122055),
  pybossa.TaskRun(154274),
  pybossa.TaskRun(163985),
  pybossa.TaskRun(163989),
  pybossa.TaskRun(165171),
  pybossa.TaskRun(175365),
  pybossa.TaskRun(179026),
  pybossa.TaskRun(189587),
  pybossa.TaskRun(196361),
  pybossa.TaskRun(196739),
  pybossa.TaskRun(209644),
  pybossa.TaskRun(210181),
  pybossa.TaskRun(210654),
  pybossa.TaskRun(211690),
  pybossa.TaskRun(214624),
  pybossa.TaskRun(229413),
  pybossa.TaskRun(232334),
  pybossa.TaskRun(235084),
  pybossa.TaskRun(235127),
  pybossa.TaskRun(257038),
  pybossa.TaskRun(257840),
  pybossa.TaskRun(262876),
  pybossa.TaskRun(346996),
  pybossa.TaskRun(358431),
  pybossa.TaskRun(399260),
  pybossa.TaskRun(416175),
  pybossa.TaskRun(435963),
  pybossa.TaskRun(586064)],
 256074: [pybossa.TaskRun(81771),
  pybossa.TaskRun(94598),
  pybossa.TaskRun(122056),
  pybossa.TaskRun(154281),
  pybossa.TaskRun(163986),
  pybossa.TaskRun(163990),
  pybossa.TaskRun(165172),
  pybossa.TaskRun(179027),
  pybossa.TaskRun(189616),
  pybossa.TaskRun(196362),
  pybossa.TaskRun(196740),
  pybossa.TaskRun(209645),
  pybossa.TaskRun(210655),
  pybossa.TaskRun(211691),
  pybossa.TaskRun(214625),
  pybossa.TaskRun(232335),
  pybossa.TaskRun(235128),
  pybossa.TaskRun(257039),
  pybossa.TaskRun(262877),
  pybossa.TaskRun(347003),
  pybossa.TaskRun(358433),
  pybossa.TaskRun(399261),
  pybossa.TaskRun(416176),
  pybossa.TaskRun(435968),
  pybossa.TaskRun(586111),
  pybossa.TaskRun(589078),
  pybossa.TaskRun(592392),
  pybossa.TaskRun(592873),
  pybossa.TaskRun(593926),
  pybossa.TaskRun(596473)],
 256075: [pybossa.TaskRun(82060),
  pybossa.TaskRun(94599),
  pybossa.TaskRun(122057),
  pybossa.TaskRun(154315),
  pybossa.TaskRun(163991),
  pybossa.TaskRun(165173),
  pybossa.TaskRun(189643),
  pybossa.TaskRun(196363),
  pybossa.TaskRun(196741),
  pybossa.TaskRun(209646),
  pybossa.TaskRun(210656),
  pybossa.TaskRun(211692),
  pybossa.TaskRun(214626),
  pybossa.TaskRun(235871),
  pybossa.TaskRun(257040),
  pybossa.TaskRun(262878),
  pybossa.TaskRun(347043),
  pybossa.TaskRun(358436),
  pybossa.TaskRun(416177),
  pybossa.TaskRun(435973),
  pybossa.TaskRun(586160),
  pybossa.TaskRun(589079),
  pybossa.TaskRun(592393),
  pybossa.TaskRun(592874),
  pybossa.TaskRun(592875),
  pybossa.TaskRun(592876),
  pybossa.TaskRun(592877),
  pybossa.TaskRun(592878),
  pybossa.TaskRun(593927),
  pybossa.TaskRun(596550)]}

OK, so the data are ready to be analyzed ;-) As I said before, we have a data frame per task, so it is really easy to analyze the results of the contributed answers by our volunteers.

Enki uses Pandas package, so it is really easy to statistically analyze the answers. For example, lets get an overview of the answers for the first task of the application within the data frame:

NOTE: Enki explodes the PyBossa task_run.info field if it is a dictionary. In this case, the Vimeo task_runs have within the info field, a dictionary with this structure: task_run.info = {'answer': 'Yes'}, so the following command will show a column with the name answer automatically for us.

In [10]:
e.task_runs_df[e.tasks[0].id]
Out[10]:
answer app_id calibration created finish_time id info link links task_id timeout user_id user_ip
81768 No 598 None 2013-06-18T09:08:19.075539 2013-06-18T09:08:19.075560 81768 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 3 None
82458 Yes 598 None 2013-06-24T12:52:21.573111 2013-06-24T12:52:21.573128 82458 {u'answer': u'Yes'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 125.22.43.12
88027 NaN 598 None 2013-06-28T15:00:07.695365 2013-06-28T15:00:07.695384 88027 {} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1138 None
94595 No 598 None 2013-07-16T10:54:44.987977 2013-07-16T10:54:44.987993 94595 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 46.162.161.63
96020 Yes 598 None 2013-07-21T01:28:44.488405 2013-07-21T01:28:44.488419 96020 {u'answer': u'Yes'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 201.250.176.37
109055 No 598 None 2013-08-03T11:01:53.143951 2013-08-03T11:01:53.143968 109055 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1258 None
122053 Yes 598 None 2013-08-05T11:31:44.318728 2013-08-05T11:31:44.318744 122053 {u'answer': u'Yes'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 194.81.199.55
134030 NotKnown 598 None 2013-08-07T09:28:09.741601 2013-08-07T09:28:09.741625 134030 {u'answer': u'NotKnown'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1323 None
145662 NaN 598 None 2013-08-08T16:15:34.834110 2013-08-08T16:15:34.834130 145662 {} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 79.158.104.138
147257 No 598 None 2013-08-08T17:46:54.688632 2013-08-08T17:46:54.688650 147257 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1484 None
154250 No 598 None 2013-08-09T09:13:07.940880 2013-08-09T09:13:07.940898 154250 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1590 None
163983 NotKnown 598 None 2013-08-11T01:47:58.604305 2013-08-11T01:47:58.604327 163983 {u'answer': u'NotKnown'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 189.71.6.185
163987 Yes 598 None 2013-08-11T02:01:00.579848 2013-08-11T02:01:00.579867 163987 {u'answer': u'Yes'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1659 None
165169 No 598 None 2013-08-11T15:23:08.467515 2013-08-11T15:23:08.467533 165169 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1379 None
175363 NaN 598 None 2013-08-13T20:32:30.328999 2013-08-13T20:32:30.329018 175363 {} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 99.243.63.115
179023 NotKnown 598 None 2013-08-14T11:33:29.942327 2013-08-14T11:33:29.942344 179023 {u'answer': u'NotKnown'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1824 None
189571 No 598 None 2013-08-15T17:51:32.663308 2013-08-15T17:51:32.663325 189571 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 82.234.143.184
196359 NotKnown 598 None 2013-08-16T20:08:36.783964 2013-08-16T20:08:36.783985 196359 {u'answer': u'NotKnown'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 82.228.190.168
196736 No 598 None 2013-08-16T23:41:46.782379 2013-08-16T23:41:46.782396 196736 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 2024 None
209642 No 598 None 2013-08-20T07:48:59.958100 2013-08-20T07:48:59.958119 209642 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 88.186.11.35
210179 No 598 None 2013-08-20T15:50:45.972478 2013-08-20T15:50:45.972499 210179 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 129.11.77.198
210652 Yes 598 None 2013-08-20T22:23:01.771939 2013-08-20T22:23:01.771963 210652 {u'answer': u'Yes'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 2134 None
211688 Yes 598 None 2013-08-21T15:53:44.000784 2013-08-21T15:53:44.000801 211688 {u'answer': u'Yes'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 2147 None
214622 NaN 598 None 2013-08-22T12:09:41.845525 2013-08-22T12:09:41.845541 214622 {} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 1869 None
220670 Yes 598 None 2013-08-23T10:12:17.192103 2013-08-23T10:12:17.192124 220670 {u'answer': u'Yes'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 61.115.201.241
226763 No 598 None 2013-08-27T11:39:25.651255 2013-08-27T11:39:25.651273 226763 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 129.31.25.80
229376 No 598 None 2013-08-29T16:03:31.663031 2013-08-29T16:03:31.663049 229376 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 2199 None
229411 No 598 None 2013-08-29T23:08:07.813845 2013-08-29T23:08:07.813865 229411 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 177.158.76.105
232332 No 598 None 2013-09-06T08:58:02.237716 2013-09-06T08:58:02.237731 232332 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None 2237 None
235082 No 598 None 2013-09-13T08:25:04.359900 2013-09-13T08:25:04.359918 235082 {u'answer': u'No'} <link rel='self' title='taskrun' href='http://... [<link rel='parent' title='app' href='http://c... 256071 None NaN 192.128.3.241

As we have a column with all the answers, called answer let's analyze it for our current application Vimeo.

NOTE: The possible answers that the volunteers can provide are: Yes, No, or I don't know. Enki will detect it, and look for the most voted answer, showing all the relevant information.

In [11]:
e.task_runs_df[e.tasks[0].id]['answer'].describe()
Out[11]:
count     26
unique     3
top       No
freq      15
dtype: object

And now you can iterate over each task and get the most voted answer from the users with the following snippet of code:

In [12]:
for t in e.tasks:
    desc = e.task_runs_df[t.id]['answer'].describe()
    print "The top answer for task.id %s is %s" % (t.id, desc['top'])
The top answer for task.id 256071 is No
The top answer for task.id 256072 is Yes
The top answer for task.id 256073 is Yes
The top answer for task.id 256074 is Yes
The top answer for task.id 256075 is Yes

If you want to create some charts and graphics, you will need to install matplotlib:

In [13]:
!pip install matplotlib
Requirement already satisfied (use --upgrade to upgrade): matplotlib in ./env/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.5 in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): tornado in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): pyparsing>=1.5.6,!=2.0.0 in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): nose in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): six in ./env/lib/python2.7/site-packages (from python-dateutil->matplotlib)
Cleaning up...

Now we count the values, create a Pandas Serie and plot it to see the distribution of answers for the first task:

In [15]:
s = e.task_runs_df[e.tasks[0].id]['answer'].value_counts()
s.plot(kind='bar', rot=0)
Out[15]:
<matplotlib.axes.AxesSubplot at 0x3ee4c50>