Analyzing a simple Crowdcrafting application with Enki¶

PyBossa provides a very simple and interesting package for analyzing any PyBossa application statistically: Enki.

The package allows you to statistically analyzed all the contributed task runs to your application, for example for the Video Pattern Recognition application in Crowdcrafting.

In order to use it, all you have to do is the following: first, install it.

NOTE: it takes some minutes to compile all the required libraries, so be patient.

In [4]:

!pip install enki

Requirement already satisfied (use --upgrade to upgrade): enki in ./env/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): pybossa-client in ./env/lib/python2.7/site-packages (from enki)
Requirement already satisfied (use --upgrade to upgrade): pandas in ./env/lib/python2.7/site-packages (from enki)
Requirement already satisfied (use --upgrade to upgrade): requests>=0.13.0 in ./env/lib/python2.7/site-packages (from pybossa-client->enki)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in ./env/lib/python2.7/site-packages (from pandas->enki)
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in ./env/lib/python2.7/site-packages (from pandas->enki)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.6.1 in ./env/lib/python2.7/site-packages (from pandas->enki)
Requirement already satisfied (use --upgrade to upgrade): six in ./env/lib/python2.7/site-packages (from python-dateutil->pandas->enki)
Cleaning up...

After installing the package you can import it:

In [5]:

import enki

And then, you can start analyzing the application with the following command:

In [6]:

e = enki.Enki(api_key='private', endpoint='http://crowdcrafting.org', app_short_name='vimeo')

As you can see, the api_key is your private key, but for reading the API you don't need a valid one.

Then, you can get all the completed tasks and its associated task_runs in order to start analyzing them:

In [7]:

e.get_all()

The previous command downloads all the tasks and task_runs, and creates 4 variables:

e.tasks: a list of the application tasks
e.task_runs: a dictionary where the key is the completed task.id, and its associated task_runs
e.tasks_df: a Pandas data frame for analyzing them
e.task_runs_df: a dictionary of data frames for analyzing the task_runs per task

Let's have a look to e.tasks:

In [8]:

e.tasks

Out[8]:

[pybossa.Task(256071),
 pybossa.Task(256072),
 pybossa.Task(256073),
 pybossa.Task(256074),
 pybossa.Task(256075)]

And now to e.task_runs:

In [9]:

e.task_runs

Out[9]:

{256071: [pybossa.TaskRun(81768),
  pybossa.TaskRun(82458),
  pybossa.TaskRun(88027),
  pybossa.TaskRun(94595),
  pybossa.TaskRun(96020),
  pybossa.TaskRun(109055),
  pybossa.TaskRun(122053),
  pybossa.TaskRun(134030),
  pybossa.TaskRun(145662),
  pybossa.TaskRun(147257),
  pybossa.TaskRun(154250),
  pybossa.TaskRun(163983),
  pybossa.TaskRun(163987),
  pybossa.TaskRun(165169),
  pybossa.TaskRun(175363),
  pybossa.TaskRun(179023),
  pybossa.TaskRun(189571),
  pybossa.TaskRun(196359),
  pybossa.TaskRun(196736),
  pybossa.TaskRun(209642),
  pybossa.TaskRun(210179),
  pybossa.TaskRun(210652),
  pybossa.TaskRun(211688),
  pybossa.TaskRun(214622),
  pybossa.TaskRun(220670),
  pybossa.TaskRun(226763),
  pybossa.TaskRun(229376),
  pybossa.TaskRun(229411),
  pybossa.TaskRun(232332),
  pybossa.TaskRun(235082)],
 256072: [pybossa.TaskRun(81769),
  pybossa.TaskRun(88028),
  pybossa.TaskRun(94596),
  pybossa.TaskRun(122054),
  pybossa.TaskRun(145673),
  pybossa.TaskRun(147291),
  pybossa.TaskRun(154267),
  pybossa.TaskRun(163984),
  pybossa.TaskRun(163988),
  pybossa.TaskRun(165170),
  pybossa.TaskRun(175364),
  pybossa.TaskRun(179024),
  pybossa.TaskRun(189576),
  pybossa.TaskRun(196360),
  pybossa.TaskRun(196737),
  pybossa.TaskRun(209643),
  pybossa.TaskRun(210180),
  pybossa.TaskRun(210653),
  pybossa.TaskRun(211689),
  pybossa.TaskRun(214623),
  pybossa.TaskRun(229377),
  pybossa.TaskRun(229378),
  pybossa.TaskRun(229379),
  pybossa.TaskRun(229412),
  pybossa.TaskRun(232333),
  pybossa.TaskRun(235083),
  pybossa.TaskRun(235126),
  pybossa.TaskRun(236173),
  pybossa.TaskRun(236825),
  pybossa.TaskRun(257037)],
 256073: [pybossa.TaskRun(81770),
  pybossa.TaskRun(94597),
  pybossa.TaskRun(122055),
  pybossa.TaskRun(154274),
  pybossa.TaskRun(163985),
  pybossa.TaskRun(163989),
  pybossa.TaskRun(165171),
  pybossa.TaskRun(175365),
  pybossa.TaskRun(179026),
  pybossa.TaskRun(189587),
  pybossa.TaskRun(196361),
  pybossa.TaskRun(196739),
  pybossa.TaskRun(209644),
  pybossa.TaskRun(210181),
  pybossa.TaskRun(210654),
  pybossa.TaskRun(211690),
  pybossa.TaskRun(214624),
  pybossa.TaskRun(229413),
  pybossa.TaskRun(232334),
  pybossa.TaskRun(235084),
  pybossa.TaskRun(235127),
  pybossa.TaskRun(257038),
  pybossa.TaskRun(257840),
  pybossa.TaskRun(262876),
  pybossa.TaskRun(346996),
  pybossa.TaskRun(358431),
  pybossa.TaskRun(399260),
  pybossa.TaskRun(416175),
  pybossa.TaskRun(435963),
  pybossa.TaskRun(586064)],
 256074: [pybossa.TaskRun(81771),
  pybossa.TaskRun(94598),
  pybossa.TaskRun(122056),
  pybossa.TaskRun(154281),
  pybossa.TaskRun(163986),
  pybossa.TaskRun(163990),
  pybossa.TaskRun(165172),
  pybossa.TaskRun(179027),
  pybossa.TaskRun(189616),
  pybossa.TaskRun(196362),
  pybossa.TaskRun(196740),
  pybossa.TaskRun(209645),
  pybossa.TaskRun(210655),
  pybossa.TaskRun(211691),
  pybossa.TaskRun(214625),
  pybossa.TaskRun(232335),
  pybossa.TaskRun(235128),
  pybossa.TaskRun(257039),
  pybossa.TaskRun(262877),
  pybossa.TaskRun(347003),
  pybossa.TaskRun(358433),
  pybossa.TaskRun(399261),
  pybossa.TaskRun(416176),
  pybossa.TaskRun(435968),
  pybossa.TaskRun(586111),
  pybossa.TaskRun(589078),
  pybossa.TaskRun(592392),
  pybossa.TaskRun(592873),
  pybossa.TaskRun(593926),
  pybossa.TaskRun(596473)],
 256075: [pybossa.TaskRun(82060),
  pybossa.TaskRun(94599),
  pybossa.TaskRun(122057),
  pybossa.TaskRun(154315),
  pybossa.TaskRun(163991),
  pybossa.TaskRun(165173),
  pybossa.TaskRun(189643),
  pybossa.TaskRun(196363),
  pybossa.TaskRun(196741),
  pybossa.TaskRun(209646),
  pybossa.TaskRun(210656),
  pybossa.TaskRun(211692),
  pybossa.TaskRun(214626),
  pybossa.TaskRun(235871),
  pybossa.TaskRun(257040),
  pybossa.TaskRun(262878),
  pybossa.TaskRun(347043),
  pybossa.TaskRun(358436),
  pybossa.TaskRun(416177),
  pybossa.TaskRun(435973),
  pybossa.TaskRun(586160),
  pybossa.TaskRun(589079),
  pybossa.TaskRun(592393),
  pybossa.TaskRun(592874),
  pybossa.TaskRun(592875),
  pybossa.TaskRun(592876),
  pybossa.TaskRun(592877),
  pybossa.TaskRun(592878),
  pybossa.TaskRun(593927),
  pybossa.TaskRun(596550)]}

OK, so the data are ready to be analyzed ;-) As I said before, we have a data frame per task, so it is really easy to analyze the results of the contributed answers by our volunteers.

Enki uses Pandas package, so it is really easy to statistically analyze the answers. For example, lets get an overview of the answers for the first task of the application within the data frame:

NOTE: Enki explodes the PyBossa task_run.info field if it is a dictionary. In this case, the Vimeo task_runs have within the info field, a dictionary with this structure: task_run.info = {'answer': 'Yes'}, so the following command will show a column with the name answer automatically for us.

In [10]:

e.task_runs_df[e.tasks[0].id]

Out[10]:

	answer	app_id	calibration	created	finish_time	id	info	link	links	task_id	timeout	user_id	user_ip
81768	No	598	None	2013-06-18T09:08:19.075539	2013-06-18T09:08:19.075560	81768	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	3	None
82458	Yes	598	None	2013-06-24T12:52:21.573111	2013-06-24T12:52:21.573128	82458	{u'answer': u'Yes'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	125.22.43.12
88027	NaN	598	None	2013-06-28T15:00:07.695365	2013-06-28T15:00:07.695384	88027	{}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1138	None
94595	No	598	None	2013-07-16T10:54:44.987977	2013-07-16T10:54:44.987993	94595	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	46.162.161.63
96020	Yes	598	None	2013-07-21T01:28:44.488405	2013-07-21T01:28:44.488419	96020	{u'answer': u'Yes'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	201.250.176.37
109055	No	598	None	2013-08-03T11:01:53.143951	2013-08-03T11:01:53.143968	109055	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1258	None
122053	Yes	598	None	2013-08-05T11:31:44.318728	2013-08-05T11:31:44.318744	122053	{u'answer': u'Yes'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	194.81.199.55
134030	NotKnown	598	None	2013-08-07T09:28:09.741601	2013-08-07T09:28:09.741625	134030	{u'answer': u'NotKnown'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1323	None
145662	NaN	598	None	2013-08-08T16:15:34.834110	2013-08-08T16:15:34.834130	145662	{}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	79.158.104.138
147257	No	598	None	2013-08-08T17:46:54.688632	2013-08-08T17:46:54.688650	147257	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1484	None
154250	No	598	None	2013-08-09T09:13:07.940880	2013-08-09T09:13:07.940898	154250	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1590	None
163983	NotKnown	598	None	2013-08-11T01:47:58.604305	2013-08-11T01:47:58.604327	163983	{u'answer': u'NotKnown'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	189.71.6.185
163987	Yes	598	None	2013-08-11T02:01:00.579848	2013-08-11T02:01:00.579867	163987	{u'answer': u'Yes'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1659	None
165169	No	598	None	2013-08-11T15:23:08.467515	2013-08-11T15:23:08.467533	165169	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1379	None
175363	NaN	598	None	2013-08-13T20:32:30.328999	2013-08-13T20:32:30.329018	175363	{}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	99.243.63.115
179023	NotKnown	598	None	2013-08-14T11:33:29.942327	2013-08-14T11:33:29.942344	179023	{u'answer': u'NotKnown'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1824	None
189571	No	598	None	2013-08-15T17:51:32.663308	2013-08-15T17:51:32.663325	189571	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	82.234.143.184
196359	NotKnown	598	None	2013-08-16T20:08:36.783964	2013-08-16T20:08:36.783985	196359	{u'answer': u'NotKnown'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	82.228.190.168
196736	No	598	None	2013-08-16T23:41:46.782379	2013-08-16T23:41:46.782396	196736	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	2024	None
209642	No	598	None	2013-08-20T07:48:59.958100	2013-08-20T07:48:59.958119	209642	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	88.186.11.35
210179	No	598	None	2013-08-20T15:50:45.972478	2013-08-20T15:50:45.972499	210179	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	129.11.77.198
210652	Yes	598	None	2013-08-20T22:23:01.771939	2013-08-20T22:23:01.771963	210652	{u'answer': u'Yes'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	2134	None
211688	Yes	598	None	2013-08-21T15:53:44.000784	2013-08-21T15:53:44.000801	211688	{u'answer': u'Yes'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	2147	None
214622	NaN	598	None	2013-08-22T12:09:41.845525	2013-08-22T12:09:41.845541	214622	{}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	1869	None
220670	Yes	598	None	2013-08-23T10:12:17.192103	2013-08-23T10:12:17.192124	220670	{u'answer': u'Yes'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	61.115.201.241
226763	No	598	None	2013-08-27T11:39:25.651255	2013-08-27T11:39:25.651273	226763	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	129.31.25.80
229376	No	598	None	2013-08-29T16:03:31.663031	2013-08-29T16:03:31.663049	229376	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	2199	None
229411	No	598	None	2013-08-29T23:08:07.813845	2013-08-29T23:08:07.813865	229411	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	177.158.76.105
232332	No	598	None	2013-09-06T08:58:02.237716	2013-09-06T08:58:02.237731	232332	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	2237	None
235082	No	598	None	2013-09-13T08:25:04.359900	2013-09-13T08:25:04.359918	235082	{u'answer': u'No'}	<link rel='self' title='taskrun' href='http://...	[<link rel='parent' title='app' href='http://c...	256071	None	NaN	192.128.3.241

As we have a column with all the answers, called answer let's analyze it for our current application Vimeo.

NOTE: The possible answers that the volunteers can provide are: Yes, No, or I don't know. Enki will detect it, and look for the most voted answer, showing all the relevant information.

In [11]:

e.task_runs_df[e.tasks[0].id]['answer'].describe()

Out[11]:

count     26
unique     3
top       No
freq      15
dtype: object

And now you can iterate over each task and get the most voted answer from the users with the following snippet of code:

In [12]:

for t in e.tasks:
    desc = e.task_runs_df[t.id]['answer'].describe()
    print "The top answer for task.id %s is %s" % (t.id, desc['top'])

The top answer for task.id 256071 is No
The top answer for task.id 256072 is Yes
The top answer for task.id 256073 is Yes
The top answer for task.id 256074 is Yes
The top answer for task.id 256075 is Yes

If you want to create some charts and graphics, you will need to install matplotlib:

In [13]:

!pip install matplotlib

Requirement already satisfied (use --upgrade to upgrade): matplotlib in ./env/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.5 in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): tornado in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): pyparsing>=1.5.6,!=2.0.0 in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): nose in ./env/lib/python2.7/site-packages (from matplotlib)
Requirement already satisfied (use --upgrade to upgrade): six in ./env/lib/python2.7/site-packages (from python-dateutil->matplotlib)
Cleaning up...

Now we count the values, create a Pandas Serie and plot it to see the distribution of answers for the first task:

In [15]:

s = e.task_runs_df[e.tasks[0].id]['answer'].value_counts()
s.plot(kind='bar', rot=0)

Out[15]:

<matplotlib.axes.AxesSubplot at 0x3ee4c50>