PyBossa provides a very simple and interesting package for analyzing any PyBossa application statistically: Enki.
The package allows you to statistically analyzed all the contributed task runs to your application, for example for the Video Pattern Recognition application in Crowdcrafting.
In order to use it, all you have to do is the following: first, install it.
NOTE: it takes some minutes to compile all the required libraries, so be patient.
!pip install enki
After installing the package you can import it:
import enki
And then, you can start analyzing the application with the following command:
e = enki.Enki(api_key='private', endpoint='http://crowdcrafting.org', app_short_name='vimeo')
As you can see, the api_key is your private key, but for reading the API you don't need a valid one.
Then, you can get all the completed tasks and its associated task_runs in order to start analyzing them:
e.get_all()
The previous command downloads all the tasks and task_runs, and creates 4 variables:
Let's have a look to e.tasks:
e.tasks
And now to e.task_runs:
e.task_runs
OK, so the data are ready to be analyzed ;-) As I said before, we have a data frame per task, so it is really easy to analyze the results of the contributed answers by our volunteers.
Enki uses Pandas package, so it is really easy to statistically analyze the answers. For example, lets get an overview of the answers for the first task of the application within the data frame:
NOTE: Enki explodes the PyBossa task_run.info field if it is a dictionary. In this case, the Vimeo task_runs have within the info field, a dictionary with this structure: task_run.info = {'answer': 'Yes'}, so the following command will show a column with the name answer automatically for us.
e.task_runs_df[e.tasks[0].id]
As we have a column with all the answers, called answer let's analyze it for our current application Vimeo.
NOTE: The possible answers that the volunteers can provide are: Yes, No, or I don't know. Enki will detect it, and look for the most voted answer, showing all the relevant information.
e.task_runs_df[e.tasks[0].id]['answer'].describe()
And now you can iterate over each task and get the most voted answer from the users with the following snippet of code:
for t in e.tasks:
desc = e.task_runs_df[t.id]['answer'].describe()
print "The top answer for task.id %s is %s" % (t.id, desc['top'])
If you want to create some charts and graphics, you will need to install matplotlib:
!pip install matplotlib
Now we count the values, create a Pandas Serie and plot it to see the distribution of answers for the first task:
s = e.task_runs_df[e.tasks[0].id]['answer'].value_counts()
s.plot(kind='bar', rot=0)