Notebook

Structural Analysis and Visualization of Networks¶

Course Project #2¶

Student: {Your Name}¶

General Information¶

Hard Deadline: 21.06.2015 23:59

Please send your reports to mailto:leonid.e.zhukov@gmail.com and mailto:shestakoffandrey@gmail.com with message subject of the following structure: [HSE Networks 2015] {LastName} {First Name} Project*{Number}*

Support your computations with figures and comments. If you are using IPython Notebook you may use this file as a starting point of your report.

Description¶

Task 1¶

You are provided with the DBLP dataset (warning, raw data!). It contains coauthorships that were revealed during $2000$-$2014$. Particularly, the file contains $3$ colomns: first two for authors' names and the third for the year of publication. This data can be naturally mapped to undirected graph structure.

Your task is construct supervised link prediction scheme.

Guidelines:¶

Use pandas module to load and manipulate the dataset in Python
Initiallize your classification set as follows:
- Determine training and testing intervals on your time domain (for instance, in DBLP dataset take a period $2000$-$2010$ as training period and $2011$-$2014$ as testing period)
- Pick pairs of authors that have appeared during training interval but have not published together during it
- These pairs form positive or negative examples depending on whether they have formed coauthorships during the testing interval
- You have arrived to binary classification problem. PROFIT!
Construct feature space:
- Most of our features tend to be topological. Examples of the features can be: (weighted) sum of neigbours, shortest distance, etc
Choose at least $4$ classification algorithms from scikit module (goes with Anaconda) and compare them in terms of Accuracy, Precision, Recall, F-Score (for positive class) and Mean Squared Error. Use k-fold cross-validation and average your results

Task 2¶

Consider the flickr dataset (warning, raw data!).
File ''users.txt'' provides a table of form userID, enterTimeStamp, additionalInfo...
File "contacts.txt" consists of pairs of userID's and link establishment timestamp

Recall scoring functions for link prediction. Your task is to compare the performance of each scoring function as follows:

TOP-$n$ accuracy
- Denote the number of links $E_\text{new}$ appeared during testing period as $n$
- Denote the ranked list of node pairs provided by score $s$ as $\hat{E}_s$
- Take top-$n$ pairs from $\hat{E}_s$ and intersect it with $E_\text{new}$. Performance is measured as the size of resulted set
ROC and AUC ('star' subtask)

Essentially, for this task you also have to follow the guideline points $1$ and $2$ above. The only thing you have to keep in mind is that flickr dataset is growing dataset. Since then, consider nodes that are significantly represented both in training and testing intervals (for instance, have at least $5$ adjacent edges in training and testing intervals)