[](http://www.datascience-paris-saclay.fr) [](https://research.pasteur.fr/en/team/group-roberto-toro/)

Imaging-psychiatry challenge: predicting autism

A data challenge on Autism Spectrum Disorder detection


_Roberto Toro (Institut Pasteur), Nicolas Traut (Institut Pasteur), Anita Beggiato (Institut Pasteur), Katja Heuer (Institut Pasteur),
Gael Varoquaux (Inria, Parietal), Alex Gramfort (Inria, Parietal), Balazs Kegl (LAL),
Guillaume Lemaitre (CDS), Alexandre Boucaud (CDS), and Joris van den Bossche (CDS)_

To download and run this notebook: download the full starting kit, with all the necessary files.

Software prerequisites

This starting kit requires the following dependencies:

  • numpy
  • scipy
  • pandas
  • scikit-learn
  • matplolib
  • seaborn
  • nilearn
  • jupyter
  • ramp-workflow

The following 2 cells will install if necessary the missing dependencies.

In [1]:
import sys
!{sys.executable} -m pip install scikit-learn seaborn nilearn
Requirement already satisfied: scikit-learn in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: seaborn in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: nilearn in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: matplotlib in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: pandas in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: nibabel>=2.0.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from nilearn)
Requirement already satisfied: cycler>=0.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: python-dateutil>=2.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: pytz in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: six>=1.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: setuptools in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->seaborn)
You are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Install ramp-workflow from the master branch on GitHub.

In [2]:
!{sys.executable} -m pip install https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master
Collecting https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master
  Downloading https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master (2.8MB)
    100% |████████████████████████████████| 2.8MB 444kB/s eta 0:00:01
  Requirement already satisfied (use --upgrade to upgrade): ramp-workflow==0+unknown from https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: pandas>=0.19.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: scikit-learn>=0.18 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: cloudpickle in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: colored in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)
Requirement already satisfied: pytz>=2011k in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)
Requirement already satisfied: six>=1.5 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.19.2->ramp-workflow==0+unknown)
You are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [3]:
%load_ext autoreload
%autoreload 2

Introduction: what is this challenge about

Autism spectrum disorder (ASD) is a developmental disorder affecting communication and behavior with different range in severity of symptoms. ASD has been reported to affect approximately 1 in 166 children.

Although there is a consensus on a relation between ASD and atypical brain networks and anatomy, those differences in brain anatomy and functional connectivity remain unclear. To address these issues, study on large cohort of subjects are necessary to ensure relevant finding.

Competition rules

You can sign up to the challenge on the event page ramp.studio event. Here are the competition rules:

  • Submissions will be trained on the 1150 subjects given in the starting kit and tested on a private set of size 1003. More information on the distribution of the public and private sets can be found on the challenge website.
  • The competition will end on July 1, 2018 at 18h UTC (20h in Paris).
  • All models will be trained on AWS instance m5.xlarge (4 CPUs and 16 GiB of RAM).
  • Participants will be given a total of 80 machine hours. Submissions of a given participant will be ordered by submission timestamp. We will make an attempt to train all submissions, but starting from (and including) the first submission that makes the participant's total training time exceed 80 hours, all submissions will be disqualified from the competition (but can enter into the collaborative phase). Testing time will not count towards the limit. Train time will be displayed on the leaderboard for all submissions, rounded to second, per cross validation fold. Since we have 8 CV folds, the maximum total training time in the leaderboard is 10h. If a submission raises an exception, its training time will not count towards the total.
  • There is a timeout of 15 minutes between submissions.
  • Submissions submitted after the end of the competition will not qualify for prizes.
  • The public leaderboard will display validation scores running a stratified CV with 8 folds and 20% validation set size, executed on the public (starting kit) data. The official scores will be calculated on the hidden test set and will be published after the closing of the competition. We will measure several scores of each submission: (i) the Area Under the Curve of the Receiver Operating Characteristic (ROC-AUC) and (ii) the accuracy. The ROC-AUC will be used to rank submissions.
  • The organizers will do their best so that the provided backend runs flawlessly. We will communicate with participants in case of concerns and will try to resolve all issues, but we reserve the right to make unilateral decisions in specific cases, not covered by this set of minimal rules.
  • The organizers reserve the right to disqualify any participant found to violate the fair competitive spirit of the challenge. Possible reasons, without being exhaustive, are multiple accounts, attempts to access the test data, etc.
  • The challenge is essentially an individual contest, so there is no way to form official teams. Participants can form teams outside the platform before submitting any model individually, contact the organizers to let them know about the team, and submit on a single team member's account. However, submitting on one's own and participating in such a team at the same time is against the "no multiple accounts" rule, so, if discovered, may lead to disqualification.
  • Participants retain copyright on their submitted code and grant reuse under BSD 3-Clause License.

Participants accept these rules automatically when making a submission at the RAMP site.

Prizes of the competitive phase

Ties in the competitive scores will be broken by earlier submission time.

  • 3000 €: the top submission according to private test ROC-AUC at the end of the competitive phase.
  • 2000 €: the second best submission according to private test ROC-AUC at the end of the competitive phase.
  • 1000 €: the third best submission according to private test ROC-AUC at the end of the competitive phase.
  • 500 €: from the fourth to the tenth best submissions according to the private test ROC-AUC at the end of the competitive phase.

Sponsorship

Prizes are provided by IESF and Paris-Saclay CDS. The computational time is provided by AWS on a research credit granted to the Paris-Saclay CDS.

The data

We start by downloading the data from Internet

In [4]:
from problem import get_train_data

data_train, labels_train = get_train_data()
In [5]:
data_train.head()
Out[5]:
participants_site participants_sex participants_age anatomy_lh_bankssts_area anatomy_lh_caudalanteriorcingulate_area anatomy_lh_caudalmiddlefrontal_area anatomy_lh_cuneus_area anatomy_lh_entorhinal_area anatomy_lh_fusiform_area anatomy_lh_inferiorparietal_area ... anatomy_select fmri_basc064 fmri_basc122 fmri_basc197 fmri_craddock_scorr_mean fmri_harvard_oxford_cort_prob_2mm fmri_motions fmri_msdl fmri_power_2011 fmri_select
subject_id
1932355398536124106 5 F 9.301370 977.0 427.0 1884.0 1449.0 463.0 2790.0 4091.0 ... 1 ./data/fmri/basc064/1932355398536124106/run_1/... ./data/fmri/basc122/1932355398536124106/run_1/... ./data/fmri/basc197/1932355398536124106/run_1/... ./data/fmri/craddock_scorr_mean/19323553985361... ./data/fmri/harvard_oxford_cort_prob_2mm/19323... ./data/fmri/motions/1932355398536124106/run_1/... ./data/fmri/msdl/1932355398536124106/run_1/193... ./data/fmri/power_2011/1932355398536124106/run... 1
5174041730092253771 19 M 29.000000 1279.0 730.0 2419.0 1611.0 467.0 3562.0 5380.0 ... 1 ./data/fmri/basc064/5174041730092253771/run_1/... ./data/fmri/basc122/5174041730092253771/run_1/... ./data/fmri/basc197/5174041730092253771/run_1/... ./data/fmri/craddock_scorr_mean/51740417300922... ./data/fmri/harvard_oxford_cort_prob_2mm/51740... ./data/fmri/motions/5174041730092253771/run_1/... ./data/fmri/msdl/5174041730092253771/run_1/517... ./data/fmri/power_2011/5174041730092253771/run... 1
10219322676643534800 19 F 45.000000 926.0 446.0 1897.0 2135.0 570.0 3064.0 4834.0 ... 1 ./data/fmri/basc064/10219322676643534800/run_1... ./data/fmri/basc122/10219322676643534800/run_1... ./data/fmri/basc197/10219322676643534800/run_1... ./data/fmri/craddock_scorr_mean/10219322676643... ./data/fmri/harvard_oxford_cort_prob_2mm/10219... ./data/fmri/motions/10219322676643534800/run_1... ./data/fmri/msdl/10219322676643534800/run_1/10... ./data/fmri/power_2011/10219322676643534800/ru... 1
10645466564919190227 5 F 9.216438 983.0 588.0 2479.0 1312.0 525.0 3766.0 5091.0 ... 1 ./data/fmri/basc064/10645466564919190227/run_1... ./data/fmri/basc122/10645466564919190227/run_1... ./data/fmri/basc197/10645466564919190227/run_1... ./data/fmri/craddock_scorr_mean/10645466564919... ./data/fmri/harvard_oxford_cort_prob_2mm/10645... ./data/fmri/motions/10645466564919190227/run_1... ./data/fmri/msdl/10645466564919190227/run_1/10... ./data/fmri/power_2011/10645466564919190227/ru... 1
14512541342641936232 28 M 15.050000 1488.0 593.0 2309.0 1829.0 726.0 3720.0 5432.0 ... 1 ./data/fmri/basc064/14512541342641936232/run_1... ./data/fmri/basc122/14512541342641936232/run_1... ./data/fmri/basc197/14512541342641936232/run_1... ./data/fmri/craddock_scorr_mean/14512541342641... ./data/fmri/harvard_oxford_cort_prob_2mm/14512... ./data/fmri/motions/14512541342641936232/run_1... ./data/fmri/msdl/14512541342641936232/run_1/14... ./data/fmri/power_2011/14512541342641936232/ru... 1

5 rows × 220 columns

In [6]:
print(labels_train)
[0 0 1 ... 0 1 0]
In [7]:
print('Number of subjects in the training tests: {}'.format(labels_train.size))
Number of subjects in the training tests: 1127

Participant features

In [8]:
data_train_participants = data_train[[col for col in data_train.columns if col.startswith('participants')]]
data_train_participants.head()
Out[8]:
participants_site participants_sex participants_age
subject_id
1932355398536124106 5 F 9.301370
5174041730092253771 19 M 29.000000
10219322676643534800 19 F 45.000000
10645466564919190227 5 F 9.216438
14512541342641936232 28 M 15.050000

Structural MRI features

A set of structural features have been extracted for each subject: (i) normalized brain volume computed using subcortical segmentation of FreeSurfer and (ii) cortical thickness and area for right and left hemisphere of FreeSurfer.

In [9]:
data_train_anatomy = data_train[[col for col in data_train.columns if col.startswith('anatomy')]]
data_train_anatomy.head()
Out[9]:
anatomy_lh_bankssts_area anatomy_lh_caudalanteriorcingulate_area anatomy_lh_caudalmiddlefrontal_area anatomy_lh_cuneus_area anatomy_lh_entorhinal_area anatomy_lh_fusiform_area anatomy_lh_inferiorparietal_area anatomy_lh_inferiortemporal_area anatomy_lh_isthmuscingulate_area anatomy_lh_lateraloccipital_area ... anatomy_MaskVol anatomy_BrainSegVol-to-eTIV anatomy_MaskVol-to-eTIV anatomy_lhSurfaceHoles anatomy_rhSurfaceHoles anatomy_SurfaceHoles anatomy_EstimatedTotalIntraCranialVol anatomy_eTIV anatomy_BrainSegVolNotVent anatomy_select
subject_id
1932355398536124106 977.0 427.0 1884.0 1449.0 463.0 2790.0 4091.0 3305.0 897.0 4406.0 ... 1375171.0 0.840976 1.077472 30.0 31.0 61.0 1.276294e+06 1.276294e+06 1058903.0 1
5174041730092253771 1279.0 730.0 2419.0 1611.0 467.0 3562.0 5380.0 3555.0 1155.0 5611.0 ... 1807924.0 0.771229 1.033285 45.0 54.0 99.0 1.749685e+06 1.749685e+06 1329340.0 1
10219322676643534800 926.0 446.0 1897.0 2135.0 570.0 3064.0 4834.0 2602.0 1171.0 6395.0 ... 1522076.0 0.774117 1.082107 48.0 50.0 98.0 1.406585e+06 1.406585e+06 1072503.0 1
10645466564919190227 983.0 588.0 2479.0 1312.0 525.0 3766.0 5091.0 3433.0 1028.0 5405.0 ... 1544951.0 0.845986 1.083437 55.0 57.0 112.0 1.425972e+06 1.425972e+06 1194831.0 1
14512541342641936232 1488.0 593.0 2309.0 1829.0 726.0 3720.0 5432.0 3956.0 1033.0 5644.0 ... 1738955.0 0.793794 1.083640 22.0 42.0 64.0 1.604735e+06 1.604735e+06 1263065.0 1

5 rows × 208 columns

Note that the column anatomy_select contain a label affected during a manual quality check (i.e. 0 and 3 reject, 1 accept, 2 accept with reserve). This column can be used during training to exclude noisy data for instance.

In [10]:
data_train_anatomy['anatomy_select'].head()
Out[10]:
subject_id
1932355398536124106     1
5174041730092253771     1
10219322676643534800    1
10645466564919190227    1
14512541342641936232    1
Name: anatomy_select, dtype: int64

Functional MRI features

The original data acquired are resting-state functional MRI. Each subject also comes with fMRI signals extracted on different brain parcellations and atlases, and a set of confound signals. Those brain atlases and parcellations are: (i) BASC parcellations with 64, 122, and 197 regions (Bellec 2010), (ii) Ncuts parcellations (Craddock 2012), (iii) Harvard-Oxford anatomical parcellations, (iv) MSDL functional atlas (Varoquaux 2011), and (v) Power atlas (Power 2011). The script used for this extraction can be found there.

In [11]:
data_train_functional = data_train[[col for col in data_train.columns if col.startswith('fmri')]]
data_train_functional.head()
Out[11]:
fmri_basc064 fmri_basc122 fmri_basc197 fmri_craddock_scorr_mean fmri_harvard_oxford_cort_prob_2mm fmri_motions fmri_msdl fmri_power_2011 fmri_select
subject_id
1932355398536124106 ./data/fmri/basc064/1932355398536124106/run_1/... ./data/fmri/basc122/1932355398536124106/run_1/... ./data/fmri/basc197/1932355398536124106/run_1/... ./data/fmri/craddock_scorr_mean/19323553985361... ./data/fmri/harvard_oxford_cort_prob_2mm/19323... ./data/fmri/motions/1932355398536124106/run_1/... ./data/fmri/msdl/1932355398536124106/run_1/193... ./data/fmri/power_2011/1932355398536124106/run... 1
5174041730092253771 ./data/fmri/basc064/5174041730092253771/run_1/... ./data/fmri/basc122/5174041730092253771/run_1/... ./data/fmri/basc197/5174041730092253771/run_1/... ./data/fmri/craddock_scorr_mean/51740417300922... ./data/fmri/harvard_oxford_cort_prob_2mm/51740... ./data/fmri/motions/5174041730092253771/run_1/... ./data/fmri/msdl/5174041730092253771/run_1/517... ./data/fmri/power_2011/5174041730092253771/run... 1
10219322676643534800 ./data/fmri/basc064/10219322676643534800/run_1... ./data/fmri/basc122/10219322676643534800/run_1... ./data/fmri/basc197/10219322676643534800/run_1... ./data/fmri/craddock_scorr_mean/10219322676643... ./data/fmri/harvard_oxford_cort_prob_2mm/10219... ./data/fmri/motions/10219322676643534800/run_1... ./data/fmri/msdl/10219322676643534800/run_1/10... ./data/fmri/power_2011/10219322676643534800/ru... 1
10645466564919190227 ./data/fmri/basc064/10645466564919190227/run_1... ./data/fmri/basc122/10645466564919190227/run_1... ./data/fmri/basc197/10645466564919190227/run_1... ./data/fmri/craddock_scorr_mean/10645466564919... ./data/fmri/harvard_oxford_cort_prob_2mm/10645... ./data/fmri/motions/10645466564919190227/run_1... ./data/fmri/msdl/10645466564919190227/run_1/10... ./data/fmri/power_2011/10645466564919190227/ru... 1
14512541342641936232 ./data/fmri/basc064/14512541342641936232/run_1... ./data/fmri/basc122/14512541342641936232/run_1... ./data/fmri/basc197/14512541342641936232/run_1... ./data/fmri/craddock_scorr_mean/14512541342641... ./data/fmri/harvard_oxford_cort_prob_2mm/14512... ./data/fmri/motions/14512541342641936232/run_1... ./data/fmri/msdl/14512541342641936232/run_1/14... ./data/fmri/power_2011/14512541342641936232/ru... 1

Unlike the anatomical and participants data, the available data are filename to CSV files in which the time-series information are stored. We show in the next section how to read and extract meaningful information from those data.

Similarly to the anatomical data, the column fmri_select gives information about the manual quality check.

In [12]:
data_train_functional['fmri_select'].head()
Out[12]:
subject_id
1932355398536124106     1
5174041730092253771     1
10219322676643534800    1
10645466564919190227    1
14512541342641936232    1
Name: fmri_select, dtype: int64

Testing data

The testing data can be loaded similarly as follows:

In [13]:
from problem import get_test_data

data_test, labels_test = get_test_data()
In [14]:
data_test.head()
Out[14]:
participants_site participants_sex participants_age anatomy_lh_bankssts_area anatomy_lh_caudalanteriorcingulate_area anatomy_lh_caudalmiddlefrontal_area anatomy_lh_cuneus_area anatomy_lh_entorhinal_area anatomy_lh_fusiform_area anatomy_lh_inferiorparietal_area ... anatomy_select fmri_basc064 fmri_basc122 fmri_basc197 fmri_craddock_scorr_mean fmri_harvard_oxford_cort_prob_2mm fmri_motions fmri_msdl fmri_power_2011 fmri_select
subject_id
5181409268393785348 31 M 12.200000 985.0 723.0 2851.0 1844.0 495.0 3526.0 5658.0 ... 1 ./data/fmri/basc064/5181409268393785348/run_1/... ./data/fmri/basc122/5181409268393785348/run_1/... ./data/fmri/basc197/5181409268393785348/run_1/... ./data/fmri/craddock_scorr_mean/51814092683937... ./data/fmri/harvard_oxford_cort_prob_2mm/51814... ./data/fmri/motions/5181409268393785348/run_1/... ./data/fmri/msdl/5181409268393785348/run_1/518... ./data/fmri/power_2011/5181409268393785348/run... 1
8797865049371315550 9 M 14.000000 1174.0 506.0 1890.0 1327.0 462.0 3564.0 4408.0 ... 1 ./data/fmri/basc064/8797865049371315550/run_1/... ./data/fmri/basc122/8797865049371315550/run_1/... ./data/fmri/basc197/8797865049371315550/run_1/... ./data/fmri/craddock_scorr_mean/87978650493713... ./data/fmri/harvard_oxford_cort_prob_2mm/87978... ./data/fmri/motions/8797865049371315550/run_1/... ./data/fmri/msdl/8797865049371315550/run_1/879... ./data/fmri/power_2011/8797865049371315550/run... 1
6486385878325245147 20 M 14.425000 1288.0 568.0 2406.0 1546.0 432.0 3497.0 4808.0 ... 1 ./data/fmri/basc064/6486385878325245147/run_1/... ./data/fmri/basc122/6486385878325245147/run_1/... ./data/fmri/basc197/6486385878325245147/run_1/... ./data/fmri/craddock_scorr_mean/64863858783252... ./data/fmri/harvard_oxford_cort_prob_2mm/64863... ./data/fmri/motions/6486385878325245147/run_1/... ./data/fmri/msdl/6486385878325245147/run_1/648... ./data/fmri/power_2011/6486385878325245147/run... 1
17126438435398394588 33 M 22.880200 1179.0 991.0 2427.0 1771.0 363.0 3579.0 6082.0 ... 1 ./data/fmri/basc064/17126438435398394588/run_1... ./data/fmri/basc122/17126438435398394588/run_1... ./data/fmri/basc197/17126438435398394588/run_1... ./data/fmri/craddock_scorr_mean/17126438435398... ./data/fmri/harvard_oxford_cort_prob_2mm/17126... ./data/fmri/motions/17126438435398394588/run_1... ./data/fmri/msdl/17126438435398394588/run_1/17... ./data/fmri/power_2011/17126438435398394588/ru... 1
16638049522113999228 2 M 8.252055 1064.0 721.0 2445.0 1453.0 561.0 3262.0 4885.0 ... 2 ./data/fmri/basc064/16638049522113999228/run_1... ./data/fmri/basc122/16638049522113999228/run_1... ./data/fmri/basc197/16638049522113999228/run_1... ./data/fmri/craddock_scorr_mean/16638049522113... ./data/fmri/harvard_oxford_cort_prob_2mm/16638... ./data/fmri/motions/16638049522113999228/run_1... ./data/fmri/msdl/16638049522113999228/run_1/16... ./data/fmri/power_2011/16638049522113999228/ru... 0

5 rows × 220 columns

In [15]:
print(labels_test)
[0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0]

Workflow

Evaluation

The framework is evaluated with a cross-validation approach. The metrics used are the AUC under the ROC and the accuracy.

In [16]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from problem import get_cv

def evaluation(X, y):
    pipe = make_pipeline(FeatureExtractor(), Classifier())
    cv = get_cv(X, y)
    results = cross_validate(pipe, X, y, scoring=['roc_auc', 'accuracy'], cv=cv,
                             verbose=1, return_train_score=True,
                             n_jobs=1)
    
    return results

Simple starting kit: using only anatomical features

FeatureExtractor

The available structural data can be used directly to make some classification. In this regard, we will use a feature extractor (i.e. FeatureExtractor). This extractor will only select only the anatomical features, dropping any information regarding the fMRI-based features.

In [17]:
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin


class FeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X_df, y):
        return self

    def transform(self, X_df):
        # get only the anatomical information
        X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]
        return X.drop(columns='anatomy_select')

Classifier

We propose to use a logistic classifier preceded from a scaler which will remove the mean and standard deviation computed on the training set.

In [18]:
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


class Classifier(BaseEstimator):
    def __init__(self):
        self.clf = make_pipeline(StandardScaler(), LogisticRegression())

    def fit(self, X, y):
        self.clf.fit(X, y)
        return self
        
    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)

Testing the submission

We can test locally our pipeline using evaluation function that we defined earlier.

In [19]:
import numpy as np
In [20]:
results = evaluation(data_train, labels_train)

print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
                                                        np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
                                                          np.std(results['test_roc_auc'])))

print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
                                                         np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
                                                           np.std(results['test_accuracy'])))
Training score ROC-AUC: 0.850 +- 0.004
Validation score ROC-AUC: 0.652 +- 0.016 

Training score accuracy: 0.772 +- 0.007
Validation score accuracy: 0.622 +- 0.016
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.1s finished

Going further: using fMRI-derived features

From the framework illustrated in the figure above, steps 1 to 2 already have been computed during some preprocessing and are the data given during this challenge. Therefore, our feature extractor will implement the step #3 which correspond to the extraction of functional connectivity features. Step 4 is identical to the pipeline presented for the anatomy with a standard scaler followed by a logistic regression classifier.

We pointed out that the available feature for fMRI are filename to the time-series. In order to limit the amount of data to be downloaded, we provide a fetcher fetch_fmri_time_series() to download only the time-series linked to a specific atlases.

In [21]:
from download_data import fetch_fmri_time_series
help(fetch_fmri_time_series)
Help on function fetch_fmri_time_series in module download_data:

fetch_fmri_time_series(atlas='all')
    Fetch the time-series extracted from the fMRI data using a specific
    atlas.
    
    Parameters
    ----------
    atlas : string, default='all'
        The name of the atlas used during the extraction. The possibilities
        are:
    
        * `'basc064`, `'basc122'`, `'basc197'`: BASC parcellations with 64,
        122, and 197 regions [1]_;
        * `'craddock_scorr_mean'`: Ncuts parcellations [2]_;
        * `'harvard_oxford_cort_prob_2mm'`: Harvard-Oxford anatomical
        parcellations;
        * `'msdl'`: MSDL functional atlas [3]_;
        * `'power_2011'`: Power atlas [4]_.
    
    Returns
    -------
    None
    
    References
    ----------
    .. [1] Bellec, Pierre, et al. "Multi-level bootstrap analysis of stable
       clusters in resting-state fMRI." Neuroimage 51.3 (2010): 1126-1139.
    
    .. [2] Craddock, R. Cameron, et al. "A whole brain fMRI atlas generated
       via spatially constrained spectral clustering." Human brain mapping
       33.8 (2012): 1914-1928.
    
    .. [3] Varoquaux, Gaël, et al. "Multi-subject dictionary learning to
       segment an atlas of brain spontaneous activity." Biennial International
       Conference on Information Processing in Medical Imaging. Springer,
       Berlin, Heidelberg, 2011.
    
    .. [4] Power, Jonathan D., et al. "Functional network organization of the
       human brain." Neuron 72.4 (2011): 665-678.

In [22]:
fetch_fmri_time_series(atlas='msdl')
Downloading completed ...

You can download all atlases at once by passing atlas='all'. It is also possible to execute the file as a script python download_data.py all.

In the FeatureExtractor below, we first only select the filename related to the MSDL time-series data. We create a FunctionTransformer which will read on-the-fly the time-series from the CSV file and store them into a numpy array. Those series will be used to extract the functional connectivity matrices which will be used later in the classifier.

In [23]:
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

from nilearn.connectome import ConnectivityMeasure


def _load_fmri(fmri_filenames):
    """Load time-series extracted from the fMRI using a specific atlas."""
    return np.array([pd.read_csv(subject_filename,
                                 header=None).values
                     for subject_filename in fmri_filenames])


class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        # make a transformer which will load the time series and compute the
        # connectome matrix
        self.transformer_fmri = make_pipeline(
            FunctionTransformer(func=_load_fmri, validate=False),
            ConnectivityMeasure(kind='tangent', vectorize=True))
        
    def fit(self, X_df, y):
        # get only the time series for the MSDL atlas
        fmri_filenames = X_df['fmri_msdl']
        self.transformer_fmri.fit(fmri_filenames, y)
        return self

    def transform(self, X_df):
        fmri_filenames = X_df['fmri_msdl']
        return self.transformer_fmri.transform(fmri_filenames)
/home/lemaitre/miniconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
In [24]:
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


class Classifier(BaseEstimator):
    def __init__(self):
        self.clf = make_pipeline(StandardScaler(), LogisticRegression(C=1.))

    def fit(self, X, y):
        self.clf.fit(X, y)
        return self
       
    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)
In [25]:
results = evaluation(data_train, labels_train)

print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
                                                        np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
                                                          np.std(results['test_roc_auc'])))

print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
                                                         np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
                                                           np.std(results['test_accuracy'])))
Training score ROC-AUC: 1.000 +- 0.000
Validation score ROC-AUC: 0.612 +- 0.019 

Training score accuracy: 1.000 +- 0.000
Validation score accuracy: 0.587 +- 0.021
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  4.1min finished

More elaborate pipeline: combining anatomy and fMRI

This workflow is a combination of the 2 previous workflows. The FeatureExtractor is extracting both structural and functional connectivity information and concatenate them. Note that each column will contain in their name either connectome or anatomy depending of the type of feature. It will be used to train different classifiers later on.

In [26]:
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

from nilearn.connectome import ConnectivityMeasure


def _load_fmri(fmri_filenames):
    """Load time-series extracted from the fMRI using a specific atlas."""
    return np.array([pd.read_csv(subject_filename,
                                 header=None).values
                     for subject_filename in fmri_filenames])


class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        # make a transformer which will load the time series and compute the
        # connectome matrix
        self.transformer_fmri = make_pipeline(
            FunctionTransformer(func=_load_fmri, validate=False),
            ConnectivityMeasure(kind='tangent', vectorize=True))
    
    def fit(self, X_df, y):
        fmri_filenames = X_df['fmri_msdl']
        self.transformer_fmri.fit(fmri_filenames, y)
        return self

    def transform(self, X_df):
        fmri_filenames = X_df['fmri_msdl']
        X_connectome = self.transformer_fmri.transform(fmri_filenames)
        X_connectome = pd.DataFrame(X_connectome, index=X_df.index)
        X_connectome.columns = ['connectome_{}'.format(i)
                                for i in range(X_connectome.columns.size)]
        # get the anatomical information
        X_anatomy = X_df[[col for col in X_df.columns
                          if col.startswith('anatomy')]]
        X_anatomy = X_anatomy.drop(columns='anatomy_select')
        # concatenate both matrices
        return pd.concat([X_connectome, X_anatomy], axis=1)

We will create a classifier (i.e. a random forest classifier) which will used both connectome and anatomical features.

In [27]:
import numpy as np

from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier


class Classifier(BaseEstimator):
    def __init__(self):
        self.clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

    def fit(self, X, y):
        self.clf.fit(X, y)
        return self
    
    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)
In [28]:
results = evaluation(data_train, labels_train)

print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
                                                        np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
                                                          np.std(results['test_roc_auc'])))

print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
                                                         np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
                                                           np.std(results['test_accuracy'])))
Training score ROC-AUC: 1.000 +- 0.000
Validation score ROC-AUC: 0.655 +- 0.028 

Training score accuracy: 1.000 +- 0.000
Validation score accuracy: 0.613 +- 0.033
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  4.1min finished

We can propose a more complex classifier than the previous one. We will train 2 single classifiers independetly on the sMRI-derived and fMRI-derived features. Then, a meta-classifier will be used to combine both information. We left out some data to be able to train the meta-classifier.

In [29]:
import numpy as np

from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split


class Classifier(BaseEstimator):
    def __init__(self):
        self.clf_connectome = make_pipeline(StandardScaler(),
                                            LogisticRegression(C=1.))
        self.clf_anatomy = make_pipeline(StandardScaler(),
                                         LogisticRegression(C=1.))
        self.meta_clf = LogisticRegression(C=1.)

    def fit(self, X, y):
        X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]
        X_connectome = X[[col for col in X.columns
                          if col.startswith('connectome')]]
        train_idx, validation_idx = train_test_split(range(y.size),
                                                     test_size=0.33,
                                                     shuffle=True,
                                                     random_state=42)
        X_anatomy_train = X_anatomy.iloc[train_idx]
        X_anatomy_validation = X_anatomy.iloc[validation_idx]
        X_connectome_train = X_connectome.iloc[train_idx]
        X_connectome_validation = X_connectome.iloc[validation_idx]
        y_train = y[train_idx]
        y_validation = y[validation_idx]

        self.clf_connectome.fit(X_connectome_train, y_train)
        self.clf_anatomy.fit(X_anatomy_train, y_train)

        y_connectome_pred = self.clf_connectome.predict_proba(
            X_connectome_validation)
        y_anatomy_pred = self.clf_anatomy.predict_proba(
            X_anatomy_validation)

        self.meta_clf.fit(
            np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1),
            y_validation)
        return self
    
    def predict(self, X):
        X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]
        X_connectome = X[[col for col in X.columns
                          if col.startswith('connectome')]]

        y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)
        y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)

        return self.meta_clf.predict(
            np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))

    def predict_proba(self, X):
        X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]
        X_connectome = X[[col for col in X.columns
                          if col.startswith('connectome')]]

        y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)
        y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)

        return self.meta_clf.predict_proba(
            np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))
In [30]:
results = evaluation(data_train, labels_train)

print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
                                                        np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
                                                          np.std(results['test_roc_auc'])))

print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
                                                         np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
                                                           np.std(results['test_accuracy'])))
Training score ROC-AUC: 0.915 +- 0.010
Validation score ROC-AUC: 0.649 +- 0.023 

Training score accuracy: 0.854 +- 0.017
Validation score accuracy: 0.606 +- 0.022
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  4.2min finished

Submitting to the online challenge: ramp.studio

Once you found a good model, you can submit them to ramp.studio to enter the online challenge. First, if it is your first time using the RAMP platform, sign up, otherwise log in. Then sign up to the event autism. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request is accepted, you can go to your sandbox and copy-paste. You can also create a new starting-kit in the submissions folder containing both feature_extractor.py and classifier.py and upload those file directly. You can check the starting-kit (feature_extractor.py and classifier.py) for an example. The submission is trained and tested on our backend in the similar way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard. If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission. The script prints mean cross-validation scores

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is the AUC. When the score is good enough, you can submit it at the RAMP.

In [31]:
!ramp_test_submission --submission starting_kit_anatomy
Testing Autism Spectrum Disorder classification
Reading train and test files from ./data ...
Reading cv ...
Training ./submissions/starting_kit_anatomy ...
CV fold 0
	score    auc    acc
	train  0.847  0.767
	valid  0.647  0.611
	test   0.765  0.696
CV fold 1
	score    auc    acc
	train  0.842  0.766
	valid  0.662  0.628
	test   0.659  0.478
CV fold 2
	score    auc    acc
	train  0.854  0.786
	valid  0.645  0.615
	test   0.720  0.609
CV fold 3
	score    auc    acc
	train  0.849  0.769
	valid  0.645  0.619
	test   0.758  0.565
CV fold 4
	score    auc    acc
	train  0.852  0.770
	valid  0.650  0.606
	test   0.735  0.652
CV fold 5
	score    auc    acc
	train  0.847  0.776
	valid  0.680  0.642
	test   0.598  0.565
CV fold 6
	score    auc    acc
	train  0.852  0.764
	valid  0.624  0.602
	test   0.773  0.652
CV fold 7
	score    auc    acc
	train  0.854  0.779
	valid  0.662  0.650
	test   0.644  0.478
----------------------------
Mean CV scores
----------------------------
	score             auc             acc
	train   0.85 ± 0.0039  0.772 ± 0.0071
	valid  0.652 ± 0.0155  0.622 ± 0.0161
	test   0.706 ± 0.0605  0.587 ± 0.0753
----------------------------
Bagged scores
----------------------------
	score    auc
	valid  0.651
	test   0.720

More information

You can find more information in the README of the ramp-workflow library.

Questions

Questions related to the starting kit should be asked on the issue tracker. The RAMP site administrators can be pinged at the RAMP slack team in the #autism channel.