[](http://www.datascience-paris-saclay.fr) | [](https://research.pasteur.fr/en/team/group-roberto-toro/) |
To download and run this notebook: download the full starting kit, with all the necessary files.
This starting kit requires the following dependencies:
numpy
scipy
pandas
scikit-learn
matplolib
seaborn
nilearn
jupyter
ramp-workflow
The following 2 cells will install if necessary the missing dependencies.
# import sys
# !{sys.executable} -m pip install "scikit-learn>=0.19,<=0.21" seaborn nilearn
Requirement already satisfied: scikit-learn in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: seaborn in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: nilearn in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: matplotlib in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: pandas in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: nibabel>=2.0.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from nilearn)
Requirement already satisfied: cycler>=0.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: python-dateutil>=2.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: pytz in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: six>=1.10 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from matplotlib->seaborn)
Requirement already satisfied: setuptools in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->seaborn)
You are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Install ramp-workflow
from the master branch on GitHub.
# !{sys.executable} -m pip install https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/0.2.1
Collecting https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master
Downloading https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master (2.8MB)
100% |████████████████████████████████| 2.8MB 444kB/s eta 0:00:01
Requirement already satisfied (use --upgrade to upgrade): ramp-workflow==0+unknown from https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master in /home/lemaitre/miniconda3/lib/python3.6/site-packages
Requirement already satisfied: numpy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: scipy in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: pandas>=0.19.2 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: scikit-learn>=0.18 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: cloudpickle in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: colored in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from ramp-workflow==0+unknown)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)
Requirement already satisfied: pytz>=2011k in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from pandas>=0.19.2->ramp-workflow==0+unknown)
Requirement already satisfied: six>=1.5 in /home/lemaitre/miniconda3/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.19.2->ramp-workflow==0+unknown)
You are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
%load_ext autoreload
%autoreload 2
Autism spectrum disorder (ASD) is a developmental disorder affecting communication and behavior with different range in severity of symptoms. ASD has been reported to affect approximately 1 in 166 children.
Although there is a consensus on a relation between ASD and atypical brain networks and anatomy, those differences in brain anatomy and functional connectivity remain unclear. To address these issues, study on large cohort of subjects are necessary to ensure relevant finding.
You can sign up to the challenge on the event page ramp.studio event. Here are the competition rules:
m5.xlarge
(4 CPUs and 16 GiB of RAM).Participants accept these rules automatically when making a submission at the RAMP site.
Ties in the competitive scores will be broken by earlier submission time.
Prizes are provided by IESF and Paris-Saclay CDS. The computational time is provided by AWS on a research credit granted to the Paris-Saclay CDS.
We start by downloading the data from Internet
from problem import get_train_data
data_train, labels_train = get_train_data()
data_train.head()
participants_site | participants_sex | participants_age | anatomy_lh_bankssts_area | anatomy_lh_caudalanteriorcingulate_area | anatomy_lh_caudalmiddlefrontal_area | anatomy_lh_cuneus_area | anatomy_lh_entorhinal_area | anatomy_lh_fusiform_area | anatomy_lh_inferiorparietal_area | ... | anatomy_select | fmri_basc064 | fmri_basc122 | fmri_basc197 | fmri_craddock_scorr_mean | fmri_harvard_oxford_cort_prob_2mm | fmri_motions | fmri_msdl | fmri_power_2011 | fmri_select | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
subject_id | |||||||||||||||||||||
1932355398536124106 | 5 | F | 9.301370 | 977.0 | 427.0 | 1884.0 | 1449.0 | 463.0 | 2790.0 | 4091.0 | ... | 1 | ./data/fmri/basc064/1932355398536124106/run_1/... | ./data/fmri/basc122/1932355398536124106/run_1/... | ./data/fmri/basc197/1932355398536124106/run_1/... | ./data/fmri/craddock_scorr_mean/19323553985361... | ./data/fmri/harvard_oxford_cort_prob_2mm/19323... | ./data/fmri/motions/1932355398536124106/run_1/... | ./data/fmri/msdl/1932355398536124106/run_1/193... | ./data/fmri/power_2011/1932355398536124106/run... | 1 |
5174041730092253771 | 19 | M | 29.000000 | 1279.0 | 730.0 | 2419.0 | 1611.0 | 467.0 | 3562.0 | 5380.0 | ... | 1 | ./data/fmri/basc064/5174041730092253771/run_1/... | ./data/fmri/basc122/5174041730092253771/run_1/... | ./data/fmri/basc197/5174041730092253771/run_1/... | ./data/fmri/craddock_scorr_mean/51740417300922... | ./data/fmri/harvard_oxford_cort_prob_2mm/51740... | ./data/fmri/motions/5174041730092253771/run_1/... | ./data/fmri/msdl/5174041730092253771/run_1/517... | ./data/fmri/power_2011/5174041730092253771/run... | 1 |
10219322676643534800 | 19 | F | 45.000000 | 926.0 | 446.0 | 1897.0 | 2135.0 | 570.0 | 3064.0 | 4834.0 | ... | 1 | ./data/fmri/basc064/10219322676643534800/run_1... | ./data/fmri/basc122/10219322676643534800/run_1... | ./data/fmri/basc197/10219322676643534800/run_1... | ./data/fmri/craddock_scorr_mean/10219322676643... | ./data/fmri/harvard_oxford_cort_prob_2mm/10219... | ./data/fmri/motions/10219322676643534800/run_1... | ./data/fmri/msdl/10219322676643534800/run_1/10... | ./data/fmri/power_2011/10219322676643534800/ru... | 1 |
10645466564919190227 | 5 | F | 9.216438 | 983.0 | 588.0 | 2479.0 | 1312.0 | 525.0 | 3766.0 | 5091.0 | ... | 1 | ./data/fmri/basc064/10645466564919190227/run_1... | ./data/fmri/basc122/10645466564919190227/run_1... | ./data/fmri/basc197/10645466564919190227/run_1... | ./data/fmri/craddock_scorr_mean/10645466564919... | ./data/fmri/harvard_oxford_cort_prob_2mm/10645... | ./data/fmri/motions/10645466564919190227/run_1... | ./data/fmri/msdl/10645466564919190227/run_1/10... | ./data/fmri/power_2011/10645466564919190227/ru... | 1 |
14512541342641936232 | 28 | M | 15.050000 | 1488.0 | 593.0 | 2309.0 | 1829.0 | 726.0 | 3720.0 | 5432.0 | ... | 1 | ./data/fmri/basc064/14512541342641936232/run_1... | ./data/fmri/basc122/14512541342641936232/run_1... | ./data/fmri/basc197/14512541342641936232/run_1... | ./data/fmri/craddock_scorr_mean/14512541342641... | ./data/fmri/harvard_oxford_cort_prob_2mm/14512... | ./data/fmri/motions/14512541342641936232/run_1... | ./data/fmri/msdl/14512541342641936232/run_1/14... | ./data/fmri/power_2011/14512541342641936232/ru... | 1 |
5 rows × 220 columns
print(labels_train)
[0 0 1 ... 0 1 0]
print('Number of subjects in the training tests: {}'.format(labels_train.size))
Number of subjects in the training tests: 1127
data_train_participants = data_train[[col for col in data_train.columns if col.startswith('participants')]]
data_train_participants.head()
participants_site | participants_sex | participants_age | |
---|---|---|---|
subject_id | |||
1932355398536124106 | 5 | F | 9.301370 |
5174041730092253771 | 19 | M | 29.000000 |
10219322676643534800 | 19 | F | 45.000000 |
10645466564919190227 | 5 | F | 9.216438 |
14512541342641936232 | 28 | M | 15.050000 |
A set of structural features have been extracted for each subject: (i) normalized brain volume computed using subcortical segmentation of FreeSurfer and (ii) cortical thickness and area for right and left hemisphere of FreeSurfer.
data_train_anatomy = data_train[[col for col in data_train.columns if col.startswith('anatomy')]]
data_train_anatomy.head()
anatomy_lh_bankssts_area | anatomy_lh_caudalanteriorcingulate_area | anatomy_lh_caudalmiddlefrontal_area | anatomy_lh_cuneus_area | anatomy_lh_entorhinal_area | anatomy_lh_fusiform_area | anatomy_lh_inferiorparietal_area | anatomy_lh_inferiortemporal_area | anatomy_lh_isthmuscingulate_area | anatomy_lh_lateraloccipital_area | ... | anatomy_MaskVol | anatomy_BrainSegVol-to-eTIV | anatomy_MaskVol-to-eTIV | anatomy_lhSurfaceHoles | anatomy_rhSurfaceHoles | anatomy_SurfaceHoles | anatomy_EstimatedTotalIntraCranialVol | anatomy_eTIV | anatomy_BrainSegVolNotVent | anatomy_select | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
subject_id | |||||||||||||||||||||
1932355398536124106 | 977.0 | 427.0 | 1884.0 | 1449.0 | 463.0 | 2790.0 | 4091.0 | 3305.0 | 897.0 | 4406.0 | ... | 1375171.0 | 0.840976 | 1.077472 | 30.0 | 31.0 | 61.0 | 1.276294e+06 | 1.276294e+06 | 1058903.0 | 1 |
5174041730092253771 | 1279.0 | 730.0 | 2419.0 | 1611.0 | 467.0 | 3562.0 | 5380.0 | 3555.0 | 1155.0 | 5611.0 | ... | 1807924.0 | 0.771229 | 1.033285 | 45.0 | 54.0 | 99.0 | 1.749685e+06 | 1.749685e+06 | 1329340.0 | 1 |
10219322676643534800 | 926.0 | 446.0 | 1897.0 | 2135.0 | 570.0 | 3064.0 | 4834.0 | 2602.0 | 1171.0 | 6395.0 | ... | 1522076.0 | 0.774117 | 1.082107 | 48.0 | 50.0 | 98.0 | 1.406585e+06 | 1.406585e+06 | 1072503.0 | 1 |
10645466564919190227 | 983.0 | 588.0 | 2479.0 | 1312.0 | 525.0 | 3766.0 | 5091.0 | 3433.0 | 1028.0 | 5405.0 | ... | 1544951.0 | 0.845986 | 1.083437 | 55.0 | 57.0 | 112.0 | 1.425972e+06 | 1.425972e+06 | 1194831.0 | 1 |
14512541342641936232 | 1488.0 | 593.0 | 2309.0 | 1829.0 | 726.0 | 3720.0 | 5432.0 | 3956.0 | 1033.0 | 5644.0 | ... | 1738955.0 | 0.793794 | 1.083640 | 22.0 | 42.0 | 64.0 | 1.604735e+06 | 1.604735e+06 | 1263065.0 | 1 |
5 rows × 208 columns
Note that the column anatomy_select
contain a label affected during a manual quality check (i.e. 0
and 3
reject, 1
accept, 2
accept with reserve). This column can be used during training to exclude noisy data for instance.
data_train_anatomy['anatomy_select'].head()
subject_id 1932355398536124106 1 5174041730092253771 1 10219322676643534800 1 10645466564919190227 1 14512541342641936232 1 Name: anatomy_select, dtype: int64
The original data acquired are resting-state functional MRI. Each subject also comes with fMRI signals extracted on different brain parcellations and atlases, and a set of confound signals. Those brain atlases and parcellations are: (i) BASC parcellations with 64, 122, and 197 regions (Bellec 2010), (ii) Ncuts parcellations (Craddock 2012), (iii) Harvard-Oxford anatomical parcellations, (iv) MSDL functional atlas (Varoquaux 2011), and (v) Power atlas (Power 2011). The script used for this extraction can be found there.
data_train_functional = data_train[[col for col in data_train.columns if col.startswith('fmri')]]
data_train_functional.head()
fmri_basc064 | fmri_basc122 | fmri_basc197 | fmri_craddock_scorr_mean | fmri_harvard_oxford_cort_prob_2mm | fmri_motions | fmri_msdl | fmri_power_2011 | fmri_select | |
---|---|---|---|---|---|---|---|---|---|
subject_id | |||||||||
1932355398536124106 | ./data/fmri/basc064/1932355398536124106/run_1/... | ./data/fmri/basc122/1932355398536124106/run_1/... | ./data/fmri/basc197/1932355398536124106/run_1/... | ./data/fmri/craddock_scorr_mean/19323553985361... | ./data/fmri/harvard_oxford_cort_prob_2mm/19323... | ./data/fmri/motions/1932355398536124106/run_1/... | ./data/fmri/msdl/1932355398536124106/run_1/193... | ./data/fmri/power_2011/1932355398536124106/run... | 1 |
5174041730092253771 | ./data/fmri/basc064/5174041730092253771/run_1/... | ./data/fmri/basc122/5174041730092253771/run_1/... | ./data/fmri/basc197/5174041730092253771/run_1/... | ./data/fmri/craddock_scorr_mean/51740417300922... | ./data/fmri/harvard_oxford_cort_prob_2mm/51740... | ./data/fmri/motions/5174041730092253771/run_1/... | ./data/fmri/msdl/5174041730092253771/run_1/517... | ./data/fmri/power_2011/5174041730092253771/run... | 1 |
10219322676643534800 | ./data/fmri/basc064/10219322676643534800/run_1... | ./data/fmri/basc122/10219322676643534800/run_1... | ./data/fmri/basc197/10219322676643534800/run_1... | ./data/fmri/craddock_scorr_mean/10219322676643... | ./data/fmri/harvard_oxford_cort_prob_2mm/10219... | ./data/fmri/motions/10219322676643534800/run_1... | ./data/fmri/msdl/10219322676643534800/run_1/10... | ./data/fmri/power_2011/10219322676643534800/ru... | 1 |
10645466564919190227 | ./data/fmri/basc064/10645466564919190227/run_1... | ./data/fmri/basc122/10645466564919190227/run_1... | ./data/fmri/basc197/10645466564919190227/run_1... | ./data/fmri/craddock_scorr_mean/10645466564919... | ./data/fmri/harvard_oxford_cort_prob_2mm/10645... | ./data/fmri/motions/10645466564919190227/run_1... | ./data/fmri/msdl/10645466564919190227/run_1/10... | ./data/fmri/power_2011/10645466564919190227/ru... | 1 |
14512541342641936232 | ./data/fmri/basc064/14512541342641936232/run_1... | ./data/fmri/basc122/14512541342641936232/run_1... | ./data/fmri/basc197/14512541342641936232/run_1... | ./data/fmri/craddock_scorr_mean/14512541342641... | ./data/fmri/harvard_oxford_cort_prob_2mm/14512... | ./data/fmri/motions/14512541342641936232/run_1... | ./data/fmri/msdl/14512541342641936232/run_1/14... | ./data/fmri/power_2011/14512541342641936232/ru... | 1 |
Unlike the anatomical and participants data, the available data are filename to CSV files in which the time-series information are stored. We show in the next section how to read and extract meaningful information from those data.
Similarly to the anatomical data, the column fmri_select
gives information about the manual quality check.
data_train_functional['fmri_select'].head()
subject_id 1932355398536124106 1 5174041730092253771 1 10219322676643534800 1 10645466564919190227 1 14512541342641936232 1 Name: fmri_select, dtype: int64
The testing data can be loaded similarly as follows:
from problem import get_test_data
data_test, labels_test = get_test_data()
data_test.head()
participants_site | participants_sex | participants_age | anatomy_lh_bankssts_area | anatomy_lh_caudalanteriorcingulate_area | anatomy_lh_caudalmiddlefrontal_area | anatomy_lh_cuneus_area | anatomy_lh_entorhinal_area | anatomy_lh_fusiform_area | anatomy_lh_inferiorparietal_area | ... | anatomy_select | fmri_basc064 | fmri_basc122 | fmri_basc197 | fmri_craddock_scorr_mean | fmri_harvard_oxford_cort_prob_2mm | fmri_motions | fmri_msdl | fmri_power_2011 | fmri_select | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
subject_id | |||||||||||||||||||||
5181409268393785348 | 31 | M | 12.200000 | 985.0 | 723.0 | 2851.0 | 1844.0 | 495.0 | 3526.0 | 5658.0 | ... | 1 | ./data/fmri/basc064/5181409268393785348/run_1/... | ./data/fmri/basc122/5181409268393785348/run_1/... | ./data/fmri/basc197/5181409268393785348/run_1/... | ./data/fmri/craddock_scorr_mean/51814092683937... | ./data/fmri/harvard_oxford_cort_prob_2mm/51814... | ./data/fmri/motions/5181409268393785348/run_1/... | ./data/fmri/msdl/5181409268393785348/run_1/518... | ./data/fmri/power_2011/5181409268393785348/run... | 1 |
8797865049371315550 | 9 | M | 14.000000 | 1174.0 | 506.0 | 1890.0 | 1327.0 | 462.0 | 3564.0 | 4408.0 | ... | 1 | ./data/fmri/basc064/8797865049371315550/run_1/... | ./data/fmri/basc122/8797865049371315550/run_1/... | ./data/fmri/basc197/8797865049371315550/run_1/... | ./data/fmri/craddock_scorr_mean/87978650493713... | ./data/fmri/harvard_oxford_cort_prob_2mm/87978... | ./data/fmri/motions/8797865049371315550/run_1/... | ./data/fmri/msdl/8797865049371315550/run_1/879... | ./data/fmri/power_2011/8797865049371315550/run... | 1 |
6486385878325245147 | 20 | M | 14.425000 | 1288.0 | 568.0 | 2406.0 | 1546.0 | 432.0 | 3497.0 | 4808.0 | ... | 1 | ./data/fmri/basc064/6486385878325245147/run_1/... | ./data/fmri/basc122/6486385878325245147/run_1/... | ./data/fmri/basc197/6486385878325245147/run_1/... | ./data/fmri/craddock_scorr_mean/64863858783252... | ./data/fmri/harvard_oxford_cort_prob_2mm/64863... | ./data/fmri/motions/6486385878325245147/run_1/... | ./data/fmri/msdl/6486385878325245147/run_1/648... | ./data/fmri/power_2011/6486385878325245147/run... | 1 |
17126438435398394588 | 33 | M | 22.880200 | 1179.0 | 991.0 | 2427.0 | 1771.0 | 363.0 | 3579.0 | 6082.0 | ... | 1 | ./data/fmri/basc064/17126438435398394588/run_1... | ./data/fmri/basc122/17126438435398394588/run_1... | ./data/fmri/basc197/17126438435398394588/run_1... | ./data/fmri/craddock_scorr_mean/17126438435398... | ./data/fmri/harvard_oxford_cort_prob_2mm/17126... | ./data/fmri/motions/17126438435398394588/run_1... | ./data/fmri/msdl/17126438435398394588/run_1/17... | ./data/fmri/power_2011/17126438435398394588/ru... | 1 |
16638049522113999228 | 2 | M | 8.252055 | 1064.0 | 721.0 | 2445.0 | 1453.0 | 561.0 | 3262.0 | 4885.0 | ... | 2 | ./data/fmri/basc064/16638049522113999228/run_1... | ./data/fmri/basc122/16638049522113999228/run_1... | ./data/fmri/basc197/16638049522113999228/run_1... | ./data/fmri/craddock_scorr_mean/16638049522113... | ./data/fmri/harvard_oxford_cort_prob_2mm/16638... | ./data/fmri/motions/16638049522113999228/run_1... | ./data/fmri/msdl/16638049522113999228/run_1/16... | ./data/fmri/power_2011/16638049522113999228/ru... | 0 |
5 rows × 220 columns
print(labels_test)
[0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0]
The framework is evaluated with a cross-validation approach. The metrics used are the AUC under the ROC and the accuracy.
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from problem import get_cv
def evaluation(X, y):
pipe = make_pipeline(FeatureExtractor(), Classifier())
cv = get_cv(X, y)
results = cross_validate(pipe, X, y, scoring=['roc_auc', 'accuracy'], cv=cv,
verbose=1, return_train_score=True,
n_jobs=1)
return results
The available structural data can be used directly to make some classification. In this regard, we will use a feature extractor (i.e. FeatureExtractor
). This extractor will only select only the anatomical features, dropping any information regarding the fMRI-based features.
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
class FeatureExtractor(BaseEstimator, TransformerMixin):
def fit(self, X_df, y):
return self
def transform(self, X_df):
# get only the anatomical information
X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]
return X.drop(columns='anatomy_select')
We propose to use a logistic classifier preceded from a scaler which will remove the mean and standard deviation computed on the training set.
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
class Classifier(BaseEstimator):
def __init__(self):
self.clf = make_pipeline(StandardScaler(), LogisticRegression())
def fit(self, X, y):
self.clf.fit(X, y)
return self
def predict(self, X):
return self.clf.predict(X)
def predict_proba(self, X):
return self.clf.predict_proba(X)
We can test locally our pipeline using evaluation
function that we defined earlier.
import numpy as np
results = evaluation(data_train, labels_train)
print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
np.std(results['test_roc_auc'])))
print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
np.std(results['test_accuracy'])))
Training score ROC-AUC: 0.850 +- 0.004 Validation score ROC-AUC: 0.652 +- 0.016 Training score accuracy: 0.772 +- 0.007 Validation score accuracy: 0.622 +- 0.016
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 1.1s finished
From the framework illustrated in the figure above, steps 1 to 2 already have been computed during some preprocessing and are the data given during this challenge. Therefore, our feature extractor will implement the step #3 which correspond to the extraction of functional connectivity features. Step 4 is identical to the pipeline presented for the anatomy with a standard scaler followed by a logistic regression classifier.
We pointed out that the available feature for fMRI are filename to the time-series. In order to limit the amount of data to be downloaded, we provide a fetcher fetch_fmri_time_series()
to download only the time-series linked to a specific atlases.
from download_data import fetch_fmri_time_series
help(fetch_fmri_time_series)
Help on function fetch_fmri_time_series in module download_data: fetch_fmri_time_series(atlas='all') Fetch the time-series extracted from the fMRI data using a specific atlas. Parameters ---------- atlas : string, default='all' The name of the atlas used during the extraction. The possibilities are: * `'basc064`, `'basc122'`, `'basc197'`: BASC parcellations with 64, 122, and 197 regions [1]_; * `'craddock_scorr_mean'`: Ncuts parcellations [2]_; * `'harvard_oxford_cort_prob_2mm'`: Harvard-Oxford anatomical parcellations; * `'msdl'`: MSDL functional atlas [3]_; * `'power_2011'`: Power atlas [4]_. Returns ------- None References ---------- .. [1] Bellec, Pierre, et al. "Multi-level bootstrap analysis of stable clusters in resting-state fMRI." Neuroimage 51.3 (2010): 1126-1139. .. [2] Craddock, R. Cameron, et al. "A whole brain fMRI atlas generated via spatially constrained spectral clustering." Human brain mapping 33.8 (2012): 1914-1928. .. [3] Varoquaux, Gaël, et al. "Multi-subject dictionary learning to segment an atlas of brain spontaneous activity." Biennial International Conference on Information Processing in Medical Imaging. Springer, Berlin, Heidelberg, 2011. .. [4] Power, Jonathan D., et al. "Functional network organization of the human brain." Neuron 72.4 (2011): 665-678.
fetch_fmri_time_series(atlas='msdl')
Downloading completed ...
You can download all atlases at once by passing atlas='all'
. It is also possible to execute the file as a script python download_data.py all
.
In the FeatureExtractor
below, we first only select the filename related to the MSDL time-series data. We create a FunctionTransformer
which will read on-the-fly the time-series from the CSV file and store them into a numpy array.
Those series will be used to extract the functional connectivity matrices which will be used later in the classifier.
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from nilearn.connectome import ConnectivityMeasure
def _load_fmri(fmri_filenames):
"""Load time-series extracted from the fMRI using a specific atlas."""
return np.array([pd.read_csv(subject_filename,
header=None).values
for subject_filename in fmri_filenames])
class FeatureExtractor(BaseEstimator, TransformerMixin):
def __init__(self):
# make a transformer which will load the time series and compute the
# connectome matrix
self.transformer_fmri = make_pipeline(
FunctionTransformer(func=_load_fmri, validate=False),
ConnectivityMeasure(kind='tangent', vectorize=True))
def fit(self, X_df, y):
# get only the time series for the MSDL atlas
fmri_filenames = X_df['fmri_msdl']
self.transformer_fmri.fit(fmri_filenames, y)
return self
def transform(self, X_df):
fmri_filenames = X_df['fmri_msdl']
return self.transformer_fmri.transform(fmri_filenames)
/home/lemaitre/miniconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
class Classifier(BaseEstimator):
def __init__(self):
self.clf = make_pipeline(StandardScaler(), LogisticRegression(C=1.))
def fit(self, X, y):
self.clf.fit(X, y)
return self
def predict(self, X):
return self.clf.predict(X)
def predict_proba(self, X):
return self.clf.predict_proba(X)
results = evaluation(data_train, labels_train)
print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
np.std(results['test_roc_auc'])))
print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
np.std(results['test_accuracy'])))
Training score ROC-AUC: 1.000 +- 0.000 Validation score ROC-AUC: 0.612 +- 0.019 Training score accuracy: 1.000 +- 0.000 Validation score accuracy: 0.587 +- 0.021
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 4.1min finished
This workflow is a combination of the 2 previous workflows. The FeatureExtractor
is extracting both structural and functional connectivity information and concatenate them. Note that each column will contain in their name either connectome or anatomy depending of the type of feature. It will be used to train different classifiers later on.
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from nilearn.connectome import ConnectivityMeasure
def _load_fmri(fmri_filenames):
"""Load time-series extracted from the fMRI using a specific atlas."""
return np.array([pd.read_csv(subject_filename,
header=None).values
for subject_filename in fmri_filenames])
class FeatureExtractor(BaseEstimator, TransformerMixin):
def __init__(self):
# make a transformer which will load the time series and compute the
# connectome matrix
self.transformer_fmri = make_pipeline(
FunctionTransformer(func=_load_fmri, validate=False),
ConnectivityMeasure(kind='tangent', vectorize=True))
def fit(self, X_df, y):
fmri_filenames = X_df['fmri_msdl']
self.transformer_fmri.fit(fmri_filenames, y)
return self
def transform(self, X_df):
fmri_filenames = X_df['fmri_msdl']
X_connectome = self.transformer_fmri.transform(fmri_filenames)
X_connectome = pd.DataFrame(X_connectome, index=X_df.index)
X_connectome.columns = ['connectome_{}'.format(i)
for i in range(X_connectome.columns.size)]
# get the anatomical information
X_anatomy = X_df[[col for col in X_df.columns
if col.startswith('anatomy')]]
X_anatomy = X_anatomy.drop(columns='anatomy_select')
# concatenate both matrices
return pd.concat([X_connectome, X_anatomy], axis=1)
We will create a classifier (i.e. a random forest classifier) which will used both connectome and anatomical features.
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier
class Classifier(BaseEstimator):
def __init__(self):
self.clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
def fit(self, X, y):
self.clf.fit(X, y)
return self
def predict(self, X):
return self.clf.predict(X)
def predict_proba(self, X):
return self.clf.predict_proba(X)
results = evaluation(data_train, labels_train)
print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
np.std(results['test_roc_auc'])))
print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
np.std(results['test_accuracy'])))
Training score ROC-AUC: 1.000 +- 0.000 Validation score ROC-AUC: 0.655 +- 0.028 Training score accuracy: 1.000 +- 0.000 Validation score accuracy: 0.613 +- 0.033
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 4.1min finished
We can propose a more complex classifier than the previous one. We will train 2 single classifiers independetly on the sMRI-derived and fMRI-derived features. Then, a meta-classifier will be used to combine both information. We left out some data to be able to train the meta-classifier.
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
class Classifier(BaseEstimator):
def __init__(self):
self.clf_connectome = make_pipeline(StandardScaler(),
LogisticRegression(C=1.))
self.clf_anatomy = make_pipeline(StandardScaler(),
LogisticRegression(C=1.))
self.meta_clf = LogisticRegression(C=1.)
def fit(self, X, y):
X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]
X_connectome = X[[col for col in X.columns
if col.startswith('connectome')]]
train_idx, validation_idx = train_test_split(range(y.size),
test_size=0.33,
shuffle=True,
random_state=42)
X_anatomy_train = X_anatomy.iloc[train_idx]
X_anatomy_validation = X_anatomy.iloc[validation_idx]
X_connectome_train = X_connectome.iloc[train_idx]
X_connectome_validation = X_connectome.iloc[validation_idx]
y_train = y[train_idx]
y_validation = y[validation_idx]
self.clf_connectome.fit(X_connectome_train, y_train)
self.clf_anatomy.fit(X_anatomy_train, y_train)
y_connectome_pred = self.clf_connectome.predict_proba(
X_connectome_validation)
y_anatomy_pred = self.clf_anatomy.predict_proba(
X_anatomy_validation)
self.meta_clf.fit(
np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1),
y_validation)
return self
def predict(self, X):
X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]
X_connectome = X[[col for col in X.columns
if col.startswith('connectome')]]
y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)
y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)
return self.meta_clf.predict(
np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))
def predict_proba(self, X):
X_anatomy = X[[col for col in X.columns if col.startswith('anatomy')]]
X_connectome = X[[col for col in X.columns
if col.startswith('connectome')]]
y_anatomy_pred = self.clf_anatomy.predict_proba(X_anatomy)
y_connectome_pred = self.clf_connectome.predict_proba(X_connectome)
return self.meta_clf.predict_proba(
np.concatenate([y_connectome_pred, y_anatomy_pred], axis=1))
results = evaluation(data_train, labels_train)
print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
np.std(results['test_roc_auc'])))
print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
np.std(results['test_accuracy'])))
Training score ROC-AUC: 0.915 +- 0.010 Validation score ROC-AUC: 0.649 +- 0.023 Training score accuracy: 0.854 +- 0.017 Validation score accuracy: 0.606 +- 0.022
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 4.2min finished
Once you found a good model, you can submit them to ramp.studio to enter the online challenge. First, if it is your first time using the RAMP platform, sign up, otherwise log in. Then sign up to the event autism. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request is accepted, you can go to your sandbox and copy-paste. You can also create a new starting-kit in the submissions
folder containing both feature_extractor.py
and classifier.py
and upload those file directly. You can check the starting-kit (feature_extractor.py
and classifier.py
) for an example. The submission is trained and tested on our backend in the similar way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission
. The script prints mean cross-validation scores
The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is the AUC. When the score is good enough, you can submit it at the RAMP.
!ramp_test_submission --submission starting_kit_anatomy
Testing Autism Spectrum Disorder classification Reading train and test files from ./data ... Reading cv ... Training ./submissions/starting_kit_anatomy ... CV fold 0 score auc acc train 0.847 0.767 valid 0.647 0.611 test 0.765 0.696 CV fold 1 score auc acc train 0.842 0.766 valid 0.662 0.628 test 0.659 0.478 CV fold 2 score auc acc train 0.854 0.786 valid 0.645 0.615 test 0.720 0.609 CV fold 3 score auc acc train 0.849 0.769 valid 0.645 0.619 test 0.758 0.565 CV fold 4 score auc acc train 0.852 0.770 valid 0.650 0.606 test 0.735 0.652 CV fold 5 score auc acc train 0.847 0.776 valid 0.680 0.642 test 0.598 0.565 CV fold 6 score auc acc train 0.852 0.764 valid 0.624 0.602 test 0.773 0.652 CV fold 7 score auc acc train 0.854 0.779 valid 0.662 0.650 test 0.644 0.478 ---------------------------- Mean CV scores ---------------------------- score auc acc train 0.85 ± 0.0039 0.772 ± 0.0071 valid 0.652 ± 0.0155 0.622 ± 0.0161 test 0.706 ± 0.0605 0.587 ± 0.0753 ---------------------------- Bagged scores ---------------------------- score auc valid 0.651 test 0.720
You can find more information in the README of the ramp-workflow library.
Questions related to the starting kit should be asked on the issue tracker. The RAMP site administrators can be pinged at the RAMP slack team in the #autism channel.