The StudyForrest project centers around the use of the movie Forrest Gump, which provides complex sensory input that is both reproducible and is also richly laden with real-life-like content and contexts. While study participants listened to and watched the movie, we collected a rich dataset that encompasses many hours of fMRI scans, structural brain scans, eye-tracking data, and extensive annotations of the movie.
Apart from these data acquisitions, several processed derivatives have also been generated and shared. These include functional contrasts, cortical and subcortical brain parcellations, retinotopic maps, and more.
The StudyForrest and all its derivatives are structured as a nested DataLad dataset that is accessible publicly and provides fine-grained data access down to the level of individual files.
DataLad is a free and open-source distributed data management system, available for all major operating systems, that was developed to aid with everything related to the evolution of digital objects. As explained in the DataLad Handbook:
It is not only keeping track of code, it is not only keeping track of data, it is not only making sharing, retrieving and linking data (and metadata) easy, but it assists with the combination of all things necessary in the digital workflow of data and science.
By following this interactive Jupyter Notebook, you are guided through the step-by-step process of accessing, downloading, and processing the StudyForrest data and its derivatives. This is made seamless with the use of DataLad and various other open source software packages, which are already installed in this Binder instance and ready to use.
Shift + Enter
or Shift + Return
Kernel
menu option above.*Note: this notebook is not intended to be an in-depth tutorial on the use of DataLad. For a more detailed walk-through, please use the DataLad Handbook and introductory tutorial linked above. Some content of the current notebook was imported and adapted from the tutorial.
You can find more details about how to install DataLad and its dependencies on all operating systems in the DataLad handbook, in the installation section. It also details how to install DataLad on shared machines that you don’t have administrative privileges (sudo rights) on, such as high performance compute clusters.
As a first step in this walk-through, let's ensure that we install the most recent version of DataLad using the Python package manager pip:
!pip install -U datalad
Once installed, DataLad can be used as a command line tool or via its Python API. In the command line, an instruction always starts with the general datalad
command:
!datalad
To find out more about the available commands, type datalad --help
. If you already have DataLad installed, make sure that it is a recent version, at least 0.12 or higher:
!datalad --version
DataLad has further functionality that is particularly useful for getting local access to the dataset, its subdatasets, and individual files. These include:
datalad install #install a top-level DataLad dataset with the option of recursively installing subdatasets
datalad clone #install a single DataLad dataset
datalad get #download a local copy of a file or files of an installed DataLad dataset
datalad drop #remove a local copy of a file or files of an installed DataLad dataset
The use of these commands are illustrated below with the StudyForrest dataset.
Getting local access to the dataset is as simple as running the clone
command and pointing it to the location of the DataLad dataset (in this case: https://github.com/psychoinformatics-de/studyforrest-data):
!datalad clone https://github.com/psychoinformatics-de/studyforrest-data
Once the dataset is cloned, it exists as a light-weight directory on your local machine (in this case: /studyforrest-data
). At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual content of the (sometimes large) data files. This fact can be verified by checking the disk usage in the relevant local directory:
cd studyforrest-data
!du -sh
As you can see, the dataset size is tiny, and definitely not the size one would expect for a multi-modality neuroimaging dataset. However, the dataset is a complete representation of all data files. To explore this, the structure of the installed StudyForrest dataset can be viewed with the tree
command:
!tree -d
There are subdirectories for the orignal data, derivative data, artifacts, stimuli, and code, all of which add to the rich StudyForrest dataset. In turn, the content of these subdirectories are structured as DataLad datasets themselves. This demonstrates the concept of dataset nesting, with the top-level (or super) dataset being the StudyForrest dataset that we just cloned, and the subdatasets being all the subdirectories two levels down from the superdataset. These can be identified using the subdatasets
command:
!datalad subdatasets
We see something unexpected, however, when navigating down to and listing the content of specific subdatasets. For example, the original/phase2
subdataset:
cd original/phase2
ls
Note that the ls
command does not yield an output, implying that there are no files or folders in the phase2
directory. This is because our initial datalad clone https://github.com/psychoinformatics-de/studyforrest-data
command only cloned the top-level dataset and referenced its subdatasets. It did not clone the subdatasets. To clone the content of sublevel datasets as well, an option is to use the install
command with an added recursive flag, as in:
datalad install -r https://github.com/psychoinformatics-de/studyforrest-data
More generally, the get
command can be used (recursively or on a single level) to install subdatasets, and adding the -n
flag prevents all data from being retrieved, as in:
datalad get -r -n https://github.com/psychoinformatics-de/studyforrest-data
Here, we use get
to clone the subdataset without retrieving data:
!datalad get -n .
Now, after successful cloning of the sublevel dataset, running the ls
command should show the dataset content:
ls
So, we have cloned a subdataset and we can inspect its contents, and now we want to work with the actual data in the files. Let's try and access the participants.tsv
file:
from pathlib import Path
import pandas as pd
tsv_file = Path('participants.tsv')
try:
tsv_abs_path = tsv_file.resolve(strict=True)
except FileNotFoundError:
print('File does not exist')
else:
print('File exists. Printing file contents.')
participant_info = pd.read_csv(tsv_file, sep='\t')
print(participant_info)
Easy, right? Now let's try with a NIfTI file, for example the functional time series located at sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
:
from pathlib import Path
import nibabel as nib
nii_file = Path('sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz')
try:
nii_abs_path = nii_file.resolve(strict=True)
except FileNotFoundError:
print('File does not exist')
else:
print('File exists. Printing file header content.')
img = nib.load(nii_file)
print(img.header)
As you can see, the returned message says File does not exist
. If we navigate to and list the contents of the subdirectory where this file is supposed to be located, we see the following:
cd sub-01/ses-movie/func/
ls
cd ../../..
Thus, we can see that the file named sub-01_ses-movie_task-movie_run-1_bold.nii.gz
is there, but in reality: what appears to be the file in the dataset is merely a symbolic link (or symlink, indicated by the @
at the end of the filename) to the actual file stored elsewhere. This is intentional behaviour of DataLad (and its dependeny git-annex) and underlies the core functionality of having local access to a full representation of the dataset, while (often large) data files are stored elsewhere.
DataLad can be instructed to retrieve small text-based files upon dataset installation or cloning (technically, these are then stored and tracked with git and not with git-annex), which explains why the tsv
-file was available and the nii
-file not.
To retrieve a specific file or many files, we use the datalad get
command:
!datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
Now, the same code to access the nii
-file and print its header content should run without errors:
from pathlib import Path
import nibabel as nib
nii_file = Path('sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz')
try:
nii_abs_path = nii_file.resolve(strict=True)
except FileNotFoundError:
print('File does not exist')
else:
print('File exists. Printing file header content.')
img = nib.load(nii_file)
print(img.header)
Once data processing is completed and data are not required locally anymore, the content can be dropped from the dataset to save diskspace:
!datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
cd ../..
Now that we are familiar with the concepts of install
, clone
, get
and drop
, we can explore and visualize the StudyForrest dataset!
Let's view a structural T1w and T2w images of a single participant (located in the original/3T_structural_mri
) subdataset.
First we'll clone the dataset and get the relevant data:
!datalad get -n original/3T_structural_mri
!datalad get original/3T_structural_mri/sub-01/anat/sub-01_T*w.nii.gz
Then we can plot the structural images in 3 dimensions. We can do this using a variety of Python tools. Here we use the functionality of matplotlib
as an example:
%matplotlib inline
import os
from utilities import plot_structural
t1w_fn = os.path.abspath('original/3T_structural_mri/sub-01/anat/sub-01_T1w.nii.gz')
fig = plot_structural(t1w_fn)
t2w_fn = os.path.abspath('original/3T_structural_mri/sub-01/anat/sub-01_T2w.nii.gz')
fig = plot_structural(t2w_fn)
Since head movement during the acquisition of a functional MRI time series can be detrimental for the eventual data analysis and results, volume-to-volume head movement parameters are typically inspected as a quality indicator of fMRI data. Framewise displacement (FD) captures head movement in a single value per volume, resulting in an FD time series per functional run. Below we present interactive distribution plots of FD values for all participants over all runs of the 3T audiovisual movie dataset. Distributions and an example time series are also presented for a single subject and a single run.
First, we retrieve the relevant data:
!datalad get -n derivative/aligned_mri
!datalad get derivative/aligned_mri/sub-*/in_bold3Tp2/*_mcparams.txt
Then we:
import os
import utilities as util
from plotly.colors import sequential, n_colors, qualitative
import plotly.graph_objs as go
dataset_dir = os.path.abspath('derivative/aligned_mri')
participants, column_names, df_subs, df_subsruns = util.prepare_fd(dataset_dir)
colors = n_colors('rgb(255, 149, 81)', 'rgb(109, 52, 137)', 20, colortype='rgb')
data = []
layout = go.Layout(
xaxis = dict(tickangle=45),
yaxis = dict(title='Framewise displacement (mm)', range=[-0.3, 2]),
title = 'Framewise displacement for all participants over all runs (audio-visual movie task)'
)
fig1 = go.Figure(layout=layout)
i = 0
for colname, color in zip(participants, colors):
data.append(df_subs[colname].dropna().to_numpy())
fig1.add_trace(go.Violin(y=data[i], line_color=colors[i], name=colname, orientation='v', side='positive', width=1.8, points=False, box_visible=True, meanline_visible=True))
i += 1
fig1.update_layout(xaxis_showgrid=False, xaxis_zeroline=False)
fig1
sub_nr = 1
sub = f"sub-{sub_nr:02d}"
data = []
layout = go.Layout(
title = 'Framewise displacement for all 8 runs of ' + sub + ' (audio-visual movie task)',
xaxis = dict(tickangle=45),
yaxis = dict(title='Framewise displacement (mm)', range=[-0.05, 1]),
height=400,
)
fig2 = go.Figure(layout=layout)
i = 0
for colname, color in zip(column_names[8*(sub_nr-1):8*sub_nr], colors):
data.append(df_subsruns[colname].dropna().to_numpy())
fig2.add_trace(go.Violin(y=data[i], line_color=sequential.Viridis[i], name=colname, orientation='v', side='positive', width=1.8, points=False, box_visible=True, meanline_visible=True))
i += 1
fig2.update_layout(xaxis_showgrid=False, xaxis_zeroline=False)
fig2
from plotly.subplots import make_subplots
run_nr = 2
run = f"run-{run_nr}"
marker = sub + '_' + run
fig3 = make_subplots(rows=1, cols=2, column_widths=[0.85, 0.15],shared_yaxes=True, subplot_titles=("Time series", "Distribution"), horizontal_spacing=0.01)
fig3.add_trace(go.Scatter(y=df_subsruns[marker].dropna().to_numpy(), mode='lines', line = dict(color=sequential.Viridis[run_nr-1], width=2), name='Time series', showlegend=False),
row=1, col=1)
fig3.add_trace(go.Violin(y=df_subsruns[marker].dropna().to_numpy(), line_color=sequential.Viridis[run_nr-1], name='Distribution', orientation='v', side='positive', width=1.5, points='all', jitter=0.5, box_visible=True, meanline_visible=True, showlegend=False),
row=1, col=2)
fig3.update_layout(
height=300,
yaxis = dict(title='FD (mm)',range=[-0.05, 1]),
title = 'Framewise displacement for sub-01-run2 (time series and distribution)'
)
fig3.update_xaxes(showticklabels=False)
fig3.update_xaxes(showticklabels=True, row=1, col=1)
fig3
Thanks for following along! You have now experienced the basics of working with the StudyForrest data using DataLad. You have also seen some sample scripts and visualizations of the structural and functional data.
Now it's your turn :) Feel free to add more code cells below and test out your favorite algorithm/script/package on the StudyForrest data!