OHBM Brainhack TrainTrack: DataLad

StudyForrest

An experiment in decentralized collaborative neuroimaging research

What is StudyForrest?¶

The StudyForrest project centers around the use of the movie Forrest Gump, which provides complex sensory input that is both reproducible and is also richly laden with real-life-like content and contexts. While study participants listened to and watched the movie, we collected a rich dataset that encompasses many hours of fMRI scans, structural brain scans, eye-tracking data, and extensive annotations of the movie.

Apart from these data acquisitions, several processed derivatives have also been generated and shared. These include functional contrasts, cortical and subcortical brain parcellations, retinotopic maps, and more.

The StudyForrest and all its derivatives are structured as a nested DataLad dataset that is accessible publicly and provides fine-grained data access down to the level of individual files.

What is DataLad?¶

DataLad is a free and open-source distributed data management system, available for all major operating systems, that was developed to aid with everything related to the evolution of digital objects. As explained in the DataLad Handbook:

It is not only keeping track of code, it is not only keeping track of data, it is not only making sharing, retrieving and linking data (and metadata) easy, but it assists with the combination of all things necessary in the digital workflow of data and science.

Useful StudyForrest and DataLad links¶

What is this notebook?¶

By following this interactive Jupyter Notebook, you are guided through the step-by-step process of accessing, downloading, and processing the StudyForrest data and its derivatives. This is made seamless with the use of DataLad and various other open source software packages, which are already installed in this Binder instance and ready to use.

Instructions:¶

All code/Markdown cells can be run using the keyboard shortcut: Shift + Enter or Shift + Return
It is important to run all of the cells in order from top to bottom. This is important because Jupyter Notebooks keep a global context, i.e. it remembers the results of the code that was recently executed, and takes that into account when running new code cells. For example, if you run a code cell to list the contents of a directory, then run a next cell to navigate to a different directory, then rerun the first cell, it will list the contents of the directory that you navigated to, not the one where you were originally located.
If you get stuck or receive errors due to running the cells in a different order, you can restart the kernel in the Kernel menu option above.

Content:¶

DataLad setup and introduction
StudyForrest dataset installation and structure
Retrieving and removing data
Visualizing structural data
Functional data quality plots
Your move :)

*Note: this notebook is not intended to be an in-depth tutorial on the use of DataLad. For a more detailed walk-through, please use the DataLad Handbook and introductory tutorial linked above. Some content of the current notebook was imported and adapted from the tutorial.

Exploring the StudyForrest data

1. DataLad setup and introduction¶

You can find more details about how to install DataLad and its dependencies on all operating systems in the DataLad handbook, in the installation section. It also details how to install DataLad on shared machines that you don’t have administrative privileges (sudo rights) on, such as high performance compute clusters.

As a first step in this walk-through, let's ensure that we install the most recent version of DataLad using the Python package manager pip:

In [ ]:

!pip install -U datalad

Once installed, DataLad can be used as a command line tool or via its Python API. In the command line, an instruction always starts with the general datalad command:

In [ ]:

!datalad

To find out more about the available commands, type datalad --help. If you already have DataLad installed, make sure that it is a recent version, at least 0.12 or higher:

In [ ]:

!datalad --version

DataLad has further functionality that is particularly useful for getting local access to the dataset, its subdatasets, and individual files. These include:

datalad install  #install a top-level DataLad dataset with the option of recursively installing subdatasets
datalad clone    #install a single DataLad dataset
datalad get      #download a local copy of a file or files of an installed DataLad dataset
datalad drop     #remove a local copy of a file or files of an installed DataLad dataset

The use of these commands are illustrated below with the StudyForrest dataset.

2. StudyForrest dataset cloning and structure¶

Getting local access to the dataset is as simple as running the clone command and pointing it to the location of the DataLad dataset (in this case: https://github.com/psychoinformatics-de/studyforrest-data):

In [ ]:

!datalad clone https://github.com/psychoinformatics-de/studyforrest-data

Once the dataset is cloned, it exists as a light-weight directory on your local machine (in this case: /studyforrest-data). At this point, it contains only small metadata and information on the identity of the files in the dataset, but not actual content of the (sometimes large) data files. This fact can be verified by checking the disk usage in the relevant local directory:

In [ ]:

cd studyforrest-data

In [ ]:

!du -sh

As you can see, the dataset size is tiny, and definitely not the size one would expect for a multi-modality neuroimaging dataset. However, the dataset is a complete representation of all data files. To explore this, the structure of the installed StudyForrest dataset can be viewed with the tree command:

In [ ]:

!tree -d

There are subdirectories for the orignal data, derivative data, artifacts, stimuli, and code, all of which add to the rich StudyForrest dataset. In turn, the content of these subdirectories are structured as DataLad datasets themselves. This demonstrates the concept of dataset nesting, with the top-level (or super) dataset being the StudyForrest dataset that we just cloned, and the subdatasets being all the subdirectories two levels down from the superdataset. These can be identified using the subdatasets command:

In [ ]:

!datalad subdatasets

We see something unexpected, however, when navigating down to and listing the content of specific subdatasets. For example, the original/phase2 subdataset:

In [ ]:

cd original/phase2

In [ ]:

ls

Note that the ls command does not yield an output, implying that there are no files or folders in the phase2 directory. This is because our initial datalad clone https://github.com/psychoinformatics-de/studyforrest-data command only cloned the top-level dataset and referenced its subdatasets. It did not clone the subdatasets. To clone the content of sublevel datasets as well, an option is to use the install command with an added recursive flag, as in:

datalad install -r https://github.com/psychoinformatics-de/studyforrest-data

More generally, the get command can be used (recursively or on a single level) to install subdatasets, and adding the -n flag prevents all data from being retrieved, as in:

datalad get -r -n https://github.com/psychoinformatics-de/studyforrest-data

Here, we use get to clone the subdataset without retrieving data:

In [ ]:

!datalad get -n .

Now, after successful cloning of the sublevel dataset, running the ls command should show the dataset content:

In [ ]:

ls

3. Retrieving and dropping data¶

So, we have cloned a subdataset and we can inspect its contents, and now we want to work with the actual data in the files. Let's try and access the participants.tsv file:

In [ ]:

from pathlib import Path
import pandas as pd

tsv_file = Path('participants.tsv')
try:
    tsv_abs_path = tsv_file.resolve(strict=True)
except FileNotFoundError:
    print('File does not exist')
else:
    print('File exists. Printing file contents.')
    participant_info = pd.read_csv(tsv_file, sep='\t')
    print(participant_info)

Easy, right? Now let's try with a NIfTI file, for example the functional time series located at sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz:

In [ ]:

from pathlib import Path
import nibabel as nib

nii_file = Path('sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz')
try:
    nii_abs_path = nii_file.resolve(strict=True)
except FileNotFoundError:
    print('File does not exist')
else:
    print('File exists. Printing file header content.')
    img = nib.load(nii_file)
    print(img.header)
    

As you can see, the returned message says File does not exist. If we navigate to and list the contents of the subdirectory where this file is supposed to be located, we see the following:

In [ ]:

cd sub-01/ses-movie/func/

In [ ]:

ls

In [ ]:

cd ../../..

Thus, we can see that the file named sub-01_ses-movie_task-movie_run-1_bold.nii.gz is there, but in reality: what appears to be the file in the dataset is merely a symbolic link (or symlink, indicated by the @ at the end of the filename) to the actual file stored elsewhere. This is intentional behaviour of DataLad (and its dependeny git-annex) and underlies the core functionality of having local access to a full representation of the dataset, while (often large) data files are stored elsewhere.

DataLad can be instructed to retrieve small text-based files upon dataset installation or cloning (technically, these are then stored and tracked with git and not with git-annex), which explains why the tsv-file was available and the nii-file not.

To retrieve a specific file or many files, we use the datalad get command:

In [ ]:

!datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz

Now, the same code to access the nii-file and print its header content should run without errors:

In [ ]:

from pathlib import Path
import nibabel as nib

nii_file = Path('sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz')
try:
    nii_abs_path = nii_file.resolve(strict=True)
except FileNotFoundError:
    print('File does not exist')
else:
    print('File exists. Printing file header content.')
    img = nib.load(nii_file)
    print(img.header)

Once data processing is completed and data are not required locally anymore, the content can be dropped from the dataset to save diskspace:

In [ ]:

!datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz

In [ ]:

cd ../..

4. Visualizing structural data¶

Now that we are familiar with the concepts of install, clone, get and drop, we can explore and visualize the StudyForrest dataset!

Let's view a structural T1w and T2w images of a single participant (located in the original/3T_structural_mri) subdataset.

First we'll clone the dataset and get the relevant data:

In [ ]:

!datalad get -n original/3T_structural_mri

In [ ]:

!datalad get original/3T_structural_mri/sub-01/anat/sub-01_T*w.nii.gz

Then we can plot the structural images in 3 dimensions. We can do this using a variety of Python tools. Here we use the functionality of matplotlib as an example:

In [ ]:

%matplotlib inline
import os
from utilities import plot_structural
t1w_fn = os.path.abspath('original/3T_structural_mri/sub-01/anat/sub-01_T1w.nii.gz')
fig = plot_structural(t1w_fn)

In [ ]:

t2w_fn = os.path.abspath('original/3T_structural_mri/sub-01/anat/sub-01_T2w.nii.gz')
fig = plot_structural(t2w_fn)

5. Functional data quality plots¶

Since head movement during the acquisition of a functional MRI time series can be detrimental for the eventual data analysis and results, volume-to-volume head movement parameters are typically inspected as a quality indicator of fMRI data. Framewise displacement (FD) captures head movement in a single value per volume, resulting in an FD time series per functional run. Below we present interactive distribution plots of FD values for all participants over all runs of the 3T audiovisual movie dataset. Distributions and an example time series are also presented for a single subject and a single run.

First, we retrieve the relevant data:

In [ ]:

!datalad get -n derivative/aligned_mri

In [ ]:

!datalad get derivative/aligned_mri/sub-*/in_bold3Tp2/*_mcparams.txt

Then we:

calculate framewise displacement from the movement parameter files,
prepare all calculated values by structuring them in Pandas dataframes and lists, and
use Plotly to create interactive graphs of framewise displacement over all subjects and runs.

In [ ]:

import os
import utilities as util
from plotly.colors import sequential, n_colors, qualitative
import plotly.graph_objs as go

dataset_dir = os.path.abspath('derivative/aligned_mri')
participants, column_names, df_subs, df_subsruns = util.prepare_fd(dataset_dir)

colors = n_colors('rgb(255, 149, 81)', 'rgb(109, 52, 137)', 20, colortype='rgb')
data = []
layout = go.Layout(
    xaxis = dict(tickangle=45),
    yaxis = dict(title='Framewise displacement (mm)', range=[-0.3, 2]),
    title = 'Framewise displacement for all participants over all runs (audio-visual movie task)'
)
fig1 = go.Figure(layout=layout)
i = 0
for colname, color in zip(participants, colors):
    data.append(df_subs[colname].dropna().to_numpy())
    fig1.add_trace(go.Violin(y=data[i], line_color=colors[i], name=colname, orientation='v', side='positive', width=1.8, points=False, box_visible=True, meanline_visible=True))
    i += 1
fig1.update_layout(xaxis_showgrid=False, xaxis_zeroline=False)
fig1

In [ ]:

sub_nr = 1
sub = f"sub-{sub_nr:02d}"

data = []
layout = go.Layout(
    title = 'Framewise displacement for all 8 runs of ' + sub + ' (audio-visual movie task)',
    xaxis = dict(tickangle=45),
    yaxis = dict(title='Framewise displacement (mm)', range=[-0.05, 1]),
    height=400,
)
fig2 = go.Figure(layout=layout)
i = 0
for colname, color in zip(column_names[8*(sub_nr-1):8*sub_nr], colors):
    data.append(df_subsruns[colname].dropna().to_numpy())
    fig2.add_trace(go.Violin(y=data[i], line_color=sequential.Viridis[i], name=colname, orientation='v', side='positive', width=1.8, points=False, box_visible=True, meanline_visible=True))
    i += 1
fig2.update_layout(xaxis_showgrid=False, xaxis_zeroline=False)
fig2

In [ ]:

from plotly.subplots import make_subplots

run_nr = 2
run = f"run-{run_nr}"
marker = sub + '_' + run

fig3 = make_subplots(rows=1, cols=2, column_widths=[0.85, 0.15],shared_yaxes=True, subplot_titles=("Time series", "Distribution"), horizontal_spacing=0.01)
fig3.add_trace(go.Scatter(y=df_subsruns[marker].dropna().to_numpy(), mode='lines', line = dict(color=sequential.Viridis[run_nr-1], width=2), name='Time series', showlegend=False),
              row=1, col=1)
fig3.add_trace(go.Violin(y=df_subsruns[marker].dropna().to_numpy(), line_color=sequential.Viridis[run_nr-1], name='Distribution', orientation='v', side='positive', width=1.5, points='all', jitter=0.5, box_visible=True, meanline_visible=True, showlegend=False),
              row=1, col=2)
fig3.update_layout(
    height=300,
    yaxis = dict(title='FD (mm)',range=[-0.05, 1]),
    title = 'Framewise displacement for sub-01-run2 (time series and distribution)'
)
fig3.update_xaxes(showticklabels=False)
fig3.update_xaxes(showticklabels=True, row=1, col=1)
fig3

6. Your move :)¶

Thanks for following along! You have now experienced the basics of working with the StudyForrest data using DataLad. You have also seen some sample scripts and visualizations of the structural and functional data.

Now it's your turn :) Feel free to add more code cells below and test out your favorite algorithm/script/package on the StudyForrest data!

ChessUrl