In this notebook I will show you how you can use a study design configuration is JSON format as produce by datascriptor (https://gitlab.com/datascriptor/datascriptor) to generate a single-study ISA investigation and how you can then serialise it in JSON and tabular (i.e. CSV) format.
Or study design configuration consists of:
Let's import all the required libraries
# If executing the notebooks on `Google Colab`,uncomment the following command
# and run it to install the required python libraries. Also, make the test datasets available.
# !pip install -r requirements.txt
from time import time
import os
import json
## ISA-API related imports
from isatools.model import Investigation, Study
## ISA-API create mode related imports
from isatools.create.model import StudyDesign
from isatools.create.connectors import generate_study_design
# serializer from ISA Investigation to JSON
from isatools.isajson import ISAJSONEncoder
# ISA-Tab serialisation
from isatools import isatab
## ISA-API create mode related imports
from isatools.create import model
from isatools import isajson
First of all we load the study design configurator with all the specs defined above
with open(os.path.abspath(os.path.join(
"isa-study-design-as-json/datascriptor", "crossover-study-design-4-arms-blood-derma-nmr-ms-chipseq.json")), "r") as config_file:
study_design_config = json.load(config_file)
study_design_config
To perform the conversion we just need to use the function generate_isa_study_design_from_config
(name possibly subject to change, should we drop the "isa" and "datascriptor" qualifiers?)
study_design = generate_study_design(study_design_config)
assert isinstance(study_design, StudyDesign)
The StudyDesign.generate_isa_study()
method returns the complete ISA-API Study
object.
start = time()
study = study_design.generate_isa_study()
end = time()
print('The generation of the study design took {:.2f} s.'.format(end - start))
assert isinstance(study, Study)
investigation = Investigation(studies=[study])
start = time()
inv_json = json.dumps(investigation, cls=ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': '))
end = time()
print('The JSON serialisation of the ISA investigation took {:.2f} s.'.format(end - start))
directory = os.path.abspath(os.path.join('output'))
if not os.path.exists(directory):
os.makedirs(directory)
with open(os.path.abspath(os.path.join('output','isa-investigation-2-arms-nmr-ms.json')), 'w') as out_fp:
json.dump(json.loads(inv_json), out_fp)
To use them on the notebook we can also dump the tables to pandas DataFrames, using the dump_tables_to_dataframes
function rather than dump
start = time()
dataframes = isatab.dump_tables_to_dataframes(investigation)
end = time()
print('The Tab serialisation of the ISA investigation took {:.2f} s.'.format(end - start))
# alternatively, if you just want to write the isatab files to file, you can run
# isatab.dump(investigation, os.path.abspath(os.path.join('notebook-output/isa-study-from-design-config')))
len(dataframes)
We have 1 study file and 2 assay files (one for MS and one for NMR). Let's check the names:
for key in dataframes.keys():
display(key)
We have 10 subjects in the each of the six arms for a total of 60 subjects. 5 blood samples per subject are collected (1 in treatment 1 phase, 1 in treatment, and 3 in the follow-up phase) for a total of 300 blood samples. These will undergo the NMR assay. We have 4 saliva samples per subject (1 during screen and 3 during follow-up) for a total of 240 saliva samples. These will undergo the "mass spcetrometry" assay.
study_frame = dataframes['s_study_01.txt']
count_arm0_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP0' in el)])
count_arm2_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP1' in el)])
count_arm3_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP2' in el)])
count_arm4_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP3' in el)])
count_arm3_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP4' in el)])
count_arm4_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP5' in el)])
print("There are {} samples in the GRP0 arm (i.e. group)".format(count_arm0_samples))
print("There are {} samples in the GRP2 arm (i.e. group)".format(count_arm2_samples))
print("There are {} samples in the GRP3 arm (i.e. group)".format(count_arm3_samples))
print("There are {} samples in the GRP4 arm (i.e. group)".format(count_arm4_samples))
For the mass. spec. assay table, we have 240 (saliva) samples, 480 extracts (2 per sample, "lipids" and "polar" fractions), 960 labeled extracts (2 per extract, as "#replicates" is 2) and 3840 mass spec protocols + 3840 output files (4 per labeled extract as we do 2 technical replicates with 2 protocol parameter combinations ["Agilent QTQF 6510", "FIA", "positive mode"]
and ["Agilent QTQF 6510", "LC", "positive mode"]
).
dataframes['a_AT0_metabolite-profiling_mass-spectrometry.txt']
For the NMR assay table, we have 300 (blood) samples, 1200 extracts (4 per sample, 2 extraction replicates of the "supernatant" and "pellet" fractions) and 4800 NMR protocols + 4800 output files (4 per extract as we do 2 technical replicates with 2 protocol parameter combinations ["Bruker Avance II 1 GHz", "1D 1H NMR", "CPGM"]
and ["Bruker Avance II 1 GHz", "1D 1H NMR", "TOCSY"]
).
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt']
For the Chip-Seq assay table, we have 300 (blood) samples, 1200 extracts (4 per sample, 2 extraction replicates of the "supernatant" and "pellet" fractions).
dataframes['a_AT16_protein-DNA-binding-site-identification_nucleic-acid-sequencing.txt']
dataframes['a_AT0_metabolite-profiling_mass-spectrometry.txt'].nunique(axis=0, dropna=True)
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt'].nunique(axis=0, dropna=True)