Code structure for data analysis

Marcos Duarte

Sometimes data from experiments are stored in different files where each file contains data for different subjects, trials, conditions, etc. This text presents a common and simple solution to write a code to analyze such data.
The basic idea is that the name of the file is created in a structured way and you can use that to run a sequence of procedures inside one or more nested loops.
For instance, consider that the two first letters of the filename encode the initials of the subject's name, the next two letters the different conditions, and the last two characters the trial number.

In [1]:
subjects   = ['AA', 'AB']
conditions = ['c1', 'c2']
trials     = ['01', '02', '03']

We could open and process these files with:

In [2]:
for subject in subjects:
    for condition in conditions:
        for trial in trials:
            filename = subject + condition + trial
            print(filename)
            # read file, process data, save results
AAc101
AAc102
AAc103
AAc201
AAc202
AAc203
ABc101
ABc102
ABc103
ABc201
ABc202
ABc203

The problem with this code is that if one one more files are missing or corrupted (which is typical), it will break. A soluction is to read the file inside a try function. The try...except handles exceptions such as a failure in reading a file and then we can use a continue statement to skip each failed iteration in the inner loop.
Let's create some files and implement this idea.

Read and save files

If the data is in text (ASCII) format, it's easier to read the file with the numpy function loadtxt or with the pandas function read_csv. Both functions behave similarly; they can skip a certain nuimber of first rows, can read files with different column separators, read numbers and letters, etc. read_csv tends to be faster but it returns a pandas DataFrame object, which might not be useful if you are not into pandas (but you should be).

To save data to a file, we can use the counterpart functions savetxt and to_csv:

In [3]:
import numpy as np

path = './../data/'
extension = '.txt'

for subject in subjects:
    for condition in conditions:
        for trial in trials:
            filename = path + subject + condition + trial + extension
            data = np.random.randn(5, 3)
            header = 'Col A\tCol B\tCol C'
            np.savetxt(filename, data, fmt='%g',
                       delimiter='\t', header = header, comments = '')
            print('File', filename, 'saved')
File ./../data/AAc101.txt saved
File ./../data/AAc102.txt saved
File ./../data/AAc103.txt saved
File ./../data/AAc201.txt saved
File ./../data/AAc202.txt saved
File ./../data/AAc203.txt saved
File ./../data/ABc101.txt saved
File ./../data/ABc102.txt saved
File ./../data/ABc103.txt saved
File ./../data/ABc201.txt saved
File ./../data/ABc202.txt saved
File ./../data/ABc203.txt saved

In my case I used the './../' command to move up one directory relatively to my curent directory (see the cd (command)).
Let's remove one of the files:

In [4]:
import os
os.remove('./../data/AAc202.txt')

Now let's read the data in these files and handle a possible missing or corrupted file:

In [5]:
for subject in subjects:
    for condition in conditions:
        for trial in trials:
            filename = path + subject + condition + trial + extension
            try:
                data = np.loadtxt(filename, skiprows=1)
            except Exception as err:
                print(filename, err)          
                continue
            else:
                print(filename, 'loaded')
            
            # process data
            # ...
            # save results
./../data/AAc101.txt loaded
./../data/AAc102.txt loaded
./../data/AAc103.txt loaded
./../data/AAc201.txt loaded
./../data/AAc202.txt [Errno 2] No such file or directory: './../data/AAc202.txt'
./../data/AAc203.txt loaded
./../data/ABc101.txt loaded
./../data/ABc102.txt loaded
./../data/ABc103.txt loaded
./../data/ABc201.txt loaded
./../data/ABc202.txt loaded
./../data/ABc203.txt loaded

Store results

The results of the analysis for each file can be stored in a variable in different ways.
We can store the results in a multidimensional variable where each dimension corresponds to the different indices in the loops. With the data above this would produce results(s, c, t), a 2x2x3 array. Or we can store everything in a two-dimensional array where for example each row corresponds to each combination of subject, condition, and trial.
Let's try both ways:

In [6]:
results = np.empty(shape=(2, 2, 3, 3))*np.NaN
for s, subject in enumerate(subjects):
    for c, condition in enumerate(conditions):
        for t, trial in enumerate(trials):
            filename = path + subject + condition + trial + extension
            try:
                data = np.loadtxt(filename, skiprows=1)
            except Exception as err:
                #print(filename, err)          
                continue
            else:
                #print(filename, 'loaded')
                pass
            
            results[s, c, t, :] = np.mean(data, axis=0)
            
print(results.shape)
print(results)
(2, 2, 3, 3)
[[[[-0.49757448 -0.2692196   0.6739528 ]
   [ 0.2925762   0.3684852   0.522114  ]
   [-0.11486849  0.2477564  -0.5365558 ]]

  [[-0.0768988   0.2418202  -0.3577064 ]
   [        nan         nan         nan]
   [ 0.39883422  0.1565358  -0.03583156]]]


 [[[-0.10402126 -0.25288802  0.2824056 ]
   [-0.44776056  0.4005756  -0.3517632 ]
   [ 0.196913   -0.019765   -0.15877938]]

  [[-1.016956    0.615349   -0.4952094 ]
   [-1.0465332   0.3817752   0.211719  ]
   [ 0.0124684  -0.3307858  -0.0447546 ]]]]

One problem with this approach is that for many dimensions the data gets convoluted and it might be difficult to read it.
The results for the first subject, condition, and trial are:

In [7]:
results[0, 0, 0, :]
Out[7]:
array([-0.49757448, -0.2692196 ,  0.6739528 ])

We can use the second approach and store the results in a two-dimensional array:

In [8]:
results = np.empty(shape=(2*2*3, 3))*np.NaN
results2 = np.empty(shape=(2*2*3, 3))*np.NaN
ind = 0
for s, subject in enumerate(subjects):
    for c, condition in enumerate(conditions):
        for t, trial in enumerate(trials):
            ind += 1
            filename = path + subject + condition + trial + extension
            try:
                data = np.loadtxt(filename, skiprows=1)
            except Exception as err:
                #print(filename, err)          
                continue
            else:
                #print(filename, 'loaded')
                pass
            
            # 1st way, using an index:
            results[ind-1, :] = np.mean(data, axis=0)
            # 2nd way, no index:
            results2[len(conditions)*len(trials)*s + len(trials)*c + t, :] = np.mean(data, axis=0)
            
print(results.shape)
print(results)
print(results2.shape)
print(results2)
(12, 3)
[[-0.49757448 -0.2692196   0.6739528 ]
 [ 0.2925762   0.3684852   0.522114  ]
 [-0.11486849  0.2477564  -0.5365558 ]
 [-0.0768988   0.2418202  -0.3577064 ]
 [        nan         nan         nan]
 [ 0.39883422  0.1565358  -0.03583156]
 [-0.10402126 -0.25288802  0.2824056 ]
 [-0.44776056  0.4005756  -0.3517632 ]
 [ 0.196913   -0.019765   -0.15877938]
 [-1.016956    0.615349   -0.4952094 ]
 [-1.0465332   0.3817752   0.211719  ]
 [ 0.0124684  -0.3307858  -0.0447546 ]]
(12, 3)
[[-0.49757448 -0.2692196   0.6739528 ]
 [ 0.2925762   0.3684852   0.522114  ]
 [-0.11486849  0.2477564  -0.5365558 ]
 [-0.0768988   0.2418202  -0.3577064 ]
 [        nan         nan         nan]
 [ 0.39883422  0.1565358  -0.03583156]
 [-0.10402126 -0.25288802  0.2824056 ]
 [-0.44776056  0.4005756  -0.3517632 ]
 [ 0.196913   -0.019765   -0.15877938]
 [-1.016956    0.615349   -0.4952094 ]
 [-1.0465332   0.3817752   0.211719  ]
 [ 0.0124684  -0.3307858  -0.0447546 ]]

We can create columns identifying the subject, condition, and trial, which might be useful for running statistical analysis:

In [9]:
results = np.empty(shape=(2*2*3, 3))*np.NaN
ind = 0
indexes = []
for s, subject in enumerate(subjects):
    for c, condition in enumerate(conditions):
        for t, trial in enumerate(trials):
            ind += 1
            indexes.append([s, c, t])
            filename = path + subject + condition + trial + extension
            try:
                data = np.loadtxt(filename, skiprows=1)
            except Exception as err:
                #print(filename, err)          
                continue
            else:
                #print(filename, 'loaded')
                pass
            
            results[ind-1, :] = np.mean(data, axis=0)
  
results = np.hstack((np.array(indexes), results))
print(results.shape)
print(results)
(12, 6)
[[ 0.          0.          0.         -0.49757448 -0.2692196   0.6739528 ]
 [ 0.          0.          1.          0.2925762   0.3684852   0.522114  ]
 [ 0.          0.          2.         -0.11486849  0.2477564  -0.5365558 ]
 [ 0.          1.          0.         -0.0768988   0.2418202  -0.3577064 ]
 [ 0.          1.          1.                 nan         nan         nan]
 [ 0.          1.          2.          0.39883422  0.1565358  -0.03583156]
 [ 1.          0.          0.         -0.10402126 -0.25288802  0.2824056 ]
 [ 1.          0.          1.         -0.44776056  0.4005756  -0.3517632 ]
 [ 1.          0.          2.          0.196913   -0.019765   -0.15877938]
 [ 1.          1.          0.         -1.016956    0.615349   -0.4952094 ]
 [ 1.          1.          1.         -1.0465332   0.3817752   0.211719  ]
 [ 1.          1.          2.          0.0124684  -0.3307858  -0.0447546 ]]

These are just some possible generic approaches to analyze data in multiple files.

And we can save the results in a file:

In [10]:
filename = path + 'results.txt'
header = 'Subject\tCondition\tTrial\tCol A\tCol B\tCol C'
np.savetxt(filename, results, fmt='%d\t%d\t%d\t%g\t%g\t%g',
           delimiter='\t', header = header, comments = '')
print('File', filename, 'saved')
File ./../data/results.txt saved