Marcos Duarte
Laboratory of Biomechanics and Motor Control (http://demotu.org/)
Federal University of ABC, Brazil
Sometimes data from experiments are stored in different files where each file contains data for different subjects, trials, conditions, etc. This text presents a common and simple solution to write a code to analyze such data.
The basic idea is that the name of the file is created in a structured way and you can use that to run a sequence of procedures inside one or more nested loops.
For instance, consider that the two first letters of the filename encode the initials of the subject's name, the next two letters the different conditions, and the last two characters the trial number.
subjects = ['AA', 'AB']
conditions = ['c1', 'c2']
trials = ['01', '02', '03']
We could open and process these files with:
for subject in subjects:
for condition in conditions:
for trial in trials:
filename = subject + condition + trial
print(filename)
# read file, process data, save results
AAc101 AAc102 AAc103 AAc201 AAc202 AAc203 ABc101 ABc102 ABc103 ABc201 ABc202 ABc203
The problem with this code is that if one one more files are missing or corrupted (which is typical), it will break. A solution is to read the file inside a try
function. The try...except
handles exceptions such as a failure in reading a file and then we can use a continue
statement to skip each failed iteration in the inner loop.
Let's create some files and implement this idea.
If the data is in text (ASCII) format, it's easier to read the file with the Numpy
function loadtxt
or with the pandas
function read_csv
. Both functions behave similarly; they can skip a certain number of first rows, can read files with different column separators, read numbers and letters, etc. read_csv
tends to be faster but it returns a pandas
DataFrame
object, which might not be useful if you are not into pandas
(but you should be).
To save data to a file, we can use the counterpart functions savetxt
and to_csv
:
import numpy as np
path = './../data/'
extension = '.txt'
for subject in subjects:
for condition in conditions:
for trial in trials:
filename = path + subject + condition + trial + extension
data = np.random.randn(5, 3)
header = 'Col A\tCol B\tCol C'
np.savetxt(filename, data, fmt='%g',
delimiter='\t', header = header, comments = '')
print('File', filename, 'saved')
File ./../data/AAc101.txt saved File ./../data/AAc102.txt saved File ./../data/AAc103.txt saved File ./../data/AAc201.txt saved File ./../data/AAc202.txt saved File ./../data/AAc203.txt saved File ./../data/ABc101.txt saved File ./../data/ABc102.txt saved File ./../data/ABc103.txt saved File ./../data/ABc201.txt saved File ./../data/ABc202.txt saved File ./../data/ABc203.txt saved
In my case I used the './../' command to move up one directory relatively to my current directory (see the cd (command)).
Let's remove one of the files:
import os
os.remove('./../data/AAc202.txt')
Now let's read the data in these files and handle a possible missing or corrupted file:
for subject in subjects:
for condition in conditions:
for trial in trials:
filename = path + subject + condition + trial + extension
try:
data = np.loadtxt(filename, skiprows=1)
except Exception as err:
print(filename, err)
continue
else:
print(filename, 'loaded')
# process data
# ...
# save results
./../data/AAc101.txt loaded ./../data/AAc102.txt loaded ./../data/AAc103.txt loaded ./../data/AAc201.txt loaded ./../data/AAc202.txt [Errno 2] No such file or directory: './../data/AAc202.txt' ./../data/AAc203.txt loaded ./../data/ABc101.txt loaded ./../data/ABc102.txt loaded ./../data/ABc103.txt loaded ./../data/ABc201.txt loaded ./../data/ABc202.txt loaded ./../data/ABc203.txt loaded
The results of the analysis for each file can be stored in a variable in different ways.
We can store the results in a multidimensional variable where each dimension corresponds to the different indices in the loops. With the data above this would produce results(s, c, t)
, a 2x2x3 array. Or we can store everything in a two-dimensional array where for example each row corresponds to each combination of subject, condition, and trial.
Let's try both ways:
results = np.empty(shape=(2, 2, 3, 3))*np.NaN
for s, subject in enumerate(subjects):
for c, condition in enumerate(conditions):
for t, trial in enumerate(trials):
filename = path + subject + condition + trial + extension
try:
data = np.loadtxt(filename, skiprows=1)
except Exception as err:
#print(filename, err)
continue
else:
#print(filename, 'loaded')
pass
results[s, c, t, :] = np.mean(data, axis=0)
print(results.shape)
print(results)
(2, 2, 3, 3) [[[[-0.274438 -0.2594482 0.107014 ] [ 0.526563 0.1208578 -0.1596212 ] [-0.107973 0.5240266 0.59081164]] [[-0.6094492 -0.020314 0.8049366 ] [ nan nan nan] [-0.16159672 0.7030814 -0.140353 ]]] [[[-1.1080408 0.45704074 -1.1114716 ] [-0.4147048 -0.3402416 -0.6172871 ] [ 0.21572672 1.3349914 0.14154375]] [[-0.4043338 -0.2619066 0.071193 ] [-0.03757986 0.7670396 -0.1359376 ] [ 0.768179 0.1022648 -0.69671646]]]]
One problem with this approach is that for many dimensions the data gets convoluted and it might be difficult to read it.
The results for the first subject, condition, and trial are:
results[0, 0, 0, :]
array([-0.274438 , -0.2594482, 0.107014 ])
We can use the second approach and store the results in a two-dimensional array:
results = np.empty(shape=(2*2*3, 3))*np.NaN
results2 = np.empty(shape=(2*2*3, 3))*np.NaN
ind = 0
for s, subject in enumerate(subjects):
for c, condition in enumerate(conditions):
for t, trial in enumerate(trials):
ind += 1
filename = path + subject + condition + trial + extension
try:
data = np.loadtxt(filename, skiprows=1)
except Exception as err:
#print(filename, err)
continue
else:
#print(filename, 'loaded')
pass
# 1st way, using an index:
results[ind-1, :] = np.mean(data, axis=0)
# 2nd way, no index:
results2[len(conditions)*len(trials)*s + len(trials)*c + t, :] = np.mean(data, axis=0)
print(results.shape)
print(results)
print(results2.shape)
print(results2)
(12, 3) [[-0.274438 -0.2594482 0.107014 ] [ 0.526563 0.1208578 -0.1596212 ] [-0.107973 0.5240266 0.59081164] [-0.6094492 -0.020314 0.8049366 ] [ nan nan nan] [-0.16159672 0.7030814 -0.140353 ] [-1.1080408 0.45704074 -1.1114716 ] [-0.4147048 -0.3402416 -0.6172871 ] [ 0.21572672 1.3349914 0.14154375] [-0.4043338 -0.2619066 0.071193 ] [-0.03757986 0.7670396 -0.1359376 ] [ 0.768179 0.1022648 -0.69671646]] (12, 3) [[-0.274438 -0.2594482 0.107014 ] [ 0.526563 0.1208578 -0.1596212 ] [-0.107973 0.5240266 0.59081164] [-0.6094492 -0.020314 0.8049366 ] [ nan nan nan] [-0.16159672 0.7030814 -0.140353 ] [-1.1080408 0.45704074 -1.1114716 ] [-0.4147048 -0.3402416 -0.6172871 ] [ 0.21572672 1.3349914 0.14154375] [-0.4043338 -0.2619066 0.071193 ] [-0.03757986 0.7670396 -0.1359376 ] [ 0.768179 0.1022648 -0.69671646]]
We can create columns identifying the subject, condition, and trial, which might be useful for running statistical analysis:
results = np.empty(shape=(2*2*3, 3))*np.NaN
ind = 0
indexes = []
for s, subject in enumerate(subjects):
for c, condition in enumerate(conditions):
for t, trial in enumerate(trials):
ind += 1
indexes.append([s, c, t])
filename = path + subject + condition + trial + extension
try:
data = np.loadtxt(filename, skiprows=1)
except Exception as err:
#print(filename, err)
continue
else:
#print(filename, 'loaded')
pass
results[ind-1, :] = np.mean(data, axis=0)
results = np.hstack((np.array(indexes), results))
print(results.shape)
print(results)
(12, 6) [[ 0. 0. 0. -0.274438 -0.2594482 0.107014 ] [ 0. 0. 1. 0.526563 0.1208578 -0.1596212 ] [ 0. 0. 2. -0.107973 0.5240266 0.59081164] [ 0. 1. 0. -0.6094492 -0.020314 0.8049366 ] [ 0. 1. 1. nan nan nan] [ 0. 1. 2. -0.16159672 0.7030814 -0.140353 ] [ 1. 0. 0. -1.1080408 0.45704074 -1.1114716 ] [ 1. 0. 1. -0.4147048 -0.3402416 -0.6172871 ] [ 1. 0. 2. 0.21572672 1.3349914 0.14154375] [ 1. 1. 0. -0.4043338 -0.2619066 0.071193 ] [ 1. 1. 1. -0.03757986 0.7670396 -0.1359376 ] [ 1. 1. 2. 0.768179 0.1022648 -0.69671646]]
These are just some possible generic approaches to analyze data in multiple files.
And we can save the results in a file:
filename = path + 'results.txt'
header = 'Subject\tCondition\tTrial\tCol A\tCol B\tCol C'
np.savetxt(filename, results, fmt='%d\t%d\t%d\t%g\t%g\t%g',
delimiter='\t', header = header, comments = '')
print('File', filename, 'saved')
File ./../data/results.txt saved