This document will walk through the steps for parsing and cleaning flow cytometry spreadsheets. The result will be a single R dataframe containing all the data, with standardized column names and formatting.
Required Files:
SIG_WNV_Flow_Data_Cleaning.ipynb
): [Download here]flow_data_cleaning_functions.r
): [Download here]WNV_Data_Dictionary.xlsx
): [Download here]** Note: this notebook can also be downloaded as an R script (only the code blocks seen below will be included): [Download R script here]
Required R Packages:
All code is available on GitHub: https://github.com/biodev/SIG
If you are not familiar with Jupyter Notebooks, I've created a short tutorial to get you up and running quickly. There is also plenty of documentation online:
After finishing this workflow you will have a cleaned dataset ready for exploration and analysis. A notebook with code examples for plotting the data (including interactive plots created with the Shiny library) is available here: [Flow Data Plotting Workflow]
Sometimes, a bit of manual cleaning of the spreadsheets is necessary for the parsing function to work correctly.
final_flow.txt
file (unexpected columns will be printed to the screen to alert the user). If any of the following columns are missing, errors will occur when merging the different panels: 'UNC strain', 'UW strain', 'RIX_ID' (changed from 'Mouse #'), 'Timepoint', 'Tissue' (changed from 'Organ'), 'Total Cell Count'.There are a number of functions in the accompanying R script (flow_data_cleaning_functions.r
) necessary for parsing and then processing the flow cytometry data:
read_flow_exp_file()
: Parses a flow spreadsheet and creates an R dataframe.fix_column_names()
: Standardizes the column names of the above dataframe.calc_treg_counts()
calc_tcell_counts()
calc_ics_counts()
calc_ics_percent_ratios()
calc_ics_count_ratios()
clean_inf_nan()
: Sets any infinite or NaN values to NA.More information on each of these functions is available by calling the describe()
function. For example, the following command will print documentation for the read_flow_exp_file()
function:
describe(read_flow_exp_file)
Remember that, in addition to the help documentation provided with describe()
, you can view the actual function definitions at any time by simply typing the function name without parentheses (e.g. describe
) at the command prompt.
## Load functions for parsing the flow cytometry spreadsheets
## The gdata library is necessary for reading Excel spreadsheets; it will be loaded as well.
source('./scripts/flow_data_cleaning_functions.r')
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED. gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED. Attaching package: ‘gdata’ The following object is masked from ‘package:stats’: nobs The following object is masked from ‘package:utils’: object.size
## View help documentation on the functions listed above
describe(read_flow_exp_file)
This function parses flow cytometry data from an Excel workbook. Parameters: f: The Excel file name. cn_expected: A character vector containing the expected column names. Returns: A dataframe containing the processed flow cytometry data.
## Load all expected flow variables (expected column names)
flow_cn = read.xls(xls="./data/WNV_Data_Dictionary.xlsx", sheet="Flow Data", as.is=T)
flow_cn = flow_cn[,1]
## Move to the directory holding the data
flow_dir = "/Users/mooneymi/Documents/MyDocuments/SystemsImmunogenetics/WNV/Lund_Flow_fixed_Apr_26"
## Get a list of data files to read (in this case all flow spreadsheets begin with the prefix 'Expt')
flow_files = list.files(flow_dir, pattern="Expt.*\\.xls")
print(flow_files)
[1] "Expt 35 cell counts 04_26_16_fixedMM.xlsx" [2] "Expt 45 cell counts 04_26_16_fixedMM.xlsx"
## Iterate through all the files, parse each, and merge all data into a single dataframe
i = 1
for (file in flow_files) {
print(file)
flow_dat = read_flow_exp_file(file.path(flow_dir, file), flow_cn)
## Check if there are any unexpected columns
new_columns = setdiff(colnames(flow_dat), flow_cn)
if (length(new_columns) > 0) {
flow_cn = c(flow_cn, new_columns)
}
if (i > 1) {
## Fill extra columns with NAs
for (col in new_columns) {
flow_all[,col] = NA
}
## Merge data
flow_all = rbind(flow_all[,flow_cn], flow_dat[,flow_cn])
} else {
flow_all = flow_dat
}
i = i + 1
}
[1] "Expt 35 cell counts 04_26_16_fixedMM.xlsx" [1] "Treg panel" [1] "CD8 d7 T cell" "CD8 d21" [1] "ICS panel" [1] "Expt 45 cell counts 04_26_16_fixedMM.xlsx" [1] "Treg panel" [1] "CD8 d12" "CD8 d21" [1] "ICS panel"
## Check the dimensions of the dataframe
dim(flow_all)
## Check that all expected columns are present
setdiff(flow_cn, colnames(flow_all))
flow_all[1:10,1:11]
ID | Mating | UW_Line | RIX_ID | Timepoint | Virus | Tissue | Lab | Data_Altered | Notes | Total_Cell_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 8024x8049_16 | 8024x8049 | 038 | 16 | 28 | WNV | brain | NA | NA | NA | 8000000 |
2 | 8024x8049_16 | 8024x8049 | 038 | 16 | 28 | WNV | spleen | NA | NA | NA | 27840000 |
3 | 8024x8049_17 | 8024x8049 | 038 | 17 | 28 | WNV | brain | NA | NA | NA | 7040000 |
4 | 8024x8049_17 | 8024x8049 | 038 | 17 | 28 | WNV | spleen | NA | NA | NA | 31680000 |
5 | 8024x8049_3 | 8024x8049 | 038 | 3 | 28 | Mock | brain | NA | NA | NA | 3200000 |
6 | 8024x8049_3 | 8024x8049 | 038 | 3 | 28 | Mock | spleen | NA | NA | NA | 15360000 |
7 | 8024x8049_7 | 8024x8049 | 038 | 7 | 28 | Mock | brain | NA | NA | NA | 3520000 |
8 | 8024x8049_7 | 8024x8049 | 038 | 7 | 28 | Mock | spleen | NA | NA | NA | 19200000 |
9 | 8024x8049_8 | 8024x8049 | 038 | 8 | 28 | Mock | brain | NA | NA | NA | 3680000 |
10 | 8024x8049_8 | 8024x8049 | 038 | 8 | 28 | Mock | spleen | NA | NA | NA | 20160000 |
## Order columns, add Lab column and fix formatting
flow_all = flow_all[, flow_cn]
flow_all$Lab = "Lund"
flow_all$ID = gsub(" ", "", flow_all$ID)
flow_all$ID = gsub("X", "x", flow_all$ID)
flow_all$Mating = gsub(" ", "", flow_all$Mating)
flow_all$Mating = gsub("X", "x", flow_all$Mating)
flow_all$UW_Line = as.numeric(flow_all$UW_Line)
flow_all$Timepoint = as.numeric(flow_all$Timepoint)
## For validation data mock animals do not have IDs, set to NA
flow_all$ID[is.na(flow_all$RIX_ID)] = NA
flow_all[1:10,1:11]
ID | Mating | UW_Line | RIX_ID | Timepoint | Virus | Tissue | Lab | Data_Altered | Notes | Total_Cell_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 8024x8049_16 | 8024x8049 | 38 | 16 | 28 | WNV | brain | Lund | NA | NA | 8000000 |
2 | 8024x8049_16 | 8024x8049 | 38 | 16 | 28 | WNV | spleen | Lund | NA | NA | 27840000 |
3 | 8024x8049_17 | 8024x8049 | 38 | 17 | 28 | WNV | brain | Lund | NA | NA | 7040000 |
4 | 8024x8049_17 | 8024x8049 | 38 | 17 | 28 | WNV | spleen | Lund | NA | NA | 31680000 |
5 | 8024x8049_3 | 8024x8049 | 38 | 3 | 28 | Mock | brain | Lund | NA | NA | 3200000 |
6 | 8024x8049_3 | 8024x8049 | 38 | 3 | 28 | Mock | spleen | Lund | NA | NA | 15360000 |
7 | 8024x8049_7 | 8024x8049 | 38 | 7 | 28 | Mock | brain | Lund | NA | NA | 3520000 |
8 | 8024x8049_7 | 8024x8049 | 38 | 7 | 28 | Mock | spleen | Lund | NA | NA | 19200000 |
9 | 8024x8049_8 | 8024x8049 | 38 | 8 | 28 | Mock | brain | Lund | NA | NA | 3680000 |
10 | 8024x8049_8 | 8024x8049 | 38 | 8 | 28 | Mock | spleen | Lund | NA | NA | 20160000 |
## Check for duplicate IDs
new_flow_ids = paste(flow_all$ID, flow_all$Tissue, sep='_')
sum(duplicated(new_flow_ids))
## Read the previously cleaned data
## Note: you will have to change the file path
cleaned_dir = '/Users/mooneymi/Documents/MyDocuments/SystemsImmunogenetics/WNV/Cleaned_Data_Releases/23-Mar-2016'
flow_prev = read.xls(file.path(cleaned_dir, 'Lund_Flow_21-Mar-2016_final.xlsx'),
header=T, as.is=T, na.strings=c(""," ", "NA", "#DIV/0!"))
dim(flow_prev)
## Check for duplicate IDs
dup_ids1 = intersect(flow_prev$ID[flow_prev$Tissue=='brain'], flow_all$ID[flow_all$Tissue=='brain'])
dup_ids2 = intersect(flow_prev$ID[flow_prev$Tissue=='spleen'], flow_all$ID[flow_all$Tissue=='spleen'])
## Overwrite old data with new
idx1 = which(flow_prev$ID %in% dup_ids1 & flow_prev$Tissue=='brain')
idx2 = which(flow_prev$ID %in% dup_ids2 & flow_prev$Tissue=='spleen')
idx_dups = c(idx1, idx2)
print(length(idx_dups))
[1] 12
idx_to_keep = setdiff(1:nrow(flow_prev), idx_dups)
flow_prev = flow_prev[idx_to_keep,]
flow_all = rbind(flow_prev[, flow_cn], flow_all[, flow_cn])
## Check the dimensions of the dataframe
dim(flow_all)
## Change all data columns to numeric
for (i in 11:277) {
flow_all[,i] = as.numeric(flow_all[,i])
}
## Calculate cell counts and ratios
flow_full = flow_all
flow_full = calc_treg_counts(flow_full)
flow_full = calc_tcell_counts(flow_full)
flow_full = calc_ics_counts(flow_full)
flow_full = calc_ics_percent_ratios(flow_full)
flow_full = calc_ics_count_ratios(flow_full)
flow_full = clean_inf_nan(flow_full)
dim(flow_full)
## Save the data file
write.table(flow_all, file=file.path(flow_dir, 'Lund_Flow_12-May-2016_final.txt'),
col.names=T, row.names=F, quote=T, sep='\t', na='')
## Save the full data file
write.table(flow_full, file=file.path(flow_dir, 'Lund_Flow_Full_12-May-2016_final.txt'),
col.names=T, row.names=F, quote=T, sep='\t', na='')
save(flow_full, file=file.path(flow_dir, 'lund_flow_full_12-May-2016_final.rda'))
Code for plotting this data is available here: [Flow Data Plotting Workflow]