Notebook

Determination of optimal biomarker for the decription of stochastic disease dynamics from logitudinal data¶

The general idea¶

Assume that stochastic disease dynamics can be described by a Markov chain. It means that disease can be roughly characterized by a number of states. And when at the time moment ($t$) the disease is in a particular state ($i$) it has a probability $P_{\Delta t} (j|i)$ to go to another state ($j$) at the next time moment ($t+\Delta t$). If one known all the transition probabilities $P_{\Delta t} (j|i)$, he can evaluet possible disease trajectories starting from current state, and, in particular, compute the probability to end up healthy or dead, roughly speaking, after some time.

How can one determine $P_{\Delta t} (j|i)$? Assume that one has a very large set of patients, whose state is monitored at repeated time intervals of $\Delta t$. By the state we mean the collection of all the parameters of the patient, i.e., metobolome, genomee, proteome, etc, that could be important for the diease. Such a large type of data sets is an example of big longitudinal data. Now, if we have a collection of patients that were at some time moment at state $i$ and were found at the next time moment at different states, the transition probability can be estmated as $P_{\Delta t} (j|i)=n(j|i)/n(i)$, where $n(j|i)$ is the total number of times a patient was found at state $i$ at time $t$ and at state $j$ at time $t+\Delta t$ and $n(i)=\sum_j n(j|i)$. As one can see it is important that each patient has more than one meausrement, at least two, i.e., the data should be longitudinal. There are methods to determine biomarker using non-longitudinal data, however they can not be used to construct a Markov chain model of desease dynamics.

Using longitudinal data and Markov chain model of dynamics allows one to avoid the following problem associated with standard methods that use non-longitudinal data. Such methods divide patients into two groups, e.g., healthy and ill, where one seeks a parameter or a set of parameters that can be used to differentiate between the patients in two groups. The weak point in such methods is the necessity to accurately divide patients into the two groups. Usually one assumes the existence of a gold standard method, that is used for such division. However, such accurate methods do not always exist, and second they are likley to be very accurate only for patients that are very healthy or very ill. The classification of patients in the grey zone, between the two extreme cases, e.g., a patient having a dissease in the initial stage, is likely to be problematic. And hence such patients can not be included in the training sets to determine the biomarker. However diagnostics of such patients is of outmost importance, since it is easier to fight disease at its initial stage. The Markov chain model avoids this problem by constructing the model of the dynamics of the entire disease, so that one can predict the probability to end up in any state of interest, starting from any state of interst. In particular, the probability to end up completely healthy vs very ill, starting from any state of the disease. In other words, one needs only to define the final states, where the patient is either very healthy or very ill, which can be done with a very good accuracy by a golden standard or other reasonably good method.

Another difference with the standard methods of biomarker determination and analysis, which is a consequence of the assumed stochastic model of disease dynamics, is that we predict not outcomes, i.e., whether a patient is ill or healthy, or whether a patient will survive or die, but rather the probabilities of such outcomes. To illuminate the difference consider a hypothetical case with a few absolutely identical patients in absolutely identical conditions. The standard approches demand that they all have identical outcome, and difference in outcome means that we have missed some important characteristic difference between the patients. The stochastic dynamics means that the patients may end up having different outcome, one may day another may survive, and it is not due to inaccuracy of the analysis, but due to inherent stochasticity. Another consequence, is that one may not use the standard ROC analysis of accuracy and specificity for diseases with stochastic dynamics.

Optimal biomarker¶

Determining the complete model of the disease dynamics, i.e., the Markov chain, is a very difficult task, mainly do to exponentially large amount of data required, because of the exponentially large configuration space. And while it might be possible in the future, we are not there yet. One way to significantly restrict the size of the configuration space, and correspondingly, the amount of data required, is to consider just a few important variables, or in extreme case just one. It is clear that one looses a lot of information, thus it is important to select variables in an optimal way, so that the important information is preserved. One such optimal variable, known as the committor function, is an ideal candidate for optimal biomarker. To define it, consider a Markov chain model for a disease dynamics. Define two end or boundary states, corresponding to perfectly healthy or very ill patient. We assume that if a patient, during disease dynamics visits any of these two states, he/she will stay in them forever. The committor function for state $i$, $q(i)$, equals the probability to end up in healthy state rather than in ill state starting from the current state. It is, literally, the likelihood of positive outcome, the most important information one would like to know about the prognosis of the patient dynamics. This optimal variable has a number of nice properties, in particule it can be used to construct simplified models of dynamics, such that they can be used to compute exactly some important properties of the original dynamics. It is extensively used in the analysis of protein folding simulations.

A nice property of the commitor function, useful here, is that it can be constructed without constructing first the complete Markov chain model of the dynamics. For an equilibrium dynamics with detailed balance, one can show that the commitor function satisfies the following variational principle: it provide minimums to the total squared displacement $\Delta r^2$. More specifically, consider a long multidimensional trajectory $\vec{X}(t)$ recorded with time interval $\Delta t$. Let $R(\vec{X})$ be function of configuration space that computes the value of putatie optimal variable $r$. Let $r(t)=R(\vec{X(t)})$ be the time-series of the putative optimal variable. We assume that the two end/boundary states, defined as $A$ and $B$ are mapped to $r(A)=0$ and $r(B)=1$. The total squared displacment is defined as $\Delta r^2=\sum_t [r(t+\Delta t)-r(t)]^2$. Then $\Delta r^2$ reachies minimum when $r(i)$ equals $q(i)$ the committor function for the Markov chain with the boundary states A and B. Which suggest the following approach for determining $q$. One suggest a functional form with many parameters $r=R(\vec{X},\alpha_i)$ as an approximation to $q$. The parameters $\alpha_i$ are determined to provide minimum to the $\Delta r^2$ functional. If function $R$ linearly dependes of $\alpha_i$, then their values can be found analytically. In case one have no idea how the functional form $R(\vec{X},\alpha_i)$ may look like, one may determine $q(i)$ in a non-parametric way, without using $R(\vec{X},\alpha_i)$ at all. These ideas has been sucessfully applied to the analysis of protein folding simulations.

One complication in using the described approach to determine the committor or optimal biomarker from patient trajectories, is that oned does not have a single long equilbrium trajectory but rather a set of short trajectories for different patients. It means that the sampling is not at equilbrium and the detailed balance, which was assumed in the derivation of the variational principle is not satisfied. A simple idea, that has been sucessfuly tested, is to assume that changes to optimal variable due to non-equilbrium sampling are negligible, so that one can use the variational functional without modifications. After the optiaml biomarker has been deterrmined, one may unbias the sampling, to obtian equilibrium properties of the dynamics. We will use such an approach here.

The overall analysis workflow¶

The simplest analysis consists of the following steps:

Reading in the patients data using the panda library.
After reading the data might be preprocessed. For example, some of the variables may be removed. The NMR spectra, used in metabolomics, could be coarsegrained. The variables could be transformed, popular transformations include $\sqrt{x}$, $\log(x+\epsilon)$. More advanced steps may include the variable or feature selection by using e.g., the lasso or $L_1$ regularisation.
An optimal biomarker is constructed from all the variables by using subroutines from the module tools. In particular, commitor function determines optimal biomarker in a supervised way, using the known clinical outcomes for the patients. The evec function determines a few optimal biomarkers in an unsupervised way, by determining a few lowest eigenmodes/eigenvectors of the stochastic disease dynamics.
Visualisation of the results. Informative plots may include the following. Patient trajectories: plot of optimal biomarker time-series for each patient. It shows how the state of a patient changes with time. - Free energy landscape of the disease dynamics. This plot provides a simple, while quantitatively accurate picture of the stochastic disease dynamics as diffusion along the optimal biomarker coordinate. The picture can be supplemented with the commitor function as the function of the optimal biomarker. While optimal biomarker and committor function a should be the same, due to biased sampling, small number of parameters or poor approximation via functional form, they may differ. Leave-one-out cross validation provides cross validation test, and can be used to demonstrate the robustness of the results, that there is no overfitting, and that the obtained optimal biomarker can be used to analyze new patients.

Analysis of dynamics of patient recovery after kidney transplant¶

We will analyse the stochastic dynamics of patient recovery after kidney transplant. The biomarker is constructed based on NMR spectra of blood from 18 patients, taken before, during, and after the operation (upto 7 days). More details can be found in Krivov et al. PLoS Comput. Biol. 10 (2014) e1003685.

Reading data¶

In [ ]:

import pandas as pd
data=pd.read_excel('test.xlsx')
data.head()

The patients data are stored in data object, which is an instance of DataFrame object of pandas module. data.head() gives a summary (first 5 rows) of its content, e.g., it shows that we have 140 samples.

In [ ]:

data.shape

The shape of the data object shows that we have 31427 variables i.e., the NMR spectral lines. The first column shows the NMR shift in PPM.

Before constructng the optimal biomarker, one may want to visualize the spectra. To do this for patient k1 one can use the plot function of pandas:

In [ ]:

%matplotlib inline
import matplotlib.pyplot as plt
ax=plt.gca()
data.plot(x='ppm',y='k1 pre',ax=ax)
data.plot(x='ppm',y='k1 int',ax=ax)
data.plot(x='ppm',y='k1 post',ax=ax)
data.plot(x='ppm',y='k1 01',ax=ax)
data.plot(x='ppm',y='k1 03',ax=ax)
data.plot(x='ppm',y='k1 04',ax=ax)
data.plot(x='ppm',y='k1 07',ax=ax)      

As one can see the spectra are very complex and it is difficult to see the difference between different outcomes.

To compute the optimal biomarker - the committor, we first need to coarsegrain the spectra, otherwise the number of variables is too large, which leads to overfitting, as we will see later. The following will truncate the data to the segment ppm min=0 and ppm max=8.0 and bin it with interval dppm=0.32. Other prepocessing of the data can be done at this point. For example, we remove samples taken before and after the operation 'pre' and 'post' and rename sample 'int' (during operation) to day 0. We also convert to MultiIndex to handle longitudinal data more transparently.

Coarsegraning the data¶

In [ ]:

def cleanup(data): #modified to work with pandas 0.24.2
    for c in data.columns[1:]:  # remove pre and post and convert int to 0
        pat,time=c.split()
        if time=='pre' or time=='post':data.drop(columns=c,inplace=True)
        if time=='int':data.rename(columns={c:pat+' 0'},inplace=True)
    data.set_index('ppm',inplace=True) # move ppm column to index
    multi={} # create multiIndex: first index - patient, second index - time
    for c in data.columns:
        pat,time=c.split()
        time=float(time)
        multi[c]=pat,time
    data.rename(columns=multi,inplace=True)
    mi=pd.MultiIndex.from_tuples(data.columns)
    data=data.reindex(columns=mi)
    data.columns.names=['patient','time']
    return data
            
def coarsegrain(data,ppmmin,ppmmax,dppm):
    import numpy
    bins=numpy.arange(ppmmin,ppmmax+dppm+1e-5,dppm)
    cdata=data.groupby(pd.cut(data.index, bins=bins)).sum()
    cdata.index.names=['ppm']
    return cdata

data=cleanup(data)
cdata=coarsegrain(data,ppmmin=0,ppmmax=8.0,dppm=0.32)

Coarse-grained data with MultiIndex looks like

In [ ]:

cdata

To see a subset of data for patient 'k1'

In [ ]:

cdata['k1']

Due to MultiIndex, visualization now can be done more strighforwardly. The coarse-grained spectra for patient 'k1':

In [ ]:

cdata['k1'].plot.bar(figsize=(12,4))

Determining the committor¶

To compute the commitor one needs to define the two boundary or end states, which is usually done by clinical assesment. In our case the patients, based on a clinical assesment at the last day, are divided into three groups:

AR - Acute Rejection
DGF - Delayed Graft Function
PF - Perfect Function

In [ ]:

AR=['k8', 'k12', 'k14', 'k21', 'k27']
DGF=['k4', 'k7', 'k9', 'k15', 'k17', 'k18', 'k24']
PF=['k1', 'k2', 'k3', 'k5', 'k6', 'k10', 'k11', 'k13', 'k16', 'k25', 'k26', 'k33', 'k34']

Lets consider dynamics between PF and DGF states. Correspondingly, we define the boundary nodes A and B as

In [ ]:

DGF_PF={}
for pat in DGF:DGF_PF[pat]='B'
for pat in PF:DGF_PF[pat]='A'

Now we are ready to compute an approximation to the committor function, as a linear combination of variables in the sample $q=\sum_\alpha c_\alpha x_\alpha$, or to project the multidimensional data onto the committor coordinate.

In [ ]:

def p2committor(data,AB,maskx=None,lxuse=None,lmb=0):
    import numpy
    nx,nt=data.shape
    if maskx is None:maskx=numpy.ones(nx,'i')
    if lxuse is not None:
      maskx=numpy.zeros(len(lppm),'i')
      for i in lxuse: maskx[i]=1
    i21=[i for i in range(nx) if maskx[i]==1]
    nx2=len(i21)
    import numpy
    lx=numpy.zeros((nx2,nt),'float64') # array of samples/variables
    bval=numpy.zeros(nt,'int')        # boundary values for supervised optimization
    lp=numpy.zeros(nt,'int')          # list/index of patients
    skip=numpy.zeros(nt,'int')       # index of patients to skip, used for cross-validation
    
    ipat={}# patient to number
    for it in range(nt):
        pat,time=data.columns[it]
        if pat not in ipat:ipat[pat]=len(ipat)+1
        if ipat[pat]<len(ipat): print ('patient samples are not continous',pat,time)
        bval[it]=0
        lp[it]=ipat[pat]
        for j in range(nx2):lx[j,it]=data.iloc[j,it]
        if it==nt-1:
            bval[it]=2
        else:
            npat,ntime=data.columns[it+1]
            if npat!=pat:bval[it]=2
        if bval[it]==2:
          if pat in AB and (AB[pat]==1 or AB[pat]=='B'): bval[it]=1  # supervised
          if pat in AB and (AB[pat]==-1 or AB[pat]=='A'): bval[it]=-1  # supervised
    #lx=data.to_numpy() works for future pandas
    
    import tools
    la,lrc,dx2,info=tools.committor(lx,lp,bval,skip,lmb=lmb)
    
    d={}
    for it in range(nt):
        c=data.columns[it]
        rc=(lrc[it]+1)/2
        d[c]=[rc,]
    qdata = pd.DataFrame(d)
    qdata.columns.names=['patient','time']
    qdata.rename(index={0:'q'},inplace=True)
    
    la2=numpy.zeros(nx,'float64')
    for j in range(nx2): la2[i21[j]]=la[j]
    d={'ppm':data.index,'ai':la2}
    ai=pd.DataFrame(d,columns=['ppm','ai'])

    return qdata,ai,dx2/4.

qdata,ai,dx2=p2committor(cdata,DGF_PF)

The DataFrame object qdata now contains time series of the optimal biomarker ($q$) or its approximation for each patient. The multidimensional signal was optimally projected onto the single coordinate.

In [ ]:

qdata

Or to see the $q$ time series for a particular patient, e.g. 'k1':

In [ ]:

qdata['k1']

The coefficients $c_\alpha$ are in the other DataFrame object ai. It is a long column and to save space, we print it as a row - we print its transpose;

In [ ]:

ai.T

To visualize how the optimal biomarker changes with time for each patient, i.e., the patient "trajectories", we proceed as follows. Patients in groups AR, DGF and PF are shown by red, blue and black colors:

In [ ]:

def plot_ptraj(data,ltype='-'):
    dpat={}
    for pat,time in data.columns:
        if pat not in dpat:dpat[pat]=[]
        dpat[pat].append((float(time),data[(pat,time)]))

    import matplotlib.pyplot as plt
    for pat in dpat:
        lt=[t for t,x in dpat[pat]]
        lx=[x for t,x in dpat[pat]]
        if pat in AR:plt.plot(lt,lx,'r'+ltype)
        if pat in DGF:plt.plot(lt,lx,'b'+ltype)
        if pat in PF:plt.plot(lt,lx,'k'+ltype)

plot_ptraj(qdata)

The same plot using the pandas:

In [ ]:

ax=plt.gca()
for pat in qdata.columns.levels[0]:
    if pat in AR:style='r-'
    if pat in DGF:style='b-'
    if pat in PF:style='k-'
    qdata[pat].T.plot(ax=ax,legend=False,style=style)

To see the coefficients $c_\alpha$ with which variable $x_\alpha$ contributes to the optimal coordinate:

In [ ]:

ai.plot.bar(x='ppm',y='ai',stacked=True,color='blue',legend=False,figsize=(8,4))

As one can see the optimal biomarker has separated patients in groups DGF (blue) and PF (black). How robust is this result? Does this variable indeed separate the patients? In principle, given large enough number of variables one can always find a combination that will separate the patients. In other words, is there an overfitting? Can we use this variable to analyse a new patient. These all are serious qeustions that require serious statistical analysis and more data. However, a leave-one-out cross validation can be used to provide a support to the analysis and tentative positive answer. In leave-one-out corss validation one computes the optimal coordinate for patient without using data for this patient, i.e., each patient is used as a new patient.

In [ ]:

def p2committor_cv(data,AB,maskx=None,lxuse=None,lmb=0,keeplastpoint=False):
    import numpy
    nx,nt=data.shape
    if maskx is None:maskx=numpy.ones(nx,'i')
    if lxuse is not None:
      maskx=numpy.zeros(len(lppm),'i')
      for i in lxuse: maskx[i]=1
    i21=[i for i in range(nx) if maskx[i]==1]
    nx2=len(i21)
    import numpy
    lx=numpy.zeros((nx2,nt),'float64') # array of samples/variables
    bval=numpy.zeros(nt,'int')        # boundary values for supervised optimization
    lp=numpy.zeros(nt,'int')          # list/index of patients
    skip=numpy.zeros(nt,'int')       # index of patients to skip, used for cross-validation
    
    ipat={}# patient to number
    for it in range(nt):
        pat,time=data.columns[it]
        if pat not in ipat: ipat[pat]=len(ipat)+1
        bval[it]=0
        lp[it]=ipat[pat]
        for j in range(nx2):lx[j,it]=data.iloc[j,it]
        if it==nt-1:
            bval[it]=2
        else:
            npat,ntime=data.columns[it+1]
            if npat!=pat:bval[it]=2
        if bval[it]==2:
          if pat in AB and (AB[pat]==1 or AB[pat]=='B'): bval[it]=1  # supervised
          if pat in AB and (AB[pat]==-1 or AB[pat]=='A'): bval[it]=-1  # supervised

    import tools
    lrc_cv=[]
    for pskip in range(1,len(ipat)+1): # the patient to skip
        skip=numpy.zeros(nt,'int') # skip patient if nonzero
        if keeplastpoint:
            for it in range(nt-1):
                if lp[it]==pskip and lp[it+1]==pskip:skip[it]=1 # skip samples for the patient
        else:
            for it in range(nt):
                if lp[it]==pskip:skip[it]=1 # skip samples for the patient
        la0,lrc0,dx20,info0=tools.committor(lx,lp,bval,skip,lmb=lmb)
        for it in range(nt):
            if lp[it]==pskip: 
                lrc_cv.append(lrc0[it]) # add rc values for the patient

    d={}
    for it in range(nt):
        c=data.columns[it]
        d[c]=[(lrc_cv[it]+1)/2,]
    qdata_cv = pd.DataFrame(d) 
    qdata_cv.columns.names=['patient','time']
    qdata_cv.rename(index={0:'q_cv'},inplace=True)

    dx2=0
    for it in range(nt):
        if bval[it]==0: dx2+=(lrc_cv[it+1]-lrc_cv[it])**2
        elif bval[it]!=2:dx2+=(bval[it]-lrc_cv[it])**2
        
    return qdata_cv,dx2/4.

qdata_cv,dx2_cv=p2committor_cv(cdata,DGF_PF)
print (dx2,dx2_cv)

plot_ptraj(qdata,'-')
plot_ptraj(qdata_cv,':')

As one can see the leave-one-out cross validation shows that trajectories computed with (dashed lines) and without cross validation (solid lines) are rather close to each other. It suggests that the analysis is robust, i.e., the inclusion of new data does not change significantly the biomarker. There is no overfitting. And that it can be used to analyse new patients. Note that difference between the $\Delta r^2$ numbers is relatively large, suggesting that some overfitting is still hapenning.

Unsupervised analysis¶

The analysis above used clinical information in terms of the boundary states, i.e., it was supervised. It is possible to perform an unsupervised analysis as well. It can be used for the initial exploratory data analysis or for dimensionality reduction. It consists in projecting the stochastic dynamics on the eigenvectors of the transition matrix. It can be considered as an analog of PCA for longitudinal data. Interstingly, as we will see, the first eigenvector separates the DGF and PF states similar to the supervised analysis above, but without using clinical information.

In [ ]:

def p2evecs(data,lmb=0):
    import numpy
    nx,nt=data.shape
    import numpy
    lx=numpy.zeros((nx,nt),'float64') # array of samples/variables
    lp=numpy.zeros(nt,'int')          # list/index of patients
    skip=numpy.zeros(nt,'int')        # index of patients to skip, used for cross-validation
    
    ipat={}# patient to number
    for it in range(nt):
        pat,time=data.columns[it]
        if pat not in ipat:ipat[pat]=len(ipat)+1
        if ipat[pat]<len(ipat): print ('patient samples are not continous',pat,time)
        lp[it]=ipat[pat]
        for j in range(nx):lx[j,it]=data.iloc[j,it]
    #lx=data.to_numpy() works for future pandas
    
    import tools
    evecs,info=tools.evecs(lx,lp,skip,lmb=0)
    
    d={}
    for it in range(nt):
        c=data.columns[it]
        d[c]=[evecs[ix,it] for ix in range(nx)]
    evdata = pd.DataFrame(d)
    evdata.columns.names=['patient','time']
    evdata.index.names=['evecs']
    
    return evdata

evecs=p2evecs(cdata)

In [ ]:

evecs

To visualise patients trajectories projected on the first eigenvector (iev=1), we use pandas MultiIndex. Note that the zeroth eigenvectors (iev=0) is always constant.

In [ ]:

iev=1
ax=plt.gca()
for pat in evecs.columns.levels[0]:
    if pat in AR:style='r-'
    if pat in DGF:style='b-'
    if pat in PF:style='k-'
    evecs[pat].T[iev].plot(ax=ax,legend=False,style=style)

It is interesting to see if other eigenvectors can be used to provide separation between other clinical conditions. However, direct inspection shows that it is not the case here. One possibility is that at this level of coarsegraining the variables contain no signal describing other outcomes and such a signal can appear at lower coarsegraning with more variables. However using many variables may lead to overfitting, which we discuss next.

Regularization¶

The clinicians distingish three different outcomes AR, PF and DGF. The determined biomarker, as can be seen from the previos pictures allows one to distinguish between PF and DGF. It is of clinical interest to find a biomarker that can predicts AR, in order, to e.g., suppres immune rejection of the graft. In order to determine it we, correspondingly, define the boundary states as AR and PF+DGF and proceed as follows

In [ ]:

AR_PFDGF={}
for pat in AR:AR_PFDGF[pat]='B'
for pat in PF:AR_PFDGF[pat]='A'
for pat in DGF:AR_PFDGF[pat]='A'
qdata2,ai2,dx22=p2committor(cdata,AR_PFDGF)
plot_ptraj(qdata2,'-')
print  (dx22)

The figure shows that the obtained coordinate does not separate the patients trajectories, i.e., it will serve as a poor biomarker. One posible reason could be that due to the usage of coarsegrained variables, the signal has been lost. We keep the optimistic view that the original high-dimensional NMR spectra contains the information necessary to accurately descrie the dynamics. Lets decrease the level of coarsegraining and repeate the analysis:

In [ ]:

cdata=coarsegrain(data,ppmmin=0,ppmmax=8.0,dppm=0.1)
qdata,ai,dx2=p2committor(cdata,AR_PFDGF)
plot_ptraj(qdata,'-')

It looks like a success! The obtained biomarker obviosly separates the AR cohort from the rest! The higher resolution NMR spectra contains the important signal! However, lets perform the cross validation, in particular to see, if the obtained biomarker can be used to analyse new patients.

In [ ]:

qdata_cv,dx2_cv=p2committor_cv(cdata,AR_PFDGF)
plot_ptraj(qdata_cv,':')

The results are obviosly poor! Even if the new higher resolution spectra contained the importnat signal, the large number of variables leads to overfitting. There are a number of approaches to avoid overfitting. Optimally one should use feature or variable selection, i.e., select only those variables which contain the importnat signal. To do that one can use the Lasso method. He we will fight overfitting with $L^2$ regularization, which is a penalty term, which penalizing overfitting. Namely we find the optimal coordinate by optimizing the following functional $\Delta r^2 +\lambda\sum_\alpha c_\alpha^2$, where $c_\alpha$ are the coefficients with which variable $x_\alpha$ contributes to the optimal coordinate $r=\sum_\alpha c_\alpha x_\alpha$. The optimal value for $\lambda$ can be found by monitoring $\Delta r^2$ of the cross-validated trajectories. The Lasso method, which can be used for variable selection uses the $L^1$ penalty $\sum_\alpha |c_\alpha|$.

Lets repeate the previous analysis, but now with a penalty term. To find the optimal value of $\lambda$ we scan for $\lambda$ and select the one with the minimal cross-validated $\Delta r^2$. We also increased the resolution to dppm$=0.01$

In [ ]:

cdata=coarsegrain(data,ppmmin=0,ppmmax=8.0,dppm=0.01)
llmb=[0.0001*i for i in range(1,10)]
for lmb in llmb:
    qdata,ai,dx2=p2committor(cdata,AR_PFDGF,lmb=lmb)
    qdata_cv,dx2_cv=p2committor_cv(cdata,AR_PFDGF,lmb=lmb,keeplastpoint=True) 
    print ('%6.2g %5.2f %5.2f' %(lmb,dx2,dx2_cv))

In [ ]:

lmb=0.0003
qdata,ai,dx2=p2committor(cdata,AR_PFDGF,lmb=lmb)
qdata_cv,dx2_cv=p2committor_cv(cdata,AR_PFDGF,lmb=lmb,keeplastpoint=True) 
print (lmb,dx2,dx2_cv)
plot_ptraj(qdata,'-')
plot_ptraj(qdata_cv,':')

This picture shows that using the penalty term removes the overfitting. And, arguably, the picture shows a better separation between the AR vs PF+DGF states, than the one before. Even better description might be obtained with the variable selection method, e.g. LASSO, followed by the analysis without regularization penalty, which we dont attempt here.