Parallel GST using MPI Tutorial

The purpose of this tutorial is to demonstrate how to compute GST estimates in parallel (using multiple CPUs or "processors"). The core PyGSTi computational routines are written to take advantage of multiple processors via the MPI communication framework, and so one must have a version of MPI and the mpi4py python package installed in order use run pyGSTi calculations in parallel.

Since mpi4py doesn't play nicely with Jupyter notebooks, this tutorial is a bit more clunky than the others. In it, we will create a standalone Python script that imports mpi4py and execute it.

We will use as an example the same "standard" single-qubit gate set of the first tutorial. We'll first create a dataset, and then a script to be run in parallel which loads the data. The creation of a simulated data is performed in the same way as the first tutorial. Since random numbers are generated and used as simulated counts within the call to generate_fake_data, it is important that this is not done in a parallel environment, or different CPUs may get different data sets. (This isn't an issue in the typical situation when the data is obtained experimentally.)

In [1]:
#Import pyGSTi and the "stardard 1-qubit quantities for a gateset with X(pi/2), Y(pi/2), and idle gates"
import pygsti
from pygsti.construction import std1Q_XYI

#Create a data set
gs_target = std1Q_XYI.gs_target
fiducials = std1Q_XYI.fiducials
germs = std1Q_XYI.germs
maxLengths = [0,1,2,4,8,16,32]

gs_datagen = gs_target.depolarize(gate_noise=0.1, spam_noise=0.001)
listOfExperiments = pygsti.construction.make_lsgst_experiment_list(gs_target.gates.keys(), fiducials, fiducials, germs, maxLengths)
ds = pygsti.construction.generate_fake_data(gs_datagen, listOfExperiments, nSamples=1000,
                                            sampleError="binomial", seed=1234)
pygsti.io.write_dataset("tutorial_files/mpi_example_dataset.txt", ds)

Next, we'll write a Python script that will load in the just-created DataSet, run GST on it, and write the output to a file. The only major difference between the contents of this script and previous examples is that the script imports mpi4py and passes a MPI comm object (comm) to the do_long_sequence_gst function. Since parallel computing is best used for computationaly intensive GST calculations, we also demonstrate how to set a per-processor memory limit to tell pyGSTi to partition its computations so as to not exceed this memory usage. Lastly, note the use of the gaugeOptParams argument of do_long_sequence_gst, which can be used to weight different gate set members differently during gauge optimization.

In [2]:
mpiScript = """
import time
import pygsti
from pygsti.construction import std1Q_XYI

#get MPI comm
from mpi4py import MPI
comm = MPI.COMM_WORLD

print("Rank %d started" % comm.Get_rank())

#define target gateset, fiducials, and germs as before
gs_target = std1Q_XYI.gs_target
fiducials = std1Q_XYI.fiducials
germs = std1Q_XYI.germs
maxLengths = [0,1,2,4,8,16,32]

#tell gauge optimization to weight the gate matrix
# elements 100x more heavily than the SPAM vector elements, and
# to specifically weight the Gx gate twice as heavily as the other
# gates.
goParams = {'itemWeights':{'spam': 0.01, 'gates': 1.0, 'Gx': 2.0} }

#Specify a per-core memory limit (useful for larger GST calculations)
memLim = 2.1*(1024)**3  # 2.1 GB

#Perform TP-constrained GST
gs_target.set_all_parameterizations("TP")
    
#load the dataset
ds = pygsti.io.load_dataset("tutorial_files/mpi_example_dataset.txt")

start = time.time()
results = pygsti.do_long_sequence_gst(ds, gs_target, fiducials, fiducials,
                                      germs, maxLengths,memLimit=memLim,
                                      gaugeOptParams=goParams, comm=comm,
                                      verbosity=2)
end = time.time()
print("Rank %d finished in %.1fs" % (comm.Get_rank(), end-start))
if comm.Get_rank() == 0:
    import pickle
    pickle.dump(results, open("tutorial_files/mpi_example_results.pkl","wb"))
"""
with open("tutorial_files/mpi_example_script.py","w") as f:
    f.write(mpiScript)

Next, we run the script with 3 processors using mpiexec. The mpiexec executable should have been installed with your MPI distribution -- if it doesn't exist, try replacing mpiexec with mpirun.

In [3]:
! mpiexec -n 3 python3 "tutorial_files/mpi_example_script.py"
Rank 0 started
Rank 1 started
Rank 2 started
--- LGST ---
  Singular values of I_tilde (truncating to first 4 of 6) = 
  4.24507306085
  1.15858456866
  0.96719520065
  0.921798550187
  0.070140897297
  0.0208719086309
  
  Singular values of target I_tilde (truncating to first 4 of 6) = 
  4.24264068712
  1.41421356237
  1.41421356237
  1.41421356237
  3.3483200348e-16
  2.72548486209e-16
  
--- Iterative MLGST: Iter 1 of 7  92 gate strings ---: 
  --- Minimum Chi^2 GST ---
  Memory limit = 2.10GB
  Cur, Persist, Gather = 0.08, 0.00, 0.21 GB
  Evaltree generation (deriv) w/mem limit = 1.81GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=92, wrtLen1=15, wrtLen2=43, subsPerProc=1).
  Sum of Chi^2 = 86.1192 (92 data params - 31 model params = expected mean of 61; p-value = 0.0188009)
  Completed in 0.6s
  2*Delta(log(L)) = 86.4539
  Iteration 1 took 0.6s
  
--- Iterative MLGST: Iter 2 of 7  92 gate strings ---: 
  --- Minimum Chi^2 GST ---
  Memory limit = 2.10GB
  Cur, Persist, Gather = 0.09, 0.00, 0.21 GB
  Evaltree generation (deriv) w/mem limit = 1.80GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=92, wrtLen1=15, wrtLen2=43, subsPerProc=1).
  Sum of Chi^2 = 86.1192 (92 data params - 31 model params = expected mean of 61; p-value = 0.0188009)
  Completed in 0.4s
  2*Delta(log(L)) = 86.4539
  Iteration 2 took 0.4s
  
--- Iterative MLGST: Iter 3 of 7  168 gate strings ---: 
  --- Minimum Chi^2 GST ---
  Memory limit = 2.10GB
  Cur, Persist, Gather = 0.09, 0.00, 0.21 GB
  Evaltree generation (deriv) w/mem limit = 1.80GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=168, wrtLen1=15, wrtLen2=43, subsPerProc=1).
  Sum of Chi^2 = 179.782 (168 data params - 31 model params = expected mean of 137; p-value = 0.00830446)
  Completed in 0.7s
  2*Delta(log(L)) = 180.545
  Iteration 3 took 0.8s
  
--- Iterative MLGST: Iter 4 of 7  441 gate strings ---: 
  --- Minimum Chi^2 GST ---
  Memory limit = 2.10GB
  Cur, Persist, Gather = 0.09, 0.00, 0.21 GB
  Evaltree generation (deriv) w/mem limit = 1.80GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=441, wrtLen1=15, wrtLen2=43, subsPerProc=1).
  Sum of Chi^2 = 471.249 (441 data params - 31 model params = expected mean of 410; p-value = 0.0195172)
  Completed in 0.8s
  2*Delta(log(L)) = 472.317
  Iteration 4 took 0.8s
  
--- Iterative MLGST: Iter 5 of 7  817 gate strings ---: 
  --- Minimum Chi^2 GST ---
  Memory limit = 2.10GB
  Cur, Persist, Gather = 0.09, 0.00, 0.21 GB
  Evaltree generation (deriv) w/mem limit = 1.80GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=823, wrtLen1=15, wrtLen2=43, subsPerProc=1).
  Sum of Chi^2 = 829.132 (817 data params - 31 model params = expected mean of 786; p-value = 0.138889)
  Completed in 1.1s
  2*Delta(log(L)) = 830.191
  Iteration 5 took 1.2s
  
--- Iterative MLGST: Iter 6 of 7  1201 gate strings ---: 
  --- Minimum Chi^2 GST ---
  Memory limit = 2.10GB
  Cur, Persist, Gather = 0.09, 0.00, 0.21 GB
  Evaltree generation (deriv) w/mem limit = 1.80GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=1279, wrtLen1=15, wrtLen2=43, subsPerProc=1).
  Sum of Chi^2 = 1235.33 (1201 data params - 31 model params = expected mean of 1170; p-value = 0.0901359)
  Completed in 1.4s
  2*Delta(log(L)) = 1236.48
  Iteration 6 took 1.6s
  
--- Iterative MLGST: Iter 7 of 7  1585 gate strings ---: 
  --- Minimum Chi^2 GST ---
  Memory limit = 2.10GB
  Cur, Persist, Gather = 0.10, 0.00, 0.21 GB
  Evaltree generation (deriv) w/mem limit = 1.79GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=1810, wrtLen1=15, wrtLen2=43, subsPerProc=1).
  Sum of Chi^2 = 1678.94 (1585 data params - 31 model params = expected mean of 1554; p-value = 0.0140621)
  Completed in 1.9s
  2*Delta(log(L)) = 1680.42
  Iteration 7 took 2.2s
  
  Switching to ML objective (last iteration)
  --- MLGST ---
  Memory: limit = 2.10GB(cur, persist, gthr = 0.10, 0.00, 0.21 GB)
  Evaltree generation (deriv) w/mem limit = 1.79GB
   mem(1 subtrees, 3,1 param-grps, 1 proc-grps) in 0s = 0.00GB (0.00GB fc)
  Created evaluation tree with 1 subtrees.  Will divide 3 procs into 1 (subtree-processing)
   groups of ~3 procs each, to distribute over 43 params (taken as 3 param groups of ~14 params).
   Memory estimate = 0.00GB (cache=1810, wrtLen1=15, wrtLen2=43, subsPerProc=1).
    Maximum log(L) = 840.168 below upper bound of -2.65029e+06
      2*Delta(log(L)) = 1680.34 (1585 data params - 31 model params = expected mean of 1554; p-value = 0.0132307)
    Completed in 3.4s
  2*Delta(log(L)) = 1680.34
  Final MLGST took 3.4s
  
Iterative MLGST Total Time: 11.0s
Rank 2 finished in 21.7s
Rank 1 finished in 21.7s
Rank 0 finished in 21.9s

Notice in the above that output within do_long_sequence_gst is not duplicated (only the first processor outputs to stdout) so that the output looks identical to running on a single processor. Finally, we just need to read the pickled Results object from file and proceed with any post-processing analysis. In this case, we'll just create a brief report.

In [4]:
import pickle
results = pickle.load(open("tutorial_files/mpi_example_results.pkl","rb"))
results.create_brief_report_pdf(filename="tutorial_files/mpi_example_brief.pdf",verbosity=2)
*** Generating tables ***
  Generating table: bestGatesetSpamTable  [0.0s]
  Generating table: bestGatesetSpamParametersTable  [0.0s]
  Generating table: bestGatesetGatesTable  [0.0s]
  Generating table: bestGatesetDecompTable  [0.0s]
  Generating table: bestGatesetRotnAxisTable  [0.0s]
  Generating table: bestGatesetVsTargetTable  [0.5s]
  Generating table: bestGatesetErrorGenTable  [0.0s]
  Generating table: progressTable  [0.8s]
*** Generating plots ***
*** Merging into template file ***
Latex file(s) successfully generated.  Attempting to compile with pdflatex...
Initial output PDF tutorial_files/mpi_example_brief.pdf successfully generated.
Final output PDF tutorial_files/mpi_example_brief.pdf successfully generated. Cleaning up .aux and .log files.

Open tutorial_files/mpi_example_brief.pdf to see the report.

In [ ]: