#!/usr/bin/env python # coding: utf-8 # # Parallelization in `ipyrad` using `ipyparallel` # One of the real strenghts of `ipyrad` is the advanced parallelization methods that it uses to distribute work across arbitrarily large computing clusters, and to be able to do so when working interactively and remotely. This is done through use of the `ipyparallel` package, which is tightly linked to `ipython` and `jupyter`. When you run the command-line `ipyrad` program all of the work of `ipyparallel` is hidden under the hood, which we've done to make the program very user-friendly. However, when using the `ipyrad` API, we've taken the alternative approach of instructing users to become intimate with the `ipyparallel` library to better understand how work is being distributed on their system. This has the benefit of allowing more flexible parallelization setups, and also makes it easier for users to take advantage of `ipyparallel` for parallelizing downstream analyses, which we have many examples of in the `analysis-tools` section of the ipyrad documentation. # ### Required software # All software required for this tutorial is installed during the ipyrad conda installation. # # In[1]: ## conda install ipyrad -c ipyrad # ### Starting an `ipcluster` instance # # The tricky aspect of using `ipyparallel` inside of Python (i.e., in a jupyter-notebook) is that you need to first start a cluster instance by running a command-line program called ``ipcluster`` (alternatively, you can also install an extension that makes it possible to start ipcluster from a tab in jupyter notebooks, but I feel the command line tool is simpler). This command will start separate python "kernels" (instances) running on the cluster/computer and ensure that they can all talk to each other. Using advanced options you can even connect kernels across multiple computers or nodes on a HPC cluster, which we'll demonstrate. # In[2]: ## ipcluster start --n=4 # ### Start a `jupyter-notebook` # If you are working on your laptop of a workstation then I typically open up two terminals, one to start a notebook and one to start an ipcluster instance. # In[18]: ## jupyter-notebook # # # # # # ### Open a notebook # Running `jupyter-notebook` will launch a server that will open a dashboard view in your browser, usually at the address is at `localhost:8888`. From the dashboard go to the menu and select `new/notebook/Python` to open a new notebook. # ### Connect to `ipcluster` in your notebook # Now from inside a notebook you can connect to the cluster using the `ipyparallel` library. Below we will connect to the client by providing no additional arguments, which is sufficient in this case sine we are using a very basic `ipcluster` setup. # In[3]: import ipyparallel as ipp print ipp.__version__ # In[4]: ## connect to ipcluster using default arguments ipyclient = ipp.Client() ## count how many engines are connected print len(ipyclient), 'cores' # ### `Profiles` in ipcluster and how ipyrad uses them # Below we show an example of a common error caused when the Client cannot find the `ipcluster` instance, in this case because it has a differnt profile name. When you start an ipcluster instance it keeps track of itself by using a specific name (its profile). The default profile is an empty string ("") and so this is the default profile that the `ipp.Client()` command will look for (and similarly the default profile that `ipyrad` will look for). If you change the name of the profile then you have to indicate this, like below. # In[22]: ## example connecting to a named profile mpi = ipp.Client(profile="MPI") # ### Start a second `ipcluster` instance with a specific profile # By using separate profiles you can have multiple `ipcluster` instances running at the same time (and possibly using different options or connected to different nodes of an HPC cluster) and you ensure that you connect to each one dictinctly. Start a new instance and then connect to it like below. Here we give it a profile name and we also tell it to initiate the engines using the `MPI` initiator by using the `--engines` flag. # In[5]: ## ipcluster start --n=4 --engines=MPI --profile=MPI # In[6]: ## now you should be able to connect to the MPI profile mpi = ipp.Client(profile="MPI") ## print mpi info print len(mpi), 'cores' # ### Using an `client()` object to distribute jobs # For full details of how this works you can read the `ipyparallel` documentation. Here I will focus on the tools in `ipyrad` that we have developed to utilize `Client` objects to distribute work. In general, all you have to do is provide the ipyclient object to the `.run()` function and `ipyrad` will take care of the rest. # In[7]: ## the ipyclient object is simply a view to the engines ipyclient # In[8]: import ipyrad as ip import ipyrad.analysis as ipa # ### Example of using a Client in an ipyrad assembly # Here we create an Assembly and when we call the `.run()` command we provide a specific `ipyclient` object as the target to distribute work on. If you do not provide this option then by default `ipyrad` will look for an `ipcluster` instance running on the default profile (""). # In[9]: ## run steps of an ipyrad assembly on a specific ipcluster instance data = ip.Assembly("example") data.set_params("sorted_fastq_path", "example_empirical_rad/*.fastq.gz") data.run("1", ipyclient=mpi) # ### Example of using a Client in an analysis (`tetrad`) # In[10]: ## run ipyrad analysis tools on a specific ipcluster instance tet = ipa.tetrad( name="test", data="./analysis-ipyrad/pedic-full_outfiles/pedic-full.snps.phy", mapfile="./analysis-ipyrad/pedic-full_outfiles/pedic-full.snps.map", nboots=20); tet.run(ipyclient=mpi) # ### Parallelizing any arbitrary function with ipyparallel # A very simple example... # In[18]: ## get load-balanced view of our ipyclient lbview = ipyclient.load_balanced_view() ## a dict to store results res = {} ## define your func def my_sum_func(x, y): return x, y, sum([x, y]) ## submit jobs to cluster with arguments import random for job in range(10): x = random.randint(0, 10) y = random.randint(0, 10) ## submitting a job returns an async object async = lbview.apply(my_sum_func, x, y) ## store results res[job] = async ## block until all jobs finish ipyclient.wait() # In[19]: ## the results objects res # In[20]: for ridx in range(10): print "job: {}, result: {}".format(ridx, res[ridx].get()) # ### Starting an ipcluster instance on a HPC cluster # See the multi-node setup in the HPC tunneling tutorial for instructions. http://ipyrad.readthedocs.io/HPC_Tunnel.html # In[ ]: ## an example command to connect to 80 cores across multiple nodes using MPI ## ipcluster start --n=80 --engines=MPI --ip='*' --profile='MPI80'