Notebook

In [2]:

%matplotlib inline

Table of Contents¶

1. The original code
2. Making our code executable at the command-line
3. Adding command-line arguments
4. Separating implementation from user interface
5. Separating library files from program files
6. Packaging and installation
Bonus enhancements

1. The original code¶

In [3]:

import numpy as np
import astropy.io.fits as fits
from astropy.convolution import Gaussian2DKernel, convolve_fft

# Create empty image
nx, ny = 512, 512
image = np.zeros((ny, nx))

# Set number of stars
n_stars = 10000

# Generate random positions
r = np.random.random(n_stars) * nx
theta = np.random.uniform(0., 2. * np.pi, n_stars)

# Generate random fluxes
fluxes = np.random.random(n_stars) ** 2

# Compute position
x = nx / 2 + r * np.cos(theta)
y = ny / 2 + r * np.sin(theta)

# Add stars to image
# ==> First for loop and if statement <==
for idx in range(n_stars):
    if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
        image[y[idx], x[idx]] += fluxes[idx]

# Convolve with a gaussian
kernel = Gaussian2DKernel(stddev=1)
image = convolve_fft(image, kernel)

# Add noise
image += np.random.normal(1., 0.001, image.shape)

# Write to a FITS file
fits.writeto('cluster.fits', image, clobber=True)

WARNING: Overwriting existing file 'cluster.fits'. [astropy.io.fits.file]
WARNING:astropy:Overwriting existing file 'cluster.fits'.

In [4]:

# APLpy is a third-party library for Python designed for easy creation of
# astronomy image plots
import aplpy

fig = aplpy.FITSFigure('cluster.fits')
fig.show_colorscale(cmap='hot')

WARNING: No WCS information found in header - using pixel coordinates [aplpy.header]
WARNING:astropy:No WCS information found in header - using pixel coordinates
INFO:astropy:Auto-setting vmin to  9.820e-01
INFO:astropy:Auto-setting vmax to  1.168e+00

INFO: Auto-setting vmin to  9.820e-01 [aplpy.core]
INFO: Auto-setting vmax to  1.168e+00 [aplpy.core]

^Back to top

2. Making our code executable at the command-line¶

First thing's first, make this into an executable script by adding a "shebang" line, like

#!/usr/bin/env python

This indicates to your shell to run the remainder of this file through Python with a copy of your current shell environment. We also use IPython's %%file magic to write this out as a file on disk:

In [5]:

%%file simcluster.py
#!/usr/bin/env python
import numpy as np
import astropy.io.fits as fits
from astropy.convolution import Gaussian2DKernel, convolve_fft

# Create empty image
nx, ny = 512, 512
image = np.zeros((ny, nx))

# Set number of stars
n_stars = 10000

# Generate random positions
r = np.random.random(n_stars) * nx
theta = np.random.uniform(0., 2. * np.pi, n_stars)

# Generate random fluxes
fluxes = np.random.random(n_stars) ** 2

# Compute position
x = nx / 2 + r * np.cos(theta)
y = ny / 2 + r * np.sin(theta)

# Add stars to image
# ==> First for loop and if statement <==
for idx in range(n_stars):
    if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
        image[y[idx], x[idx]] += fluxes[idx]

# Convolve with a gaussian
kernel = Gaussian2DKernel(stddev=1)
image = convolve_fft(image, kernel)

# Add noise
image += np.random.normal(1., 0.001, image.shape)

# Write to a FITS file
fits.writeto('cluster.fits', image, clobber=True)

Overwriting simcluster.py

We also want to set the "executable" bit on our file's permissions, which tells the shell that this is an executable file (i.e. if we enter the file's name at a command prompt it will try to run it as a program):

In [6]:

! chmod +x simcluster.py

chmod is the shell program for setting file permissions (it can also be used to set read/write permissions). The +x option means to set the file simcluster as "executable" for all users (likewise -x removes the executable flag).

Now we can run our program at the command prompt:

In [7]:

! ./simcluster.py

WARNING: Overwriting existing file 'cluster.fits'. [astropy.io.fits.file]

We include the ./ because without it the shell does not know where to look for the program named "simcluster". If a program name is entered at the shell without an explicit path (like we do when we run ls, chmod, etc.) the shell searches all the entries on our $PATH environment variable for a program by that name. We can check the $PATH using the shell like:

In [8]:

! echo $PATH

/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

Or we can read and set environment variables directly in Python using the os module:

In [9]:

import os
os.environ['PATH']

Out[9]:

'/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games'

^Back to top

3. Adding command-line arguments¶

Getting back to our script, the next thing to do to make this script into a reusable program that others can use too, would be to note that we actually have several parameters for which specified some values that worked for us at the time, but that others (where "others" includes ourselves, in the future) may wish to adjust.

Perhaps most notably the file name that we wrote to (maybe "clobber.fits" doesn't do it for everyone). Other parameters we might want to make adjustable would be the number of stars, and the dimensions (in pixels, for now) of the image. Those are the only parameters we'll focus on for now--can you think of others?

It isn't very nice to have to modify our code every time we want to tweak some parameters. Nevermind that we might not even remember what parameters we used to generate a particular image (did this have 10000 stars? or 12000? I don't remember since I changed it since I last generated that image...). Of course since we're writing to FITS files there are things we might do, like include keywords in the header specifying the parameters of the image). But for ease of use we would like our script to accept command-line argument to set the number of stars, etc. Then the command we used to generate that file would be in our shell history. It also means that when we share are script with someone else we just have to tell them how to run it--not how to go in and edit the code.

How do Python scripts accept command-line arguments?

I've you've ever written a program in C you've probably written a main function like:

int main(int argc, char** argv) { ... }

When you run your program, the program name itself along with any command-line arguments that followed the program name are passed in via an array of strings in argv (where argc tells us the number of arguments that were passed in).

In Python the equivalent is just a variable that lives in the sys module called sys.argv. It gives us a list of all the command-line arguments. To see this, let's just print sys.argv at the top of our script:

In [10]:

%%file simcluster.py
#!/usr/bin/env python
import numpy as np
import astropy.io.fits as fits
from astropy.convolution import Gaussian2DKernel, convolve_fft

# Let's take a look at the command-line arguments:
import sys
print(sys.argv)

# Create empty image
nx, ny = 512, 512
image = np.zeros((ny, nx))

# Set number of stars
n_stars = 10000

# Generate random positions
r = np.random.random(n_stars) * nx
theta = np.random.uniform(0., 2. * np.pi, n_stars)

# Generate random fluxes
fluxes = np.random.random(n_stars) ** 2

# Compute position
x = nx / 2 + r * np.cos(theta)
y = ny / 2 + r * np.sin(theta)

# Add stars to image
# ==> First for loop and if statement <==
for idx in range(n_stars):
    if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
        image[y[idx], x[idx]] += fluxes[idx]

# Convolve with a gaussian
kernel = Gaussian2DKernel(stddev=1)
image = convolve_fft(image, kernel)

# Add noise
image += np.random.normal(1., 0.001, image.shape)

# Write to a FITS file
fits.writeto('cluster.fits', image, clobber=True)

Overwriting simcluster.py

In [11]:

! ./simcluster.py --stars=10000 -x 512 -y 512 cluster.fits

['./simcluster.py', '--stars=10000', '-x', '512', '-y', '512', 'cluster.fits']
WARNING: Overwriting existing file 'cluster.fits'. [astropy.io.fits.file]

What we would like to do is parse these command-line options so that we use the value of the --stars option to set our n_stars variable, and we also note that we accept -x and -y options that we want to take as our x and y dimensions. We also need to handle the filename.

And remember when we said we'll want to explain to our colleagues how to use our program? Well, if we provide them with a --help option, maybe we won't have to do even that, if it's a simple enough program.

Just to throw a wrench into things we also want to support -s as a shortcut for --stars and -h as a shortcut for --help.

We could spend a while writing all the code to handle these command-line arguments (rather than getting on with our science). But fortunately, except in extreme cases, we don't have to do that. This is already a solved problem, and there are libraries that will make it easy to add standard argument handling to our script.

Python comes out of the box with three such libraries:

Why three? Is this just to confuse you? No! It's because Python is a living language, and the standard library continues to evolve. getopt is an interface to a C library of the same name, and is fairly basic. optparse is a higher-level library that getopt that has been in Python for a long time, but was beginning to show its age. argparse already existed as a third-party library, and (most) people liked it so much more than the old optparse that it was added as a replacement for optparse (optparse is kept around though so that old programs don't just break!)

We will take a brief look at argparse. Here's one way we might go about adding argument parsing to our program with argparse:

In [12]:

%%file simcluster.py
#!/usr/bin/env python

import argparse  # We added a new import statement for argarse
import sys       # Also import sys for sys.argv

import numpy as np
import astropy.io.fits as fits
from astropy.convolution import Gaussian2DKernel, convolve_fft

# Create an ArgumentParser--a special object that keeps track of all the
# arguments we want our script to be able to handle, and then implements parsing
# them from sys.argv
parser = argparse.ArgumentParser(description="Generates simulated images of clusters")

# Add an optional argument for # of stars (default=10000)
parser.add_argument('-s', '--stars', type=int, default=10000,
                    help='the number of stars to generate')

# Add the x and y arguments
parser.add_argument('-x', type=int, default=512,
                    help='the x dimension (in pixels) of the image')
parser.add_argument('-y', type=int, default=512,
                    help='the y dimension (in pixels) of the image')

# Add an argument to handle the output file.  If we use argparse.FileType it will
# handle opening a writeable file (and ensuring we can write to it).
# Note that the argument name 'file' does not beging with a '-' or '--'; this indicates
# to argparse that it is a *positional* argument
parser.add_argument('file', type=argparse.FileType('w'),
                    help='the name of the output file')

args = parser.parse_args(sys.argv[1:])

n_stars = args.stars
nx = args.x
ny = args.y
fileobj = args.file

# Create empty image
image = np.zeros((ny, nx))

# Generate random positions
r = np.random.random(n_stars) * nx
theta = np.random.uniform(0., 2. * np.pi, n_stars)

# Generate random fluxes
fluxes = np.random.random(n_stars) ** 2

# Compute position
x = nx / 2 + r * np.cos(theta)
y = ny / 2 + r * np.sin(theta)

# Add stars to image
# ==> First for loop and if statement <==
for idx in range(n_stars):
    if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
        image[y[idx], x[idx]] += fluxes[idx]

# Convolve with a gaussian
kernel = Gaussian2DKernel(stddev=1)
image = convolve_fft(image, kernel)

# Add noise
image += np.random.normal(1., 0.001, image.shape)

# Write to a FITS file
fits.writeto(fileobj, image, clobber=True)

Overwriting simcluster.py

So what did all that get us? What do all the help strings do for us (other than document the code)? Well let's try it out:

In [13]:

! ./simcluster.py --help

usage: simcluster.py [-h] [-s STARS] [-x X] [-y Y] file

Generates simulated images of clusters

positional arguments:
  file                  the name of the output file

optional arguments:
  -h, --help            show this help message and exit
  -s STARS, --stars STARS
                        the number of stars to generate
  -x X                  the x dimension (in pixels) of the image
  -y Y                  the y dimension (in pixels) of the image

Note also that the output file is not optional--we must provide one now:

In [14]:

! ./simcluster.py --stars=12000

usage: simcluster.py [-h] [-s STARS] [-x X] [-y Y] file
simcluster.py: error: too few arguments

In [15]:

! ./simcluster.py --stars=12000 cluster2.fits

In [16]:

fig = aplpy.FITSFigure('cluster2.fits')
fig.show_colorscale(cmap='hot')

WARNING: No WCS information found in header - using pixel coordinates [aplpy.header]
WARNING:astropy:No WCS information found in header - using pixel coordinates
INFO:astropy:Auto-setting vmin to  9.814e-01
INFO:astropy:Auto-setting vmax to  1.175e+00

INFO: Auto-setting vmin to  9.814e-01 [aplpy.core]
INFO: Auto-setting vmax to  1.175e+00 [aplpy.core]

^Back to top

4. Separating implementation from user interface¶

Now we have a new problem--our script is getting long, and a bit jumbled. We have argument handling, image processing, and file writing all mixed together. Even though we've done a fairly good job organizing the source into stages, this does not scale well as our script gets more sophisticated and grows from 50 lines to 100 lines to 1000 lines or more (for example the writing stage is just one line right now, but what if we want to add more to the FITS header? Maybe generate a WCS (which will require a lot more command-line options as well). Maybe we want to support other output formats. And of course the image generating itself could grow vastly more sophisticated.

Note also that the code for generating the image has no deep connection to the code for argument handling. It existed before we added argument parsing. So maybe that independence should be made explicit by breaking these stages into separate functions. Note also that generating the image just involved creating a Numpy array. We don't necessarily have to write that out to a file at all. Maybe we could use that function as just one step in a pipeline that will do some analysis (source counting, etc.) on our generated image.

To get this script under control we'll start breaking it up into a few different functions. Remember how for a C program we had a main function:

int main(int argc, char** argv) { ... }

that receives all the command-line arguments? Let's make a main function in our Python program as well and use that as a place to do all our argument handling:

In [17]:

%%file simcluster.py
#!/usr/bin/env python

import argparse  # We added a new import statement for argarse
import sys       # Also import sys for sys.argv

import numpy as np
import astropy.io.fits as fits
from astropy.convolution import Gaussian2DKernel, convolve_fft


def main(argv):
    # Create an ArgumentParser--a special object that keeps track of all the
    # arguments we want our script to be able to handle, and then implements parsing
    # them from sys.argv
    parser = argparse.ArgumentParser(description="Generates simulated images of clusters")

    # Add an optional argument for # of stars (default=10000)
    parser.add_argument('-s', '--stars', type=int, default=10000,
                        help='the number of stars to generate')

    # Add the x and y arguments
    parser.add_argument('-x', type=int, default=512,
                        help='the x dimension (in pixels) of the image')
    parser.add_argument('-y', type=int, default=512,
                        help='the y dimension (in pixels) of the image')

    # Add an argument to handle the output file.  If we use argparse.FileType it will
    # handle opening a writeable file (and ensuring we can write to it).
    # Note that the argument name 'file' does not beging with a '-' or '--'; this indicates
    # to argparse that it is a *positional* argument
    parser.add_argument('file', type=argparse.FileType('w'),
                        help='the name of the output file')

    args = parser.parse_args(argv)

    n_stars = args.stars
    nx = args.x
    ny = args.y
    fileobj = args.file

    # Create empty image
    image = np.zeros((ny, nx))

    # Generate random positions
    r = np.random.random(n_stars) * nx
    theta = np.random.uniform(0., 2. * np.pi, n_stars)

    # Generate random fluxes
    fluxes = np.random.random(n_stars) ** 2

    # Compute position
    x = nx / 2 + r * np.cos(theta)
    y = ny / 2 + r * np.sin(theta)

    # Add stars to image
    # ==> First for loop and if statement <==
    for idx in range(n_stars):
        if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
            image[y[idx], x[idx]] += fluxes[idx]

    # Convolve with a gaussian
    kernel = Gaussian2DKernel(stddev=1)
    image = convolve_fft(image, kernel)

    # Add noise
    image += np.random.normal(1., 0.001, image.shape)

    # Write to a FITS file
    fits.writeto(fileobj, image, clobber=True)

if __name__ == '__main__':
    main(sys.argv[1:])

Overwriting simcluster.py

Now let's move the code for actually generating the pixels of the simulated image into its own function:

In [18]:

%%file simcluster.py
#!/usr/bin/env python

import argparse  # We added a new import statement for argarse
import sys       # Also import sys for sys.argv

import numpy as np
import astropy.io.fits as fits
from astropy.convolution import Gaussian2DKernel, convolve_fft


def simulated_cluster(n_stars=10000, dimensions=(512, 512)):
    """
    Generates an image simulating a cluster of stars, including
    a Gaussian filter and background noise.
    
    Parameters
    ----------
    n_stars : `int`
        A positive integer giving the number of visible stars in the image
        (default: 10000).
        
    dimensions : `tuple`
        A two-tuple of positive integers specifying the dimensions (in pixels)
        of the output image (default: 512x512).
        
    Returns
    -------
    array : `~numpy.ndarray`
        A 2D Numpy array containing the pixels of the generated image.
    """ 
        
    nx, ny = dimensions
    
    # Create empty image
    image = np.zeros((ny, nx))

    # Generate random positions
    r = np.random.random(n_stars) * nx
    theta = np.random.uniform(0., 2. * np.pi, n_stars)

    # Generate random fluxes
    fluxes = np.random.random(n_stars) ** 2

    # Compute position
    x = nx / 2 + r * np.cos(theta)
    y = ny / 2 + r * np.sin(theta)

    # Add stars to image
    # ==> First for loop and if statement <==
    for idx in range(n_stars):
        if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
            image[y[idx], x[idx]] += fluxes[idx]

    # Convolve with a gaussian
    kernel = Gaussian2DKernel(stddev=1)
    image = convolve_fft(image, kernel)

    # Add noise
    image += np.random.normal(1., 0.001, image.shape)
    
    return image


def main(argv):
    """Main function for the simcluster.py script."""
    
    # Create an ArgumentParser--a special object that keeps track of all the
    # arguments we want our script to be able to handle, and then implements parsing
    # them from sys.argv
    parser = argparse.ArgumentParser(description="Generates simulated images of clusters")

    # Add an optional argument for # of stars (default=10000)
    parser.add_argument('-s', '--stars', type=int, default=10000,
                        help='the number of stars to generate')

    # Add the x and y arguments
    parser.add_argument('-x', type=int, default=512,
                        help='the x dimension (in pixels) of the image')
    parser.add_argument('-y', type=int, default=512,
                        help='the y dimension (in pixels) of the image')

    # Add an argument to handle the output file.  If we use argparse.FileType it will
    # handle opening a writeable file (and ensuring we can write to it).
    # Note that the argument name 'file' does not beging with a '-' or '--'; this indicates
    # to argparse that it is a *positional* argument
    parser.add_argument('file', type=argparse.FileType('w'),
                        help='the name of the output file')

    args = parser.parse_args(argv)

    image = simulated_cluster(n_stars=args.stars, dimensions=(args.x, args.y))

    # Write to a FITS file
    # For now the file writing is simple enough that we leave it in the main() function; in the future
    # we may want to break this out into its own routine that takes a filename and the image array
    # (and any other relevant options) and handles the writing
    fits.writeto(args.file, image, clobber=True)

if __name__ == '__main__':
    main(sys.argv[1:])

Overwriting simcluster.py

The most mysterious aspect of this last version is probably the incantation at the end:

if __name__ == '__main__':
    main(sys.argv[1:])

Clearly this reads sys.argv and passes all the argument (except the first, which is the program name) to our main() function. But what is if __name__ == '__main__':?

Every Python source file read by Python (also known as a "module" in Python parlance), such as our simcluster.py file, automatically gets a variable that is global (to that file, not to the entire Python interpreter) variable called __name__ defined. Typically __name__ just comes from the filename, but with the .py stripped off. Nearly every library or "module" you load in Python comes from a file somewhere, and has this __name__ attribute. For example:

In [19]:

import argparse
argparse.__file__

Out[19]:

'/usr/lib/python2.7/argparse.pyc'

In [20]:

argparse.__name__

Out[20]:

'argparse'

However, there is a caveat. When we import argparse, argparse.__name__ is the same as the file name. However, when we run a .py file as a script, as we are doing when we run simcluster.py at the command-line, it gets a special __name__ with the value '__main__'. We can see this with a simple test script:

In [21]:

%%file maintest.py

print __name__

Overwriting maintest.py

In [22]:

!python maintest.py

__main__

This is how we know that this file was the main entry point to our program. Any other Python module imported into our program still reports its normal __name__. (Interesting note, when you run the Python REPL, all the commands you run are actually in a default module called '__main__'. Try it yourself--run python and then type print __name__).

The reason we put this incantation in our script is to note that we have now separated out a function called simulated_cluster, which we might like to use sometime in an interactive Python session, or perhaps as part of a different program. In other words, simcluster.py is no longer just a script--it has also become a (small) library of functions, that can also optionally be run as a script. That is, rather than call ./simcluster.py from the command-line, we can also import it in our Python session:

In [23]:

import simcluster

Note that when we import simcluster, its __name__ is no longe '__main__':

In [24]:

simcluster.__name__

Out[24]:

'simcluster'

That means we can import the module without it automatically running the main function of the script. We can then call functions from within the module in our interactive session:

In [25]:

simcluster.simulated_cluster(100, (10, 10))

Out[25]:

array([[ 1.08060842,  1.07781005,  1.06687394,  1.04314849,  1.02514778,
         1.07103758,  1.21289128,  1.30040318,  1.1865629 ,  1.05154149],
       [ 1.14474091,  1.15789996,  1.14743815,  1.0866468 ,  1.04366706,
         1.09124621,  1.19923599,  1.2307618 ,  1.13277319,  1.03827623],
       [ 1.14379661,  1.18677995,  1.2045861 ,  1.13782828,  1.09185713,
         1.10385719,  1.12215755,  1.10702282,  1.06145937,  1.02006527],
       [ 1.1888733 ,  1.21211689,  1.21824269,  1.18211722,  1.18498579,
         1.198651  ,  1.13984625,  1.08152114,  1.04217991,  1.01276278],
       [ 1.25177213,  1.2134936 ,  1.19002654,  1.2160672 ,  1.26027973,
         1.28623961,  1.18859329,  1.0913626 ,  1.0357741 ,  1.0089897 ],
       [ 1.14867159,  1.12709839,  1.16200597,  1.23744945,  1.2437626 ,
         1.22530688,  1.15773793,  1.09226373,  1.04001322,  1.00919381],
       [ 1.03924094,  1.0634581 ,  1.14238012,  1.20993034,  1.18198106,
         1.15080055,  1.12951789,  1.11843634,  1.0859279 ,  1.03178174],
       [ 1.02265141,  1.05677639,  1.13390076,  1.17900249,  1.15885219,
         1.15614676,  1.16340906,  1.2117383 ,  1.19567374,  1.09095349],
       [ 1.06123756,  1.06292554,  1.08276469,  1.09781104,  1.09732647,
         1.1150451 ,  1.1707453 ,  1.28957764,  1.30740887,  1.15728917],
       [ 1.09167092,  1.06188045,  1.02819676,  1.02438726,  1.02829093,
         1.04469458,  1.09684281,  1.20752832,  1.2506197 ,  1.14320566]])

^Back to top

5. Separating library files from program files¶

We have done a nice job so far of separating the implementation of our algorithm (the actual "science") from the "user-friendly" command-line interface that we now provide to it. Our simcluster.py module is both an importable Python module (or library), but also acts as a command-line program when executed.

However, we can still do better. There are at least a few issues with this (in the case of this simple program hypothetical, but still realistic for a more complicated application):

There's a degree to which, by putting .py in the filename, we're exposing how the sausage is made. Maybe we just want people to be able to make simulated images without knowing or caring what programming language it happened to be implemented in. We might even change the implementation language in the distant future--or maybe we're writing a replacement for an already existing program. If we're the only user it doesn't matter so much, but if we want to share our program with colleagues it might be nicer if we just gave them a program called simcluster (without the .py).
Right now our "library" code consists of a single function simulated_cluster. But we've also mentioned that in the future it might grow more components. Real simulation code might involve dozens of subroutines and utilitiy functions that we might want to break up into separate files. It might become unmaintainable if we insist that all the code for simcluster stay in a single file.
Whether our code consists of a single file, or a bundle of files, we want to be able to install it to a standard location on the operating system. That way we can run the simcluster program just like any other program like ls or grep or ds9. We also want to be able to import simcluster from our Python code just like we import argparse or import astropy.

In this section we'll address the first bullet point. In the subsequent section we'll address the third. (The middle bullet point also pertains to the first one, but a little more softfly in our example program since the code is very small.)

What we are going to do now is separate our "library" code, from the command-line interface, this time by moving them into entirely separate files. The "library" code, currently just consisting of our simulated_cluster function can stay where it is in simcluster.py. We will move the main() function into a new file called just simcluster:

In [26]:

%%file simcluster.py
"""Utilities for generating simulated astronomical images."""

__version__ = '0.1'

import numpy as np
from astropy.convolution import Gaussian2DKernel, convolve_fft


def simulated_cluster(n_stars=10000, dimensions=(512, 512)):
    """
    Generates an image simulating a cluster of stars, including
    a Gaussian filter and background noise.
    
    Parameters
    ----------
    n_stars : `int`
        A positive integer giving the number of visible stars in the image
        (default: 10000).
        
    dimensions : `tuple`
        A two-tuple of positive integers specifying the dimensions (in pixels)
        of the output image (default: 512x512).
        
    Returns
    -------
    array : `~numpy.ndarray`
        A 2D Numpy array containing the pixels of the generated image.
    """ 
        
    nx, ny = dimensions
    
    # Create empty image
    image = np.zeros((ny, nx))

    # Generate random positions
    r = np.random.random(n_stars) * nx
    theta = np.random.uniform(0., 2. * np.pi, n_stars)

    # Generate random fluxes
    fluxes = np.random.random(n_stars) ** 2

    # Compute position
    x = nx / 2 + r * np.cos(theta)
    y = ny / 2 + r * np.sin(theta)

    # Add stars to image
    # ==> First for loop and if statement <==
    for idx in range(n_stars):
        if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
            image[y[idx], x[idx]] += fluxes[idx]

    # Convolve with a gaussian
    kernel = Gaussian2DKernel(stddev=1)
    image = convolve_fft(image, kernel)

    # Add noise
    image += np.random.normal(1., 0.001, image.shape)
    
    return image

Overwriting simcluster.py

Note that we also added a __version__ variable in simcluster.py, as well as a docstring for the module. Using __version__ to specify a library's version is standard practice. But you should also think about how to use the version number effectively. Time permitting we can have a discussion about that, but in short there are many documented schemes for how to version your software. One popular one is called "semantic versions" or "semver" for short. Most Python projects in the wild use some variation on "semver", if not "semver" in the stricted definition.

In [27]:

%%file simcluster
#!/usr/bin/env python

import sys
import argparse

from astropy.io import fits
from simcluster import simulated_cluster

def main(argv):
    """Main function for the simcluster.py script."""
    
    # Create an ArgumentParser--a special object that keeps track of all the
    # arguments we want our script to be able to handle, and then implements parsing
    # them from sys.argv
    parser = argparse.ArgumentParser(description="Generates simulated images of clusters")

    # Add an optional argument for # of stars (default=10000)
    parser.add_argument('-s', '--stars', type=int, default=10000,
                        help='the number of stars to generate')

    # Add the x and y arguments
    parser.add_argument('-x', type=int, default=512,
                        help='the x dimension (in pixels) of the image')
    parser.add_argument('-y', type=int, default=512,
                        help='the y dimension (in pixels) of the image')

    # Add an argument to handle the output file.  If we use argparse.FileType it will
    # handle opening a writeable file (and ensuring we can write to it).
    # Note that the argument name 'file' does not beging with a '-' or '--'; this indicates
    # to argparse that it is a *positional* argument
    parser.add_argument('file', type=argparse.FileType('w'),
                        help='the name of the output file')

    args = parser.parse_args(argv)

    image = simulated_cluster(n_stars=args.stars, dimensions=(args.x, args.y))

    # Write to a FITS file
    # For now the file writing is simple enough that we leave it in the main() function; in the future
    # we may want to break this out into its own routine that takes a filename and the image array
    # (and any other relevant options) and handles the writing
    fits.writeto(args.file, image, clobber=True)

if __name__ == '__main__':
    main(sys.argv[1:])

Overwriting simcluster

Note that now all simcluster.py contains is a Python function definition. There is no code in it (like the main function) that is executable in the context of a command-line script. So we will remove the +x executable mode from the file (but add it to the new simcluster file):

In [28]:

! chmod -x simcluster.py
! chmod +x simcluster

^Back to top

6. Packaging and installation¶

As mentioned in the previous section, although we have nicely separated our library code from our user-interface code, we now have a new problem: Instead of a nice, single script file that we can e-mail around to our colleagues, we now how two files to deal with, at least one of which depends on the other. How do we keep things straight--how do we make sure that simcluster is executable, and that simcluster.py is importable from within Python?

We all have a fairly "standard" execution $PATH for our programs (by "standard" we mean standard for a particular platform--this actually varies wildly between OS's):

In [29]:

! echo $PATH

/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

When we give people (or ourselves) the simcluster script we could ask them to copy it to on eof those standard paths. But which one? We'd have to give instructions for all OS's, and even with that things can be messy and complicated. Things get even worse when it comes to Python libraries. Python has a number of standard paths that it searches for libraries, and this varies depending on what Python version you're using. To see where Python searches, print out the variable sys.path from the sys module:

In [30]:

import sys
sys.path

Out[30]:

['',
 '/usr/local/lib/python2.7/dist-packages/prettyplotlib-0.1.4-py2.7.egg',
 '/usr/local/lib/python2.7/dist-packages/brewer2mpl-1.3.2-py2.7.egg',
 '/usr/lib/python2.7',
 '/usr/lib/python2.7/plat-linux2',
 '/usr/lib/python2.7/lib-tk',
 '/usr/lib/python2.7/lib-old',
 '/usr/lib/python2.7/lib-dynload',
 '/usr/local/lib/python2.7/dist-packages',
 '/usr/lib/python2.7/dist-packages',
 '/usr/lib/python2.7/dist-packages/PIL',
 '/usr/lib/python2.7/dist-packages/gst-0.10',
 '/usr/lib/python2.7/dist-packages/gtk-2.0',
 '/usr/lib/pymodules/python2.7',
 '/usr/lib/python2.7/dist-packages/ubuntu-sso-client',
 '/usr/lib/python2.7/dist-packages/ubuntuone-client',
 '/usr/lib/python2.7/dist-packages/ubuntuone-couch',
 '/usr/lib/python2.7/dist-packages/ubuntuone-installer',
 '/usr/lib/python2.7/dist-packages/ubuntuone-storage-protocol',
 '/usr/local/lib/python2.7/dist-packages/IPython/extensions']

Each value in sys.path is a directory on your filesystem. When you run the code import simcluster it searches each of those directories, in order (the first entry '' means the current directory) for a file called simcluster.py to import.

Note: Alternatively, it looks for a directory called simcluster containing a file named simcluster/__init__.py. That indicates that simcluster is a "package", which is beyond the scope of this tutorial.

Knowing where to properly install a Python module can be a nightmare for novices.

Fortunately, your Python distribution does know where best to install both scripts and Python modules (so long as you have permission to install to those locations--there are ways of specifying alternative paths, but that is again outside the scope of this tutorial).

The "standard" way to install Python software is to write a special script called setup.py, which uses a library called setuptools (or sometimes the older distutils). You may have seen this before if you have ever downloaded a third-party Python library. The library probably came in some archive file (either .zip or .tar.gz) containing all the source code for the library and/or scripts, along with a setup.py file, and a few other files (a license, a README, etc.)

Users can simply unpack the archive, and run the command

./setup.py install

to install all the software in your package to the correct locations for your particular Python distribution (this will work even with Anaconda and the like). The key is that you don't have to implement the details of how this command works. All of that is already done by Python. All you have to do is provide a little boilerplate in the setup.py file that looks something like:

In [31]:

%%file setup.py
#!/usr/bin/env python

from setuptools import setup

setup(
    name='simcluster',
    author='Your Name Here!',
    author_email='your.name@example.com',  # so people can pester you ;)
    version='0.1',
    py_modules=['simcluster'],  # Python modules to install (without the .py in the filename)
    scripts=['simcluster']  # This is the full name of the script "simcluster"; this will be installed to a bin/ directory
)

Overwriting setup.py

When the setup.py script is run all that happens is the setup() function is called. This passes all the information about your project to the Python setuptools/distutils infrastructure, and it uses it to implement an entire command-line interface for building and installing your project, as we'll see.

Next, be kind to your users and make the setup.py script executable as well:

In [32]:

! chmod +x setup.py

Now when we run this setup.py we can see that it implements a whole lot of functionality that we didn't have to write:

In [33]:

! ./setup.py --help

Common commands: (see '--help-commands' for more)

  setup.py build      will build the package underneath 'build/'
  setup.py install    will install the package

Global options:
  --verbose (-v)      run verbosely (default)
  --quiet (-q)        run quietly (turns verbosity off)
  --dry-run (-n)      don't actually do anything
  --help (-h)         show detailed help message
  --no-user-cfg       ignore pydistutils.cfg in your home directory
  --command-packages  list of packages that provide distutils commands

Information display options (just display information, ignore any commands)
  --help-commands     list all available commands
  --name              print package name
  --version (-V)      print package version
  --fullname          print <package name>-<version>
  --author            print the author's name
  --author-email      print the author's email address
  --maintainer        print the maintainer's name
  --maintainer-email  print the maintainer's email address
  --contact           print the maintainer's name if known, else the author's
  --contact-email     print the maintainer's email address if known, else the
                      author's
  --url               print the URL for this package
  --license           print the license of the package
  --licence           alias for --license
  --description       print the package description
  --long-description  print the long package description
  --platforms         print the list of platforms
  --classifiers       print the list of classifiers
  --keywords          print the list of keywords
  --provides          print the list of packages/modules provided
  --requires          print the list of packages/modules required
  --obsoletes         print the list of packages/modules made obsolete

usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: setup.py --help [cmd1 cmd2 ...]
   or: setup.py --help-commands
   or: setup.py cmd --help

To install the package we just run ./setup.py install. We can also run ./setup.py install --help to see additional options for controlling where files are installed to. But in general, it will by default copy files to their appropriate locations (so long as you have permission to write to those locations on the filesystem-- you may have to run sudo ./setup.py install).

setup.py also provides us with a number of other services. For example, it was mentioned earlier that when you download a Python library it comes in an archive file (usually referred to as a "source distribution"). Well, you can make your own simply by running:

In [34]:

! ./setup.py sdist

running sdist
running egg_info
writing simcluster.egg-info/PKG-INFO
writing top-level names to simcluster.egg-info/top_level.txt
writing dependency_links to simcluster.egg-info/dependency_links.txt
reading manifest file 'simcluster.egg-info/SOURCES.txt'
writing manifest file 'simcluster.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.txt

warning: check: missing required meta-data: url

creating simcluster-0.1
creating simcluster-0.1/simcluster.egg-info
making hard links in simcluster-0.1...
hard linking setup.py -> simcluster-0.1
hard linking simcluster -> simcluster-0.1
hard linking simcluster.py -> simcluster-0.1
hard linking simcluster.egg-info/PKG-INFO -> simcluster-0.1/simcluster.egg-info
hard linking simcluster.egg-info/SOURCES.txt -> simcluster-0.1/simcluster.egg-info
hard linking simcluster.egg-info/dependency_links.txt -> simcluster-0.1/simcluster.egg-info
hard linking simcluster.egg-info/top_level.txt -> simcluster-0.1/simcluster.egg-info
Writing simcluster-0.1/setup.cfg
Creating tar archive
removing 'simcluster-0.1' (and everything under it)

This automatically creates an archive with all the required files for your project (though sometimes you need to hold its hand a bit for it to know what all the files are that it needs, see Python documentation for more details).

By default, it automatically creates the file in a subdirectory of your project called dist/ (note on OSX and Linux it usually creates a .tar.gz by default, while on Windows it creates a .zip by default, so the following example may not work on Windows):

In [35]:

! tar -tf dist/simcluster-0.1.tar.gz

simcluster-0.1/
simcluster-0.1/PKG-INFO
simcluster-0.1/setup.cfg
simcluster-0.1/simcluster
simcluster-0.1/simcluster.egg-info/
simcluster-0.1/simcluster.egg-info/SOURCES.txt
simcluster-0.1/simcluster.egg-info/PKG-INFO
simcluster-0.1/simcluster.egg-info/dependency_links.txt
simcluster-0.1/simcluster.egg-info/top_level.txt
simcluster-0.1/setup.py
simcluster-0.1/simcluster.py

Finally, if you're really ambitious and want to share your code with the rest of the world you can post it to PyPI, the Python Package Index (also known to old-timers as "the cheeseshop"). When your project is on PyPI then people don't even have to manually download and unpack the source distribution. They can just install your package with pip, the Python installer. You can just tell your users to run:

pip install simcluster

and pip will do the rest. This only works if your project has a setup.py that implements these standard interfaces. If you have an account on PyPI you can post your project to PyPI by running:

./setup.py register sdist upload

This is actually three commands in one: the register command tells PyPI you want to reserve a project name called "simcluster" and that its latest version is "0.1" (along with any other metadata you provided setup.py such as your name). Then sdist builds a source distribution archive as we already saw. Finally upload uploads the source archive to PyPI (this is separate from register which only posts the metadata).

You are now the proud owner of a Python package!

^Back to top

Bonus enhancements¶

Getting back to our code, let's see if there some other improvements we can make...

If you look carefully, you'll note also that I now have some repeated information--there are two copies of my default values for n_stars and the image dimensions: My simulated_cluster function has some default values for those arguments, and I also have a set of defaults that I gave to the ArgumentParser. Maybe I want these to have different defaults--there's nothing wrong with that. But I might also consider some way of adding this information in one place. There are many approaches one could take for this. One of the simplest might be a global variable listing defaults in one place:

In [36]:

%%file simcluster.py
"""Utilities for generating simulated astronomical images."""

import numpy as np
from astropy.convolution import Gaussian2DKernel, convolve_fft


CLUSTER_DEFAULTS = {
    'stars': 10000,
    'dimensions': (512, 512)
}


def simulated_cluster(n_stars=10000, dimensions=(512, 512)):
    """
    Generates an image simulating a cluster of stars, including
    a Gaussian filter and background noise.
    
    Parameters
    ----------
    n_stars : `int`
        A positive integer giving the number of visible stars in the image
        (default: 10000).
        
    dimensions : `tuple`
        A two-tuple of positive integers specifying the dimensions (in pixels)
        of the output image (default: 512x512).
        
    Returns
    -------
    array : `~numpy.ndarray`
        A 2D Numpy array containing the pixels of the generated image.
    """ 
        
    nx, ny = dimensions
    
    # Create empty image
    image = np.zeros((ny, nx))

    # Generate random positions
    r = np.random.random(n_stars) * nx
    theta = np.random.uniform(0., 2. * np.pi, n_stars)

    # Generate random fluxes
    fluxes = np.random.random(n_stars) ** 2

    # Compute position
    x = nx / 2 + r * np.cos(theta)
    y = ny / 2 + r * np.sin(theta)

    # Add stars to image
    # ==> First for loop and if statement <==
    for idx in range(n_stars):
        if x[idx] >= 0 and x[idx] < nx and y[idx] >= 0 and y[idx] < ny:
            image[y[idx], x[idx]] += fluxes[idx]

    # Convolve with a gaussian
    kernel = Gaussian2DKernel(stddev=1)
    image = convolve_fft(image, kernel)

    # Add noise
    image += np.random.normal(1., 0.001, image.shape)
    
    return image

Overwriting simcluster.py

In [37]:

%%file simcluster
#!/usr/bin/env python

import sys
import argparse

from astropy.io import fits
from simcluster import simulated_cluster, CLUSTER_DEFAULTS

def main(argv):
    """Main function for the simcluster.py script."""
    
    # Create an ArgumentParser--a special object that keeps track of all the
    # arguments we want our script to be able to handle, and then implements parsing
    # them from sys.argv
    parser = argparse.ArgumentParser(description="Generates simulated images of clusters")

    # Add an optional argument for # of stars (default=10000)
    parser.add_argument('-s', '--stars', type=int, default=CLUSTER_DEFAULTS['stars'],
                        help='the number of stars to generate')

    # Add the x and y arguments
    parser.add_argument('-x', type=int, default=CLUSTER_DEFAULTS['dimensions'][0],
                        help='the x dimension (in pixels) of the image')
    parser.add_argument('-y', type=int, default=CLUSTER_DEFAULTS['dimensions'][1],
                        help='the y dimension (in pixels) of the image')

    # Add an argument to handle the output file.  If we use argparse.FileType it will
    # handle opening a writeable file (and ensuring we can write to it).
    # Note that the argument name 'file' does not beging with a '-' or '--'; this indicates
    # to argparse that it is a *positional* argument
    parser.add_argument('file', type=argparse.FileType('w'),
                        help='the name of the output file')

    args = parser.parse_args(argv)

    image = simulated_cluster(n_stars=args.stars, dimensions=(args.x, args.y))

    # Write to a FITS file
    # For now the file writing is simple enough that we leave it in the main() function; in the future
    # we may want to break this out into its own routine that takes a filename and the image array
    # (and any other relevant options) and handles the writing
    fits.writeto(args.file, image, clobber=True)

if __name__ == '__main__':
    main(sys.argv[1:])

Overwriting simcluster

Another possible approach would have been to determine the default arguments to simulated_cluster directly from the function definition itself by using the inspect module from the Python standard library. This will be left as an exercise.

^Back to top