Notebook

1 Cooler command line interface

Notes:

To simulate shell scripts and terminal interaction, we preface every code cell with the "cell magic": %%bash, which sends the code to bash instead of the Python interpreter. Another way to send code from IPython to the shell is to prefix a line with shell escape !.

Cooler command line interface¶

If you type cooler at the command line with no arguments or with -h or --help you'll get the following quick reference of available subcommands.

In [1]:

%%bash

cooler -h

Usage: cooler [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging.
  -d, --debug    On error, drop into the post-mortem debugger shell.
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  cload     Create a cooler from genomic pairs and bins.
  load      Create a cooler from a pre-binned matrix.
  merge     Merge multiple coolers with identical axes.
  coarsen   Coarsen a cooler to a lower resolution.
  zoomify   Generate a multi-resolution cooler file by coarsening.
  balance   Out-of-core matrix balancing.
  info      Display a cooler's info and metadata.
  dump      Dump a cooler's data to a text stream.
  show      Display and browse a cooler in matplotlib.
  makebins  Generate fixed-width genomic bins.
  digest    Generate fragment-delimited genomic bins.
  csort     Sort and index a contact list.
  ls        List all coolers inside a file.
  cp        Copy a cooler from one file to another or within the same file.
  ln        Create a hard link to a cooler (rather than a true copy) in the...
  mv        Rename a cooler within the same file.
  tree      Display a file's data hierarchy.
  attrs     Display a file's attribute hierarchy.

For more information about a specific subcommand, type cooler <subcommand> -h to display the help text.

In [2]:

%%bash

cooler info -h

Usage: cooler info [OPTIONS] COOL_PATH

  Display a cooler's info and metadata.

  COOL_PATH : Path to a COOL file or cooler URI.

Options:
  -f, --field TEXT  Print the value of a specific info field.
  -m, --metadata    Print the user metadata in JSON format.
  -o, --out TEXT    Output file (defaults to stdout)
  -h, --help        Show this message and exit.

Example¶

Let's try it.

In [3]:

%%bash

cooler info data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

{
    "bin-size": 1000000,
    "bin-type": "fixed",
    "creation-date": "2016-02-25T21:05:29.075865",
    "format-url": "https://github.com/mirnylab/cooler",
    "format-version": 2,
    "genome-assembly": "hg19",
    "id": null,
    "library-version": "0.3.0",
    "nbins": 3114,
    "nchroms": 25,
    "nnz": 4150156
}

In [4]:

%%bash

cooler info -f bin-size data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

In [5]:

%%bash

cooler info -m data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

{
    "publication": "",
    "QC": {
        "double-sided": {
            "total": 3390352656,
            "valid": 3147639590,
            "filtered-invalid": {
                "removed-self-circles": 1741768,
                "removed-error-pair": 6074295,
                "removed-dangling-ends": 234897003
            },
            "filtered-valid": {
                "removed-duplicate": 110650005,
                "removed-start-near-rsite": "",
                "removed-outlier-fragment": 151337031,
                "removed-large-small-pair": 657466
            }
        },
        "pre-filtering": {
            "total": 5332721651,
            "double-sided": 3390352656,
            "unused": 0,
            "single-sided": 1942368995
        },
        "post-filtering": {
            "total": 2884995088,
            "cis": 2085711027,
            "trans": 799284061
        }
    },
    "enzyme": "MboI",
    "cell-type": "GM12878",
    "species": "Homo sapiens",
    "sex": "F"
}

For more in-depth introspection into the HDF5 file structure, you can use cooler tree to display the group and array hierarchy and cooler attrs to display all attributes in the hierarchy. As a bonus, these commands work on any HDF5 file!

In [6]:

%%bash

cooler tree data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

/
 ├── bins
 │   ├── chrom (3114,) int32
 │   ├── end (3114,) int64
 │   ├── start (3114,) int64
 │   └── weight (3114,) float64
 ├── chroms
 │   ├── length (25,) int64
 │   └── name (25,) |S32
 ├── indexes
 │   ├── bin1_offset (3115,) int32
 │   └── chrom_offset (26,) int32
 └── pixels
     ├── bin1_id (4150156,) int64
     ├── bin2_id (4150156,) int64
     └── count (4150156,) int64

In [7]:

%%bash

cooler attrs data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool

'@attrs':
  bin-size: 1000000
  bin-type: fixed
  creation-date: 2016-02-25 21:05:29.075865
  format-url: https://github.com/mirnylab/cooler
  format-version: 2
  genome-assembly: hg19
  id: null
  library-version: 0.3.0
  metadata:
    QC:
      double-sided:
        filtered-invalid:
          removed-dangling-ends: 234897003
          removed-error-pair: 6074295
          removed-self-circles: 1741768
        filtered-valid:
          removed-duplicate: 110650005
          removed-large-small-pair: 657466
          removed-outlier-fragment: 151337031
          removed-start-near-rsite: ''
        total: 3390352656
        valid: 3147639590
      post-filtering:
        cis: 2085711027
        total: 2884995088
        trans: 799284061
      pre-filtering:
        double-sided: 3390352656
        single-sided: 1942368995
        total: 5332721651
        unused: 0
    cell-type: GM12878
    enzyme: MboI
    publication: ''
    sex: F
    species: Homo sapiens
  nbins: 3114
  nchroms: 25
  nnz: 4150156
bins:
  '@attrs': {}
  chrom:
    '@attrs': {}
  end:
    '@attrs': {}
  start:
    '@attrs': {}
  weight:
    '@attrs': {}
chroms:
  '@attrs': {}
  length:
    '@attrs': {}
  name:
    '@attrs': {}
indexes:
  '@attrs': {}
  bin1_offset:
    '@attrs': {}
  chrom_offset:
    '@attrs': {}
pixels:
  '@attrs': {}
  bin1_id:
    '@attrs': {}
  bin2_id:
    '@attrs': {}
  count:
    '@attrs': {}

Aggregate a list of read pairs into a `cool` file¶

To make a contact matrix, we need

A list of read pairs representing captured contacts.
A segmentation of the genome into bins by which we aggregate (bin) the read pair counts.

For(1), we will start with a very small subsample of 100,000 read pairs from GSM1551552 (Rao et al, GM12878). The fields of the file are readID, strand1, chrom1, pos1, frag1, strand2, chrom2, pos2, frag2, mapq1, mapq2.

In [8]:

%%bash

zcat data/GSM1551552_HIC003_merged_nodups.txt.subset.gz | head

D260LACXX130602:2:2315:7361:72358 0 1 85378 186 16 1 591085 1097 0 0
D258GACXX130605:8:2316:2958:10584 0 1 728403 1418 16 1 719104 1406 20 10
C24LCACXX130513:8:2315:5697:82732 0 1 758309 1523 0 1 43498676 121266 68 120
D260LACXX130602:2:2202:16754:100485 16 1 890311 1801 16 1 993887 2056 165 18
D260LACXX130602:2:2309:8547:48542 0 1 925938 1866 16 1 1034493 2117 178 60
C24LCACXX130513:8:2312:14225:27548 0 1 941657 1904 0 1 1620964 3551 178 0
D258GACXX130605:8:2308:5276:50416 0 1 992230 2049 0 1 1255504 2556 175 178
C24LCACXX130513:8:1308:7340:36704 0 1 1070613 2191 16 1 1070941 2193 153 40
C24LCACXX130513:8:2104:5074:54778 16 1 1137145 2329 16 1 201625081 455535 144 178
C24LCACXX130513:8:1302:14464:56521 0 1 1170252 2381 16 1 1353433 2788 178 22

This data was mapped to the Broad's b37 assembly and uses ENSEMBL-style chromosome names (1..22, X, Y, MT) instead of the UCSC-style (chr1..chr22, chrX, chrY, chrM).

We provide a chromosome sizes file with the chromosomes we want to use in a desired order. Make sure the name style of the chromsizes file matches the name style of the pairs file! The following is the b37 chromosome sizes file with chromosomes in a "natural" semantic order, leaving out the unlocalized and unplaced scaffolds.

In [9]:

%%bash

cat data/b37.chrom.sizes

1	249250621
2	243199373
3	198022430
4	191154276
5	180915260
6	171115067
7	159138663
8	146364022
9	141213431
10	135534747
11	135006516
12	133851895
13	115169878
14	107349540
15	102531392
16	90354753
17	81195210
18	78077248
19	59128983
20	63025520
21	48129895
22	51304566
X	155270560
Y	59373566
MT	16569

We also need to decide how we want to bin the contacts. Usually, we choose a fixed bin size or "resolution". Another option for Hi-C data is to use restriction fragment-delimited genomic bins based on the restriction enzyme used in the experiment. cooler allows for any binning scheme you like, as long as you provide it as a bin table. We can store a bin table in a simple BED file using the makebins command.

In [10]:

%%bash

cooler makebins -h

Usage: cooler makebins [OPTIONS] CHROMSIZES_PATH BINSIZE

  Generate fixed-width genomic bins.

  Output a genome segmentation at a fixed resolution as a BED file.

  CHROMSIZES_PATH : UCSC-like chromsizes file, with chromosomes in desired
  order.

  BINSIZE : Resolution (bin size) in base pairs <int>.

Options:
  -o, --out TEXT       Output file (defaults to stdout)
  -H, --header         Print the header of column names as the first row.
                       [default: False]
  -i, --rel-ids [0|1]  Include a column of relative bin IDs for each
                       chromosome. Choose whether to report them as 0- or
                       1-based.
  -h, --help           Show this message and exit.

If you have the FASTA sequence of the reference genome, you can also "digest" it to create a bin table of fragments.

In [11]:

%%bash

cooler digest -h

Usage: cooler digest [OPTIONS] CHROMSIZES_PATH FASTA_PATH ENZYME

  Generate fragment-delimited genomic bins.

  Output a genome segmentation of restriction fragments as a BED file.

  CHROMSIZES_PATH : UCSC-like chromsizes file, with chromosomes in desired
  order.

  FASTA_PATH : Genome assembly FASTA file or folder containing FASTA files
  (uncompressed).

  ENZYME : Name of restriction enzyme

Options:
  -o, --out TEXT       Output file (defaults to stdout)
  -H, --header         Print the header of column names as the first row.
                       [default: False]
  -i, --rel-ids [0|1]  Include a column of relative bin IDs for each
                       chromosome. Choose whether to report them as 0- or
                       1-based.
  -h, --help           Show this message and exit.

In [12]:

%%bash

CHROMSIZES_FILE='data/b37.chrom.sizes'

cooler makebins --out "data/bins.1000kb.bed" $CHROMSIZES_FILE 1000000

# what's in the file?
head "data/bins.1000kb.bed"

1	0	1000000
1	1000000	2000000
1	2000000	3000000
1	3000000	4000000
1	4000000	5000000
1	5000000	6000000
1	6000000	7000000
1	7000000	8000000
1	8000000	9000000
1	9000000	10000000

Note¶

There is a convenient syntax to specify a fixed-resolution bin table, so you rarely need to generate one manually:

<chromsizes_path>:<binsize-in-bp>

e.g. The bin table above can be specified as data/b37.chrom.sizes.reduced:1000000.

In [13]:

%%bash

cooler cload pairs -h

Usage: cooler cload pairs [OPTIONS] BINS PAIRS_PATH COOL_PATH

  Bin any text file or stream of pairs.

  Pairs data need not be sorted. Accepts compressed files. To pipe input
  from stdin, set PAIRS_PATH to '-'.

  BINS : One of the following

      <TEXT:INTEGER> : 1. Path to a chromsizes file, 2. Bin size in bp

      <TEXT> : Path to BED file defining the genomic bin segmentation.

  PAIRS_PATH : Path to contacts (i.e. read pairs) file.

  COOL_PATH : Output COOL file path or URI.

Options:
  --metadata TEXT                 Path to JSON file containing user metadata.
  --assembly TEXT                 Name of genome assembly (e.g. hg19, mm10)
  -c1, --chrom1 INTEGER           chrom1 field number (one-based)  [required]
  -p1, --pos1 INTEGER             pos1 field number (one-based)  [required]
  -c2, --chrom2 INTEGER           chrom2 field number (one-based)  [required]
  -p2, --pos2 INTEGER             pos2 field number (one-based)  [required]
  --chunksize INTEGER             Number of input lines to load at a time
  -0, --zero-based                Positions are zero-based  [default: False]
  --comment-char TEXT             Comment character that indicates lines to
                                  ignore.  [default: #]
  -N, --no-symmetric-upper        Create a complete square matrix without
                                  implicit symmetry. This allows for distinct
                                  upper- and lower-triangle values
  --input-copy-status [unique|duplex]
                                  Copy status of input data when using
                                  symmetric-upper storage. | `unique`:
                                  Incoming data comes from a unique half of a
                                  symmetric map, regardless of how the
                                  coordinates of a pair are ordered. `duplex`:
                                  Incoming data contains upper- and lower-
                                  triangle duplicates. All input records that
                                  map to the lower triangle will be discarded!
                                  | If you wish to treat lower- and upper-
                                  triangle input data as distinct, use the
                                  ``--no-symmetric-upper`` option.   [default:
                                  unique]
  --field TEXT                    Specify quantitative input fields to
                                  aggregate into value columns using the
                                  syntax ``--field <field-name>=<field-
                                  number>``. Optionally, append ``:`` followed
                                  by ``dtype=<dtype>`` to specify the data
                                  type (e.g. float), and/or ``agg=<agg>`` to
                                  specify an aggregation function different
                                  from sum (e.g. mean). Field numbers are
                                  1-based. Passing 'count' as the target name
                                  will override the default behavior of
                                  storing pair counts. Repeat the ``--field``
                                  option for each additional field.
  --temp-dir DIRECTORY            Create temporary files in a specified
                                  directory. Pass ``-`` to use the platform
                                  default temp dir.
  --no-delete-temp                Do not delete temporary files when finished.
  --max-merge INTEGER             Maximum number of chunks to merge before
                                  invoking recursive merging  [default: 200]
  --storage-options TEXT          Options to modify the data filter pipeline.
                                  Provide as a comma-separated list of key-
                                  value pairs of the form 'k1=v1,k2=v2,...'.
                                  See http://docs.h5py.org/en/stable/high/data
                                  set.html#filter-pipeline for more details.
  -h, --help                      Show this message and exit.

In [14]:

%%bash
# Note that the input pairs file happens to be space-delimited, so we convert to tab-delimited with `tr`.
CHROMSIZES_FILE='data/b37.chrom.sizes'
BINSIZE=1000000
PAIRS_FILE='data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'
OUTPUT_FILE='data/test.cool'

zcat $PAIRS_FILE \
    | tr ' ' '\t' \
    | cooler cload pairs -c1 3 -p1 4 -c2 7 -p2 8 $CHROMSIZES_FILE:$BINSIZE - $OUTPUT_FILE

INFO:cooler.create:Writing chunk 0: /home/nezar/local/devel/cooler-binder/data/tmp32q3ijrr.multi.cool::0
WARNING:py.warnings:/home/nezar/miniconda3/lib/python3.7/site-packages/dask/dataframe/utils.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

INFO:cooler.create:Creating cooler at "/home/nezar/local/devel/cooler-binder/data/tmp32q3ijrr.multi.cool::/0"
INFO:cooler.create:Writing chroms
INFO:cooler.create:Writing bins
INFO:cooler.create:Writing pixels
INFO:cooler.create:Writing indexes
INFO:cooler.create:Writing info
INFO:cooler.create:Done
INFO:cooler.create:Merging into data/test.cool
INFO:cooler.create:Creating cooler at "data/test.cool::/"
INFO:cooler.create:Writing chroms
INFO:cooler.create:Writing bins
INFO:cooler.create:Writing pixels
INFO:cooler.reduce:nnzs: [46598]
INFO:cooler.reduce:current: [46598]
INFO:cooler.create:Writing indexes
INFO:cooler.create:Writing info
INFO:cooler.create:Done

There are benefits to sorting and indexing pairs. See below.

Text export¶

The cooler dump command lets us print the data back out as text with several formatting and annotation options. It also accepts range queries, both intra- and inter-chromosomal.

In [15]:

%%bash

cooler dump -h

Usage: cooler dump [OPTIONS] COOL_PATH

  Dump a cooler's data to a text stream.

  COOL_PATH : Path to COOL file or cooler URI.

Options:
  -t, --table [chroms|bins|pixels]
                                  Which table to dump. Choosing 'chroms' or
                                  'bins' will cause all pixel-related options
                                  to be ignored. Note that for coolers stored
                                  in symmetric-upper mode, 'pixels' only holds
                                  the upper triangle values of the matrix.
                                  [default: pixels]
  -c, --columns SEPARATED[,]      Restrict output to a subset of columns,
                                  provided as a comma-separated list.
  -H, --header                    Print the header of column names as the
                                  first row.  [default: False]
  --na-rep TEXT                   Missing data representation. Default is
                                  empty ''.
  --float-format TEXT             Format string for floating point numbers
                                  (e.g. '.12g', '03.2f').  [default: g]
  -r, --range TEXT                The coordinates of a genomic region shown
                                  along the row dimension, in UCSC-style
                                  notation. (Example:
                                  chr1:10,000,000-11,000,000). If omitted, the
                                  entire contact matrix is printed.
  -r2, --range2 TEXT              The coordinates of a genomic region shown
                                  along the column dimension. If omitted, the
                                  column range is the same as the row range.
  -m, --matrix                    For coolers stored in symmetric-upper mode,
                                  ensure any empty areas of the genomic query
                                  window are populated by generating the
                                  lower-triangular pixels.  [default: False]
  -b, --balanced / --no-balance   Apply balancing weights to data. This will
                                  print an extra column called `balanced`
                                  [default: False]
  --join                          Print the full chromosome bin coordinates
                                  instead of bin IDs. This will replace the
                                  `bin1_id` column with `chrom1`, `start1`,
                                  and `end1`, and the `bin2_id` column with
                                  `chrom2`, `start2` and `end2`.  [default:
                                  False]
  --annotate SEPARATED[,]         Join additional columns from the bin table
                                  against the pixels. Provide a comma
                                  separated list of column names (no spaces).
                                  The merged columns will be suffixed by '1'
                                  and '2' accordingly.
  --one-based-ids                 Print bin IDs as one-based rather than zero-
                                  based.
  --one-based-starts              Print start coordinates as one-based rather
                                  than zero-based.
  -k, --chunksize INTEGER         Sets the amount of pixel data loaded from
                                  disk at one time. Can affect the performance
                                  of joins on high resolution datasets.
                                  Default is to load as many rows as there are
                                  bins.
  -o, --out TEXT                  Output text file If .gz extension is
                                  detected, file is written using zlib.
                                  Default behavior is to stream to stdout.
  -h, --help                      Show this message and exit.

In [16]:

%%bash

cooler dump -t chroms data/test.cool

1	249250621
2	243199373
3	198022430
4	191154276
5	180915260
6	171115067
7	159138663
8	146364022
9	141213431
10	135534747
11	135006516
12	133851895
13	115169878
14	107349540
15	102531392
16	90354753
17	81195210
18	78077248
19	59128983
20	63025520
21	48129895
22	51304566
X	155270560
Y	59373566
MT	16569

In [17]:

%%bash

cooler dump -t bins data/test.cool | head

1	0	1000000
1	1000000	2000000
1	2000000	3000000
1	3000000	4000000
1	4000000	5000000
1	5000000	6000000
1	6000000	7000000
1	7000000	8000000
1	8000000	9000000
1	9000000	10000000

In [18]:

%%bash

cooler dump -t pixels --header data/test.cool | head

bin1_id	bin2_id	count
0	0	3
0	1	4
0	43	1
0	155	1
0	229	1
0	437	1
0	492	1
0	493	1
0	666	1

In [19]:

%%bash

cooler dump -t pixels --header --join data/test.cool | head

chrom1	start1	end1	chrom2	start2	end2	count
1	0	1000000	1	0	1000000	3
1	0	1000000	1	1000000	2000000	4
1	0	1000000	1	43000000	44000000	1
1	0	1000000	1	155000000	156000000	1
1	0	1000000	1	229000000	230000000	1
1	0	1000000	2	187000000	188000000	1
1	0	1000000	2	242000000	243000000	1
1	0	1000000	2	243000000	243199373	1
1	0	1000000	3	172000000	173000000	1

In [20]:

%%bash

cooler dump -t pixels -r 10:10,000,000-20,000,000 -r2 10:30,000,000-80,000,000 --header --join data/test.cool | head

chrom1	start1	end1	chrom2	start2	end2	count
10	10000000	11000000	10	70000000	71000000	1
10	11000000	12000000	10	30000000	31000000	1
10	11000000	12000000	10	35000000	36000000	1
10	11000000	12000000	10	42000000	43000000	2
10	11000000	12000000	10	43000000	44000000	1
10	11000000	12000000	10	52000000	53000000	1
10	11000000	12000000	10	58000000	59000000	1
10	12000000	13000000	10	30000000	31000000	1
10	12000000	13000000	10	45000000	46000000	1

In [21]:

%%bash

cooler dump -t pixels --header --balanced data/test.cool | head

Balancing weights not found

Oops! Our contact matrix isn't balanced yet. Let's do that next.

Balancing¶

Matrix balancing normalization, i.e. iterative correction.

We usually normalize or "correct" Hi-C using a technique called matrix balancing. This involves finding a set of weights or biases $b_i$ for each bin $i$ such that

$$ Normalized[i,j] = Observed[i,j] \cdot b[i] \cdot b[j], $$

such that the marginals (i.e., row/column sums) of the global contact matrix are flat and equal.

cooler balance will store the pre-computed balancing weights in the bin table as an extra column called weight.

Note that whole-genome matrix balancing on a high resolution matrix requires iterative computations on a matrix that may not fit in computer memory, even in sparse form. Our "out-of-core" method performs the calculations by splitting and loading the data into smaller chunks and combining the partial results afterwards.

In [22]:

%%bash

cooler balance -h

Usage: cooler balance [OPTIONS] COOL_PATH

  Out-of-core matrix balancing.

  Matrix must be symmetric. See the help for various filtering options to
  mask out poorly mapped bins.

  COOL_PATH : Path to a COOL file.

Options:
  -p, --nproc INTEGER             Number of processes to split the work
                                  between.  [default: 8]
  -c, --chunksize INTEGER         Control the number of pixels handled by each
                                  worker process at a time.  [default:
                                  10000000]
  --mad-max INTEGER               Ignore bins from the contact matrix using
                                  the 'MAD-max' filter: bins whose log
                                  marginal sum is less than ``mad-max`` median
                                  absolute deviations below the median log
                                  marginal sum of all the bins in the same
                                  chromosome.  [default: 5]
  --min-nnz INTEGER               Ignore bins from the contact matrix whose
                                  marginal number of nonzeros is less than
                                  this number.  [default: 10]
  --min-count INTEGER             Ignore bins from the contact matrix whose
                                  marginal count is less than this number.
                                  [default: 0]
  --blacklist PATH                Path to a 3-column BED file containing
                                  genomic regions to mask out during the
                                  balancing procedure, e.g. sequence gaps or
                                  regions of poor mappability.
  --ignore-diags INTEGER          Number of diagonals of the contact matrix to
                                  ignore, including the main diagonal.
                                  Examples: 0 ignores nothing, 1 ignores the
                                  main diagonal, 2 ignores diagonals (-1, 0,
                                  1), etc.  [default: 2]
  --ignore-dist INTEGER           Distance in bp to ignore.
  --tol FLOAT                     Threshold value of variance of the marginals
                                  for the algorithm to converge.  [default:
                                  1e-05]
  --max-iters INTEGER             Maximum number of iterations to perform if
                                  convergence is not achieved.  [default: 200]
  --cis-only                      Calculate weights against intra-chromosomal
                                  data only instead of genome-wide.
  --trans-only                    Calculate weights against inter-chromosomal
                                  data only instead of genome-wide.
  --name TEXT                     Name of column to write to.  [default:
                                  weight]
  -f, --force                     Overwrite the target dataset, 'weight', if
                                  it already exists.
  --check                         Check whether a data column 'weight' already
                                  exists.
  --stdout                        Print weight column to stdout instead of
                                  saving to file.
  --convergence-policy [store_final|store_nan|discard|error]
                                  What to do with weights when balancing
                                  doesn't converge in max_iters.
                                  'store_final': Store the final result,
                                  regardless of whether the iterations
                                  converge to the specified tolerance;
                                  'store_nan': Store a vector of NaN values to
                                  indicate that the matrix failed to converge;
                                  'discard': Store nothing and exit
                                  gracefully; 'error': Abort with non-zero
                                  exit status.  [default: store_final]
  -h, --help                      Show this message and exit.

cooler balance iterates until the balanced marginals (i.e. row sums of the balanced matrix) are sufficiently flat (the variance falls below the limit tol).

In [23]:

%%bash

cooler balance -p 10 -c 10000 data/test.cool

INFO:cooler.cli.balance:Balancing "data/test.cool"
INFO:cooler.balance:variance is 108.70508940352958
INFO:cooler.balance:variance is 11.544325707535195
INFO:cooler.balance:variance is 2.6406181035180114
INFO:cooler.balance:variance is 0.7820569377009834
INFO:cooler.balance:variance is 0.2430786648548748
INFO:cooler.balance:variance is 0.08214976966884285
INFO:cooler.balance:variance is 0.0281252239555167
INFO:cooler.balance:variance is 0.01007790709086638
INFO:cooler.balance:variance is 0.003641581079865616
INFO:cooler.balance:variance is 0.0013517640780139287
INFO:cooler.balance:variance is 0.0005055339361931239
INFO:cooler.balance:variance is 0.00019213135598238062
INFO:cooler.balance:variance is 7.348487864296598e-05
INFO:cooler.balance:variance is 2.838381218829392e-05
INFO:cooler.balance:variance is 1.1018544001826925e-05
INFO:cooler.balance:variance is 4.304041056888539e-06

In [24]:

%%bash

cooler dump --header --balanced data/test.cool | head

bin1_id	bin2_id	count	balanced
0	0	3	0.0619576
0	1	4	0.0917378
0	43	1	0.0246198
0	155	1	0.0223535
0	229	1	0.0292275
0	437	1	0.0364777
0	492	1	0.0238998
0	493	1	0.0843047
0	666	1	0.023308

cooler balance will also attach some metadata to the bins/weight array as attributes, which you can inspect:

In [25]:

%%bash

cooler attrs data/test.cool::bins

'@attrs': {}
chrom:
  '@attrs': {}
end:
  '@attrs': {}
start:
  '@attrs': {}
weight:
  '@attrs':
    cis_only: false
    converged: true
    ignore_diags: 2
    mad_max: 5
    min_count: 0
    min_nnz: 10
    scale: 31.707717089629107
    tol: 1.0e-05
    var: 4.304041056888539e-06

Display the contact matrix¶

You can also use the cooler show function to produce images of the contact matrix. Requires the matplotlib Python package.

In [26]:

%%bash

cooler show -h

Usage: cooler show [OPTIONS] COOL_PATH RANGE

  Display and browse a cooler in matplotlib.

  COOL_PATH : Path to a COOL file or Cooler URI.

  RANGE : The coordinates of the genomic region to display, in UCSC
  notation. Example: chr1:10,000,000-11,000,000

Options:
  -r2, --range2 TEXT              The coordinates of a genomic region shown
                                  along the column dimension. If omitted, the
                                  column range is the same as the row range.
                                  Use to display asymmetric matrices or trans
                                  interactions.
  -b, --balanced                  Show the balanced contact matrix. If not
                                  provided, display the unbalanced counts.
  -o, --out TEXT                  Save the image of the contact matrix to a
                                  file. If not specified, the matrix is
                                  displayed in an interactive window. The
                                  figure format is deduced from the extension
                                  of the file, the supported formats are png,
                                  jpg, svg, pdf, ps and eps.
  --dpi INTEGER                   The DPI of the figure, if saving to a file
  -s, --scale [linear|log2|log10]
                                  Scale transformation of the colormap:
                                  linear, log2 or log10. Default is log10.
  -f, --force                     Force display very large matrices (>=10^8
                                  pixels). Use at your own risk as it may
                                  cause performance issues.
  --zmin FLOAT                    The minimal value of the color scale. Units
                                  must match those of the colormap scale. To
                                  provide a negative value use a equal sign
                                  and quotes, e.g. -zmin='-0.5'
  --zmax FLOAT                    The maximal value of the color scale. Units
                                  must match those of the colormap scale. To
                                  provide a negative value use a equal sign
                                  and quotes, e.g. -zmax='-0.5'
  --cmap TEXT                     The colormap used to display the contact
                                  matrix. See the full list at http://matplotl
                                  ib.org/examples/color/colormaps_reference.ht
                                  ml
  --field TEXT                    Pixel values to display.  [default: count]
  -h, --help                      Show this message and exit.

Here's the undersampled dataset.

In [27]:

%%bash

cooler show --out data/test.png --dpi 200 data/test.cool 3:0-80,000,000

WARNING:py.warnings:/home/nezar/miniconda3/lib/python3.7/site-packages/cooler/cli/show.py:30: RuntimeWarning: divide by zero encountered in log10
  mat = np.log10(mat)

In [28]:

from IPython.display import Image
Image('data/test.png')

Out[28]:

Here's what the full one looks like.

In [29]:

%%bash

cooler show --out data/test2.png --dpi 200 data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool chr3:0-80,000,000

In [30]:

from IPython.display import Image
Image('data/test2.png')

Out[30]:

Optional: Load a pairs file indexed with Pairix ¶

Alternatively, you can sort, format and index your pairs file before ingesting. Having an indexed pairs file can also be useful of other kinds of read-level analyses.

We use the cooler csort command to do this. What does it do?

Given a chromosome order, it creates a new pairs file with the following properties:

Consistently ordered mates: mates of every interchromsomal pair are properly "flipped" in order to respect the requested order of the chromosomes. For intrachromosomal pairs, mates are flipped such that pos1 is always less than or equal to pos2. As a result, the data will have an upper triangular orientation with respect to the chromosome order (interpreting sides 1 and 2 as i and j axes in a matrix coordinate system).
Sorted: once the mates are oriented, the pair records are lexically sorted by chrom1, chrom2, pos1, pos2. With (1) and (2), contacts are said to be sorted by chromosome-chromosome block, a.k.a. "block" sorted.
Indexed: we use bgzip to compress the file and Pairix to index it. This creates a small .px2 index file which facilitates 2-dimensional queries on the reads.

Notes:

If (1) is already satisfied, you can also prepare a Pairix-indexed file manually, without cooler csort. See this example

In [31]:

!conda install --yes -c bioconda pairix

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

In [32]:

!cooler csort -h

Usage: cooler csort [OPTIONS] PAIRS_PATH CHROMOSOMES_PATH

  Sort and index a contact list.

  Order the mates of each pair record so that all contacts are upper
  triangular with respect to the chromosome ordering given by the
  chromosomes file, sort contacts by genomic location, and index the
  resulting file.

  PAIRS_PATH : Contacts (i.e. read pairs) text file, optionally compressed.

  CHROMOSOMES_PATH : File listing desired chromosomes in the desired order.
  May be tab-delimited, e.g. a UCSC-style chromsizes file. Contacts mapping
  to other chromosomes will be discarded.

  **Notes**

  - csort can also be used to sort and index a text representation of
    a contact *matrix* in bedGraph-like format. In this case, substitute
    `pos1` and `pos2` with `start1` and `start2`, respectively.
  - Requires Unix tools: sort, bgzip + tabix or pairix.

  If indexing with Tabix, the output file will have the following
  properties:

  - Upper triangular: the read pairs on each row are assigned to side 1 or 2
    in such a way that (chrom1, pos1) is always "less than" (chrom2, pos2)
  - Rows are lexicographically sorted by chrom1, pos1, chrom2, pos2;
    i.e. "positionally sorted"
  - Compressed with bgzip [*]
  - Indexed using Tabix [*] on chrom1 and pos1.

  If indexing with Pairix, the output file will have the following
  properties:

  - Upper triangular: the read pairs on each row are assigned to side 1 or 2
    in such a way that (chrom1, pos1) is always "less than" (chrom2, pos2)
  - Rows are lexicographically sorted by chrom1, chrom2, pos1, pos2; i.e.
    "block sorted"
  - Compressed with bgzip [*]
  - Indexed using Pairix [+] on chrom1, chrom2 and pos1.

  [*] Tabix manpage: <http://www.htslib.org/doc/tabix.html>.
  [+] Pairix on Github: <https://github.com/4dn-dcic/pairix>

Options:
  -c1, --chrom1 INTEGER       chrom1 field number in the input file (starting
                              from 1)  [required]
  -c2, --chrom2 INTEGER       chrom2 field number  [required]
  -p1, --pos1 INTEGER         pos1 field number  [required]
  -p2, --pos2 INTEGER         pos2 field number  [required]
  -i, --index [tabix|pairix]  Select the preset sort and indexing options
                              [default: pairix]
  --flip-only                 Only flip mates; no sorting or indexing. Write
                              to stdout.  [default: False]
  -p, --nproc INTEGER         Number of processors  [default: 8]
  -0, --zero-based            Read positions are zero-based  [default: False]
  --sep TEXT                  Data delimiter in the input file  [default: \t]
  --comment-char TEXT         Comment character to skip header  [default: #]
  --sort-options TEXT         Quoted list of additional options to `sort`
                              command
  -o, --out TEXT              Output gzip file
  -s1, --strand1 INTEGER      strand1 field number (deprecated)
  -s2, --strand2 INTEGER      strand2 field number (deprecated)
  -h, --help                  Show this message and exit.

In [33]:

%%bash
# Note that the input pairs file happens to be space-delimited, which we specify 
# with the --sep argument  (tab is assumed by default).
# The output pairs file will always be tab-delimited!

CHROMSIZES_FILE='data/b37.chrom.sizes'
PAIRS_FILE='data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'

cooler csort -c1 3 -p1 4 -c2 7 -p2 8 --sep ' ' --out data/pairs.sorted.txt.gz $PAIRS_FILE $CHROMSIZES_FILE

INFO:cooler.cli.csort:Enumerating requested chromosomes...
INFO:cooler.cli.csort:1	1
INFO:cooler.cli.csort:2	2
INFO:cooler.cli.csort:3	3
INFO:cooler.cli.csort:4	4
INFO:cooler.cli.csort:5	5
INFO:cooler.cli.csort:6	6
INFO:cooler.cli.csort:7	7
INFO:cooler.cli.csort:8	8
INFO:cooler.cli.csort:9	9
INFO:cooler.cli.csort:10	10
INFO:cooler.cli.csort:11	11
INFO:cooler.cli.csort:12	12
INFO:cooler.cli.csort:13	13
INFO:cooler.cli.csort:14	14
INFO:cooler.cli.csort:15	15
INFO:cooler.cli.csort:16	16
INFO:cooler.cli.csort:17	17
INFO:cooler.cli.csort:18	18
INFO:cooler.cli.csort:19	19
INFO:cooler.cli.csort:20	20
INFO:cooler.cli.csort:21	21
INFO:cooler.cli.csort:22	22
INFO:cooler.cli.csort:X	23
INFO:cooler.cli.csort:Y	24
INFO:cooler.cli.csort:MT	25
INFO:cooler.cli.csort:Input: 'data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'
INFO:cooler.cli.csort:Output: 'data/pairs.sorted.txt.gz'
INFO:cooler.cli.csort:Reordering pair mates and sorting pair records...
INFO:cooler.cli.csort:Sort order: block (chrom1, chrom2, pos1, pos2)
INFO:cooler.cli.csort:sort -k3,3 -k7,7 -k4,4n -k8,8n --parallel=8 --buffer-size=50%
INFO:cooler.cli.csort:Indexing...
INFO:cooler.cli.csort:Indexer: pairix
INFO:cooler.cli.csort:pairix -f -s3 -d7 -b4 -e4 -u8 -v8 data/pairs.sorted.txt.gz

In [34]:

%%bash

# What's in the output?
zcat data/pairs.sorted.txt.gz | head

D260LACXX130602:2:2315:7361:72358	0	1	85378	186	16	1	591085	1097	0	0
D258GACXX130605:8:2316:2958:10584	0	1	719104	1418	16	1	728403	1406	20	10
C24LCACXX130513:8:2315:5697:82732	0	1	758309	1523	0	1	43498676	121266	68	120
D260LACXX130602:2:2209:16327:23744	0	1	784407	527093	16	1	229918735	1593	88	40
D260LACXX130602:2:2202:16754:100485	16	1	890311	1801	16	1	993887	2056	165	18
D260LACXX130602:2:2309:8547:48542	0	1	925938	1866	16	1	1034493	2117	178	60
C24LCACXX130513:8:2312:14225:27548	0	1	941657	1904	0	1	1620964	3551	178	0
D258GACXX130605:8:1201:10692:67155	0	1	949416	342780	16	1	155311666	1935	178	178
C24LCACXX130513:8:2307:2896:10307	0	1	951652	3944	16	1	1751062	1946	146	175
D258GACXX130605:8:2308:5276:50416	0	1	992230	2049	0	1	1255504	2556	175	178

Finally, using cooler cload pairix, we aggregate (bin) the contacts in pairs.sorted.txt.gz against the bins file, bins.1000kb.bed, and write the contents to the binary test.cool file.

A Pairix-indexed file has the advantage of 2D querying. However, it uses a slightly different sorting convention:

Like the previous Tabix scheme, interchromosomal pairs in Pairix files should consistently respect some order of the chromosomes (i.e. be "upper triangular"). Unlike the previous scheme, the chromosome order used to create the pairs file can be arbitrary, and does not need to match the order you wish to use in the cooler file.
Unlike the previous Tabix scheme, where the file is sorted by chrom1, pos1, chrom2, pos2, Pairix files are sorted by chrom1, chrom2, pos1, pos2.

In [35]:

%%bash

cooler cload pairix -h

Usage: cooler cload pairix [OPTIONS] BINS PAIRS_PATH COOL_PATH

  Bin a pairix-indexed contact list file.

  BINS : One of the following

      <TEXT:INTEGER> : 1. Path to a chromsizes file, 2. Bin size in bp

      <TEXT> : Path to BED file defining the genomic bin segmentation.

  PAIRS_PATH : Path to contacts (i.e. read pairs) file.

  COOL_PATH : Output COOL file path or URI.

  See also: 'cooler csort' to sort and index a contact list file

  Pairix on GitHub: <https://github.com/4dn-dcic/pairix>.

Options:
  --metadata TEXT          Path to JSON file containing user metadata.
  --assembly TEXT          Name of genome assembly (e.g. hg19, mm10)
  -p, --nproc INTEGER      Number of processes to split the work between.
                           [default: 8]
  -0, --zero-based         Positions are zero-based  [default: False]
  -s, --max-split INTEGER  Divide the pairs from each chromosome into at most
                           this many chunks. Smaller chromosomes will be split
                           less frequently or not at all. Increase ths value
                           if large chromosomes dominate the workload on
                           multiple processors.  [default: 2]
  -h, --help               Show this message and exit.

In [36]:

%%bash

# alternatively, we could pass $CHROMSIZES_FILE:1000000 below instead of creating $BINS_FILE
BINS_FILE='data/bins.1000kb.bed'
INDEXED_PAIRS_FILE='data/pairs.sorted.txt.gz'
OUTPUT_FILE='data/test.cool'

cooler cload pairix $BINS_FILE $INDEXED_PAIRS_FILE $OUTPUT_FILE

INFO:cooler.cli.cload:Using 8 cores
WARNING:py.warnings:/home/nezar/miniconda3/lib/python3.7/site-packages/dask/dataframe/utils.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

INFO:cooler.create:Creating cooler at "data/test.cool::/"
INFO:cooler.create:Writing chroms
INFO:cooler.create:Writing bins
INFO:cooler.create:Writing pixels
INFO:cooler.create:Binning 1:0-125000000|*
INFO:cooler.create:Binning 1:125000000-249250621|*
INFO:cooler.create:Binning 2:0-129000000|*
INFO:cooler.create:Binning 2:129000000-243199373|*
INFO:cooler.create:Binning 3:0-158000000|*
INFO:cooler.create:Binning 3:158000000-198022430|*
INFO:cooler.create:Binning 4:0-163000000|*
INFO:cooler.create:Binning 4:163000000-191154276|*
INFO:cooler.create:Finished 3:158000000-198022430|*
INFO:cooler.create:Binning 5:0-173000000|*
INFO:cooler.create:Finished 4:163000000-191154276|*
INFO:cooler.create:Binning 5:173000000-180915260|*
INFO:cooler.create:Finished 5:173000000-180915260|*
INFO:cooler.create:Binning 6:0-171115067|*
INFO:cooler.create:Finished 1:125000000-249250621|*
INFO:cooler.create:Binning 7:0-159138663|*
INFO:cooler.create:Finished 2:129000000-243199373|*
INFO:cooler.create:Binning 8:0-146364022|*
INFO:cooler.create:Finished 2:0-129000000|*
INFO:cooler.create:Binning 9:0-141213431|*
INFO:cooler.create:Finished 1:0-125000000|*
INFO:cooler.create:Binning 10:0-135534747|*
INFO:cooler.create:Finished 4:0-163000000|*
INFO:cooler.create:Binning 11:0-135006516|*
INFO:cooler.create:Finished 3:0-158000000|*
INFO:cooler.create:Binning 12:0-133851895|*
INFO:cooler.create:Finished 5:0-173000000|*
INFO:cooler.create:Binning 13:0-115169878|*
INFO:cooler.create:Finished 6:0-171115067|*
INFO:cooler.create:Binning 14:0-107349540|*
INFO:cooler.create:Finished 9:0-141213431|*
INFO:cooler.create:Finished 8:0-146364022|*
INFO:cooler.create:Binning 15:0-102531392|*
INFO:cooler.create:Binning 16:0-90354753|*
INFO:cooler.create:Finished 7:0-159138663|*
INFO:cooler.create:Binning 17:0-81195210|*
INFO:cooler.create:Finished 10:0-135534747|*
INFO:cooler.create:Binning 18:0-78077248|*
INFO:cooler.create:Finished 11:0-135006516|*
INFO:cooler.create:Binning 19:0-59128983|*
INFO:cooler.create:Finished 12:0-133851895|*
INFO:cooler.create:Binning 20:0-63025520|*
INFO:cooler.create:Finished 13:0-115169878|*
INFO:cooler.create:Binning 21:0-48129895|*
INFO:cooler.create:Finished 14:0-107349540|*
INFO:cooler.create:Binning 22:0-51304566|*
INFO:cooler.create:Finished 22:0-51304566|*
INFO:cooler.create:Finished 21:0-48129895|*
INFO:cooler.create:Binning X:0-155270560|*
INFO:cooler.create:Binning Y:0-59373566|*
INFO:cooler.create:Finished 19:0-59128983|*
INFO:cooler.create:Binning MT:0-16569|*
INFO:cooler.create:Finished 18:0-78077248|*
INFO:cooler.create:Finished 17:0-81195210|*
INFO:cooler.create:Finished 16:0-90354753|*
INFO:cooler.create:Finished Y:0-59373566|*
INFO:cooler.create:Finished MT:0-16569|*
INFO:cooler.create:Finished 20:0-63025520|*
INFO:cooler.create:Finished 15:0-102531392|*
INFO:cooler.create:Finished X:0-155270560|*
INFO:cooler.create:Writing indexes
INFO:cooler.create:Writing info
INFO:cooler.create:Done

In [ ]:

Table of Contents