Notes:
To simulate shell scripts and terminal interaction, we preface every code cell with the "cell magic": %%bash
, which sends the code to bash instead of the Python interpreter. Another way to send code from IPython to the shell is to prefix a line with shell escape !
.
If you type cooler
at the command line with no arguments or with -h
or --help
you'll get the following quick reference of available subcommands.
%%bash
cooler -h
Usage: cooler [OPTIONS] COMMAND [ARGS]... Type -h or --help after any subcommand for more information. Options: -v, --verbose Verbose logging. -d, --debug On error, drop into the post-mortem debugger shell. -V, --version Show the version and exit. -h, --help Show this message and exit. Commands: cload Create a cooler from genomic pairs and bins. load Create a cooler from a pre-binned matrix. merge Merge multiple coolers with identical axes. coarsen Coarsen a cooler to a lower resolution. zoomify Generate a multi-resolution cooler file by coarsening. balance Out-of-core matrix balancing. info Display a cooler's info and metadata. dump Dump a cooler's data to a text stream. show Display and browse a cooler in matplotlib. makebins Generate fixed-width genomic bins. digest Generate fragment-delimited genomic bins. csort Sort and index a contact list. ls List all coolers inside a file. cp Copy a cooler from one file to another or within the same file. ln Create a hard link to a cooler (rather than a true copy) in the... mv Rename a cooler within the same file. tree Display a file's data hierarchy. attrs Display a file's attribute hierarchy.
For more information about a specific subcommand, type cooler <subcommand> -h
to display the help text.
%%bash
cooler info -h
Usage: cooler info [OPTIONS] COOL_PATH Display a cooler's info and metadata. COOL_PATH : Path to a COOL file or cooler URI. Options: -f, --field TEXT Print the value of a specific info field. -m, --metadata Print the user metadata in JSON format. -o, --out TEXT Output file (defaults to stdout) -h, --help Show this message and exit.
Let's try it.
%%bash
cooler info data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool
{ "bin-size": 1000000, "bin-type": "fixed", "creation-date": "2016-02-25T21:05:29.075865", "format-url": "https://github.com/mirnylab/cooler", "format-version": 2, "genome-assembly": "hg19", "id": null, "library-version": "0.3.0", "nbins": 3114, "nchroms": 25, "nnz": 4150156 }
%%bash
cooler info -f bin-size data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool
1000000
%%bash
cooler info -m data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool
{ "publication": "", "QC": { "double-sided": { "total": 3390352656, "valid": 3147639590, "filtered-invalid": { "removed-self-circles": 1741768, "removed-error-pair": 6074295, "removed-dangling-ends": 234897003 }, "filtered-valid": { "removed-duplicate": 110650005, "removed-start-near-rsite": "", "removed-outlier-fragment": 151337031, "removed-large-small-pair": 657466 } }, "pre-filtering": { "total": 5332721651, "double-sided": 3390352656, "unused": 0, "single-sided": 1942368995 }, "post-filtering": { "total": 2884995088, "cis": 2085711027, "trans": 799284061 } }, "enzyme": "MboI", "cell-type": "GM12878", "species": "Homo sapiens", "sex": "F" }
For more in-depth introspection into the HDF5 file structure, you can use cooler tree
to display the group and array hierarchy and cooler attrs
to display all attributes in the hierarchy. As a bonus, these commands work on any HDF5 file!
%%bash
cooler tree data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool
/ ├── bins │ ├── chrom (3114,) int32 │ ├── end (3114,) int64 │ ├── start (3114,) int64 │ └── weight (3114,) float64 ├── chroms │ ├── length (25,) int64 │ └── name (25,) |S32 ├── indexes │ ├── bin1_offset (3115,) int32 │ └── chrom_offset (26,) int32 └── pixels ├── bin1_id (4150156,) int64 ├── bin2_id (4150156,) int64 └── count (4150156,) int64
%%bash
cooler attrs data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool
'@attrs': bin-size: 1000000 bin-type: fixed creation-date: 2016-02-25 21:05:29.075865 format-url: https://github.com/mirnylab/cooler format-version: 2 genome-assembly: hg19 id: null library-version: 0.3.0 metadata: QC: double-sided: filtered-invalid: removed-dangling-ends: 234897003 removed-error-pair: 6074295 removed-self-circles: 1741768 filtered-valid: removed-duplicate: 110650005 removed-large-small-pair: 657466 removed-outlier-fragment: 151337031 removed-start-near-rsite: '' total: 3390352656 valid: 3147639590 post-filtering: cis: 2085711027 total: 2884995088 trans: 799284061 pre-filtering: double-sided: 3390352656 single-sided: 1942368995 total: 5332721651 unused: 0 cell-type: GM12878 enzyme: MboI publication: '' sex: F species: Homo sapiens nbins: 3114 nchroms: 25 nnz: 4150156 bins: '@attrs': {} chrom: '@attrs': {} end: '@attrs': {} start: '@attrs': {} weight: '@attrs': {} chroms: '@attrs': {} length: '@attrs': {} name: '@attrs': {} indexes: '@attrs': {} bin1_offset: '@attrs': {} chrom_offset: '@attrs': {} pixels: '@attrs': {} bin1_id: '@attrs': {} bin2_id: '@attrs': {} count: '@attrs': {}
cool
file¶To make a contact matrix, we need
For(1), we will start with a very small subsample of 100,000 read pairs from GSM1551552 (Rao et al, GM12878). The fields of the file are readID, strand1, chrom1, pos1, frag1, strand2, chrom2, pos2, frag2, mapq1, mapq2.
%%bash
zcat data/GSM1551552_HIC003_merged_nodups.txt.subset.gz | head
D260LACXX130602:2:2315:7361:72358 0 1 85378 186 16 1 591085 1097 0 0 D258GACXX130605:8:2316:2958:10584 0 1 728403 1418 16 1 719104 1406 20 10 C24LCACXX130513:8:2315:5697:82732 0 1 758309 1523 0 1 43498676 121266 68 120 D260LACXX130602:2:2202:16754:100485 16 1 890311 1801 16 1 993887 2056 165 18 D260LACXX130602:2:2309:8547:48542 0 1 925938 1866 16 1 1034493 2117 178 60 C24LCACXX130513:8:2312:14225:27548 0 1 941657 1904 0 1 1620964 3551 178 0 D258GACXX130605:8:2308:5276:50416 0 1 992230 2049 0 1 1255504 2556 175 178 C24LCACXX130513:8:1308:7340:36704 0 1 1070613 2191 16 1 1070941 2193 153 40 C24LCACXX130513:8:2104:5074:54778 16 1 1137145 2329 16 1 201625081 455535 144 178 C24LCACXX130513:8:1302:14464:56521 0 1 1170252 2381 16 1 1353433 2788 178 22
This data was mapped to the Broad's b37
assembly and uses ENSEMBL-style chromosome names (1..22
, X
, Y
, MT
) instead of the UCSC-style (chr1..chr22
, chrX
, chrY
, chrM
).
We provide a chromosome sizes file with the chromosomes we want to use in a desired order. Make sure the name style of the chromsizes file matches the name style of the pairs file! The following is the b37
chromosome sizes file with chromosomes in a "natural" semantic order, leaving out the unlocalized and unplaced scaffolds.
%%bash
cat data/b37.chrom.sizes
1 249250621 2 243199373 3 198022430 4 191154276 5 180915260 6 171115067 7 159138663 8 146364022 9 141213431 10 135534747 11 135006516 12 133851895 13 115169878 14 107349540 15 102531392 16 90354753 17 81195210 18 78077248 19 59128983 20 63025520 21 48129895 22 51304566 X 155270560 Y 59373566 MT 16569
We also need to decide how we want to bin the contacts. Usually, we choose a fixed bin size or "resolution". Another option for Hi-C data is to use restriction fragment-delimited genomic bins based on the restriction enzyme used in the experiment. cooler
allows for any binning scheme you like, as long as you provide it as a bin table. We can store a bin table in a simple BED file using the makebins
command.
%%bash
cooler makebins -h
Usage: cooler makebins [OPTIONS] CHROMSIZES_PATH BINSIZE Generate fixed-width genomic bins. Output a genome segmentation at a fixed resolution as a BED file. CHROMSIZES_PATH : UCSC-like chromsizes file, with chromosomes in desired order. BINSIZE : Resolution (bin size) in base pairs <int>. Options: -o, --out TEXT Output file (defaults to stdout) -H, --header Print the header of column names as the first row. [default: False] -i, --rel-ids [0|1] Include a column of relative bin IDs for each chromosome. Choose whether to report them as 0- or 1-based. -h, --help Show this message and exit.
If you have the FASTA sequence of the reference genome, you can also "digest" it to create a bin table of fragments.
%%bash
cooler digest -h
Usage: cooler digest [OPTIONS] CHROMSIZES_PATH FASTA_PATH ENZYME Generate fragment-delimited genomic bins. Output a genome segmentation of restriction fragments as a BED file. CHROMSIZES_PATH : UCSC-like chromsizes file, with chromosomes in desired order. FASTA_PATH : Genome assembly FASTA file or folder containing FASTA files (uncompressed). ENZYME : Name of restriction enzyme Options: -o, --out TEXT Output file (defaults to stdout) -H, --header Print the header of column names as the first row. [default: False] -i, --rel-ids [0|1] Include a column of relative bin IDs for each chromosome. Choose whether to report them as 0- or 1-based. -h, --help Show this message and exit.
%%bash
CHROMSIZES_FILE='data/b37.chrom.sizes'
cooler makebins --out "data/bins.1000kb.bed" $CHROMSIZES_FILE 1000000
# what's in the file?
head "data/bins.1000kb.bed"
1 0 1000000 1 1000000 2000000 1 2000000 3000000 1 3000000 4000000 1 4000000 5000000 1 5000000 6000000 1 6000000 7000000 1 7000000 8000000 1 8000000 9000000 1 9000000 10000000
There is a convenient syntax to specify a fixed-resolution bin table, so you rarely need to generate one manually:
<chromsizes_path>:<binsize-in-bp>
e.g. The bin table above can be specified as data/b37.chrom.sizes.reduced:1000000
.
%%bash
cooler cload pairs -h
Usage: cooler cload pairs [OPTIONS] BINS PAIRS_PATH COOL_PATH Bin any text file or stream of pairs. Pairs data need not be sorted. Accepts compressed files. To pipe input from stdin, set PAIRS_PATH to '-'. BINS : One of the following <TEXT:INTEGER> : 1. Path to a chromsizes file, 2. Bin size in bp <TEXT> : Path to BED file defining the genomic bin segmentation. PAIRS_PATH : Path to contacts (i.e. read pairs) file. COOL_PATH : Output COOL file path or URI. Options: --metadata TEXT Path to JSON file containing user metadata. --assembly TEXT Name of genome assembly (e.g. hg19, mm10) -c1, --chrom1 INTEGER chrom1 field number (one-based) [required] -p1, --pos1 INTEGER pos1 field number (one-based) [required] -c2, --chrom2 INTEGER chrom2 field number (one-based) [required] -p2, --pos2 INTEGER pos2 field number (one-based) [required] --chunksize INTEGER Number of input lines to load at a time -0, --zero-based Positions are zero-based [default: False] --comment-char TEXT Comment character that indicates lines to ignore. [default: #] -N, --no-symmetric-upper Create a complete square matrix without implicit symmetry. This allows for distinct upper- and lower-triangle values --input-copy-status [unique|duplex] Copy status of input data when using symmetric-upper storage. | `unique`: Incoming data comes from a unique half of a symmetric map, regardless of how the coordinates of a pair are ordered. `duplex`: Incoming data contains upper- and lower- triangle duplicates. All input records that map to the lower triangle will be discarded! | If you wish to treat lower- and upper- triangle input data as distinct, use the ``--no-symmetric-upper`` option. [default: unique] --field TEXT Specify quantitative input fields to aggregate into value columns using the syntax ``--field <field-name>=<field- number>``. Optionally, append ``:`` followed by ``dtype=<dtype>`` to specify the data type (e.g. float), and/or ``agg=<agg>`` to specify an aggregation function different from sum (e.g. mean). Field numbers are 1-based. Passing 'count' as the target name will override the default behavior of storing pair counts. Repeat the ``--field`` option for each additional field. --temp-dir DIRECTORY Create temporary files in a specified directory. Pass ``-`` to use the platform default temp dir. --no-delete-temp Do not delete temporary files when finished. --max-merge INTEGER Maximum number of chunks to merge before invoking recursive merging [default: 200] --storage-options TEXT Options to modify the data filter pipeline. Provide as a comma-separated list of key- value pairs of the form 'k1=v1,k2=v2,...'. See http://docs.h5py.org/en/stable/high/data set.html#filter-pipeline for more details. -h, --help Show this message and exit.
%%bash
# Note that the input pairs file happens to be space-delimited, so we convert to tab-delimited with `tr`.
CHROMSIZES_FILE='data/b37.chrom.sizes'
BINSIZE=1000000
PAIRS_FILE='data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'
OUTPUT_FILE='data/test.cool'
zcat $PAIRS_FILE \
| tr ' ' '\t' \
| cooler cload pairs -c1 3 -p1 4 -c2 7 -p2 8 $CHROMSIZES_FILE:$BINSIZE - $OUTPUT_FILE
INFO:cooler.create:Writing chunk 0: /home/nezar/local/devel/cooler-binder/data/tmp32q3ijrr.multi.cool::0 WARNING:py.warnings:/home/nezar/miniconda3/lib/python3.7/site-packages/dask/dataframe/utils.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm INFO:cooler.create:Creating cooler at "/home/nezar/local/devel/cooler-binder/data/tmp32q3ijrr.multi.cool::/0" INFO:cooler.create:Writing chroms INFO:cooler.create:Writing bins INFO:cooler.create:Writing pixels INFO:cooler.create:Writing indexes INFO:cooler.create:Writing info INFO:cooler.create:Done INFO:cooler.create:Merging into data/test.cool INFO:cooler.create:Creating cooler at "data/test.cool::/" INFO:cooler.create:Writing chroms INFO:cooler.create:Writing bins INFO:cooler.create:Writing pixels INFO:cooler.reduce:nnzs: [46598] INFO:cooler.reduce:current: [46598] INFO:cooler.create:Writing indexes INFO:cooler.create:Writing info INFO:cooler.create:Done
There are benefits to sorting and indexing pairs. See below.
The cooler dump
command lets us print the data back out as text with several formatting and annotation options. It also accepts range queries, both intra- and inter-chromosomal.
%%bash
cooler dump -h
Usage: cooler dump [OPTIONS] COOL_PATH Dump a cooler's data to a text stream. COOL_PATH : Path to COOL file or cooler URI. Options: -t, --table [chroms|bins|pixels] Which table to dump. Choosing 'chroms' or 'bins' will cause all pixel-related options to be ignored. Note that for coolers stored in symmetric-upper mode, 'pixels' only holds the upper triangle values of the matrix. [default: pixels] -c, --columns SEPARATED[,] Restrict output to a subset of columns, provided as a comma-separated list. -H, --header Print the header of column names as the first row. [default: False] --na-rep TEXT Missing data representation. Default is empty ''. --float-format TEXT Format string for floating point numbers (e.g. '.12g', '03.2f'). [default: g] -r, --range TEXT The coordinates of a genomic region shown along the row dimension, in UCSC-style notation. (Example: chr1:10,000,000-11,000,000). If omitted, the entire contact matrix is printed. -r2, --range2 TEXT The coordinates of a genomic region shown along the column dimension. If omitted, the column range is the same as the row range. -m, --matrix For coolers stored in symmetric-upper mode, ensure any empty areas of the genomic query window are populated by generating the lower-triangular pixels. [default: False] -b, --balanced / --no-balance Apply balancing weights to data. This will print an extra column called `balanced` [default: False] --join Print the full chromosome bin coordinates instead of bin IDs. This will replace the `bin1_id` column with `chrom1`, `start1`, and `end1`, and the `bin2_id` column with `chrom2`, `start2` and `end2`. [default: False] --annotate SEPARATED[,] Join additional columns from the bin table against the pixels. Provide a comma separated list of column names (no spaces). The merged columns will be suffixed by '1' and '2' accordingly. --one-based-ids Print bin IDs as one-based rather than zero- based. --one-based-starts Print start coordinates as one-based rather than zero-based. -k, --chunksize INTEGER Sets the amount of pixel data loaded from disk at one time. Can affect the performance of joins on high resolution datasets. Default is to load as many rows as there are bins. -o, --out TEXT Output text file If .gz extension is detected, file is written using zlib. Default behavior is to stream to stdout. -h, --help Show this message and exit.
%%bash
cooler dump -t chroms data/test.cool
1 249250621 2 243199373 3 198022430 4 191154276 5 180915260 6 171115067 7 159138663 8 146364022 9 141213431 10 135534747 11 135006516 12 133851895 13 115169878 14 107349540 15 102531392 16 90354753 17 81195210 18 78077248 19 59128983 20 63025520 21 48129895 22 51304566 X 155270560 Y 59373566 MT 16569
%%bash
cooler dump -t bins data/test.cool | head
1 0 1000000 1 1000000 2000000 1 2000000 3000000 1 3000000 4000000 1 4000000 5000000 1 5000000 6000000 1 6000000 7000000 1 7000000 8000000 1 8000000 9000000 1 9000000 10000000
%%bash
cooler dump -t pixels --header data/test.cool | head
bin1_id bin2_id count 0 0 3 0 1 4 0 43 1 0 155 1 0 229 1 0 437 1 0 492 1 0 493 1 0 666 1
%%bash
cooler dump -t pixels --header --join data/test.cool | head
chrom1 start1 end1 chrom2 start2 end2 count 1 0 1000000 1 0 1000000 3 1 0 1000000 1 1000000 2000000 4 1 0 1000000 1 43000000 44000000 1 1 0 1000000 1 155000000 156000000 1 1 0 1000000 1 229000000 230000000 1 1 0 1000000 2 187000000 188000000 1 1 0 1000000 2 242000000 243000000 1 1 0 1000000 2 243000000 243199373 1 1 0 1000000 3 172000000 173000000 1
%%bash
cooler dump -t pixels -r 10:10,000,000-20,000,000 -r2 10:30,000,000-80,000,000 --header --join data/test.cool | head
chrom1 start1 end1 chrom2 start2 end2 count 10 10000000 11000000 10 70000000 71000000 1 10 11000000 12000000 10 30000000 31000000 1 10 11000000 12000000 10 35000000 36000000 1 10 11000000 12000000 10 42000000 43000000 2 10 11000000 12000000 10 43000000 44000000 1 10 11000000 12000000 10 52000000 53000000 1 10 11000000 12000000 10 58000000 59000000 1 10 12000000 13000000 10 30000000 31000000 1 10 12000000 13000000 10 45000000 46000000 1
%%bash
cooler dump -t pixels --header --balanced data/test.cool | head
Balancing weights not found
Oops! Our contact matrix isn't balanced yet. Let's do that next.
Matrix balancing normalization, i.e. iterative correction.
We usually normalize or "correct" Hi-C using a technique called matrix balancing. This involves finding a set of weights or biases $b_i$ for each bin $i$ such that
$$ Normalized[i,j] = Observed[i,j] \cdot b[i] \cdot b[j], $$such that the marginals (i.e., row/column sums) of the global contact matrix are flat and equal.
cooler balance
will store the pre-computed balancing weights in the bin table as an extra column called weight
.
Note that whole-genome matrix balancing on a high resolution matrix requires iterative computations on a matrix that may not fit in computer memory, even in sparse form. Our "out-of-core" method performs the calculations by splitting and loading the data into smaller chunks and combining the partial results afterwards.
%%bash
cooler balance -h
Usage: cooler balance [OPTIONS] COOL_PATH Out-of-core matrix balancing. Matrix must be symmetric. See the help for various filtering options to mask out poorly mapped bins. COOL_PATH : Path to a COOL file. Options: -p, --nproc INTEGER Number of processes to split the work between. [default: 8] -c, --chunksize INTEGER Control the number of pixels handled by each worker process at a time. [default: 10000000] --mad-max INTEGER Ignore bins from the contact matrix using the 'MAD-max' filter: bins whose log marginal sum is less than ``mad-max`` median absolute deviations below the median log marginal sum of all the bins in the same chromosome. [default: 5] --min-nnz INTEGER Ignore bins from the contact matrix whose marginal number of nonzeros is less than this number. [default: 10] --min-count INTEGER Ignore bins from the contact matrix whose marginal count is less than this number. [default: 0] --blacklist PATH Path to a 3-column BED file containing genomic regions to mask out during the balancing procedure, e.g. sequence gaps or regions of poor mappability. --ignore-diags INTEGER Number of diagonals of the contact matrix to ignore, including the main diagonal. Examples: 0 ignores nothing, 1 ignores the main diagonal, 2 ignores diagonals (-1, 0, 1), etc. [default: 2] --ignore-dist INTEGER Distance in bp to ignore. --tol FLOAT Threshold value of variance of the marginals for the algorithm to converge. [default: 1e-05] --max-iters INTEGER Maximum number of iterations to perform if convergence is not achieved. [default: 200] --cis-only Calculate weights against intra-chromosomal data only instead of genome-wide. --trans-only Calculate weights against inter-chromosomal data only instead of genome-wide. --name TEXT Name of column to write to. [default: weight] -f, --force Overwrite the target dataset, 'weight', if it already exists. --check Check whether a data column 'weight' already exists. --stdout Print weight column to stdout instead of saving to file. --convergence-policy [store_final|store_nan|discard|error] What to do with weights when balancing doesn't converge in max_iters. 'store_final': Store the final result, regardless of whether the iterations converge to the specified tolerance; 'store_nan': Store a vector of NaN values to indicate that the matrix failed to converge; 'discard': Store nothing and exit gracefully; 'error': Abort with non-zero exit status. [default: store_final] -h, --help Show this message and exit.
cooler balance
iterates until the balanced marginals (i.e. row sums of the balanced matrix) are sufficiently flat (the variance falls below the limit tol
).
%%bash
cooler balance -p 10 -c 10000 data/test.cool
INFO:cooler.cli.balance:Balancing "data/test.cool" INFO:cooler.balance:variance is 108.70508940352958 INFO:cooler.balance:variance is 11.544325707535195 INFO:cooler.balance:variance is 2.6406181035180114 INFO:cooler.balance:variance is 0.7820569377009834 INFO:cooler.balance:variance is 0.2430786648548748 INFO:cooler.balance:variance is 0.08214976966884285 INFO:cooler.balance:variance is 0.0281252239555167 INFO:cooler.balance:variance is 0.01007790709086638 INFO:cooler.balance:variance is 0.003641581079865616 INFO:cooler.balance:variance is 0.0013517640780139287 INFO:cooler.balance:variance is 0.0005055339361931239 INFO:cooler.balance:variance is 0.00019213135598238062 INFO:cooler.balance:variance is 7.348487864296598e-05 INFO:cooler.balance:variance is 2.838381218829392e-05 INFO:cooler.balance:variance is 1.1018544001826925e-05 INFO:cooler.balance:variance is 4.304041056888539e-06
%%bash
cooler dump --header --balanced data/test.cool | head
bin1_id bin2_id count balanced 0 0 3 0.0619576 0 1 4 0.0917378 0 43 1 0.0246198 0 155 1 0.0223535 0 229 1 0.0292275 0 437 1 0.0364777 0 492 1 0.0238998 0 493 1 0.0843047 0 666 1 0.023308
cooler balance
will also attach some metadata to the bins/weight
array as attributes, which you can inspect:
%%bash
cooler attrs data/test.cool::bins
'@attrs': {} chrom: '@attrs': {} end: '@attrs': {} start: '@attrs': {} weight: '@attrs': cis_only: false converged: true ignore_diags: 2 mad_max: 5 min_count: 0 min_nnz: 10 scale: 31.707717089629107 tol: 1.0e-05 var: 4.304041056888539e-06
You can also use the cooler show
function to produce images of the contact matrix. Requires the matplotlib
Python package.
%%bash
cooler show -h
Usage: cooler show [OPTIONS] COOL_PATH RANGE Display and browse a cooler in matplotlib. COOL_PATH : Path to a COOL file or Cooler URI. RANGE : The coordinates of the genomic region to display, in UCSC notation. Example: chr1:10,000,000-11,000,000 Options: -r2, --range2 TEXT The coordinates of a genomic region shown along the column dimension. If omitted, the column range is the same as the row range. Use to display asymmetric matrices or trans interactions. -b, --balanced Show the balanced contact matrix. If not provided, display the unbalanced counts. -o, --out TEXT Save the image of the contact matrix to a file. If not specified, the matrix is displayed in an interactive window. The figure format is deduced from the extension of the file, the supported formats are png, jpg, svg, pdf, ps and eps. --dpi INTEGER The DPI of the figure, if saving to a file -s, --scale [linear|log2|log10] Scale transformation of the colormap: linear, log2 or log10. Default is log10. -f, --force Force display very large matrices (>=10^8 pixels). Use at your own risk as it may cause performance issues. --zmin FLOAT The minimal value of the color scale. Units must match those of the colormap scale. To provide a negative value use a equal sign and quotes, e.g. -zmin='-0.5' --zmax FLOAT The maximal value of the color scale. Units must match those of the colormap scale. To provide a negative value use a equal sign and quotes, e.g. -zmax='-0.5' --cmap TEXT The colormap used to display the contact matrix. See the full list at http://matplotl ib.org/examples/color/colormaps_reference.ht ml --field TEXT Pixel values to display. [default: count] -h, --help Show this message and exit.
Here's the undersampled dataset.
%%bash
cooler show --out data/test.png --dpi 200 data/test.cool 3:0-80,000,000
WARNING:py.warnings:/home/nezar/miniconda3/lib/python3.7/site-packages/cooler/cli/show.py:30: RuntimeWarning: divide by zero encountered in log10 mat = np.log10(mat)
from IPython.display import Image
Image('data/test.png')
Here's what the full one looks like.
%%bash
cooler show --out data/test2.png --dpi 200 data/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool chr3:0-80,000,000
from IPython.display import Image
Image('data/test2.png')
Alternatively, you can sort, format and index your pairs file before ingesting. Having an indexed pairs file can also be useful of other kinds of read-level analyses.
We use the cooler csort
command to do this. What does it do?
Given a chromosome order, it creates a new pairs file with the following properties:
pos1
is always less than or equal to pos2
. As a result, the data will have an upper triangular orientation with respect to the chromosome order (interpreting sides 1 and 2 as i
and j
axes in a matrix coordinate system)..px2
index file which facilitates 2-dimensional queries on the reads.Notes:
cooler csort
. See this example!conda install --yes -c bioconda pairix
Collecting package metadata (current_repodata.json): done Solving environment: done # All requested packages already installed.
!cooler csort -h
Usage: cooler csort [OPTIONS] PAIRS_PATH CHROMOSOMES_PATH Sort and index a contact list. Order the mates of each pair record so that all contacts are upper triangular with respect to the chromosome ordering given by the chromosomes file, sort contacts by genomic location, and index the resulting file. PAIRS_PATH : Contacts (i.e. read pairs) text file, optionally compressed. CHROMOSOMES_PATH : File listing desired chromosomes in the desired order. May be tab-delimited, e.g. a UCSC-style chromsizes file. Contacts mapping to other chromosomes will be discarded. **Notes** - csort can also be used to sort and index a text representation of a contact *matrix* in bedGraph-like format. In this case, substitute `pos1` and `pos2` with `start1` and `start2`, respectively. - Requires Unix tools: sort, bgzip + tabix or pairix. If indexing with Tabix, the output file will have the following properties: - Upper triangular: the read pairs on each row are assigned to side 1 or 2 in such a way that (chrom1, pos1) is always "less than" (chrom2, pos2) - Rows are lexicographically sorted by chrom1, pos1, chrom2, pos2; i.e. "positionally sorted" - Compressed with bgzip [*] - Indexed using Tabix [*] on chrom1 and pos1. If indexing with Pairix, the output file will have the following properties: - Upper triangular: the read pairs on each row are assigned to side 1 or 2 in such a way that (chrom1, pos1) is always "less than" (chrom2, pos2) - Rows are lexicographically sorted by chrom1, chrom2, pos1, pos2; i.e. "block sorted" - Compressed with bgzip [*] - Indexed using Pairix [+] on chrom1, chrom2 and pos1. [*] Tabix manpage: <http://www.htslib.org/doc/tabix.html>. [+] Pairix on Github: <https://github.com/4dn-dcic/pairix> Options: -c1, --chrom1 INTEGER chrom1 field number in the input file (starting from 1) [required] -c2, --chrom2 INTEGER chrom2 field number [required] -p1, --pos1 INTEGER pos1 field number [required] -p2, --pos2 INTEGER pos2 field number [required] -i, --index [tabix|pairix] Select the preset sort and indexing options [default: pairix] --flip-only Only flip mates; no sorting or indexing. Write to stdout. [default: False] -p, --nproc INTEGER Number of processors [default: 8] -0, --zero-based Read positions are zero-based [default: False] --sep TEXT Data delimiter in the input file [default: \t] --comment-char TEXT Comment character to skip header [default: #] --sort-options TEXT Quoted list of additional options to `sort` command -o, --out TEXT Output gzip file -s1, --strand1 INTEGER strand1 field number (deprecated) -s2, --strand2 INTEGER strand2 field number (deprecated) -h, --help Show this message and exit.
%%bash
# Note that the input pairs file happens to be space-delimited, which we specify
# with the --sep argument (tab is assumed by default).
# The output pairs file will always be tab-delimited!
CHROMSIZES_FILE='data/b37.chrom.sizes'
PAIRS_FILE='data/GSM1551552_HIC003_merged_nodups.txt.subset.gz'
cooler csort -c1 3 -p1 4 -c2 7 -p2 8 --sep ' ' --out data/pairs.sorted.txt.gz $PAIRS_FILE $CHROMSIZES_FILE
INFO:cooler.cli.csort:Enumerating requested chromosomes... INFO:cooler.cli.csort:1 1 INFO:cooler.cli.csort:2 2 INFO:cooler.cli.csort:3 3 INFO:cooler.cli.csort:4 4 INFO:cooler.cli.csort:5 5 INFO:cooler.cli.csort:6 6 INFO:cooler.cli.csort:7 7 INFO:cooler.cli.csort:8 8 INFO:cooler.cli.csort:9 9 INFO:cooler.cli.csort:10 10 INFO:cooler.cli.csort:11 11 INFO:cooler.cli.csort:12 12 INFO:cooler.cli.csort:13 13 INFO:cooler.cli.csort:14 14 INFO:cooler.cli.csort:15 15 INFO:cooler.cli.csort:16 16 INFO:cooler.cli.csort:17 17 INFO:cooler.cli.csort:18 18 INFO:cooler.cli.csort:19 19 INFO:cooler.cli.csort:20 20 INFO:cooler.cli.csort:21 21 INFO:cooler.cli.csort:22 22 INFO:cooler.cli.csort:X 23 INFO:cooler.cli.csort:Y 24 INFO:cooler.cli.csort:MT 25 INFO:cooler.cli.csort:Input: 'data/GSM1551552_HIC003_merged_nodups.txt.subset.gz' INFO:cooler.cli.csort:Output: 'data/pairs.sorted.txt.gz' INFO:cooler.cli.csort:Reordering pair mates and sorting pair records... INFO:cooler.cli.csort:Sort order: block (chrom1, chrom2, pos1, pos2) INFO:cooler.cli.csort:sort -k3,3 -k7,7 -k4,4n -k8,8n --parallel=8 --buffer-size=50% INFO:cooler.cli.csort:Indexing... INFO:cooler.cli.csort:Indexer: pairix INFO:cooler.cli.csort:pairix -f -s3 -d7 -b4 -e4 -u8 -v8 data/pairs.sorted.txt.gz
%%bash
# What's in the output?
zcat data/pairs.sorted.txt.gz | head
D260LACXX130602:2:2315:7361:72358 0 1 85378 186 16 1 591085 1097 0 0 D258GACXX130605:8:2316:2958:10584 0 1 719104 1418 16 1 728403 1406 20 10 C24LCACXX130513:8:2315:5697:82732 0 1 758309 1523 0 1 43498676 121266 68 120 D260LACXX130602:2:2209:16327:23744 0 1 784407 527093 16 1 229918735 1593 88 40 D260LACXX130602:2:2202:16754:100485 16 1 890311 1801 16 1 993887 2056 165 18 D260LACXX130602:2:2309:8547:48542 0 1 925938 1866 16 1 1034493 2117 178 60 C24LCACXX130513:8:2312:14225:27548 0 1 941657 1904 0 1 1620964 3551 178 0 D258GACXX130605:8:1201:10692:67155 0 1 949416 342780 16 1 155311666 1935 178 178 C24LCACXX130513:8:2307:2896:10307 0 1 951652 3944 16 1 1751062 1946 146 175 D258GACXX130605:8:2308:5276:50416 0 1 992230 2049 0 1 1255504 2556 175 178
Finally, using cooler cload pairix
, we aggregate (bin) the contacts in pairs.sorted.txt.gz
against the bins file, bins.1000kb.bed
, and write the contents to the binary test.cool
file.
A Pairix-indexed file has the advantage of 2D querying. However, it uses a slightly different sorting convention:
chrom1
, pos1
, chrom2
, pos2
, Pairix files are sorted by chrom1
, chrom2
, pos1
, pos2
.%%bash
cooler cload pairix -h
Usage: cooler cload pairix [OPTIONS] BINS PAIRS_PATH COOL_PATH Bin a pairix-indexed contact list file. BINS : One of the following <TEXT:INTEGER> : 1. Path to a chromsizes file, 2. Bin size in bp <TEXT> : Path to BED file defining the genomic bin segmentation. PAIRS_PATH : Path to contacts (i.e. read pairs) file. COOL_PATH : Output COOL file path or URI. See also: 'cooler csort' to sort and index a contact list file Pairix on GitHub: <https://github.com/4dn-dcic/pairix>. Options: --metadata TEXT Path to JSON file containing user metadata. --assembly TEXT Name of genome assembly (e.g. hg19, mm10) -p, --nproc INTEGER Number of processes to split the work between. [default: 8] -0, --zero-based Positions are zero-based [default: False] -s, --max-split INTEGER Divide the pairs from each chromosome into at most this many chunks. Smaller chromosomes will be split less frequently or not at all. Increase ths value if large chromosomes dominate the workload on multiple processors. [default: 2] -h, --help Show this message and exit.
%%bash
# alternatively, we could pass $CHROMSIZES_FILE:1000000 below instead of creating $BINS_FILE
BINS_FILE='data/bins.1000kb.bed'
INDEXED_PAIRS_FILE='data/pairs.sorted.txt.gz'
OUTPUT_FILE='data/test.cool'
cooler cload pairix $BINS_FILE $INDEXED_PAIRS_FILE $OUTPUT_FILE
INFO:cooler.cli.cload:Using 8 cores WARNING:py.warnings:/home/nezar/miniconda3/lib/python3.7/site-packages/dask/dataframe/utils.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm INFO:cooler.create:Creating cooler at "data/test.cool::/" INFO:cooler.create:Writing chroms INFO:cooler.create:Writing bins INFO:cooler.create:Writing pixels INFO:cooler.create:Binning 1:0-125000000|* INFO:cooler.create:Binning 1:125000000-249250621|* INFO:cooler.create:Binning 2:0-129000000|* INFO:cooler.create:Binning 2:129000000-243199373|* INFO:cooler.create:Binning 3:0-158000000|* INFO:cooler.create:Binning 3:158000000-198022430|* INFO:cooler.create:Binning 4:0-163000000|* INFO:cooler.create:Binning 4:163000000-191154276|* INFO:cooler.create:Finished 3:158000000-198022430|* INFO:cooler.create:Binning 5:0-173000000|* INFO:cooler.create:Finished 4:163000000-191154276|* INFO:cooler.create:Binning 5:173000000-180915260|* INFO:cooler.create:Finished 5:173000000-180915260|* INFO:cooler.create:Binning 6:0-171115067|* INFO:cooler.create:Finished 1:125000000-249250621|* INFO:cooler.create:Binning 7:0-159138663|* INFO:cooler.create:Finished 2:129000000-243199373|* INFO:cooler.create:Binning 8:0-146364022|* INFO:cooler.create:Finished 2:0-129000000|* INFO:cooler.create:Binning 9:0-141213431|* INFO:cooler.create:Finished 1:0-125000000|* INFO:cooler.create:Binning 10:0-135534747|* INFO:cooler.create:Finished 4:0-163000000|* INFO:cooler.create:Binning 11:0-135006516|* INFO:cooler.create:Finished 3:0-158000000|* INFO:cooler.create:Binning 12:0-133851895|* INFO:cooler.create:Finished 5:0-173000000|* INFO:cooler.create:Binning 13:0-115169878|* INFO:cooler.create:Finished 6:0-171115067|* INFO:cooler.create:Binning 14:0-107349540|* INFO:cooler.create:Finished 9:0-141213431|* INFO:cooler.create:Finished 8:0-146364022|* INFO:cooler.create:Binning 15:0-102531392|* INFO:cooler.create:Binning 16:0-90354753|* INFO:cooler.create:Finished 7:0-159138663|* INFO:cooler.create:Binning 17:0-81195210|* INFO:cooler.create:Finished 10:0-135534747|* INFO:cooler.create:Binning 18:0-78077248|* INFO:cooler.create:Finished 11:0-135006516|* INFO:cooler.create:Binning 19:0-59128983|* INFO:cooler.create:Finished 12:0-133851895|* INFO:cooler.create:Binning 20:0-63025520|* INFO:cooler.create:Finished 13:0-115169878|* INFO:cooler.create:Binning 21:0-48129895|* INFO:cooler.create:Finished 14:0-107349540|* INFO:cooler.create:Binning 22:0-51304566|* INFO:cooler.create:Finished 22:0-51304566|* INFO:cooler.create:Finished 21:0-48129895|* INFO:cooler.create:Binning X:0-155270560|* INFO:cooler.create:Binning Y:0-59373566|* INFO:cooler.create:Finished 19:0-59128983|* INFO:cooler.create:Binning MT:0-16569|* INFO:cooler.create:Finished 18:0-78077248|* INFO:cooler.create:Finished 17:0-81195210|* INFO:cooler.create:Finished 16:0-90354753|* INFO:cooler.create:Finished Y:0-59373566|* INFO:cooler.create:Finished MT:0-16569|* INFO:cooler.create:Finished 20:0-63025520|* INFO:cooler.create:Finished 15:0-102531392|* INFO:cooler.create:Finished X:0-155270560|* INFO:cooler.create:Writing indexes INFO:cooler.create:Writing info INFO:cooler.create:Done