-
Notifications
You must be signed in to change notification settings - Fork 10
Getting Started with CCMgen and CCMpredPy
Find an example alignment taken from the PSICOV supplementary data in examples/1atzA.fas
. In general, you can run CCMpredPy on any FASTA-formatted multiple sequence alignment although results will be better the more sequences you have per column in the alignment.
Copy over examples/1atzA.fas
and examples/1atzA.pdb
to a new directory data/
which will be our working directory.
First, we'll use CCMpredPy to learn evolutionary couplings characteristic for our example protein family by maximizing the pseudo-likelihood of the Markov Random Field (MRF) model.
A contact map can be computed from the coupling coefficients of the MRF using the standard L2 norm score. Typically, corrections are applied to this matrix to remove entropy and phylogenetic bias.
ccmpred data/1atzA.fas --ofn-pll\
--plot-opt-progress data/1atzA.log.html \
-m data/1atzA.noapc.mat \
--apc data/1atzA.apc.mat \
--entropy-correction data/1atzA.ec.mat \
CCMpredPy output will show as follows:
┏━╸┏━╸┏┳┓┏━┓┏━┓┏━╸╺┳┓ version 1.0.0
┃ ┃ ┃┃┃┣━┛┣┳┛┣╸ ┃┃ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹╹ ╹┗╸┗━╸╺┻┛ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Number of effective sequences after simple reweighting (id-threshold=0.8, ignore_gaps=False): 1149.44.
Calculating AA Frequencies with 0.00087 percent pseudocounts (uniform_pseudocounts 1 1)
L₂ regularization (λsingle=10 λpairfactor=0.2 λpair=0.2)
Plot with optimization statistics will be written to data/1atzA.log.html
Will optimize 3781600 float64 variables wrt PLL
and L₂ regularization (λsingle=10 λpairfactor=0.2 λpair=0.2)
Optimizer: LBFGS optimization
convergence criteria: maxit=2000
[ removed the per-iteration statistics for brevity ]
Finished with code 0 -- CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
Compute contact map using frobenius norm of couplings.
Apply Average Product Correction (APC).
Apply entropy correction (using 20 states and log2).
Writing contact matrices to:
data/1atzA.noapc.mat
data/1atzA.apc.mat
data/1atzA.ec.mat
We now have several new files in data/
:
- a summed contact score matrix
1atzA.noapc.mat
with raw contact scores - a summed contact score matrix
1atzA.apc.mat
with contact scores that have been corrected with the Average Product Correction (APC) - a summed contact score matrix
1atzA.ec.mat
with contact scores that have been corrected for entropy bias - an interactive html file
1atzA.log.html
visualizing the optimization log
Contact Maps that have been generated with the CCMpredPy -m
flag can be visualized as .html file using the following command:
plot_ccmpred cmap \
--mat-file data/1atzA.apc.mat
--alignment-file data/1atzA.fas \
--pdb-file data/1atzA.pdb \
--plot-file data/ 1atzA.apc.html \
--seq-sep 4 --contact-threshold 8
Specifying the original alignment file with the flag --alignment-file
will add a subplot with a per-column entropy line graph.
Specifying a reference PDB structure with the flag --pdb-file
will show the observed pairwise amino distance in the lower triangle of the matrix (Note that numbering of residues in the PDB file must begin with 1 and match the dimensions of the contact matrix file!).
The C_beta distance threshold for defining true positive contacts can be specified with the flag --contact-threshold
and residue pairs along the diagonal can be masked by specifying a sequence separation cutoff with the flag --seq-sep
.
The contact map will look like this:
We can learn the coupling parameters of the Markov Random Field model using a more precise inference method that needs slightly longer run time than pseudo-likelihood maximization. However, it is recommended to use evolutionary couplings inferred with persistent contrastive divergence with CCMgen to generate new, similar protein sequences:
ccmpred data/1atzA.fas --ofn-cd --persistent --alg-gd \
-b data/1atzA.braw.gz \
The output generated by CCMgenPy will look as follows:
┏━╸┏━╸┏┳┓┏━┓┏━┓┏━╸╺┳┓ version 1.0.0
┃ ┃ ┃┃┃┣━┛┣┳┛┣╸ ┃┃ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹╹ ╹┗╸┗━╸╺┻┛ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Number of effective sequences after simple reweighting (id-threshold=0.8, ignore_gaps=False): 1149.44.
Calculating AA Frequencies with 0.00087 percent pseudocounts (uniform_pseudocounts 1 1)
L₂ regularization (λsingle=10 λpairfactor=0.2 λpair=0.2)
Will optimize 2482125 float64 variables wrt persistent contrastive divergence:
nr of sampled sequences=500 (0.163xN and 0.435xNeff and 6.667xL) Gibbs steps=1
and L₂ regularization (λsingle=10 λpairfactor=0.2 λpair=0.2)
Optimizer: Gradient descent optimization (alpha0=0.001)
convergence criteria: maxit=2000 early_stopping=False epsilon=1e-05 prev=5
decay: decay=False
[ removed the per-iteration statistics for brevity ]
Finished with code 2 -- Reached maximum number of iterations
Writing msgpack-formatted potentials to data/1atzA.braw.gz
We now have one additional file in the data/
directory: 1atzA.braw.gz
, the raw coupling potentials stored in binary MessagePack format for use with CCMgen.
If one wishes to generate diverse alignments with CCMgen, it is helpful to reduce the number of constraints in the MRF model that would otherwise trap Gibbs sampler in local optima leading to alignments of low diversity. For that purpose, it is possible to specify a reference structure in form of a PDB file and a distance threshold, so that residue pairs separated by Cbeta distances larger than the threshold will obtain zero-couplings.
ccmpred data/1atzA.fas --ofn-cd --persistent --alg-gd \
-b data/1atzA.braw.gz \
--pdb-file data/1atzA.pdb --contact-threshold 12
It is possible to initialise potentials in CCMpredPy from a binary MessagePack file by using the flag --init-from-raw
, which might be of interest in order to restart optimization given a predefined model or in combination with the flag --do-not-optimize
to simply compute contact matrix files from the raw coupling potentials.
Furthermore, to work with the raw coupling files in your Python code you can use raw
module in the ccmpred package as follows:
# load the raw module
>>> import ccmpred.raw
#read raw coupling potentials stored in binary MessagePack format
>>> raw = ccmpred.raw.parse_msgpack("data/1atzA.braw.gz")
#the length of the protein (number columns in alignment)
>>> raw.ncol
75
#the dimensions of the single potentials
>>> raw.x_single.shape
(75, 20)
#the dimensions of the pairwise potentials
>>> raw.x_pair.shape
(75, 75, 21, 21)
#meta information about the CCMpredPy run that generated the model
>>> raw.meta
{'version': '1.0.0', 'method': 'ccmpred-py', 'workflow': [{'timestamp': '2018-06-04 14:39:35.902889',
'msafile': {'neff': 1149.4357714155785, 'nrow': 3068, 'ncol': 75, 'file': 'data/1atzA.fas', 'max_gap_pos': 100, 'max_gap_seq': 100}, 'pseudocounts': {'pseudocount_type': 'uniform_pseudocounts', 'pseudocount_n_single': 1, 'pseudocount_n_pair': 1, 'remove_gaps': False, 'pseudocount_ratio_single': 0.0008692358364079104, 'pseudocount_ratio_pair': 0.0008692358364079104}, 'weighting': {'type': 'simple', 'ignore_gaps': False, 'cutoff': 0.8}, 'results': {'opt_code': 2, 'opt_message': 'Reached maximum number of iterations', 'num_iterations': 50, 'runtime': 0.252776, 'fx_final': -1, 'out_binary_raw_file': 'data/1atzA.braw.gz'}, 'initialisation': {'single_potential_init': None, 'pair_potential_init': 'zero'}, 'regularization': {'regularization_type': None, 'regularization_scaling': None, 'single_prior': None, 'lambda_single': 10, 'lambda_pair': 0.2, 'lambda_pair_factor': 0.2}, 'optimization': {'method': 'gradientDescent', 'convergence': {'maxit': 50, 'early_stopping': False, 'epsilon': 1e-05, 'convergence_prev': 5}, 'decay': {'alpha0': 0.001, 'decay': False, 'decay_start': 0.0001, 'decay_rate': 10.0, 'decay_type': 'step'}, 'fix_v': False}, 'obj_function': {'name': 'ContrastiveDivergence', 'gibbs_steps': 1, 'nr_seq_sample': 500}}]}
With CCMge it is possible to either generate a MCMC sample of specified size from a MRF model or generate a sequence sample along a predefined phylogeny (idealized binary/star tree or topology read from Newick file).
In order to generate a simple MCMC sample with 1000 sequences (sequence drawn after 500 steps of Gibbs sampling, Markov chain has been randomly initialized but gap structure of original sequence has been used) you can run:
ccmgen data/1atzA.braw.gz data/1atzA.mcmc.fasta \
--mcmc-sampling \
--alnfile data/1atzA.fas \
--mcmc-sample-random-gapped --mcmc-burn-in 500 --num-sequences 1000
The output generated by CCMgen will look as follows:
┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻ version 1.0.0
┃ ┃ ┃┃┃┃╺┓┣╸ ┃┗┫ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Successfully loaded model parameters from data/1atzA.braw.gz.
Start sampling 1000 sequences according to model starting with random-gapped sequences using burn-in=500.
sampled alignment has 1000 sequences...
Sampled alignment has Neff 2.59408
Writing sampled alignment to data/1atzA.mcmc.fasta
In order to generate an alignment along a certain phylogenetic topology while obeying the constraints from a MRF model you can run:
ccmgen data/1atzA.braw.gz data/1atzA.binary.fasta \
--tree-binary \
--mutation-rate 3 \
--seq0-mrf 10 \
--num-sequences 1024
This command will generate an alignment with 1024 sequences that are sampled according to a binary tree topology with an ancestor (=root) sequence obtained by evolving a polyA sequence with 10 Gibbs steps according to the MRF model and that have in total 3 * 75 (length of protein) mutations per position. The output generated by CCMgen will look as follows:
┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻ version 1.0.0
┃ ┃ ┃┃┃┃╺┓┣╸ ┃┗┫ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
Successfully loaded model parameters from data/1atzA.braw.gz.
Created binary tree with 1024 leaves, 2048 nodes, avg branch length=0.1, depth_min=1.0000e+00, depth_max=1.0000e+00
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) : QPTDLVFLIDGSWSIRRNVFELVKRFLSRVVDALDIGPDSTRVALVQYSSDPRLEFLLNSHGDKNSLLQAIARLQ
avg number of amino acid substitutions (parent -> child): 23.0
avg number of amino acid substitutions (root -> leave): 225.0
Alignment was sampled with mutation rate 3.0 and has Neff 2.2551
Writing sampled alignment to data/1atzA.binary.fasta
It is also possible to generate an alignment of the same diversity as a specified input alignment (diversity measured as Neff as defined in HHsuite package). Note that it is helpful to use a MRF with only a small number of coupling constraints, e.g. by learning a MRF while setting couplings for non-contacts to zero (see Learning MRF with small number of constraints).
ccmgen data/1atzA.braw.gz data/1atzA.binary.fixedNeff.fas \
--tree-binary \
--mutation-rate-neff \
--alnfile data/1atzA.fas
--seq0-mrf 10 \
--num-sequences 1024
Starting with a mutation rate = 1, CCMgen will adapt the mutation rate until an alignment with desired diversity has been generated (algorithm will be restarted after reaching a maxmimal mutation rate of 50).
The output will look similar to this:
┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻ version 1.0.0
┃ ┃ ┃┃┃┃╺┓┣╸ ┃┗┫ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Successfully loaded model parameters from data/1atzA.braw.gz.
Created binary tree with 1024 leaves, 2048 nodes, avg branch length=0.1, depth_min=1.0000e+00, depth_max=1.0000e+00
Sample sequences to generate alignment with target Neff~5.49181...
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) : RTTDLILLLDASDSVSRTYWQLMKDFTESFARNFYVYPVLARIGLLEYAGDPQVEIGLNDYGNNKDLLKALDSLK
avg number of amino acid substitutions (parent -> child): 8.0
avg number of amino acid substitutions (root -> leave): 75.0
Alignment was sampled with mutation rate 1 and has Neff 3.9756 (ΔNeff [%] = 27.608)
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) : QMMDLYMLLDGSGSVGLGNMNKLRSFAIDLFERLTIGPEAIQLAALQFARRASVAFSFGEANSKESMLEELRKLQ
avg number of amino acid substitutions (parent -> child): 10.0
avg number of amino acid substitutions (root -> leave): 98.0
Alignment was sampled with mutation rate 1.31 and has Neff 4.1846 (ΔNeff [%] = 23.802)
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) : RQLDVALALDSSSSIGTDEFETMKGFLRSMTNSLDVGGTATRLALLFYSSRSKTEISFTSHTNREEVLREIRRMQ
avg number of amino acid substitutions (parent -> child): 14.0
avg number of amino acid substitutions (root -> leave): 142.0
Alignment was sampled with mutation rate 1.9 and has Neff 5.2428 (ΔNeff [%] = 4.5344)
[ removed the sampling output for brevity ]
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) : RPIDFGFLLDASSSISVYDFRLALAFIQAVTSSFDLFWNRMRYSLVKYSDQPVVGLTLTDWETINEIMDAIKNVQ
avg number of amino acid substitutions (parent -> child): 14.0
avg number of amino acid substitutions (root -> leave): 135.0
Alignment was sampled with mutation rate 1.8 and has Neff 5.5054 (ΔNeff [%] = -0.24752)
Writing sampled alignment to data/1atzA.binary.fixedNeff.fas
Getting started learning MRF models from alignments of protein families with CCMgenPy and sampling sequences with CCMgen is straightforward but there are many more options available to fine-tune the learning and sequence generation. See the CCMgen and CCMpredPy documentation for more information.