Skip to content

Tool for motif deconvolution in large HLA-I ligand datasets

Notifications You must be signed in to change notification settings

GfellerLab/MixMHCp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

############
# Written by David Gfeller.
#
# For any question, please contact [email protected]
#
# To cite MixMHCp2.1, please refer to:
# - Bassani-Sternberg M and Gfeller D, Unsupervised HLA Peptidome Deconvolution Improves Ligand Prediction Accuracy and Predicts Cooperative Effects in Peptide-HLA Interactions. J. Immunol. (2016)
# - Gfeller et al, The length distribution and multipl specificity of naturally presented HLA-I ligands, J. Immunol (2018).
#
# MixMHCp can be used freely by academic groups for non-commercial purposes (see license).
# The product is provided free of charge, and, therefore, on an "as is"
# basis, without warranty of any kind.
#
# FOR-PROFIT USERS
# If you plan to use MixMHCp (version 2.1) in any for-profit
# application, you are required to obtain a separate license.
# To do so, please contact Nadette Bulgin ([email protected]) at the Ludwig Institute for  Cancer Research Ltd.
#
# Copyright (2018) David Gfeller
############

############
# New features of the version 2.1
############

The main novelty of this version is the possibility to have non-standard amino acids (e.g., phosphorylated residues). Non-standard amino acids need to be provided in input sequences as small letters and will be displayed in purple in the logos

############

MixMHCp can be run from the terminal (command-line) on Mac and Linux. It requires perl, C++ and Rscript to run on your machine and be included in your path. If Rscrpit or R are not available in your path, you can manually specifiy the location of Rscript in run_MixMHCp.pl (search for 'Rscript').

- Installing MixMHCp:

1) Create a directory where to install MixMHCp. Please avoid blank space or non-canonical characters in your path. 

2) Decompress MixMHCp2.1.tar.gz in this directory.

3) Open MixMHCp file with a text editor (e.g., emacs) and put between the "" of lib_path="YOUR PATH TO MixMHCp2.1/lib FOLDER" the absolute path of the MixMHCp2.1/lib directory.

4) Put the MixMHCp executable in your PATH.

5) On Linux, you need to recompile the MixMHCp.cc file (for instance: g++ lib/MixMHCp.cc -o lib/MixMHCp.x)


- Running MixMHCp (see help for other options):

MixMHCp -i INPUT_FILE -o OUTPUT_DIR -m MAX_MOTIF
Example: MixMHCp -i input.fa -o output -m 6

MixMHCp requires as input fasta files with no gaps and only standard amino acids.
You can also use txt files with peptide sequences (lines starting with '>' are simply ignored).
Make sure your input file and paths names do not contain blank spaces or weird characters.
Make sure your input files are text files and do not include any rich formatting (e.g., save Excel files as tab delimited .txt files).

- Test MixMHCp:

Make sure you are in the MixMHCp2.1 directory and you have manually entered the path in MixMHCp file (see point 3) in installation instructions).

Run the command: ./MixMHCp -i test/test.txt -o test/out -m 3

Apart from path names, your output should be similar to what you have in test/out_compare directory.

The test file represents HLA-I ligands identified in Bassani-Sternberg et al. Nat Commun. 2016 in Mel12 sample. The solution with 3 motifs corresponds to the accurate deconvolution (HLA-A01:01 - motifs 3, HLA-B08:01 - motif 1, HLA-C07:01 - motif 2).



#####################
- Other informations:
#####################


1) MixMHCp should only be applied on pre-aligned peptides of length no more than 10 amino acid longer than the core, and does not include any alignment step. Ligands of different lengths can be used in input. Unsupervised motif deconvolution is performed on ligands of the same size as the core (default: 9). The other peptides are scored based on the 9-mer motifs (see below). The logos for each motif and each peptide length can be visualized in logos_html directory. Responsibility files include the information about the predicted position of the start and end of each of the core motifs. Please refer to MHCpExt (Guillaume et al PNAS 2018) for an analysis with statistical significance of C- or N-terminal extensions.

2) The responsibility values provide information of how much each peptide contributes to each motif. These values can then be used to cluster the peptides and the users can choose different threshold (e.g., any number, or the maximal value). The information about the position of the start and end of the core motif corresponds to the best position based on the scoring with each PWM inferred from peptides of the core length.

3) By default, the logos are plotted with a modified version of ggseqlogo, that is included in the package (modified version of the R package ggseqlogo: Wagih, Omar. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, no. 22 (2017)). To plot logos with Seq2Logo (Martin Christen Frolund Thomsen; Morten Nielsen, Nucleic Acids Research 2012; 40 (W1): W281-W287), you need to have the proper license (http://www.cbs.dtu.dk/biotools/Seq2Logo/) and have Seq2Logo in your path and add in the command line '-lt Seq2Logo'. You also need 'gs' (Ghostscript) to work on your machine. If you use Seq2Logo to generate the logos, be aware that the logos do not include any random count or background frequency corrections. This is useful to explore what is in your data, but may not exactly correspond to the motifs used to build predictors.

4) The logos (and the number of peptides in html files) are determined after clustering the peptides based on the largest responsibility values.

5) The PWMs provided as output correspond to those optimally describing the data. They do not necessarily correspond to the best predictors, since they do not incorporate any renormalization by the background frequencies and are build with a relatively low random count. In general, for making predictions, it is better to train a predictor based on the peptides themselves (for instance by clustering the peptides based on the responsibility values).

6) The optimal number of motifs is difficult to automatically determine and existing measures provide unsatisfactory results. We encourage the users to manually look at the motifs and determine the optimal number. The value provided in KLD/best_ncl.txt (first line) corresponds to the method proposed in Bassani-Sternberg and Gfeller J Immunol 2016. However, in many cases it is worth considering more motifs than this number. The KLD values printed below for each number of motifs correspond to the method developed in GibbsCluster (Andreatta et al Bioinformatics 2013). However, here as well, highest KLD values often do not correspond to the best number of motifs. For instance, in the example used for testing (test/test.txt), the motifs for the different HLA-I alleles are nicely obtained choosing 8 motifs (including multiple specificity for HLA-A03:01 and HLA-A11:01), and not 3 motifs.

7) Trash cluster is included to remove peptides that do not match any of the inferred motifs.

8) Motif deconvolution is only performed for peptides of length lcore (default: 9). Peptides of other lengths are scored based on the PWM inferred from the 9-mers, using the first 3 and last 2 positions, and allowing for N- or C-terminal extensions. However, the fraction of peptides from each motif may change substantially for different peptide lengths (either because HLA-I molecules have different length distributions, or because of more frequent contaminations in longer peptides in MS data). For this reason, the weight on each motif (including the trash motif) are inferred by Expectation Maximization for each peptide length separately. Information about these weights and the corresponding length distributions are provided in weights/ directory. The first column of the weight_n.txt files gives the peptide length, the second column gives the total number of peptides of each length, the n_weight columns give the weight of each motif, as defined in the mixture model, the n_distr columns give the peptide length distributions for each motif, as obtained after clustering the peptides based on the maximum of the responsibility values. Peptide length distributions are plotted in length_distribution.html for each motif.

9) MixMHCp is tailored to motif deconvolution in HLA-I peptides and includes several specific options related to this problem. You can still use it for other types of data, but it is advised to treat separately peptides of different lengths, unless you expect N- and C-terminal motifs to be conserved.

10) For the default alphabet, uniprot based background frequencies are used. If additional amino acids are part of the alphabet, background frequencies are needed to be provided for the alphabet used. The format should be standard text format with each amino acid together with its frequency (separated by a single space ‘ ‘) on a different line. A bias file for canonical plus phosphorylated amino acids (20+3) is provided in lib/, containing background frequencies based on the phosphorylated human proteome.