Skip to content

Latest commit

 

History

History
254 lines (159 loc) · 8.19 KB

README.md

File metadata and controls

254 lines (159 loc) · 8.19 KB

protein_complex_maps

#Scripts for handling protein complex map data

##Elution correlation ###Correlation matrices for each experiment, each species, and all experiments concatenated

python ./protein_complex_maps/external/infer_complexes/score.py

input: tab separated wide elution profile: prot_ids [tab] total_spectral_count [tab] frac1_spectral_count [tab] ...

output: corr_poisson

output is a giant all by all matrix

Example

python ./protein_complex_maps/external/infer_complexes/score.py ./examples/Hs_helaN_ph_hcw120_2_psome_exosc_randos.txt poisson

###Reformat all by all to tidy (3 column)

python ./protein_complex_maps/features/convert_correlation.py

input: corr_poisson

output: corr_poisson.pairs

P1 P2 correlation_coefficient; For all protein pairs

Example

python ./protein_complex_maps/features/convert_correlation.py --input_correlation_matrix ./examples/Hs_helaN_ph_hcw120_2_psome_exosc_randos.txt.corr_poisson --input_elution_profile ./examples/Hs_helaN_ph_hcw120_2_psome_exosc_randos.txt --output_file ./examples/Hs_helaN_ph_hcw120_2_psome_exosc_randos.txt.corr_poisson_tidy

###Feature matrix

Any feature which you can put on a pair of proteins

python ./protein_complex_maps/features/build_feature_matrix.py

input: all .corr_poisson.pairs

output: feature_matrix.txt

Note: this is the point to put in additional features like AP-MS etc. as long as it describes a pair of proteins

pairs Feature1 Feature2 Feature3
P1 P2 value1 value2 value3
... ... ... ...
PN PN-1 value4 value5 value6

n x m, where n = #prots choose 2, m = # of features

Example

python ./protein_complex_maps/features/build_feature_matrix.py --input_pairs_files ./examples/Hs_helaN_ph_hcw120_2_psome_exosc_randos.txt.corr_poisson_tidy --output_file ./examples/Hs_helaN_ph_hcw120_2_psome_exosc_randos.txt.corr_poisson_tidy.featmat

###Format corum into test and training sets Remove redundancy from corum (merge similar clusters)

python ./protein_complex_maps/complex_merge.py

input: nonredundant_allComplexesCore_mammals.txt output: nonredundant_allComplexesCore_mammals_merged06.txt

Randomly split the corum complexes into training and test (split)

python ./protein_complex_maps/features/split_complexes.py

input: complexes nonredundant_allComplexesCore_mammals_merged06.txt

output:

  • [input_basename].test.txt
  • [input_basename].train.txt
  • [input_basename].test_ppis.txt
  • [input_basename].train_ppis.txt
  • [input_basename].neg_test_ppis.txt
  • [input_basename].neg_train_ppis.txt

Takes any pairwise overlap between train and test ppi, and randomly removes ppi from either test or train. So say complex 1 = AB, AC, BC & complex 2 = AB AC AD BC BD => complex 1 = AB BC, complex 2 = AB AD CD Also make sure complexes between training and test are completely separated

Example

python ./protein_complex_maps/complex_merge.py --cluster_filename ./examples/allComplexesCore_geneid.txt --output_filename ./examples/allComplexesCore_geneid_merged06.txt --merge_threshold 0.6
python ./protein_complex_maps/features/split_complexes.py --input_complexes ./examples/allComplexesCore_geneid_merged06.txt

###Make feature matrix w/ labels from corum

python ./protein_complex_maps/features/add_label.py

input: feature_matrix.txt

output: corum_train_labeled.txt

(These are the possible labels)

  • +1 positive label = pair is co-complex in corum
  • -1 negative label = pair is in corum, but not in same complex
  • 0 = at least one protein in the pair is not in corum

###Make input for the SVM

Convert to libsvm format training set, strips out a lot of headers, etc.

python ./protein_complex_maps/features/feature2libsvm.py

input: corum_train_labeled.txt

output: corum_train_labeled.libsvm1.txt, tab separated

SVM biased toward large numbers in features. Scaling just puts all features scaled to 1.

$LIBSVM_HOME/svm-scale

input: corum_train_labeled.libsvm1.scale_parameters

output: corum_train_labeled.libsvm1.scale.txt

SVM training and parameter sweep to optimize C and gamma

parameter sweep using training set (trains on 9/10th, compared to leave out)

python $LIBSVM_HOME/tools/grid.py

input: corum_train_labeled.libsvm1.scale.txt

output: corum_train_labeled.libsvm1.scale.txt.out

###Train classifier

Takes optimal c and g from SVM training and trains a classifier

$LIBSVM_HOME/svm-train

input: corum_train_labeled.libsvm1.scale.txt

output: corum_train_labeled.libsvm1.scale.model_c_g (with c and g values)

predict unlabeled set w/ test set on train model

$LIBSVM_HOME/svm-predict

input: corum_train_labeled.libsvm0.scaleByTrain.txt, corum_train_labeled.libsvm1.scale.model_c_g

output: corum_train_labeled.libsvm0.scaleByTrain.resultsWprob

probability ordered list of pairs

python ./protein_complex_maps/features/svm_results2pairs.py

inputs: corum_train_labeled.txt, corum_train_labeled.libsvm0.scaleByTrain.resultsWprob

output: corum_train_labeled.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob.txt

###Cluster PPis At this point, want to find clusters (dense regions)

two-stage clustering

python ./protein_complex_maps/features/clustering_parameter_optimization.py

inputs:

  • corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob.txt

  • nonredundant_allComplexesCore_mammals_merged06.train.txt

outputs:

  • corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob_combined.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.txt
  • corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.out

Do a parameter sweep (about 1000 different possibilities

  • PPi score threshold [1.0, 0.9., 0.8 ... .1]
  • Clusterone parameters
    • overlap (jaccard score) [0.8, 0.7, 0.6] -- merging complexes with overlap
    • density (threshold of total number of interactions vs. total possible interactions) unconnected -> fully connected
  • MCL inflation [1.2, 3, 4, 7]

Process: Run through clusterone, then run clusters from clusterone through MCL.

Output: a set of clusters times # of possible combinations

Select best set of clusters (usually a couple thousand) by comparing to corum training complex set using K-Cliques metric or other comparison metric

###Generate Cytoscape Network

Make clusters into pairs

python ./protein_complex_maps/util/cluster2pairwise.py

input: corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob_combined.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.[best].txt

output: corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob_combined.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.[best].pairsWclustID.txt

Make clusters into node table

python ./protein_complex_maps/util/cluster2node_table.py

input: corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob_combined.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.[best].txt

output: corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob_combined.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.[best].nodeTable.txt

Make edge attribute table

python ./protein_complex_maps/util/pairwise2clusterid.py

inputs:

  • corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob.txt
  • corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.[best].txt

output: corum_train_labeled.libsvm1.scale.libsvm0.scaleByTrain.resultsWprob_pairs_noself_nodups_wprob.best_cluster_wOverlap_nr_allComplexesCore_mammals_psweep_clusterone_mcl.[best].edgeAttributeWClusterid.txt

Load into Cytoscape