This repository contains scripts of PDBBind-Opt workflow, which organizes a bunch of open-source softwares to probe and fix structural problems in PDBBind.
pre_process/
: Scripts to prepare PDBBind and BioLiP dataset (identifying ligands and extract binding affinity data)workflow/
: Codes for PDBBind-Opt worflowdimorphite_dl
: Package to assign protonation states. We modified thesite_substructures.smarts
to make the rules easier.fix_ligand.py
: LigandFixer modulefix_protein.py
: ProteinFixer moduleprocess.py
: Main workflowrcsb.py
: Functions to query RCSB (i.e. downloading files, query SMILES strings)gather.py
: Functions to create metadata csv filesfix_polymer.py
: Functions to fix polymer ligandsmaual_smiles.json
: Manually corrected reference SMILESbuilding_blocks.csv
: SMILES of alpha-amino acids and common N/C terminal caps. Used to create reference SMILES for polymers
error_fix/
: Contains some error analysisfigshare/
: Metadata of BioLiP2-Opt and PDBBind-Opt dumped in Figshare repo.
BioLiP2-Opt datasets prepared by PDBBind-Opt workflow can be found in this Figshare repoistory.
For some reasons, the PDBBind-Opt dataset is not diectly accessible now but we will find the best way to release it soon. Users can reproduce the PDBBind-Opt dataset following the instructions below.
- Step 1: Download PDBBind index file from their official website. Run
download.sh
in thepre_process
to download BioLiP2 dataset - Step 2: Run
pre_process/create_dataset_csv.ipynb
to extract binding affinity and identifying ligands. This will give the three csv files - Step 3: Go to the
workflow
and use the following command to run the workflow
mkdir ../raw_data
python procees.py -i ../pre_process/BioLiP_bind_sm.csv -d ../raw_data/biolip2_opt
python procees.py -i ../pre_process/PDBBind_poly.csv -d ../raw_data/pdbbind_opt_poly --poly
python procees.py -i ../pre_process/PDBBind_sm.csv -d ../raw_data/pdbbind_opt_sm
This will take about one day on a 256-core CPU. If you have more nodes, considering split the input csv file to several chunks and run them in parallel. When the workflow finish, in the output directory, each PDBID will have a folder and if the workflow succeed on this PDBID, there will be a file named done.tag
under its folder, otherwise ther will be a file named err
.
- Step 4: Run the
gather.py
to create metadata files, for example:
python gather.py -i ../pre_process/BioLiP_bind_sm.csv -d ../raw_data/biolip2_opt -o ../figshare/biolip2_opt/biolip2_opt.csv
After conda create -n PDBBindOPTenv
, most of packages can be directly installed using pip install
, such as pip install gemmi
,pip install rdkit-pypi
, pip install openmm
. In my experience (HPC, Linux, Python==3.11.9 environment), some packages are not easily installed using conda install conda-forge
for new people in this area, and they are openmmforcefields, openff, pdbfixer and openbabel.
I recommend mamba (mamba, not mamda).
- Install Miniforge
# in my case, I install wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh" bash Miniforge3-Linux-x86_64.sh
- Navigate to
${HOME}
root, you will see newminiforge3
folder alongside yourminiconda3
folder. In${HOME}/miniforge3/etc/profile.d/
, you will seeconda.sh
andmamba.sh
,source
themsource /${HOME}/miniforge3/etc/profile.d/conda.sh source /${HOME}/miniforge3/etc/profile.d/mamba.sh
- At this moment, if we check
conda env list
,we will see# conda environments: # /${HOME}/miniconda3 /${HOME}/miniconda3/envs/PDBBindOPTenv base /${HOME}/miniforge3
conda activate /${HOME}/miniconda3/envs/PDBBindOPTenv
mamba install -c conda-forge openmmforcefields
mamba install -c conda-forge openff-toolkit
mamba install -c conda-forge pdbfixer
mamba install -c conda-forge openbabel