Implement all thermodynamic property features into Thermo target. #47

leeping · 2014-02-20T08:21:55Z

This is a summary of my discussions with Erik (@ebran) today. Hopefully I'll get to this implementation within the next week or two. The overall goal is to merge the existing functionality of the Liquid and Lipid targets into the more general Thermo target, which will greatly facilitate optimizations that use general combinations (or new types) of experimental thermodynamic data.

After this is done, the Liquid and Lipid targets will remain as long as there are users who are currently using them / want to use them, though it's my hope that I can someday remove them since all of their functionality will be in Thermo.

Overall workflow of Thermo target.

The target is specified in the input file as follows:

$target
name MyThermoProperties
type Thermo_GMX
properties H_ads Rho
simulations
global
steps 20000
--
name Liquid
steps 10000
--
name Adsorbed
gmx_gro conf.gro
gmx_top topol.top
gmx_mdp grompp.mdp
gmx_ndx index.ndx
/simulations
$end

Note that the properties field is a list of property names, which correspond to columns in expset.txt (below) and implemented methods for calculating that property and its derivatives from the MD simulation. In general, properties are calculated using timeseries of instantaneous observables that come out of simulations. In other words, specifying a given property will cause particular simulations to run, and particular timeseries to be extracted from these simulations. We won't extract all possible timeseries because it's time-consuming and space intensive, instead opting to extract only those needed. Multiple properties can make use of the same timeseries in the same simulation.

The simulation settings are specified within the simulations block which is a "section input type" - i.e. in contrast to most options, there will be a custom parser for the text contained between simulations and /simulations. Each simulation has a name which corresponds to the required simulations for the calculated properties - that is to say, if a given property requires a timeseries from the Interface simulation, then the target must contain simulation settings for Interface. Simple simulation settings - such as time step - can be entered here, and depending on which MD software is used, the user may provide files like grompp.mdp and tinker.key. The global block contains default settings that are copied over to the individual simulations.

In the target folder, there will be a plain text file expset.txt corresponding to the experimental data set. A separate chain of simulations will be run for each row of data in this file. This file will contain columns corresponding to the experimental data and the relevant variables for the simulation (e.g. temperature and pressure). I might want to change the default name to data.txt because my other collaborators are used to data.csv.

# Experimental data for liquid bromine.
  Label     Temp_K Density_kg/m^3 w   Enthalpy_kJ/mol w    Temperature_K Pressure_bar
  298.15K   298.15 3102.8         1.0 29.96           1.0  298.15        1.01325

# Variables: Denominators and weights for quantities
Denoms  = 30 0.3
Weights = 1.0 1.0

Here's how the calculation will work: The optimizer first provides the force field parameters, and the target's stage() function will execute the simulations of thermodynamic properties, 1+ chain of simulations per row of experimental data. These can either run locally or through the Work Queue. The executed script, md-chain.py, takes as input the simulation settings (including initial conditions) that are required for calculating the properties. The simulations are executed sequentially, and the time series of instantaneous observables (e.g. potentials, volumes) are written to disk. Each time series will be keyed with the simulation name and observable name. As a sanity check / indicator, md-chain.py will perform a standalone calculation of properties corresponding to the available time series.

After md-chain.py returns the time series, they will be used to calculate the thermodynamic properties themselves in the target's get() method. The current thought is to create an object for each row of experimental data, which is initialized using the time series and provides methods to calculate each property and their gradients. The main advantage of Thermo is that it allows for the implementation of any thermodynamic property as a method that operates on time series of observables from simulation chains.

Here the time series of observables may be preprocessed using MBAR which increases the amount of data available for each row of experimental data, as long as the only variables are temperature and pressure. Also here would be the entry point for evaluating the standard error of the objective function via bootstrapping.

After the property values and their gradients are calculated, the objective function, gradient and Hessian contributions are easily computed and returned to Objective.

Required coding.

Parser for input options. This should be easy.
Need to create a class corresponding to "row of experimental data", which contains the time series of observables / experimental data and implements the methods for calculating properties and their derivatives. This code should be straightforward.

I don't really know what to name this class, because DataSet sounds more like the entire experimental data set, and DataPoint makes it sound like just one property. How about ThermoProperties?
Need to modify md-chain.py so that it automatically knows how to set up and execute the simulation chain and extract the relevant time series. This should not be too hard except for these potential challenges:
- Handling of initial conditions. Under normal circumstances, each simulation chain should store the initial conditions corresponding to the final coordinates of the last good optimization iteration. If we want to run multiple simulation chains for each row of properties, then each one needs to have different initial conditions for the simulations to matter. Because of this, I think I need some method to specify the initial conditions explicitly rather than having a single set of default coordinates that is used everywhere.
- Flexible handling of user input. ForceBalance will expect the user to have run parameters and simulation settings (.mdp, .top, .key etc.) inside the target folder, under a subfolder corresponding to the row label. Settings in the input file (e.g. number of MD steps) and variables in expset.txt will override the options in the run parameter files. If the run parameter files do not contain required options, ForceBalance will provide some defaults (e.g. reasonable Ewald summation settings if none provided). This is how I'm currently doing it, but I imagine md-chain.py needs to be especially careful with handling multiple layers of possible input. (Extra note: The input can be simplified greatly if the GROMACS, TINKER etc. input files are required to have the same base name as the simulation name.)
- Calculation of relevant properties. Depending on which simulations are run, md-chain.py can calculate some properties on its own to provide a sanity check for the user (I've found this to be helpful in npt.py). Since md-chain.py already knows the simulations to run and which time series to extract, it should be able to look up which properties it is able to calculate on its own.
Need to implement Work Queue functionality, MBAR and error estimation via bootstrap into thermo.py. Also it needs to make extensive use of the ThermoProperties objects. We should incorporate the backtracking feature in Liquid that reuses experimental data when an optimization step is rejected. We should also have the ability to execute multiple copies of a simulation chain for each ThermoProperties object in order to take advantage of parallelism. I think these are in order of increasing difficulty, and since it ties everything together we should implement it last.
thermo.py needs to parse expset.txt (or data.txt depending on what we name it) intelligently. In the beginning this should be easy, because many properties are simply floating point numbers. However, in the case of deuterium order parameters, the property is a medium-sized array and we need an intuitive syntax for parsing array input. I think Keri (@kmckiern) already has code for this.
Engine objects need to have a more general way to run molecular dynamics, and implement methods for extracting time series. This should be easy, though I would like to rewrite the molecular_dynamics() method while preserving the compatibility with the Liquid and Lipid targets.

Considerations for future properties

I'm not an expert at free energy calculations, but it's best to know whether this framework will accommodate free energies (as a simulation in the chain and a method in ThermoProperties), or if it is too complex and requires a different framework.
Protein NMR measurements. Are there any special considerations here?

The text was updated successfully, but these errors were encountered:

leeping added enhancement labels Feb 20, 2014

leeping mentioned this issue Mar 29, 2014

Discuss Options for Organizing Experimental Data #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement all thermodynamic property features into Thermo target. #47

Implement all thermodynamic property features into Thermo target. #47

leeping commented Feb 20, 2014

Implement all thermodynamic property features into Thermo target. #47

Implement all thermodynamic property features into Thermo target. #47

Comments

leeping commented Feb 20, 2014

Overall workflow of Thermo target.

Required coding.

Considerations for future properties