This repository is a place to plan and solidify ideas about data management in chemistry and related fields, primarily following discussions that occurred at the Chemical Science Symposium 2020: "How can machine learning and autonomy accelerate chemistry?".
It was suggested that the community should try to centralize efforts on a common core schema for data, that focusses on interoperability and openness. Mechanisms for other parties to extend, re-use and adapt these schemas can then be codified.
The aim of this repository (and GitHub organisation more broadly) is to provide a version controlled "scratchpad" for discussion, ideas and organisation. All content and names are placeholders until otherwise decided.
There is a Gitter chatroom associated with this repository that can be used for informal discussions (and potentially more in the future). There is also a Slack workspace for those interested in contributing; please request an invite on Gitter if you wish to join.
I'll try to summarise the possible to-do list that we discussed:
- If this GitHub is not the place, then potentially make a Slack or otherwise.
- A Slack and Gitter have been made.
- One option may be to request a forum on the new matsci.org platform.
- Collect interested parties, via email and so-on, and establish a focal point for discussion.
- Discussions are now ongoing on Slack and Etherpad
- Find out who has the time to contribute, and in what ways
- Potentially prepare a perspective paper as a "call-to-arms" for improved data standards.
All ideas and suggested changes are welcome, please submit a pull request!
This will need to be decided on by the community. As a placeholder, we adopt the Code of Conduct from the Contributor Covenant, found in CODE_OF_CONDUCT.md.
This list is a scattershot of related projects that were mentioned in discussions. For a more complete list (under construction), please see neo-chem/awesome-chemical-data.
- Comparison grid of many ELNs: produced by Hardvard Biomedical Data Management (more info)
- ESCALATE: A fully-featured data platform for experiment specification, comprehension and data management.
- RightField: Semantically-tagged spreadsheets, a potential data entry solution/approach with a low barrier to entry.
- NMReData initiative: A FAIR data format for NMR experiments. CHEMeDATA. tries to do the same for all of chemistry.
- SpectroscopyHub: A standardisation initiative/data platform for XPS experiments.
- OPTIMADE: An open API specification for materials databases (recently released).
- The bluesky project: A collection of Python libraries for experiment and data control.
- Blue Obelisk on Github:
- cheminfo: community on GitHub and their ELN (with option to export data to Zenodo, there is a well-defined structure to the JSON), one instance hosted on C6H6.
- LabTrove: chemistry ELN/"Smart Research Framework" (potentially defunct)
- Chemotion: ELN focussed on organic chemistry.
- Chemical Analysis Metadata Platform: Defined metadata and ontology for chemical analysis.
- Open Reaction Database: Conner Coley, Abby Doyle, Pfityer, Merck and Google work on a protobuffer schema for a chemical reactions database.
- Autoprotocol: language for specifying experimental protocols for scientific research.
A collection of papers to motivate discussion.
- Too many tags spoil the metadata: investigating the knowledge management of scientific research with semantic web technologies, Kanza, S, et al., Journal of Cheminformatics, 11, 23 (2019) 10.1186/s13321-019-0345-8.
- What influence would a cloud based semantic laboratory notebook have on the digitisation and management of scientific research? Kanza, S, University of Southampton Doctoral Thesis, (2018) 10.5258/SOTON/D0384
- An entire PhD on the merits of semantic lab notebooks, with an open source prototype Semanticat.
- Tremouilhac, P.; Nguyen, A.; Huang, Y.-C.; Kotov, S.; Lütjohann, D. S.; Hübsch, F.; Jung, N.; Bräse, S. Chemotion ELN: An Open Source Electronic Lab Notebook for Chemists in Academia. Journal of Cheminformatics 2017, 9 (1), 54. https://doi.org/10.1186/s13321-017-0240-0.
- Patiny, L.; Zasso, M.; Kostro, D.; Bernal, A.; Castillo, A. M.; Bolaños, A.; Asencio, M. A.; Pellet, N.; Todd, M.; Schloerer, N.; Kuhn, S.; Holmes, E.; Javor, S.; Wist, J. The C6H6 NMR Repository: An Integral Solution to Control the Flow of Your Data from the Magnet to the Public. Magnetic Resonance in Chemistry 2018, 56 (6), 520–528. https://doi.org/10.1002/mrc.4669.
- Hastings, J.; Chepelev, L.; Willighagen, E.; Adams, N.; Steinbeck, C.; Dumontier, M. The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web. PLOS ONE 2011, 6 (10), e25513. https://doi.org/10.1371/journal.pone.0025513.
- Chalk, S. J. SciData: A Data Model and Ontology for Semantic Representation of Scientific Data. J Cheminform 2016, 8 (1), 54. https://doi.org/10.1186/s13321-016-0168-9.
- Murray-Rust, P.; Rzepa, H. S.; Tyrrell, S. M.; Zhang, Y. Representation and Use of Chemistry in the Global Electronic Age. Org. Biomol. Chem. 2004, 2 (22), 3192–3203. https://doi.org/10.1039/B410732B.
- Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on, O'Boyle, N, et al., Journal of Cheminformatics, 3, 37 (2011) 10.1186/1758-2946-3-37