Many biological databases split/merge or withdraw identifiers. Below you can see some examples from HGNC:
HGNC_ID | STATUS | WITHDRAWN_SYMBOL | MERGED_INTO_REPORT(S) (i.e HGNC_ID/SYMBOL/STATUS) |
---|---|---|---|
HGNC:1531 | Merged/Split | CBBM | HGNC:4206/OPN1MW/Approved, HGNC:9936/OPN1LW/Approved |
HGNC:354 | Merged/Split | AIH2 | HGNC:3344/ENAM/Approved |
HGNC:440 | Entry Withdrawn | ALPPL1 |
Some molecular entries were withdrawn/deleted from a database and they won't exist anymore. HGNC:440
is an example of withdrawn id from HGNC.
When an id is split or merged in a database, a new id(s) will be used for that specific entity. The new id(s) is called the primary id(s), while the split/merged id is the secondary id, which will not be used anymore.
secondary id | primary id |
---|---|
HGNC:1531 | HGNC:4206 |
HGNC:1531 | HGNC:9936 |
CBBM | OPN1MW |
CBBM | OPN1LW |
HGNC:354 | HGNC:3344 |
AIH2 | ENAM |
The split ids may introduce one-to-multiple mapping issues which should be further evaluated.
In some databases, multiple ids refer to the same entity. We define these ids as duplicate ids. Below you see an example from HMDB:
Version | Status | Creation Date | Update Date | HMDB ID | Secondary Accession Numbers |
---|---|---|---|---|---|
5.0 | Detected and Quantified | 2006-08-13 13:18:56 UTC | 2021-09-14 14:59:00 UTC | HMDB0004160 | HMDB0004159, HMDB0004161, HMDB04159, HMDB04160, HMDB04161 |
In this case, the id, currently used by the databases to refer to the entity, is the primary id.
duplicate id | primary id |
---|---|
HMDB0004159 | HMDB0004160 |
HMDB0004161 | HMDB0004160 |
HMDB04159 | HMDB0004160 |
HMDB04160 | HMDB0004160 |
HMDB04161 | HMDB0004160 |
The BridgeDb project is collecting this information to create secondary to primary mapping databases, which will improve data interoperability.
Java 11 is required.
mvn clean install assembly:single
- when the input is txt file
java -cp target/create-bridgedb-secondary2primary-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.bridgedb.sec2pri.TXTsec2pri $databaseName $databaseCode $separator
databaseName
: database name located in the input
directory. Some examples of input data can be found here;
databaseCode
: the annotation of data sources database, called SytemCodes extracted from here.
separator
: the field separator character;
- when the input is a zip file contating the XML files
java -cp target/create-bridgedb-secondary2primary-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.bridgedb.sec2pri.XMLsec2pri $databaseName $databaseCode $databaseSymbolCode $priIdNode $secIdNode $secIdNodeTag $priSymbolNode $secSymbolNode $secSymbolNodeTag
The files are released via the BridgeDb Website
The mapping files are also archived on Zenodo