create-bridgedb-secondary2primary

Introduction

Many biological databases split/merge or withdraw identifiers. Below you can see some examples from HGNC:

HGNC_ID	STATUS	WITHDRAWN_SYMBOL	MERGED_INTO_REPORT(S) (i.e HGNC_ID/SYMBOL/STATUS)
HGNC:1531	Merged/Split	CBBM	HGNC:4206/OPN1MW/Approved, HGNC:9936/OPN1LW/Approved
HGNC:354	Merged/Split	AIH2	HGNC:3344/ENAM/Approved
HGNC:440	Entry Withdrawn	ALPPL1

Withdrawn entries (deleted ids)

Some molecular entries were withdrawn/deleted from a database and they won't exist anymore. HGNC:440 is an example of withdrawn id from HGNC.

Split/merged ids

When an id is split or merged in a database, a new id(s) will be used for that specific entity. The new id(s) is called the primary id(s), while the split/merged id is the secondary id, which will not be used anymore.

secondary id	primary id
HGNC:1531	HGNC:4206
HGNC:1531	HGNC:9936
CBBM	OPN1MW
CBBM	OPN1LW
HGNC:354	HGNC:3344
AIH2	ENAM

The split ids may introduce one-to-multiple mapping issues which should be further evaluated.

Secondary ids vs duplicate ids

In some databases, multiple ids refer to the same entity. We define these ids as duplicate ids. Below you see an example from HMDB:

Version	Status	Creation Date	Update Date	HMDB ID	Secondary Accession Numbers
5.0	Detected and Quantified	2006-08-13 13:18:56 UTC	2021-09-14 14:59:00 UTC	HMDB0004160	HMDB0004159, HMDB0004161, HMDB04159, HMDB04160, HMDB04161

In this case, the id, currently used by the databases to refer to the entity, is the primary id.

duplicate id	primary id
HMDB0004159	HMDB0004160
HMDB0004161	HMDB0004160
HMDB04159	HMDB0004160
HMDB04160	HMDB0004160
HMDB04161	HMDB0004160

The BridgeDb project is collecting this information to create secondary to primary mapping databases, which will improve data interoperability.

Installation

Java 11 is required.

mvn clean install assembly:single

How to create a Derby file

when the input is txt file

java -cp target/create-bridgedb-secondary2primary-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.bridgedb.sec2pri.TXTsec2pri $databaseName $databaseCode $separator

databaseName: database name located in the input directory. Some examples of input data can be found here;

databaseCode: the annotation of data sources database, called SytemCodes extracted from here.

separator: the field separator character;

when the input is a zip file contating the XML files

java -cp target/create-bridgedb-secondary2primary-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.bridgedb.sec2pri.XMLsec2pri $databaseName $databaseCode $databaseSymbolCode $priIdNode $secIdNode $secIdNodeTag $priSymbolNode $secSymbolNode $secSymbolNodeTag

Releases

The files are released via the BridgeDb Website

The mapping files are also archived on Zenodo

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.settings		.settings
input		input
output		output
src		src
target		target
.classpath		.classpath
.gitignore		.gitignore
.project		.project
AUTHORS		AUTHORS
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
derby.log		derby.log
inputPreparation.Rmd		inputPreparation.Rmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

create-bridgedb-secondary2primary

Introduction

Withdrawn entries (deleted ids)

Split/merged ids

Secondary ids vs duplicate ids

Installation

How to create a Derby file

Releases

About

Releases

Packages

Languages

License

bridgedb/create-bridgedb-secondary2primary

Folders and files

Latest commit

History

Repository files navigation

create-bridgedb-secondary2primary

Introduction

Withdrawn entries (deleted ids)

Split/merged ids

Secondary ids vs duplicate ids

Installation

How to create a Derby file

Releases

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages