-
Notifications
You must be signed in to change notification settings - Fork 7
4.5 Ambiguous symbols
The underlying sequence files in .FASTA
format can contain any of the following symbols:
A // Adenine
C // Cytosine
G // Guanine
T // Thymine
- // Deletion
N // failed read / any
R // A or G
Y // C or T
S // C or G
W // A or T
K // G or T
M // A or C
B // not A
D // not C
H // not G
V // not T
While one mostly queries for the symbols A
, C
, G
, T
and -
to look for specific features and mutations of a sequence, or N
for quality control of the underlying data, the ambiguous symbols R
through V
are often too cumbersome to consider in analyses.
LAPIS supports the flexible consideration of these ambiguous symbols through an extension of the boolean logic syntax in the variant queries.
Here we introduce the two new expression Maybe (or UpperBound) to consider sequences that have an ambiguous code which maybe matches the queries value. The complementary expression Exact (or LowerBound) is also introduced.
Consider the following sequences:
12345
AAACG
AARCG
AANCG
AAGCG
AAACG
A filter for the mutation 3G
returns only the sequence AAGCG
, as it is the only sequence with the symbol G
at position 3.
The filter Maybe(3G)
, also considers however, that the sequence AARCG
may have the symbol G
at position 3, because the symbol R
can represent Guanine.
Conversely, the query Not(3A)
contains the sequences
`AARCG`
`AANCG`
`AAGCG`
If you want to restrict the set of sequences to those which also do not have an ambiguous code containing A
at position 3, you can get the lower bound of the sequences using the query Exact(Not(3G))
or equivalently Not(Maybe(3G)
:
`AANCG`
`AAGCG`