Skip to content

Commit

Permalink
Keep only sequences without errors
Browse files Browse the repository at this point in the history
  • Loading branch information
manuelburger committed Nov 5, 2024
1 parent 5872c04 commit eaaaca4
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions src/nanotron/data/petagraph_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,8 +378,9 @@ def fasta_parsing_func(self, input_data: Tuple[str, bytes]):
decoded_lines = data.decode()
sequences = [str(s.seq) for s in SeqIO.parse(StringIO(decoded_lines), "fasta")]

# make sure only ALPHABET
# sequences = ["".join([c for c in s if c in ALPHABET]) for s in sequences]
# Following DNA-BERTv2: https://arxiv.org/pdf/2306.15006
# Zhou et al.: "We exclude all sequences with N and retain only sequences that consist of A, T, C, and G.
sequences = [s for s in sequences if set(s).issubset(ALPHABET)]

# Chop sequences in preparation for graph traversal
sequences = [self.chop_at_first_repeated_kmer(s, k=KMER_LENGTH) for s in sequences]
Expand Down

0 comments on commit eaaaca4

Please sign in to comment.