Skip to content

Commit

Permalink
add todo
Browse files Browse the repository at this point in the history
  • Loading branch information
manuelburger committed Nov 5, 2024
1 parent e031d07 commit 22bfa96
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/nanotron/data/petagraph_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,7 @@ def fasta_parsing_func(self, input_data: Tuple[str, bytes]):
decoded_lines = data.decode()
sequences = [str(s.seq) for s in SeqIO.parse(StringIO(decoded_lines), "fasta")]

# make sure only ALPHABET
# make sure only ALPHABET, TODO: align with training vocabulary allow "N" to pass through
sequences = ["".join([c for c in s if c in ALPHABET]) for s in sequences]

# Chop sequences in preparation for graph traversal
Expand Down

0 comments on commit 22bfa96

Please sign in to comment.