Skip to content

EDTA v2.0.0 - faster, better, and nicer!

Compare
Choose a tag to compare
@oushujun oushujun released this 26 Nov 02:22
· 320 commits to master since this release

Performance improvements

  1. Set to use the original LTRharvest and LTR_FINDER when --threads 1. It will be much faster for highly fragmented genomes (> 5,000 sequences) by reducing the number of files created (#225). Users may run EDTA_raw.pl for each TE type with --threads 1, then run EDTA.pl with multi threads and --overwrite 0.
  2. Improve the filtering scheme for TE flanking sequences that are highly repetitive. If both flanking sequences are repetitive, filter out those with copy number > 50k on either side (Based on feedback from Zhigui Bao @baozg). This will avoid program suspension due to the long stretch of tandem repeats that exist in high-quality genomes.
  3. Improve and polish the filtering scheme suggested by Sergei Ryazansky @DrHogart (#136).

New features

  1. change the longest sequence ID limit from 15 to 13 characters to allow sequences > 100 Mb (#239).
  2. support renaming LTR sequences that RepeatModeler reports via --sensitive 1 (#184).
  3. support renaming TEsorter libraries (#184).
  4. cleanup_nested.pl: added the -clean option to allow for cleaning or not cleaning nested sequences.
  5. get_consistent_TE.pl: a new script that helps find TEs that are consistently annotated in a genome.
  6. add more specific guides for EDTA usage installed via conda (#208).
  7. rename and save the existing.EDTA.intact.fa.out file when using the parameter --overwrite 0.
  8. Updated EDTA_processI.pl and TE_purifier.pl: redirect RepeatMasker error msgs to STDERR suggested by Nathalie de Vries.
  9. make_panTElib.pl: a matured script that helps to create a pan-genome TE library for pan-genome TE annotations. A documented usage example (with great details) can be found here: https://github.com/HuffordLab/NAM-genomes/tree/master/te-annotation

Issues fixed

  1. Resolve classification inconsistency when --curatedlib is provided
    1. Added new entries and alias to the TE SO database (#219).
    2. Format sequence IDs for library files provided via --curatedlib to use the TE SO system (#220).
    3. check TIR classification discrepancy between candidate seq and lib seq with TE_SO name conversion.
  2. Resolve singularity warnings by adding "LC_ALL=C" and author info to the Dockerfile (#122).
  3. Fix #150 when flanking sequence is empty.
  4. Fixed typos in EDTA.pl and EDTA_processI.pl reported by Nathalie de Vries.

Note

If your run was successful with version 1.9.4+ and didn't notice any particular errors, you may not need to rerun it with 2.0.0. The core filtering algorithms are not very different between these versions.