diff --git a/faqs/index.html b/faqs/index.html
index e49af8b..f0a55c2 100644
--- a/faqs/index.html
+++ b/faqs/index.html
@@ -59,7 +59,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/faqs/",
"headline": "FAQs",
"description": "Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How’s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn’t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene\/plasmid\/virus\/phage sequences) longer than 200 bp by default.",
- "wordCount" : "773",
+ "wordCount" : "818",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
@@ -1830,21 +1830,31 @@
FAQs
LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes.
There are some ways to improve the search speed of lexicmap search.
-
Increasing the concurrency number
+
Increasing the concurrency number
-
Increasing the value of --max-open-files (default 512). You might need to
+
(If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory.
+>change the open files limit.
+
+
+
(If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory.
+
-
Loading the entire seed data into memoy (It’s unnecessary if the index is stored in SSD)
+
Loading the entire seed data into memoy (It’s unnecessary if the index is stored in SSD)
Setting -w/--load-whole-seeds to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.
-
Returning less results
+
Returning less results
Setting -n/--top-n-genomes to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.
diff --git a/search/en.data.min.json b/search/en.data.min.json
index 2e0f01d..ad889b1 100644
--- a/search/en.data.min.json
+++ b/search/en.data.min.json
@@ -1 +1 @@
-[{"id":0,"href":"/LexicMap/tutorials/misc/index-gtdb/","title":"Indexing GTDB","parent":"More","content":"Info:\nhttps://gtdb.ecogenomic.org/ Tools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;GTDB_complete\u0026quot; -M \u0026quot;gtdb\u0026quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \\ \u0026gt; failed.txt # empty files find $genomes -name \u0026quot;*.gz\u0026quot; -size 0 \u0026gt;\u0026gt; failed.txt # delete these files cat failed.txt | rush '/bin/rm {}' # redownload them: # run the genome_updater command again, with the flag -i Indexing. On a 48-CPU machine, time: 11 h, ram: 64 GB, index size: 906 GB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\nlexicmap index \\ -I files/ \\ --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' \\ -O gtdb_complete.lmi --log gtdb_complete.lmi.log \\ -b 5000 Files:\n$ du -sh files gtdb_complete.lmi --apparent-size 413G files 907G gtdb_complete.lmi $ dirsize gtdb_complete.lmi gtdb_complete.lmi: 906.14 GiB (972,962,162,476) 543.06 GiB seeds 362.98 GiB genomes 102.37 MiB kmers-m12345.tsv 9.60 MiB genomes.map.bin 312.53 KiB masks.bin 330 B info.toml ","description":"Info:\nhttps://gtdb.ecogenomic.org/ Tools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;GTDB_complete\u0026quot; -M \u0026quot;gtdb\u0026quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $?"},{"id":1,"href":"/LexicMap/usage/utils/masks/","title":"masks","parent":"utils","content":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks. (default 40000) -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -p, --prefix int ► Length of mask k-mer prefix for checking low-complexity (0 for no checking). (default 15) -s, --seed int ► The seed for generating random masks. (default 1) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils masks --quiet -d demo.lmi/ | head -n 10 1 AAAAAAAAGTCACTTGACAATCCACACGGTG 2 AAAAAAACTGCTTGCACCTTTCTCGCCTCTC 3 AAAAAAATTCTCGGCGGTGTTTCCAGGCGCA 4 AAAAAACCCAAGCGCGAAAGCCTGAACAACC 5 AAAAAACGTGGCGTCCCCTGTATAACGGCTA 6 AAAAAAGAGGGGAAGCAAGCTGAAGGATATG 7 AAAAAAGCTTAGTGTGAATGAATGGCTTCCG 8 AAAAAATCCAGGGTTCCGTTAAGGATCTGTC 9 AAAAAATGCCTCGCAGAGCAGGCTATGCTGA 10 AAAAAATTGATTCTTAGAGCGTTCCCGCCCA $ lexicmap utils masks --quiet -d demo.lmi/ | tail -n 10 39991 TTTTTTACACGCTGTGACTGCATTACAAAAA 39992 TTTTTTAGCCAGGGTTCACAGCGCCAAAACA 39993 TTTTTTATCGGACGCCAAGTTTGTAATCGTC 39994 TTTTTTCACTCGCATCTAGGAAGGAAGCATA 39995 TTTTTTCTTGCATCGTATTCAGCACGTTCCT 39996 TTTTTTGCCGAGTGACCCCGAAAAGCTCACA 39997 TTTTTTGGCGTGAGGCATTGTTTACTGCCTT 39998 TTTTTTTAAGTGGTCGTGGTAGGAGCCTCAC 39999 TTTTTTTCCGTAACTAGGTTCTGGCGATTCC 40000 TTTTTTTGAGGGTATAAGATAGAGAAAAGCT # check a specific mask $ lexicmap utils masks --quiet -d demo.lmi/ -m 12345 12345 CATTAGTAGAAGAAGGCACAATGTATCGTCG Freqency of prefixes.\n$ lexicmap utils masks --quiet -d demo.lmi/ \\ | csvtk mutate -Ht -f 2 -p \u0026#39;^(.{7})\u0026#39; \\ | csvtk freq -Ht -f 3 -nr \\ | head -n 10 AAAAAAA 3 AAAAAAT 3 AAAAACA 3 AAAAACC 3 AAAAACG 3 AAAAACT 3 AAAAAGC 3 AAAAAGG 3 AAAAAGT 3 AAAAATT 3 $ lexicmap utils masks --quiet -d demo.lmi/ \\ | csvtk mutate -Ht -f 2 -p \u0026#39;^(.{7})\u0026#39; \\ | csvtk freq -Ht -f 3 -n \\ | head -n 10 AAAAAAC 2 AAAAAAG 2 AAAAAGA 2 AAAAATA 2 AAAAATC 2 AAAAATG 2 AAAACAC 2 AAAACAT 2 AAAACCG 2 AAAACGC 2 Frequency of frequencies. i.e., for 40,000 masks, 47 = 16384. All 16,384 masks are duplicated twice, and 7,232 of them are duplicated 3 times.\n$ lexicmap utils masks --quiet -d demo.lmi/ | csvtk mutate -Ht -f 2 -p \u0026#39;^(.{7})\u0026#39; | csvtk freq -Ht -f 3 -n | csvtk freq -Ht -f 2 -k 2 9152 3 7232 ","description":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks."},{"id":2,"href":"/LexicMap/usage/index/","title":"index","parent":"Usage","content":" Terminology differences In the LexicMap source code and command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Usage $ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers in the file names. 2. Input plain or gzip/xz/zstd/bzip2 compressed FASTA/Q files can be given via positional arguments or the flag -X/--infile-list with a list of input files. Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level sub-directories allowed. A regular expression for matching sequencing files is available via the flag -r/--file-regexp. 4. Some non-isolate assemblies might have extremely large genomes (e.g., GCA_000765055.1, \u0026gt;150 mb). The flag -g/--max-genome is used to skip these input files, and the file list would be written to a file (-G/--big-genomes). You need to increase the value for indexing fungi genomes. 5. Maximum genome size: 268,435,456. More precisely: $total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456, as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index. 6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value). Attention: *1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome identifiers in the index and search result would be: the basenames of files with common FASTA/Q file extensions removed, which are extracted via the flag -N/--ref-name-regexp. ► The extracted genome identifiers better be distinct, which will be shown in search results and are used to extract subsequences in the command \u0026#34;lexicmap utils subseq\u0026#34;. 2) ► Unwanted sequences like plasmids can be filtered out by content in FASTA/Q header via regular expressions (-B/--seq-name-filter). 3) All degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. code bases saved A A A C C C G G G T/U T T M A/C A R A/G A W A/T A S C/G C Y C/T C K G/T G V A/C/G A H A/C/T A D A/G/T A B C/G/T C N A/C/G/T A Important parameters: --- Genome data --- *1. -b/--batch-size, ► Maximum number of genomes in each batch (maximum: 131072, default: 5000). ► If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. --- LexicHash mask generation --- 0. -M/--mask-file, ► File with custom masks, which could be exported from an existing index or newly generated by \u0026#34;lexicmap utils masks\u0026#34;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. *1. -k/--kmer, ► K-mer size (maximum: 32, default: 31). ■ Bigger values improve the search specificity and do not increase the index size. *2. -m/--masks, ► Number of LexicHash masks (default: 40000). ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. --- Seeds data (k-mer-value data) --- *1. --seed-max-desert ► Maximum length of distances between seeds (default: 200). The default value of 200 guarantees queries \u0026gt;=200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist (50 by default) bases. ■ Big values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. 2. -c/--chunks, ► Number of seed file chunks (maximum: 128, default: #CPUs). ► Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. *3. -J/--seed-data-threads ► Number of threads for writing seed data and merging seed chunks from all batches (maximum: -c/--chunks, default: 8). ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. 4. --partitions, ► Number of partitions for indexing each seed file (default: 1024). ► Bigger values bring a little higher memory occupation. ► After indexing, \u0026#34;lexicmap utils reindex-seeds\u0026#34; can be used to reindex the seeds data with another value of this flag. 5. --max-open-files, ► Maximum number of open files (default: 512). ► It\u0026#39;s only used in merging indexes of multiple genome batches. Usage: lexicmap index [flags] [-k \u0026lt;k\u0026gt;] [-m \u0026lt;masks\u0026gt;] { -I \u0026lt;seqs dir\u0026gt; | -X \u0026lt;file list\u0026gt;} -O \u0026lt;out dir\u0026gt; Flags: -b, --batch-size int ► Maximum number of genomes in each batch (maximum value: 131072) (default 5000) -G, --big-genomes string ► Out file of skipped files with $total_bases + ($num_contigs - 1) * $contig_interval \u0026gt;= -g/--max-genome. The second column is one of the skip types: no_valid_seqs, too_large_genome, too_many_seqs. -c, --chunks int ► Number of chunks for storing seeds (k-mer-value data) files. (default 16) --contig-interval int ► Length of interval (N\u0026#39;s) between contigs in a genome. (default 1000) -r, --file-regexp string ► Regular expression for matching sequence files in -I/--in-dir, case ignored. Attention: use double quotation marks for patterns containing commas, e.g., -p \u0026#39;\u0026#34;A{2,}\u0026#34;\u0026#39;. (default \u0026#34;\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --force ► Overwrite existing output directory. -h, --help help for index -I, --in-dir string ► Input directory containing FASTA/Q files. Directory and file symlinks are followed. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -M, --mask-file string ► File of custom masks. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed etc. -m, --masks int ► Number of LexicHash masks. (default 40000) -g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. Need to be smaller than the maximum supported genome size: 268435456 (default 15000000) --max-open-files int ► Maximum opened files, used in merging indexes. (default 512) -l, --min-seq-len int ► Maximum sequence length to index. The value would be k for values \u0026lt;= 0 (default -1) --no-desert-filling ► Disable sketching desert filling (only for debug). -O, --out-dir string ► Output LexicMap index directory. --partitions int ► Number of partitions for indexing seeds (k-mer-value data) files. The value needs to be the power of 4. (default 1024) -s, --rand-seed int ► Rand seed for generating random masks. (default 1) -N, --ref-name-regexp string ► Regular expression (must contains \u0026#34;(\u0026#34; and \u0026#34;)\u0026#34;) for extracting the reference name from the filename. Attention: use double quotation marks for patterns containing commas, e.g., -p \u0026#39;\u0026#34;A{2,}\u0026#34;\u0026#39; (default \u0026#34;(?i)(.+)\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --save-seed-pos ► Save seed positions, which can be inspected with \u0026#34;lexicmap utils seed-pos\u0026#34;. -J, --seed-data-threads int ► Number of threads for writing seed data and merging seed chunks from all batches, the value should be in range of [1, -c/--chunks] (default 8) -d, --seed-in-desert-dist int ► Distance of k-mers to fill deserts. (default 50) -D, --seed-max-desert int ► Maximum length of sketching deserts, or maximum seed distance. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist bases. (default 200) -B, --seq-name-filter strings ► List of regular expressions for filtering out sequences by contents in FASTA/Q header/name, case ignored. -S, --skip-file-check ► Skip input file checking when given files or a file list. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples See Building an index ","description":"Terminology differences In the LexicMap source code and command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Usage $ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1."},{"id":3,"href":"/LexicMap/tutorials/misc/index-genbank/","title":"Indexing GenBank+RefSeq","parent":"More","content":"Make sure you have enough disk space, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;genbank\u0026quot; -M \u0026quot;ncbi\u0026quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \\ \u0026gt; failed.txt # empty files find $genomes -name \u0026quot;*.gz\u0026quot; -size 0 \u0026gt;\u0026gt; failed.txt # delete these files cat failed.txt | rush '/bin/rm {}' # redownload them: # run the genome_updater command again, with the flag -i Indexing. On a 48-CPU machine, time: 54 h, ram: 178 GB, index size: 4.94 TB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\nlexicmap index \\ -I files/ \\ --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' \\ -O genbank_refseq.lmi --log genbank_refseq.lmi.log \\ -b 25000 ","description":"Make sure you have enough disk space, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;genbank\u0026quot; -M \u0026quot;ncbi\u0026quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $?"},{"id":4,"href":"/LexicMap/introduction/","title":"Introduction","parent":"","content":" LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nPreprint:\nWei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Citation Support License Related projects Features LexicMap is scalable to up to millions of prokaryotic genomes. The sensitivity of LexicMap is comparable with Blastn. The alignment is fast and memory-efficient. LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs). LexicMap is easy to use (tutorials and usages). Both tabular and Blast-style output formats are available. Besides, we provide several commands to explore the index data and extract indexed subsequences. Introduction Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.\nExisting full alignment tools face challenges of high memory consumption and slow speeds. Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis. Prefilter+Align strategies have the sensitivity issue in the prefiltering step. Methods: (algorithm overview)\nAn improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently. We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee. We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched, which means seeds are immune to any single SNP. A hierarchical index enables fast and low-memory variable-length seed matching (prefix + suffix matching). A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment. A reimplemented Wavefront alignment algorithm is used for base-level alignment. Results:\nLexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million prokaryotic assemblies respectively). Running at this scale has previously only been achieved by Phylign (previously called mof-search), which compresses genomes with phylogenetic information and provides searching (prefiltering with COBS and alignment with minimap2).\nFor searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Bastn is unable to run with this dataset on common servers as it requires \u0026gt;2000 GB RAM. (see performance).\nWith LexicMap (48 CPUs),\nQuery Genome hits Time RAM A 1.3-kb marker gene 37,164 36 s 4.1 GB A 1.5-kb 16S rRNA 1,949,496 10 m 41 s 14.1 GB A 52.8-kb plasmid 544,619 19 m 20 s 19.3 GB 1003 AMR genes 25,702,419 187 m 40 s 55.4 GB Quick start Building an index (see the tutorial of building an index).\n# From a directory with multiple genome files lexicmap index -I genomes/ -O db.lmi # From a file list with one file per line lexicmap index -X files.txt -O db.lmi Querying (see the tutorial of searching).\n# For short queries like genes or long reads, returning top N hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 # For longer queries like plasmids, returning all hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Sample output (queries are a few Nanopore Q20 reads). See output format details.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 CIGAR string, aligned query and subject sequences can be outputted as extra columns via the flag -a/--all.\n# Extracting similar sequences for a query gene. # search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \\ --min-qcov-per-hsp 90 --all # extract matched sequences as FASTA format sed 1d results.tsv | awk -F'\\t' '{print \u0026quot;\u0026gt;\u0026quot;$5\u0026quot;:\u0026quot;$14\u0026quot;-\u0026quot;$15\u0026quot;:\u0026quot;$16\u0026quot;\\n\u0026quot;$20;}' \\ | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT Export blast-style format:\nseqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 1 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 Learn more tutorials and usages.\nPerformance Indexing dataset genomes gzip_size tool db_size time RAM GTDB complete 402,538 443 GB LexicMap 973 GB 10 h 36 m 63.3 GB Blastn 387 GB 3 h 11 m 718 MB AllTheBacteria HQ 1,858,610 2.5 TB LexicMap 4.26 TB 48 h 08 m 88.6 GB Blastn 1.93 TB 14 h 03 m 2.9 GB Phylign 248 GB / / Genbank+RefSeq 2,340,672 2.7 TB LexicMap 5.43 TB 54 h 33 m 178.3 GB Blastn 2.37 TB 14 h 04 m 4.3 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). LexicMap index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Searching Blastn failed to run as it requires \u0026gt;2000GB RAM for Genbank+RefSeq and AllTheBacteria datasets. Phylign only has the index for AllTheBacteria HQ dataset.\nGTDB complete (402,538 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 5,170 5,143 17 s 1.4 GB Blastn 7,121 6,177 2,171 s 351.2 GB a 16S rRNA gene 1,542 bp LexicMap 303,925 278,141 235 s 4.4 GB Blastn 301,197 277,042 2,353 s 378.4 GB a plasmid 52,830 bp LexicMap 63,108 1,190 499 s 4.6 GB Blastn 69,311 2,308 2,262 s 364.7 GB 1033 AMR genes 1 kb (median) LexicMap 3,867,003 2,228,339 4,350 s 16.3 GB Blastn 5,357,772 2,240,766 4,686 s 442.1 GB AllTheBacteria HQ (1,858,610 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 27,963 27,953 31 s 3.4 GB Phylign_local 7,936 30 m 48 s 77.6 GB Phylign_cluster 7,936 28 m 33 s a 16S rRNA gene 1,542 bp LexicMap 1,857,761 1,740,000 9 m 36 s 14.9 GB Phylign_local 1,017,765 130 m 33 s 77.0 GB Phylign_cluster 1,017,765 86 m 41 s a plasmid 52,830 bp LexicMap 468,821 3,618 15 m 55 s 15.7 GB Phylign_local 46,822 47 m 33 s 82.6 GB Phylign_cluster 46,822 39 m 34 s 1033 AMR genes 1 kb (median) LexicMap 21,288,000 12,148,642 138 m 55 s 49.9 GB Phylign_local 1,135,215 156 m 08 s 85.9 GB Phylign_cluster 1,135,215 133 m 49 s Genbank+RefSeq (2,340,672 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 37,164 37,082 36 s 4.1 GB a 16S rRNA gene 1,542 bp LexicMap 1,949,496 1,381,974 10 m 41 s 14.1 GB a plasmid 52,830 bp LexicMap 544,619 6,563 19 m 20 s 19.3 GB 1033 AMR genes 1 kb (median) LexicMap 25,702,419 14,692,624 187 m 40 s 55.4 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). Main searching parameters: LexicMap v0.4.0: --threads 48 --top-n-genomes 0 --min-qcov-per-genome 0 --min-qcov-per-hsp 0 --min-match-pident 70. Blastn v2.15.0+: -num_threads 48 -max_target_seqs 10000000. Phylign (AllTheBacteria fork 9fc65e6): threads: 48, cobs_kmer_thres: 0.33, minimap_preset: \u0026quot;asm20\u0026quot;, nb_best_hits: 5000000, max_ram_gb: 100; For cluster, maximum number of slurm jobs is 100. Installation LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.\nOr install with conda:\nconda install -c bioconda lexicmap Algorithm overview Citation Wei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459\nSupport Please open an issue to report bugs, propose new functions or ask for help.\nLicense MIT License\nRelated projects High-performance LexicHash computation in Go. Wavefront alignment algorithm (WFA) in Golang. ","description":"LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nPreprint:\nWei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Citation Support License Related projects Features LexicMap is scalable to up to millions of prokaryotic genomes."},{"id":5,"href":"/LexicMap/usage/utils/kmers/","title":"kmers","parent":"utils","content":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out.tsv.gz] Flags: -h, --help help for kmers -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -m, --mask int ► View k-mers captured by Xth mask. (0 for all) (default 1) -f, --only-forward ► Only output forward k-mers. -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples The default output is captured k-mers of the first mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ---- ------------------------------- ------ ------ --------------- ------- ------ -------- 1 AAAAAAAAAAAAAAAGACAACAAAAGACATA 8 1 GCF_002950215.1 3870418 - yes 1 AAAAAAAAACAAACATTTGCGGCGGGGCCAT 8 1 GCF_000742135.1 2043044 + no 1 AAAAAAAAACCAGAAATCACACGCCAACTCC 8 1 GCF_002949675.1 1345415 + yes 1 AAAAAAAAACGATTATCCTCAATTAATTTCT 8 1 GCF_000392875.1 814251 + no 1 AAAAAAAAACGCTTCTACATCGAGCAGCGAG 8 1 GCF_001457655.1 941619 + yes 1 AAAAAAAAACGCTTTGTAACTCGATTGATAG 8 1 GCF_009759685.1 997945 + yes 1 AAAAAAAAACTGCTGTCCCTGGTCCGTCAGG 8 1 GCF_002950215.1 4262890 - yes 1 AAAAAAAAAGATTTGATTTTTTTCATTAATA 8 1 GCF_000392875.1 766998 - yes 1 AAAAAAAAAGCATTTTTTCGATCTCTTTACG 8 1 GCF_000392875.1 1623731 + yes 1 AAAAAAAAAGTTTCCGGGACACTACCTAACC 8 1 GCF_000017205.1 5804200 - yes 1 AAAAAAAAATTATTTTGCTAATCAATAGGTC 8 1 GCF_000006945.2 4886411 - yes 1 AAAAAAAACAAAGAATTATTACACAACATTC 8 1 GCF_003697165.2 4055655 + yes 1 AAAAAAAACACGGACTTATTGAAATCGTATT 8 1 GCF_000392875.1 746746 + yes 1 AAAAAAAACCAACTTTGAAAAAAGTAATGTA 8 1 GCF_000148585.2 917529 - yes 1 AAAAAAAACCATATTATGTCCGATCCTCACA 8 1 GCF_000392875.1 1060650 + yes 1 AAAAAAAACCCGCCGAAGCGGGTTTTTTTAT 8 1 GCF_000742135.1 1612499 + no 1 AAAAAAAACCTAATGGTAAATAACGTTTTGG 8 1 GCF_006742205.1 2346818 + yes 1 AAAAAAAACGAAAAACGGTAACACGGGAATT 8 1 GCF_001544255.1 1605298 + yes 1 AAAAAAAACGACTCCAGAGAGATCATCGTAT 8 1 GCF_000392875.1 1279686 + yes Only forward k-mers.\n$ lexicmap utils kmers --quiet -d demo.lmi/ -f | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ---- ------------------------------- ------ ------ --------------- ------- ------ -------- 1 AAAAAAAAACAAACATTTGCGGCGGGGCCAT 8 1 GCF_000742135.1 2043044 + no 1 AAAAAAAAACGATTATCCTCAATTAATTTCT 8 1 GCF_000392875.1 814251 + no 1 AAAAAAAACCCGCCGAAGCGGGTTTTTTTAT 8 1 GCF_000742135.1 1612499 + no 1 AAAAAAAACGGTTCAGCTGACCAGCCAGCTG 8 1 GCF_002950215.1 401140 + no 1 AAAAAAAAGAACAAATTCGAGGAAAAAGAAG 9 1 GCF_001027105.1 1268573 + no 1 AAAAAAAAGATATTGAAGTTAAAGTAATTTG 9 1 GCF_000742135.1 3038258 + no 1 AAAAAAAAGCCCACGAACCGGGGGCAATATC 9 1 GCF_002950215.1 3578394 + no 1 AAAAAAAAGCCCCGCCGAAGCGGGGCTTTTT 9 1 GCF_000017205.1 5110420 + no 1 AAAAAAAAGGATTATAACAAAATTTTGTCAT 9 1 GCF_001544255.1 426716 + no 1 AAAAAAAAGGCTTTACGGATGATCCGATGGA 9 1 GCF_009759685.1 3033057 + no 1 AAAAAAAAGTAATTGCAGCTATTATTGGGAC 10 1 GCF_001027105.1 437272 + no 1 AAAAAAAAGTATTAAGCAACTGACTAAAAGT 10 1 GCF_006742205.1 1841209 + no 1 AAAAAAAAGTCACAATTATTGGTGCCGGTTT 13 1 GCF_000392875.1 1508457 - no 1 AAAAAAAAGTCATCAAGGATTATTTGAGTTA 12 1 GCF_001457655.1 1847867 + no 1 AAAAAAAAGTCATCGCTTTATCTGTCAGTAT 12 1 GCF_001544255.1 156689 - no 1 AAAAAAAAGTCATCTTCGGATGGCTTTTTTA 12 1 GCF_000148585.2 1363150 - no 1 AAAAAAAAGTCCATCCTGCAGCATAAAATAA 11 1 GCF_000742135.1 4671015 + no 1 AAAAAAAAGTCCCTGCTGTTTGCCCAGTCCT 11 1 GCF_000006945.2 3796 - no 1 AAAAAAAAGTCCGCTGATAAGGCTTGAAAAG 11 3 GCF_002949675.1 2356807 + no Specify the mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ --mask 12345 | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ----- ------------------------------- ------ ------ --------------- ------- ------ -------- 12345 CATTAGTAAAAACCAACTTAGTTACGACACG 8 1 GCF_001027105.1 1823411 + no 12345 CATTAGTAAAACATTTTGAACCTGTGATTGA 8 1 GCF_006742205.1 1192019 + no 12345 CATTAGTAAAAGTCGTTTGGTAAAGCGATTA 8 1 GCF_001027105.1 1334989 + yes 12345 CATTAGTAAACGTACAAAACTATTGGTTAGA 8 1 GCF_001027105.1 2037559 + yes 12345 CATTAGTAAATCCAGGAATCCTAACCGACGA 8 1 GCF_001027105.1 963152 + yes 12345 CATTAGTAACGCGTACGAAACCGTAGTAAGT 8 1 GCF_001027105.1 1958187 + yes 12345 CATTAGTAAGTTGTCGGTCTAACGCGGATTA 8 1 GCF_002950215.1 2882180 + yes 12345 CATTAGTACATTCAAGTATTATTCATTAAAC 8 1 GCF_009759685.1 665376 + yes 12345 CATTAGTACCGATAGGACATCATGAACACAA 8 1 GCF_002950215.1 4677222 + yes 12345 CATTAGTACCTTCATCGCTATCCCATTAGGC 8 1 GCF_000006945.2 92542 + yes 12345 CATTAGTACGTGTCCCGCAAAGAGAAAGAAC 8 1 GCF_000006945.2 3412102 + yes 12345 CATTAGTAGAAAAATACAAAGGCATTTATGA 11 1 GCF_900638025.1 665985 - no 12345 CATTAGTAGAAAATTGATAATCTAAGAGTTC 11 1 GCF_002950215.1 2940281 + no 12345 CATTAGTAGAAATGGGCAAAGAATAGGAAAA 11 1 GCF_000148585.2 81286 + no 12345 CATTAGTAGAAGAAATTGCAGCAAGTATTAA 14 1 GCF_001027105.1 621160 + no 12345 CATTAGTAGAAGAACTGAAGTTAGTGCCTAT 14 1 GCF_001096185.1 2113047 + no 12345 CATTAGTAGAAGAAGACCAAGCACGACGCAT 15 1 GCF_000392875.1 891723 + no 12345 CATTAGTAGAAGAGTTGTTCGTCAGTTACGG 13 1 GCF_001544255.1 831068 - no 12345 CATTAGTAGAAGATTTAGTGGCAAGCTCAAT 13 1 GCF_001457655.1 1280653 + no \u0026ldquo;reversed\u0026rdquo; means means if the k-mer is reversed for suffix matching. E.g., CATTAGTAAAAGTCGTTTGGTAAAGCGATTA is reversed, so you need to reverse it before searching in the genome.\n$ seqkit locate -p $(echo CATTAGTAAAAGTCGTTTGGTAAAGCGATTA | rev) refs/GCF_001027105.1.fa.gz -M | csvtk pretty -t seqID patternName pattern strand start end ------------- ------------------------------- ------------------------------- ------ ------- ------- NZ_CP011526.1 ATTAGCGAAATGGTTTGCTGAAAATGATTAC ATTAGCGAAATGGTTTGCTGAAAATGATTAC + 1334989 1335019 For all masks. The result might be very big, therefore, writing to gzip format is recommended.\n$ lexicmap utils kmers -d demo.lmi/ --mask 0 -o kmers.tsv.gz $ zcat kmers.tsv.gz | csvtk freq -t -f mask -nr | head -n 10 mask frequency 24088 322 15814 295 13923 293 27102 291 13922 282 15967 281 10001 280 15986 272 16440 269 a faster way\nseq 1 $(lexicmap utils masks -d demo.lmi/ --quiet | wc -l) \\ | rush --eta 'echo -e {}\u0026quot;\\t\u0026quot;$(lexicmap utils kmers -d demo.lmi/ -m {} -f --quiet | csvtk nrow)' \\ | csvtk add-header -t -n mask,seeds \\ | csvtk sort -t -k seeds:nr \\ | head -n 10 Lengths of shared prefixes between probes and captured k-mers.\nzcat kmers.tsv.gz \\ | csvtk grep -t -f reversed -p no \\ | csvtk plot hist -t -f prefix -o prefix.hist.png \\ --xlab \u0026quot;length of common prefixes between captured k-mers and masks\u0026quot; The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out."},{"id":6,"href":"/LexicMap/tutorials/search/","title":"Step 2. Searching","parent":"Tutorials","content":" Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned. Input should be (gzipped) FASTA or FASTQ records from files or STDIN.\nHardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 16 GB) is preferred. The memory usage in searching is mainly related to: The number of matched genomes and sequences. The length of query sequences. Similarities between query and target sequences. The number of threads. It uses all CPUs by default (-j/--threads). Disk Sufficient space is required to store the index size. No temporary files are generated during searching. Algorithm Click to show details. ... Masking: Query sequence is masked by the masks of the index. In other words, each mask captures the most similar k-mer which shares the longest prefix with the mask, and stores its position and strand information. Seeding: For each mask, the captured k-mer is used to search seeds (captured k-mers in reference genomes) sharing prefixes or suffixes of at least p bases. Prefix matching Setting the search range: Since the seeded k-mers are stored in lexicographic order, the k-mer matching turns into a range query. For example, for a query CATGCT requiring matching at least 4-bp prefix is equal to extract k-mers ranging from CATGAA, CATGAC, CATGAG, \u0026hellip;, to CATGTT. Retrieving search start point: The index file of each seed data file stores some k-mers\u0026rsquo; offsets in the data file, and the index is loaded in RAM. Retrieving seed data: Seed k-mers are read from the file and checked one by one, and k-mers in the search range are returned, along with the k-mer information (genome batch, genome number, location, and strand). Suffix matching Reversing the query k-mer and performing prefix matching, returning seeds of reversed k-mers (see indexing algorithm). Chaining: Seeding results, i.e., anchors (matched k-mers from the query and subject sequence), are summarized by genome, and deduplicated. Performing chaining (see the paper). Alignment for each chain. Extending the anchor region. for extracting sequences from the query and reference genome. For example, extending 1 kb in upstream and downstream of anchor region. Performing pseudo-alignment with extended query and subject sequences, for find similar regions. For these similar regions that accross more than one reference sequences, splitting them into multiple ones. Fast alignment of query and subject sequence regions with our implementation of Wavefront alignment algorithm. Filtering alignments based on user options. Parameters Flags in bold text are important and frequently used.\nGeneral Flag Value Function Comment -w/--load-whole-seeds Load the whole seed data into memory for faster search Use this if the index is not big and many queries are needed to search. -n/--top-n-genomes Default 0, 0 for all Keep top N genome matches for a query in the chaining phase Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be \u0026gt;= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step. -a/--all Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026ldquo;lexicmap utils 2blast\u0026rdquo; -J/\u0026ndash;max-query-conc Default 12, 0 for all Maximum number of concurrent queries Bigger values do not improve the batch searching speed and consume much memory. Chaining Flag Value Function Comment -p, --seed-min-prefix Default 15 Minimum (prefix) length of matched seeds. Smaller values produce more results at the cost of slow speed. -P, --seed-min-single-prefix Default 17 Minimum (prefix) length of matched seeds if there\u0026rsquo;s only one pair of seeds matched. Smaller values produce more results at the cost of slow speed. --seed-max-dist Default 1000 Max distance between seeds in seed chaining. It should be \u0026lt;= contig interval length in database. --seed-max-gap Default 200 Max gap in seed chaining. Alignment Flag Value Function Comment -Q/--min-qcov-per-genome Default 0 Minimum query coverage (percentage) per genome. -q/--min-qcov-per-hsp Default 0 Minimum query coverage (percentage) per HSP. -l/--align-min-match-len Default 50 Minimum aligned length in a HSP segment. -i/--align-min-match-pident Default 70 Minimum base identity (percentage) in a HSP segment. --align-band Default 50 Band size in backtracking the score matrix. --align-ext-len Default 1000 Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. It should be \u0026lt;= contig interval length in database. --align-max-gap Default 20 Maximum gap in a HSP segment. Improving searching speed Here are some tips to improve the search speed.\nIncreasing the concurrency number Increasing the value of --max-open-files (default 512). You might need to change the open files limit. (If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory. Loading the entire seed data into memoy (It\u0026rsquo;s unnecessary if the index is stored in SSD) Setting -w/--load-whole-seeds to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters. Returning less results Setting -n/--top-n-genomes to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time. Steps For short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 70 \\ --min-qcov-per-genome 70 \\ --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 0 \\ --min-qcov-per-genome 0 \\ --top-n-genomes 0 Click to show the log of a demo run. ... $ lexicmap search -d demo.lmi/ q.gene.fasta -o q.gene.fasta.lexicmap.tsv 09:32:55.551 [INFO] LexicMap v0.4.0 09:32:55.551 [INFO] https://github.com/shenwei356/LexicMap 09:32:55.551 [INFO] 09:32:55.551 [INFO] checking input files ... 09:32:55.551 [INFO] 1 input file(s) given 09:32:55.551 [INFO] 09:32:55.551 [INFO] loading index: demo.lmi/ 09:32:55.551 [INFO] reading masks... 09:32:55.552 [INFO] reading indexes of seeds (k-mer-value) data... 09:32:55.555 [INFO] creating genome reader pools, each batch with 16 readers... 09:32:55.555 [INFO] index loaded in 4.192051ms 09:32:55.555 [INFO] 09:32:55.555 [INFO] searching ... 09:32:55.596 [INFO] 09:32:55.596 [INFO] processed queries: 1, speed: 1467.452 queries per minute 09:32:55.596 [INFO] 100.0000% (1/1) queries matched 09:32:55.596 [INFO] done searching 09:32:55.596 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv 09:32:55.596 [INFO] 09:32:55.596 [INFO] elapsed time: 45.230604ms 09:32:55.596 [INFO] Extracting similar sequences for a query gene.\n# search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --min-qcov-per-hsp 90 --all -o results.tsv # extract matched sequences as FASTA format sed 1d results.tsv | awk -F\u0026#39;\\t\u0026#39; \u0026#39;{print \u0026#34;\u0026gt;\u0026#34;$5\u0026#34;:\u0026#34;$14\u0026#34;-\u0026#34;$15\u0026#34;:\u0026#34;$16\u0026#34;\\n\u0026#34;$20;}\u0026#39; | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT Exporting blast-like alignment text.\nFrom file:\nlexicmap utils 2blast results.tsv -o results.txt Add genome annotation\nlexicmap utils 2blast results.tsv -o results.txt --kv-file-genome ass2species.map From stdin:\n# align only one long-read \u0026lt;= 500 bp $ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 1 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 Output Alignment result relationship Query ├── Subject genome # A query might have one or more genome hits, ├── Subject sequence # in different sequences. ├── High-Scoring segment Pair (HSP) # HSP is an alignment segment. Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs.\nA High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/\nOutput format Tab-delimited format with 17+ columns, with 1-based positions.\n1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026quot;|\u0026quot; and \u0026quot; \u0026quot;) between qseq and sseq. (optional with -a/--all) Examples A single-copy gene (SecY) query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------------------------------------- ---- ---- --------------- -------------------- ------- --- ------- ------- ------- ---- ------ ---- ------ ------ ---- ------- lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_000395405.1 NZ_KB947497.1 100.000 1 100.000 1299 100.000 0 1 1299 232279 233577 + 274511 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_019731615.1 NZ_JAASJA010000010.1 100.000 1 100.000 1299 100.000 0 1 1299 2798 4096 + 42998 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCA_004103085.1 RPCL01000012.1 100.000 1 100.000 1299 100.000 0 1 1299 44095 45393 + 84242 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_023571745.1 NZ_JAMKBS010000014.1 100.000 1 100.000 1299 100.000 0 1 1299 44077 45375 + 84206 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_013248625.1 NZ_JABTDK010000002.1 100.000 1 100.000 1299 100.000 0 1 1299 9609 10907 + 49787 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900092155.1 NZ_FLUS01000006.1 100.000 1 100.000 1299 100.000 0 1 1299 63161 64459 + 77366 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902165815.1 NZ_CABHHZ010000005.1 100.000 1 100.000 1299 100.000 0 1 1299 39386 40684 - 200163 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_014243495.1 NZ_SJAV01000002.1 100.000 1 100.000 1299 100.000 0 1 1299 39085 40383 - 256772 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900148695.1 NZ_FRXS01000009.1 100.000 1 100.000 1299 100.000 0 1 1299 39230 40528 - 96692 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902164645.1 NZ_LR607334.1 100.000 1 100.000 1299 100.000 0 1 1299 236677 237975 + 3380663 A 16S rRNA gene query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen --------------------------- ---- ------ --------------- ----------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- NC_000913.3:4166659-4168200 1542 293398 GCF_002248685.1 NZ_NQBE01000079.1 100.000 1 100.000 1542 100.000 0 1 1542 40 1581 - 99259 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 1 100.000 1542 100.000 0 1 1542 1270211 1271752 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 2 100.000 1542 100.000 0 1 1542 5466287 5467828 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 3 100.000 1543 99.546 2 1 1542 557008 558549 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 4 100.000 1543 99.482 2 1 1542 4473658 4475199 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 5 100.000 1543 99.482 2 1 1542 5154150 5155691 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 6 100.000 1543 99.482 2 1 1542 5195176 5196717 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 7 100.000 1543 99.482 2 1 1542 5369865 5371406 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701684.1 100.000 1 100.000 1542 100.000 0 1 1542 1108651 1110192 - 1914390 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701686.1 100.000 2 100.000 1542 99.741 0 1 1542 100680 102221 + 102235 A plasmid query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------- ----- ----- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ----- ------- ------- ---- ------- CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 4 0.916 484 91.116 0 51686 52169 27192 27675 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 5 0.829 438 90.868 1 52342 52779 26583 27019 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086534.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 4 0.916 484 91.116 0 51686 52169 3843 4326 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 5 0.829 438 90.868 1 52342 52779 4499 4935 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086546.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 1 77.157 40762 99.993 0 12069 52830 9513 50274 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 2 18.033 9528 99.990 1 1207 10733 1 9528 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 3 2.283 1206 100.000 0 1 1206 50275 51480 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058618.1 97.473 4 2.497 1319 100.000 0 25153 26471 3019498 3020816 - 4718403 Long reads Queries are a few Nanopore Q20 reads from a mock metagenomic community.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 Search results (TSV format) above are formatted with csvtk pretty.\nSummarizing results If you would like to summarize alignment results, e.g., the number of species, here\u0026rsquo;s the method.\nPrepare a two-column tab-delimited file for mapping reference (genome) or sequence IDs to any information (such as species name).\n# for GTDB/GenBank/RefSeq genomes downloaded with genome_updater cut -f 1,8 assembly_summary.txt \u0026gt; ref2species.tsv head -n 3 ass2species.tsv GCF_002287175.1 Methanobacterium bryantii GCF_000762265.1 Methanobacterium formicicum GCF_029601605.1 Methanobacterium formicicum Add information to the alignment result with csvtk or other tools.\n# add species cat b.gene_E_coli_16S.fasta.lexicmap.tsv \\ | csvtk mutate -t --after slen -n species -f sgenome \\ | csvtk replace -t -f species -p \u0026quot;(.+)\u0026quot; -r \u0026quot;{kv}\u0026quot; -k ass2species.tsv \\ \u0026gt; result.with_species.tsv # filter result with query coverage \u0026gt;= 80 and count the species cat result.with_species.tsv \\ | csvtk uniq -t -f sgenome \\ | csvtk filter2 -t -f \u0026quot;\\$qcovHSP \u0026gt;= 80\u0026quot; \\ | csvtk freq -t -f species -nr \\ \u0026gt; result.with_species.tsv.stats.tsv csvtk head -t -n 5 result.with_species.tsv.stats.tsv \\ | csvtk pretty -t species frequency ------------------------ --------- Salmonella enterica 135065 Escherichia coli 128071 Streptococcus pneumoniae 51971 Staphylococcus aureus 44215 Pseudomonas aeruginosa 34254 ","description":"Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query."},{"id":7,"href":"/LexicMap/tutorials/misc/index-allthebacteria/","title":"Indexing AllTheBacteria","parent":"More","content":"Make sure you have enough disk space, at least 8 TB, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/shenwei356/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.\nDownloading the list file of all assemblies in the latest version (v0.2 plus incremental versions). assemblies.\nmkdir -p atb; cd atb; # attention, the URL might changes, please check it in the browser. wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz If you only need to add assemblies from an incremental version. Please manually download the file list in the path AllTheBacteria/Assembly/OSF Storage/File_lists.\nDownloading assembly tarball files.\n# tarball file names and their URLs zcat file_list.all.latest.tsv.gz | awk 'NR\u0026gt;1 {print $3\u0026quot;\\t\u0026quot;$4}' | uniq \u0026gt; tar2url.tsv # download cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}' Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use gzip (can be replaced with faster pigz ) to compress them to save disk space.\n# {^tar.xz} is for removing the suffix \u0026quot;tar.xz\u0026quot; ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa' cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.\n$ tree atb | more atb ├── atb.assembly.r0.2.batch.1 │ ├── SAMD00013333.fa.gz │ ├── SAMD00049594.fa.gz │ ├── SAMD00195911.fa.gz │ ├── SAMD00195914.fa.gz Parepare a file list of assemblies.\nJust use find or fd (much faster).\n# find find atb/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; files.txt # fd fd .fa.gz$ atb/ \u0026gt; files.txt What it looks like:\n$ head -n 2 files.txt atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz (Optional) Only keep assemblies of high-quality. Please manually download the hq_set.sample_list.txt.gz file from this path, e.g., AllTheBacteria/Metadata/OSF Storage/Aggregated/Latest_2024-08/ (choose the latest date).\nfind atb/ -name \u0026quot;*.fa.gz\u0026quot; | grep -w -f \u0026lt;(zcat hq_set.sample_list.txt.gz) \u0026gt; files.txt Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)\nlexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log Steps for v0.2 hosted at EBI ftp Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/\nmkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.shenwei.me/LexicMap/AllTheBacteria-v0.2.url.txt # download # rush is used: https://github.com/shenwei356/rush # The download.rush file stores finished jobs, which will be skipped in a second run for resuming jobs. cat AllTheBacteria-v0.2.url.txt | rush --eta -j 2 -c -C download.rush 'wget {}' # list of high-quality samples wget https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/metadata/hq_set.sample_list.txt.gz Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use gzip (can be replaced with faster pigz ) to compress them to save disk space.\n# {^asm.tar.xz} is for removing the suffix \u0026quot;asm.tar.xz\u0026quot; ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa' cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.\n$ tree atb | more atb ├── achromobacter_xylosoxidans__01 │ ├── SAMD00013333.fa.gz │ ├── SAMD00049594.fa.gz │ ├── SAMD00195911.fa.gz │ ├── SAMD00195914.fa.gz # disk usage $ du -sh atb 2.9T atb $ du -sh atb --apparent-size 2.1T atb Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)\n# file paths of all samples find atb/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; atb_all.txt # wc -l atb_all.txt # 1876015 atb_all.txt # file paths of high-quality samples grep -w -f \u0026lt;(zcat atb/hq_set.sample_list.txt.gz) atb_all.txt \u0026gt; atb_hq.txt # wc -l atb_hq.txt # 1858610 atb_hq.txt # index lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log For 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\n# disk usage $ du -sh atb_hq.lmi 4.6T atb_hq.lmi $ du -sh atb_hq.lmi --apparent-size 3.9T atb_hq.lmi $ dirsize atb_hq.lmi atb_hq.lmi: 3.88 TiB (4,261,437,129,065) 2.11 TiB seeds 1.77 TiB genomes 39.22 MiB genomes.map.bin 312.53 KiB masks.bin 332 B info.toml Note that, there\u0026rsquo;s a tmp directory atb_hq.lmi being created during indexing. In the tmp directory, the seed data would be bigger than the final size of seeds directory, however, the genome files are simply moved to the final index.\n","description":"Make sure you have enough disk space, at least 8 TB, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/shenwei356/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF."},{"id":8,"href":"/LexicMap/usage/utils/genomes/","title":"genomes","parent":"utils","content":" Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 8) Examples $ lexicmap utils genomes -d demo.lmi/ GCF_000148585.2 GCF_001457655.1 GCF_900638025.1 GCF_001096185.1 GCF_006742205.1 GCF_001544255.1 GCF_000392875.1 GCF_001027105.1 GCF_009759685.1 GCF_002949675.1 GCF_002950215.1 GCF_000006945.2 GCF_003697165.2 GCF_000742135.1 GCF_000017205.1 ","description":"Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments."},{"id":9,"href":"/LexicMap/tutorials/misc/index-globdb/","title":"Indexing GlobDB","parent":"More","content":"Info:\nGlobDB , a dereplicated dataset of the species reps of the GTDB, GEM, SPIRE and SMAG datasets a lot. https://x.com/daanspeth/status/1822964436950192218 Steps:\n# download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 ","description":"Info:\nGlobDB , a dereplicated dataset of the species reps of the GTDB, GEM, SPIRE and SMAG datasets a lot. https://x.com/daanspeth/status/1822964436950192218 Steps:\n# download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 "},{"id":10,"href":"/LexicMap/installation/","title":"Installation","parent":"","content":"LexicMap can be installed via conda, downloading executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Or use mamba, which is faster.\nconda install -c conda-forge mamba mamba install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64.tar.gz, 中国镜像 Linux arm64 lexicmap_linux_arm64.tar.gz, 中国镜像 Decompress it:\ntar -zxvf lexicmap_linux_amd64.tar.gz If you have the root privilege, simply copy it to /usr/local/bin:\nsudo cp lexicmap /usr/local/bin/ If you don\u0026rsquo;t have the root privilege, copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bashrc # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration MacOS Download the binary file.\nOS Arch File, 中国镜像 macOS 64-bit lexicmap_darwin_amd64.tar.gz, 中国镜像 macOS arm64 lexicmap_darwin_arm64.tar.gz, 中国镜像 Copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bashrc # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration Windows Download the binary file.\nOS Arch File, 中国镜像 Windows 64-bit lexicmap_windows_amd64.exe.tar.gz, 中国镜像 Decompress it.\nCopy lexicmap.exe to C:\\WINDOWS\\system32.\nOthers Please open an issue to request binaries for other platforms. Or compiling from the source. Compile from the source Install go (go 1.22 or later versions).\nwget https://go.dev/dl/go1.22.6.linux-amd64.tar.gz tar -zxf go1.22.6.linux-amd64.tar.gz -C $HOME/ # or # echo \u0026quot;export PATH=$PATH:$HOME/go/bin\u0026quot; \u0026gt;\u0026gt; ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile LexicMap.\n# ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/LexicMap/lexicmap # The executable binary file is located in: # ~/go/bin/lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/lexicmap $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/LexicMap cd LexicMap/lexicmap/ go build # The executable binary file is located in: # ./lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./lexicmap $HOME/bin/ Shell-completion Supported shell: bash|zsh|fish|powershell\nBash:\n# generate completion shell lexicmap autocompletion --shell bash # configure if never did. # install bash-completion if the \u0026quot;complete\u0026quot; command is not found. echo \u0026quot;for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\u0026quot; \u0026gt;\u0026gt; ~/.bash_completion echo \u0026quot;source ~/.bash_completion\u0026quot; \u0026gt;\u0026gt; ~/.bashrc Zsh:\n# generate completion shell lexicmap autocompletion --shell zsh --file ~/.zfunc/_kmcp # configure if never did echo 'fpath=( ~/.zfunc \u0026quot;${fpath[@]}\u0026quot; )' \u0026gt;\u0026gt; ~/.zshrc echo \u0026quot;autoload -U compinit; compinit\u0026quot; \u0026gt;\u0026gt; ~/.zshrc fish:\nlexicmap autocompletion --shell fish --file ~/.config/fish/completions/lexicmap.fish ","description":"LexicMap can be installed via conda, downloading executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Or use mamba, which is faster.\nconda install -c conda-forge mamba mamba install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64."},{"id":11,"href":"/LexicMap/usage/search/","title":"search","parent":"Usage","content":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed. Hence, you can search with default parameters and then filter the result with tools like awk or csvtk. Alignment result relationship: Query ├── Subject genome ├── Subject sequence ├── High-Scoring segment Pair (HSP) Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs. \u0026gt; A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the \u0026gt; highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/ Output format: Tab-delimited format with 17+ columns, with 1-based positions. 1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) Usage: lexicmap search [flags] -d \u0026lt;index path\u0026gt; [query.fasta.gz ...] [-o query.tsv.gz] Flags: --align-band int ► Band size in backtracking the score matrix (pseduo alignment phase). (default 50) --align-ext-len int ► Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. It should be \u0026lt;= contig interval length in database. (default 1000) --align-max-gap int ► Maximum gap in a HSP segment. (default 20) -l, --align-min-match-len int ► Minimum aligned length in a HSP segment. (default 50) -i, --align-min-match-pident float ► Minimum base identity (percentage) in a HSP segment. (default 70) -a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. -h, --help help for search -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --load-whole-seeds ► Load the whole seed data into memory for faster search. --max-open-files int ► Maximum opened files. (default 512) -J, --max-query-conc int ► Maximum number of concurrent queries. Bigger values do not improve the batch searching speed and consume much memory. (default 12) -Q, --min-qcov-per-genome float ► Minimum query coverage (percentage) per genome. -q, --min-qcov-per-hsp float ► Minimum query coverage (percentage) per HSP. -o, --out-file string ► Out file, supports a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) --pseudo-align ► Only perform pseudo alignment, alignment metrics, including qcovGnm, qcovSHP and pident, will be less accurate. --seed-max-dist int ► Max distance between seeds in seed chaining. (default 10000) --seed-max-gap int ► Max gap in seed chaining. (default 500) -p, --seed-min-prefix int ► Minimum (prefix) length of matched seeds. (default 15) -P, --seed-min-single-prefix int ► Minimum (prefix) length of matched seeds if there\u0026#39;s only one pair of seeds matched. (default 17) -n, --top-n-genomes int ► Keep top N genome matches for a query (0 for all) in chaining phase. Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be \u0026gt;= 5. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 8) Examples See Searching ","description":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed."},{"id":12,"href":"/LexicMap/usage/utils/subseq/","title":"subseq","parent":"utils","content":" Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Usage: lexicmap utils subseq [flags] Flags: -h, --help help for subseq -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --line-width int ► Line width of sequence (0 for no wrap). (default 60) -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -n, --ref-name string ► Reference name. -r, --region string ► Region of the subsequence (1-based). -R, --revcom ► Extract subsequence on the negative strand. -s, --seq-id string ► Sequence ID. If the value is empty, the positions in the region are treated as that in the concatenated sequence. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Extracting subsequence with genome ID, sequence ID, position range and strand information.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4591684:4593225 -R \u0026gt;NZ_CP033092.2:4591684-4593225:- AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA GTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAA TGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCAT AACGTCGCAAGACCAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATG GGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGG GGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATT GACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAG GGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACC GGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCA AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCC CTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAAT TCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAG AATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGA AATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGC CGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCG ACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAAC TCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGT TCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGT AGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAA CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA If the sequence ID (-s/--seq-id) is not given, the positions are these in the concatenated sequence.\nChecking sequence lengths of a genome with seqkit.\n$ seqkit fx2tab -nil refs/GCF_003697165.2.fa.gz NZ_CP033092.2 4903501 NZ_CP033091.2 131333 Extracting the 1000-bp interval sequence inserted by lexicmap index.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 \u0026gt;GCF_003697165.2:4903502-4904501:+ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA It detects if the end position is larger than the sequence length.\n# the length of NZ_CP033092.2 is 4903501 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903501:1000000000 \u0026gt;NZ_CP033092.2:4903501-4903501:+ C $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903502:1000000000 \u0026gt;NZ_CP033092.2:4903502-4903501:+ ","description":"Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes."},{"id":13,"href":"/LexicMap/tutorials/misc/index-uhgg/","title":"Indexing UHGG","parent":"More","content":"Info:\nUnified Human Gastrointestinal Genome (UHGG) v2.0.2 A unified catalog of 204,938 reference genomes from the human gut microbiome Number of Genomes: 289,232 Tools:\nhttps://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\n# meta data wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.2/genomes-all_metadata.tsv # gff url sed 1d genomes-all_metadata.tsv | cut -f 20 | sed 's/v2.0/v2.0.2/' | sed -E 's/^ftp/https/' \u0026gt; url.txt # download gff files mkdir -p files; cd files time cat ../url.txt \\ | rush --eta -v 'dir={///%}/{//%}' \\ 'mkdir -p {dir}; curl -s -o {dir}/{%} {}' \\ -c -C download.rush -j 12 cd .. # extract sequences from gff files find files/ -name \u0026quot;*.gff.gz\u0026quot; \\ | rush --eta \\ 'zcat {} | perl -ne \u0026quot;print if \\$s; \\$s=true if /^##FASTA/\u0026quot; | seqkit seq -w 0 -o {/}/{%:}.fna.gz' \\ -c -C extract.rush Indexing. On a 48-CPU machine, time: 3 h, ram: 41 GB, index size: 426 GB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\nlexicmap index \\ -I files/ \\ -O uhgg.lmi --log uhgg.lmi.log \\ -b 5000 File sizes:\n$ du -sh files/ uhgg.lmi 658G files/ 509G uhgg.lmi $ du -sh files/ uhgg.lmi --apparent-size 425G files/ 426G uhgg.lmi $ dirsize uhgg.lmi uhgg.lmi: 425.15 GiB (456,497,171,291) 243.47 GiB seeds 181.67 GiB genomes 6.34 MiB genomes.map.bin 312.53 KiB masks.bin 330 B info.toml ","description":"Info:\nUnified Human Gastrointestinal Genome (UHGG) v2.0.2 A unified catalog of 204,938 reference genomes from the human gut microbiome Number of Genomes: 289,232 Tools:\nhttps://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\n# meta data wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.2/genomes-all_metadata.tsv # gff url sed 1d genomes-all_metadata.tsv | cut -f 20 | sed 's/v2.0/v2.0.2/' | sed -E 's/^ftp/https/' \u0026gt; url.txt # download gff files mkdir -p files; cd files time cat ../url.txt \\ | rush --eta -v 'dir={///%}/{//%}' \\ 'mkdir -p {dir}; curl -s -o {dir}/{%} {}' \\ -c -C download."},{"id":14,"href":"/LexicMap/releases/","title":"Releases","parent":"","content":" Latest version v0.4.0 v0.4.0 - 2024-08-15 New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs. Fix a bug in indexing small genomes. Change the default value of -b, --batch-size from 10,000 to 5,000. Improve lexichash data structure. Write and merge seed data in parallel, new flag -J/--seed-data-threads. Improve the log. lexicmap search: Fix chaining for highly-repetitive regions. Perform more accurate alignment with WFA. Use buffered reader for seeds file reading. Fix object recycling and reduce memory usage. Fix alignment against genomes with many short contigs. Fix early quit when meeting a sequence shorter than k. Add a new option -J/--max-query-conc to limit the miximum number of concurrent queries, with a default valule of 12 instead of the number of CPUs, which reduces the memory usage in batch searching. Result format: Cluster alignments of each target sequence. Remove the column seeds. Add columns gaps, cigar, align, which can be reformated with lexicmap utils 2blast. lexicmap utils kmers: Fix the progress bar. Fix a bug where some masks do not have any k-mer. Add a new column prefix to show the length of common prefix between the seed and the probe. Add a new column reversed to indicate if the k-mer is reversed for suffix matching. lexicmap utils masks: Add the support of only outputting a specific mask. lexicmap utils seed-pos: New columns: sseqid and pos_seq. More accurate seed distance. Add histograms of numbers of seed in sliding windows. lexicmap utils subseq: Fix a bug when the given end position is larger than the sequence length. Add the strand (\u0026quot;+\u0026quot; or \u0026ldquo;-\u0026rdquo;) in the sequence header. Please run lexicmap version to check update !!! Please run lexicmap autocompletion to update shell autocompletion script !!! Previous versions v0.3.0 v0.3.0 - 2024-05-14 lexicmap index: Better seed coverage by filling sketching deserts. Use longer (1000bp N\u0026rsquo;s, previous: k-1) intervals between contigs. Fix a concurrency bug between genome data writing and k-mer-value data collecting. Change the format of k-mer-value index file, and fix the computation of index partitions. Optionally save seed positions which can be outputted by lexicmap utils seed-pos. lexicmap search: Improved seed-chaining algorithm. Better support of long queries. Add a new flag -w/--load-whole-seeds for loading the whole seed data into memory for faster search. Parallelize alignment in each query, so it\u0026rsquo;s faster for a single query. Optional outputing matched query and subject sequences. 2-5X searching speed with a faster masking method. Change output format. Add output of query start and end positions. Fix a target sequence extracting bug. Keep indexes of genome data in memory. lexicmap utils kmers: Fix a little bug, wrong number of k-mers for the second k-mer in each k-mer pair. New commands: lexicmap utils gen-masks for generating masks from the top N largest genomes. lexicmap utils seed-pos for extracting seed positions via reference names. lexicmap utils reindex-seeds for recreating indexes of k-mer-value (seeds) data. lexicmap utils genomes for list genomes IDs in the index. v0.2.0 v0.2.0 - 2024-02-02 Software architecture and index formats are redesigned to reduce searching memory occupation. Indexing: genomes are processed in batches to reduce RAM usage, then indexes of all batches are merged. Searching: seeds matching is performed on disk yet it\u0026rsquo;s ultra-fast. v0.1.0 v0.1.0 - 2024-01-15 The first release. Seed indexing and querying are performed in RAM. GTDB r214 with 10k masks: index size 75GB, RAM: 130GB. ","description":"Latest version v0.4.0 v0.4.0 - 2024-08-15 New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs."},{"id":15,"href":"/LexicMap/usage/utils/seed-pos/","title":"seed-pos","parent":"utils","content":" Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Extra columns: Using -v/--verbose will output more columns: len_aaa, length of consecutive A\u0026#39;s. seq, sequence between the previous and current seed. Figures: Using -O/--plot-dir will write plots into given directory: - Histograms of seed distances. - Histograms of numbers of seeds in sliding windows. Usage: lexicmap utils seed-pos [flags] Flags: -a, --all-refs ► Output for all reference genomes. This would take a long time for an index with a lot of genomes. -b, --bins int ► Number of bins in histograms. (default 100) --color-index int ► Color index (1-7). (default 1) --force ► Overwrite existing output directory. --height float ► Histogram height (unit: inch). (default 4) -h, --help help for seed-pos -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --max-open-files int ► Maximum opened files, used for extracting sequences. (default 512) -D, --min-dist int ► Only output records with seed distance \u0026gt;= this value. -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -O, --plot-dir string ► Output directory for 1) histograms of seed distances, 2) histograms of numbers of seeds in sliding windows. --plot-ext string ► Histogram plot file extention. (default \u0026#34;.png\u0026#34;) -n, --ref-name strings ► Reference name(s). -s, --slid-step int ► The step size of sliding windows for counting the number of seeds (default 200) -w, --slid-window int ► The window size of sliding windows for counting the number of seeds (default 500) -v, --verbose ► Show more columns including position of the previous seed and sequence between the two seeds. Warning: it\u0026#39;s slow to extract the sequences, recommend set -D 1000 or higher values to filter results --width float ► Histogram width (unit: inch). (default 6) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Adding the flag --save-seed-pos in index building.\n$ lexicmap index -I refs/ -O demo.lmi --save-seed-pos --force Listing seed position of one genome.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv $ head -n 10 seed_distance.tsv | csvtk pretty -t ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 90 90 - 89 GCF_000017205.1 NC_009656.1 133 133 + 43 GCF_000017205.1 NC_009656.1 137 137 - 4 GCF_000017205.1 NC_009656.1 139 139 - 2 GCF_000017205.1 NC_009656.1 160 160 - 21 GCF_000017205.1 NC_009656.1 300 300 - 140 GCF_000017205.1 NC_009656.1 338 338 + 38 GCF_000017205.1 NC_009656.1 360 360 + 22 GCF_000017205.1 NC_009656.1 361 361 + 1 Check the biggest seed distances.\n$ csvtk freq -t -f distance seed_distance.tsv \\ | csvtk sort -t -k distance:nr \\ | head -n 10 \\ | csvtk pretty -t distance frequency -------- --------- 199 43 198 49 197 52 196 43 195 44 194 47 193 43 192 53 191 38 Or only list records with seed distances longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -D 190 \\ | csvtk pretty -t | head -n 5 ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 13964 13964 - 197 GCF_000017205.1 NC_009656.1 27420 27420 + 191 GCF_000017205.1 NC_009656.1 30942 30942 + 193 Plot histogram of distances between seeds and histogram of number of seeds in sliding windows.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv --plot-dir seed_distance In the plot below, there\u0026rsquo;s a peak at 50 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 50 bp by default.\nMore columns including sequences between two seeds.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v \\ | head -n4 | csvtk pretty -t -W 40 --clip ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 90 90 - 89 9 TTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGC... GCF_000017205.1 NC_009656.1 133 133 + 43 3 TTTCTTTTAAAGGATAGAAGCGGTTATTGCTCTTGGT... GCF_000017205.1 NC_009656.1 137 137 - 4 0 GGTT Or only list records with seed distance longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v -D 190 \\ | head -n 2 \\ | csvtk pretty -t -W 40 ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 13964 13964 - 197 8 ATTTGCCCATTGAGGCGCCGGTATTGCGCATGGAAGTGGT GCGCATCGACGCCGAGGGCGTCGGCCTGCGCTTCCTCGCC GATCAATGAAACCCGAGTTCCACGTGGAACCACGGTCCTG CCATCGATCAGCGAACGGGCGAATCCGCCGCCCGTTATCG GCTAGAATGCGCGCCGCTCGGCATGGGGCCGGGCATG Listing seed position of all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz Show the number of seed positions in each genome. Frequencies larger than 40000 (the number of masks) means some k-mers can be foud in more than one positions in a genome.\n$ csvtk freq -t -f ref -nr seed-pos.tsv.gz | csvtk pretty -t ref frequency --------------- --------- GCF_000017205.1 134674 GCF_000742135.1 103882 GCF_003697165.2 92389 GCF_000006945.2 91007 GCF_002950215.1 89876 GCF_002949675.1 84731 GCF_009759685.1 72615 GCF_001027105.1 56806 GCF_000392875.1 55397 GCF_006742205.1 52670 GCF_001544255.1 49919 GCF_900638025.1 46654 GCF_001457655.1 46226 GCF_001096185.1 46222 GCF_000148585.2 44848 Plot the histograms of distances between seeds for all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz \\ --plot-dir seed_distance --force 09:56:34.059 [INFO] creating genome reader pools, each batch with 1 readers... processed files: 15 / 15 [======================================] ETA: 0s. done 09:56:34.656 [INFO] seed positions of 15 genomes(s) saved to seed-pos.tsv.gz 09:56:34.656 [INFO] histograms of 15 genomes(s) saved to seed_distance 09:56:34.656 [INFO] 09:56:34.656 [INFO] elapsed time: 598.080462ms 09:56:34.656 [INFO] $ ls seed_distance/ GCF_000006945.2.png GCF_000742135.1.png GCF_001544255.1.png GCF_006742205.1.png GCF_000006945.2.seed_number.png GCF_000742135.1.seed_number.png GCF_001544255.1.seed_number.png GCF_006742205.1.seed_number.png GCF_000017205.1.png GCF_001027105.1.png GCF_002949675.1.png GCF_009759685.1.png GCF_000017205.1.seed_number.png GCF_001027105.1.seed_number.png GCF_002949675.1.seed_number.png GCF_009759685.1.seed_number.png GCF_000148585.2.png GCF_001096185.1.png GCF_002950215.1.png GCF_900638025.1.png GCF_000148585.2.seed_number.png GCF_001096185.1.seed_number.png GCF_002950215.1.seed_number.png GCF_900638025.1.seed_number.png GCF_000392875.1.png GCF_001457655.1.png GCF_003697165.2.png GCF_000392875.1.seed_number.png GCF_001457655.1.seed_number.png GCF_003697165.2.seed_number.png In the plots below, there\u0026rsquo;s a peak at 50 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 50 bp by default. And they show that the seed number, seed distance and seed density are related to genome sizes.\nGCF_000392875.1 (genome size: 2.9 Mb)\n","description":"Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;."},{"id":16,"href":"/LexicMap/tutorials/misc/","title":"More","parent":"Tutorials","content":"","description":""},{"id":17,"href":"/LexicMap/tutorials/","title":"Tutorials","parent":"","content":"","description":""},{"id":18,"href":"/LexicMap/usage/utils/","title":"utils","parent":"Usage","content":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line)."},{"id":19,"href":"/LexicMap/usage/utils/reindex-seeds/","title":"reindex-seeds","parent":"utils","content":" Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. The value needs to be the power of 4. (default 1024) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils reindex-seeds -d demo.lmi/ --partitions 1024 10:20:29.150 [INFO] recreating seed indexes with 1024 partitions for: demo.lmi/ processed files: 16 / 16 [======================================] ETA: 0s. done 10:20:29.166 [INFO] update index information file: demo.lmi/info.toml 10:20:29.166 [INFO] finished updating the index information file: demo.lmi/info.toml 10:20:29.166 [INFO] 10:20:29.166 [INFO] elapsed time: 15.981266ms 10:20:29.166 [INFO] ","description":"Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. The value needs to be the power of 4. (default 1024) Global Flags: -X, --infile-list string ► File of input file list (one file per line)."},{"id":20,"href":"/LexicMap/usage/","title":"Usage","parent":"","content":"","description":""},{"id":21,"href":"/LexicMap/faqs/","title":"FAQs","parent":"","content":" Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How\u0026rsquo;s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nDoes LexicMap support fungi genomes? Yes. LexicMap mainly supports small genomes including prokaryotic, viral, and plasmid genomes. Fungi can also be supported, just remember to increase the value of -g/--max-genome when running lexicmap index, which is used to skip genomes larger than 15Mb by default.\n-g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. (default 15000000) Maximum genome size is about 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nFor big and complex genomes, like the human genome (chr1 is ~248 Mb) which has many repetitive sequences, LexicMap would be slow to align.\nHow\u0026rsquo;s the hardware requirement? For index building. See details hardware requirement. For seaching. See details hardware requirement. Can I extract the matched sequences? Yes, lexicmap search has a flag\n-a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. to output CIGAR string, aligned query and subject sequences.\n18. cigar, CIGAR string of the alignment (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) An example:\n# Extracting similar sequences for a query gene. # search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \\ --min-qcov-per-hsp 90 --all # extract matched sequences as FASTA format sed 1d results.tsv | awk -F'\\t' '{print \u0026quot;\u0026gt;\u0026quot;$5\u0026quot;:\u0026quot;$14\u0026quot;-\u0026quot;$15\u0026quot;:\u0026quot;$16\u0026quot;\\n\u0026quot;$20;}' \\ | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT And lexicmap util 2blast can help to convert the tabular format to Blast-style format, see examples.\nHow can I extract the upstream and downstream flanking sequences of matched regions? lexicmap utils subseq can extract subsequencess via genome ID, sequence ID and positions. So you can use these information from the search result and expand the region positions to extract flanking sequences.\nWhy isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? It happens if there are some degenerate bases (e.g., N) in the query sequence. In the indexing step, all degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. While for the query sequences, we don\u0026rsquo;t convert them.\nWhy is LexicMap slow for batch searching? LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes. There are some ways to improve the search speed of lexicmap search.\nIncreasing the concurrency number Increasing the value of --max-open-files (default 512). You might need to change the open files limit. (If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory. Loading the entire seed data into memoy (It\u0026rsquo;s unnecessary if the index is stored in SSD) Setting -w/--load-whole-seeds to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters. Returning less results Setting -n/--top-n-genomes to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time. Sacrificing accuracy Setting --pseudo-align to only perform pseudo alignment, which is slightly faster and uses less memory. It can be used in searching with long and divergent query sequences like nanopore long-reads. Click to read more detail of the usage.\n","description":"Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How\u0026rsquo;s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default."},{"id":22,"href":"/LexicMap/notes/","title":"Notes","parent":"","content":"","description":""},{"id":23,"href":"/LexicMap/","title":"","parent":"","content":" LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing\nlexicmap index -I genomes/ -O db.lmi Step 2: searching\nlexicmap search -d db.lmi q.fasta -o r.tsv Tutorials Usages FAQs Notes Accurate and efficient alignment Using LexicMap to search in the whole 2,340,672 Genbank+Refseq prokaryotic genomes with 48 CPUs.\nQuery Genome hits Time RAM A 1.3-kb gene 37,164 36s 4.1GB A 1.5-kb 16S rRNA 1,949,496 10m41s 14.1GB A 52.8-kb plasmid 544,619 19m20s 19.3GB 1003 AMR genes 25,702,419 187m40s 55.4GB Blastn is unable to run with the same dataset on common servers as it requires \u0026gt;2000 GB RAM.\nPerformance ","description":"LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing"},{"id":24,"href":"/LexicMap/usage/utils/2blast/","title":"2blast","parent":"utils","content":" Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G. You need increase the value when \u0026#34;bufio.Scanner: token too long\u0026#34; error reported (default \u0026#34;20M\u0026#34;) -h, --help help for 2blast -i, --ignore-case ► Ignore cases of sgenome and sseqid -g, --kv-file-genome string ► Two-column tabular file for mapping the target genome ID (sgenome) to the corresponding value -s, --kv-file-seq string ► Two-column tabular file for mapping the target sequence ID (sseqid) to the corresponding value -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples From stdin.\n$ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 2 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_000017205.1_r160 Length = 478 [Subject genome #1/1] = GCF_000017205.1 Pseudomonas aeruginosa Query coverage per genome = 95.188% \u0026gt;NC_009656.1 Length = 6588339 HSP #1 Query coverage per seq = 95.188%, Aligned length = 463, Identities = 95.680%, Gaps = 12 Query range = 13-467, Subject range = 4866862-4867320, Strand = Plus/Plus Query 13 CCTCAAACGAGTCC-AACAGGCCAACGCCTAGCAATCCCTCCCCTGTGGGGCAGGGAAAA 71 |||||||||||||| |||||||| |||||| | ||||||||||||| |||||||||||| Sbjct 4866862 CCTCAAACGAGTCCGAACAGGCCCACGCCTCACGATCCCTCCCCTGTCGGGCAGGGAAAA 4866921 Query 72 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCATCTTCCACGGTGCCC 131 |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| Sbjct 4866922 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCAT-TTCCACGGTGCCC 4866980 Query 132 GCCCACGGCGGACCCGCGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTA-CA 190 ||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||| || Sbjct 4866981 GCC-ACGGCGGACCC-CGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTATCA 4867038 Query 191 CTCGGCGTTCGGCCAGCGACAGC---GACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 247 |||||||| |||||||||||||| |||||||||||||||||||||||||||||||||| Sbjct 4867039 CTCGGCGT-CGGCCAGCGACAGCAGCGACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 4867097 Query 248 AGGTGGTGCGCTCGCTGAC-AAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCG- 305 ||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 4867098 AGGTGGTGCGCTCGCTGACGAAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCGG 4867157 Query 306 TGCCGGACAGCCCGTGGCCGCCGAACAGTTGCACGCCCACCACCGCGCCGAT-TGGTTTC 364 |||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| | Sbjct 4867158 TGCCGGACAGCCCGTGGCCGCCGAACGGTTGCACGCCCACCACCGCGCCGATCTGGTTGC 4867217 Query 365 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTGGATGCGGCGGGCGGTTTCCTCGT 424 |||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| Sbjct 4867218 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTCGATGCGGCGGGCGGTTTCCTCGT 4867277 Query 425 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 467 ||||||||||||||||||||||||||||||||||||||||||| Sbjct 4867278 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 4867320 Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 From file.\n$ lexicmap utils 2blast r.lexicmap.tsv -o r.lexicmap.txt ","description":"Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G."},{"id":25,"href":"/LexicMap/usage/lexicmap/","title":"lexicmap","parent":"Usage","content":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Use \u0026#34;lexicmap [command] --help\u0026#34; for more information about a command. ","description":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line)."},{"id":26,"href":"/LexicMap/notes/motivation/","title":"Motivation","parent":"Notes","content":" BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information.\n","description":"BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information."},{"id":27,"href":"/LexicMap/tutorials/index/","title":"Step 1. Building a database","parent":"Tutorials","content":"Terminology differences:\nOn this page and in the LexicMap command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz A regular expression is also available to extract reference id from the file name. E.g., --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' extracts GCF_000006945.2 from GenBank assembly file GCF_000006945.2_ASM694v2_genomic.fna.gz While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it\u0026rsquo;s feasible as sequence IDs in search result can help to distinguish targe genomes. Run: From a directory with multiple genome files:\nlexicmap index -I genomes/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files.txt -O db.lmi Input Genome size\nLexicMap is mainly suitable for small genomes like Archaea, Bacteria, Viruses and plasmids.\nMaximum genome size: 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nSequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it\u0026rsquo;s feasible as sequence IDs in search result can help to distinguish targe genomes.\nFile type: FASTA/Q files, in plain text or gzip/xz/zstd/bzip2 compressed formats. File name: \u0026ldquo;Genome ID\u0026rdquo; + \u0026ldquo;File extention\u0026rdquo;. E.g., GCF_000006945.2.fna.gz. Genome ID: they should be distinct for accurate result interpretation, which will be shown in the search result. A regular expression is also available to extract reference id from the file name. E.g., --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' extracts GCF_000006945.2 from GenBank assembly file GCF_000006945.2_ASM694v2_genomic.fna.gz File extention: a regular expression set by the flag -r/--file-regexp is used to match input files. The default value supports common sequence file extentions, e.g., .fa, .fasta, .fna, .fa.gz, .fasta.gz, .fna.gz, fasta.xz, fasta.zst, and fasta.bz2. Sequences: Only DNA or RNA sequences are supported. Sequence IDs should be distinct for accurate result interpretation, which will be shown in the search result. Sequence description (text behind sequence ID) is not saved. If you do need it, you can create a mapping file (seqkit seq -n ref.fa.gz | sed -E 's/\\s+/\\t/' \u0026gt; id2desc.tsv) and use it to add description in search result. One or more sequences (contigs) in each file are allowed. Unwanted sequences can be filtered out by regular expressions from the flag -B/--seq-name-filter. Genome size limit. Some none-isolate assemblies might have extremely large genomes, e.g., GCA_000765055.1 has \u0026gt;150 Mb. The flag -g/--max-genome (default 15 Mb) is used to skip these input files, and the file list would be written to a file via the flag -G/--big-genomes. For fungi genomes, please increase the value. Minimum sequence length. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value). At most 17,179,869,184 (234) genomes are supported. For more genomes, please create a file list and split it into multiple parts, and build an index for each part. Input files can be given via one of the following ways:\nPositional arguments. For a few input files. A file list via the flag -X/--infile-list with one file per line. It can be STDIN (-), e.g., you can filter a file list and pass it to lexicmap index. The flag -S/--skip-file-check is optional for skiping input file checking if you believe these files do exist. A directory containing input files via the flag -I/--in-dir. Multiple-level directories are supported. So you don\u0026rsquo;t need to saved hundreds of thousand files into one directoy. Directory and file symlinks are followed. Hardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 100 GB) is preferred. The memory usage in index building is mainly related to: The number of masks (-m/--masks, default 40,000). Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=10,000 is high enough. The number of genomes. More genomes consume more memory. The divergence between genome sequences in each batch. Diverse genomes consume more memory. The genome batch size (-b/--batch-size, default 5,000). This is the main parameter to adjust memory usage. Bigger values increase indexing memory occupation. The maximum seed distance or the maximum sketching desert size (-D/--seed-max-desert, default 200), and the distance of k-mers to fill deserts (-d/--seed-in-desert-dist, default 50). Bigger -D/--seed-max-desert values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. If the RAM is not sufficient. Please: Use a smaller genome batch size. It decreases indexing memory occupation and has little affection on searching performance. Use a smaller number of masks, e.g., 20,000 performs well for small genomes (\u0026lt;=5 Mb). And if the queries are long (\u0026gt;= 2kb), there\u0026rsquo;s little affection for the alignment results. Disk More is better. LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance. See some examples. Note that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genomes bring higher compression rates. SSD disks are preferred, while HDD disks are also fast enough. Algorithm Click to show details. ... Generating m LexicHash masks.\nGenerate m prefixes. Generating all permutations of p-bp prefixes that can cover all possible k-mers, p is the biggest value for 4p \u0026lt;= m (desired number of masks), e.g., p=7 for 40,000 masks. (47 = 16384) Duplicating these prefixes to m prefixes. For each prefix, Randomly generating left k-p bases. If the mask is duplicated, re-generating. Building an index for each genome batch (-b/--batch-size, default 5,000, max 131,072).\nFor each genome file in a genome batch. Optionally discarding sequences via regular expression (-B/--seq-name-filter). Skipping genomes bigger than the value of -g/--max-genome. Concatenating all sequences, with intervals of 1000-bp N\u0026rsquo;s. Capturing the most similar k-mer (in non-gap and non-interval regions) for each mask and recording the k-mer and its location(s) and strand information. Base N is treated as A. Filling sketching deserts (genome regions longer than --seed-max-desert [default 200] without any captured k-mers/seeds). In a sketching desert, not a single k-mer is captured because there\u0026rsquo;s another k-mer in another place which shares a longer prefix with the mask. As a result, for a query similar to seqs in this region, all captured k-mers can’t match the correct seeds. For a desert region (start, end), masking the extended region (start-1000, end+1000) with the masks. Starting from start, every around --seed-in-desert-dist (default 50) bp, finding a k-mer which is captured by some mask, and adding the k-mer and its position information into the index of that mask. Saving the concatenated genome sequence (bit-packed, 2 bits for one base, N is treated as A) and genome information (genome ID, size, and lengths of all sequences) into the genome data file, and creating an index file for the genome data file for fast random subsequence extraction. Duplicate and reverse all k-mers, and save each reversed k-mer along with the duplicated position information in the seed data of the closest (sharing the longgest prefix) mask. This is for suffix matching of seeds. Compressing k-mers and the corresponding data (k-mer-data, or seeds data, including genome batch, genome number, location, and strand) into chunks of files, and creating an index file for each k-mer-data file for fast seeding. Writing summary information into info.toml file. Merging indexes of multiple batches.\nFor each k-mer-data chunk file (belonging to a list of masks), serially reading data of each mask from all batches, merging them and writting to a new file. For genome data files, just moving them. Concatenating genomes.map.bin, which maps each genome ID to its batch ID and index in the batch. Update the index summary file. Parameters Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nFlags in bold text are important and frequently used.\nGenome batches Flag Value Function Comment -b/--batch-size Max: 131072, default: 5000 Maximum number of genomes in each batch If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. LexicHash mask generation Flag Value Function Comment -M/--mask-file A file File with custom masks File with custom masks, which could be exported from an existing index or newly generated by \u0026ldquo;lexicmap utils masks\u0026rdquo;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. -k/--kmer Max: 32, default: 31 K-mer size ■ Bigger values improve the search specificity and do not increase the index size. -m/--masks Default: 40,000 Number of masks ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=10,000 is high enough. Seeds (k-mer-value) data Flag Value Function Comment --seed-max-desert Default: 200 Maximum length of distances between seeds The default value of 200 guarantees queries \u0026gt;200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every \u0026ndash;seed-in-desert-dist (50 by default) bases. ■ Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. -c/--chunks Maximum: 128, default: #CPUs Number of seed file chunks Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. -J/--seed-data-threads Maximum: -c/\u0026ndash;chunks, default: 8 Number of threads for writing seed data and merging seed chunks from all batches ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. -p/--partitions Default: 1024 Number of partitions for indexing each seed file Bigger values bring a little higher memory occupation. ► After indexing, lexicmap utils reindex-seeds can be used to reindex the seeds data with another value of this flag. --max-open-files Default: 512 Maximum number of open files It\u0026rsquo;s only used in merging indexes of multiple genome batches. Also see the usage of lexicmap index.\nSteps We use a small dataset for demonstration.\nPreparing the test genomes (15 bacterial genomes) in the refs directory.\nNote that the genome files contain the assembly accessions (ID) in the file names.\ngit clone https://github.com/shenwei356/LexicMap cd LexicMap/demo/ ls refs/ GCF_000006945.2.fa.gz GCF_000392875.1.fa.gz GCF_001096185.1.fa.gz GCF_002949675.1.fa.gz GCF_006742205.1.fa.gz GCF_000017205.1.fa.gz GCF_000742135.1.fa.gz GCF_001457655.1.fa.gz GCF_002950215.1.fa.gz GCF_009759685.1.fa.gz GCF_000148585.2.fa.gz GCF_001027105.1.fa.gz GCF_001544255.1.fa.gz GCF_003697165.2.fa.gz GCF_900638025.1.fa.gz Building an index with genomes from a directory.\nlexicmap index -I refs/ -O demo.lmi It would take about 6 seconds and 3 GB RAM in a 16-CPU PC.\nOptionally, we can also use a file list as the input.\n$ head -n 3 files.txt refs/GCF_000006945.2.fa.gz refs/GCF_000017205.1.fa.gz refs/GCF_000148585.2.fa.gz lexicmap index -X files.txt -O demo.lmi Click to show the log of a demo run. ... # here we set a small --batch-size 5 $ lexicmap index -I refs/ -O demo.lmi --batch-size 5 16:22:49.745 [INFO] LexicMap v0.4.0 (14c2606) 16:22:49.745 [INFO] https://github.com/shenwei356/LexicMap 16:22:49.745 [INFO] 16:22:49.745 [INFO] checking input files ... 16:22:49.745 [INFO] 15 input file(s) given 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ main parameters ] --------------------- 16:22:49.745 [INFO] 16:22:49.745 [INFO] input and output: 16:22:49.745 [INFO] input directory: refs/ 16:22:49.745 [INFO] regular expression of input files: (?i)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expressions for filtering out sequences: [] 16:22:49.745 [INFO] max genome size: 15000000 16:22:49.745 [INFO] output directory: demo.lmi 16:22:49.745 [INFO] 16:22:49.745 [INFO] mask generation: 16:22:49.745 [INFO] k-mer size: 31 16:22:49.745 [INFO] number of masks: 40000 16:22:49.745 [INFO] rand seed: 1 16:22:49.745 [INFO] prefix length for checking low-complexity in mask generation: 15 16:22:49.745 [INFO] 16:22:49.745 [INFO] seed data: 16:22:49.745 [INFO] maximum sketching desert length: 450 16:22:49.745 [INFO] distance of k-mers to fill deserts: 150 16:22:49.745 [INFO] seeds data chunks: 16 16:22:49.745 [INFO] seeds data indexing partitions: 1024 16:22:49.745 [INFO] 16:22:49.745 [INFO] general: 16:22:49.745 [INFO] genome batch size: 5 16:22:49.745 [INFO] batch merge threads: 8 16:22:49.745 [INFO] 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ generating masks ] --------------------- 16:22:50.180 [INFO] 16:22:50.180 [INFO] --------------------- [ building index ] --------------------- 16:22:50.328 [INFO] 16:22:50.328 [INFO] ------------------------[ batch 1/3 ]------------------------ 16:22:50.328 [INFO] building index for batch 1 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:51.192 [INFO] writing seeds... 16:22:51.264 [INFO] finished writing seeds in 71.756662ms 16:22:51.264 [INFO] finished building index for batch 1 in: 935.464336ms 16:22:51.264 [INFO] 16:22:51.264 [INFO] ------------------------[ batch 2/3 ]------------------------ 16:22:51.264 [INFO] building index for batch 2 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:53.126 [INFO] writing seeds... 16:22:53.212 [INFO] finished writing seeds in 86.823785ms 16:22:53.212 [INFO] finished building index for batch 2 in: 1.948770015s 16:22:53.212 [INFO] 16:22:53.212 [INFO] ------------------------[ batch 3/3 ]------------------------ 16:22:53.212 [INFO] building index for batch 3 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:54.350 [INFO] writing seeds... 16:22:54.437 [INFO] finished writing seeds in 87.058101ms 16:22:54.437 [INFO] finished building index for batch 3 in: 1.224414126s 16:22:54.437 [INFO] 16:22:54.437 [INFO] merging 3 indexes... 16:22:54.437 [INFO] [round 1] 16:22:54.437 [INFO] batch 1/1, merging 3 indexes to demo.lmi.tmp/r1_b1 with 8 threads... 16:22:54.613 [INFO] [round 1] finished in 175.640164ms 16:22:54.613 [INFO] rename demo.lmi.tmp/r1_b1 to demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] finished building LexicMap index from 15 files with 40000 masks in 4.875616203s 16:22:54.620 [INFO] LexicMap index saved: demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] elapsed time: 4.875654824s 16:22:54.620 [INFO] Output The LexicMap index is a directory with multiple files.\nFile structure $ tree demo.lmi/ demo.lmi/ # the index directory ├── genomes # directory of genome data │ └── batch_0000 # genome data of one batch │ ├── genomes.bin # genome data file, containing genome ID, size, sequence lengths, bit-packed sequences │ └── genomes.bin.idx # index of genome data file, for fast subsequence extraction ├── seeds # seed data: pairs of k-mer and its location information (genome batch, genome number, location, strand) │ ├── chunk_000.bin # seed data file │ ├── chunk_000.bin.idx # index of seed data file, for fast seed searching and data extraction ... ... ... │ ├── chunk_015.bin # the number of chunks is set by flag `-c/--chunks`, default: #cpus │ └── chunk_015.bin.idx ├── genomes.map.bin # mapping genome ID to batch number of genome number in the batch ├── info.toml # summary of the index └── masks.bin # mask data Index size LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance.\nNote that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring higher compression rates.\nDemo data # 15 genomes demo.lmi: 73.30 MB (73,297,328) 59.41 MB seeds 13.57 MB genomes 320.03 kB masks.bin 375 B genomes.map.bin 323 B info.toml GTDB repr # 85,205 genomes gtdb_repr.lmi: 228.15 GB (228,149,871,198) 156.44 GB seeds 71.71 GB genomes 2.13 MB genomes.map.bin 320.03 kB masks.bin 329 B info.toml GTDB complete # 402,538 genomes gtdb_complete.lmi: 972.85 GB (972,854,821,322) 583.10 GB seeds 389.74 GB genomes 10.06 MB genomes.map.bin 320.03 kB masks.bin 330 B info.toml Genbank\u0026#43;RefSeq # 2,340,672 genomes genbank_refseq.lmi: 5.43 TB (5,428,824,803,581) 3.04 TB seeds 2.38 TB genomes 821.17 MB kmers-m12345.tsv 58.52 MB genomes.map.bin 320.03 kB masks.bin 332 B info.toml AllTheBacteria HQ # 1,858,610 genomes atb_hq.lmi: 4.26 TB (4,261,437,129,065) 2.32 TB seeds 1.94 TB genomes 41.12 MB genomes.map.bin 320.03 kB masks.bin 332 B info.toml Directory/file sizes are counted with https://github.com/shenwei356/dirsize v1.2.1 (dirsize -k, base: 1000). Index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Explore the index We provide several commands to explore the index data and extract indexed subsequences:\nlexicmap utils genomes can list genome IDs of indexed genomes, see the usage and example. lexicmap utils masks can list masks of the index, see the usage and example. lexicmap utils kmers can list details of all seeds (k-mers), including reference, location(s), the strand, and the k-mer direction. see the usage and example. lexicmap utils seed-pos can help to explore the seed positions, see the usage and example. Before that, the flag --save-seed-pos needs to be added to lexicmap index. lexicmap utils subseq can extract subsequences via genome ID, sequence ID and positions, see the usage and example. What\u0026rsquo;s next: Searching ","description":"Terminology differences:\nOn this page and in the LexicMap command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names."},{"id":28,"href":"/LexicMap/tags/","title":"Tags","parent":"","content":"","description":""}]
\ No newline at end of file
+[{"id":0,"href":"/LexicMap/tutorials/misc/index-gtdb/","title":"Indexing GTDB","parent":"More","content":"Info:\nhttps://gtdb.ecogenomic.org/ Tools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;GTDB_complete\u0026quot; -M \u0026quot;gtdb\u0026quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \\ \u0026gt; failed.txt # empty files find $genomes -name \u0026quot;*.gz\u0026quot; -size 0 \u0026gt;\u0026gt; failed.txt # delete these files cat failed.txt | rush '/bin/rm {}' # redownload them: # run the genome_updater command again, with the flag -i Indexing. On a 48-CPU machine, time: 11 h, ram: 64 GB, index size: 906 GB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\nlexicmap index \\ -I files/ \\ --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' \\ -O gtdb_complete.lmi --log gtdb_complete.lmi.log \\ -b 5000 Files:\n$ du -sh files gtdb_complete.lmi --apparent-size 413G files 907G gtdb_complete.lmi $ dirsize gtdb_complete.lmi gtdb_complete.lmi: 906.14 GiB (972,962,162,476) 543.06 GiB seeds 362.98 GiB genomes 102.37 MiB kmers-m12345.tsv 9.60 MiB genomes.map.bin 312.53 KiB masks.bin 330 B info.toml ","description":"Info:\nhttps://gtdb.ecogenomic.org/ Tools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;GTDB_complete\u0026quot; -M \u0026quot;gtdb\u0026quot; -t 12 -m -L curl cd GTDB_complete/2024-01-30_19-34-40/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $?"},{"id":1,"href":"/LexicMap/usage/utils/masks/","title":"masks","parent":"utils","content":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks. (default 40000) -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -p, --prefix int ► Length of mask k-mer prefix for checking low-complexity (0 for no checking). (default 15) -s, --seed int ► The seed for generating random masks. (default 1) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils masks --quiet -d demo.lmi/ | head -n 10 1 AAAAAAAAGTCACTTGACAATCCACACGGTG 2 AAAAAAACTGCTTGCACCTTTCTCGCCTCTC 3 AAAAAAATTCTCGGCGGTGTTTCCAGGCGCA 4 AAAAAACCCAAGCGCGAAAGCCTGAACAACC 5 AAAAAACGTGGCGTCCCCTGTATAACGGCTA 6 AAAAAAGAGGGGAAGCAAGCTGAAGGATATG 7 AAAAAAGCTTAGTGTGAATGAATGGCTTCCG 8 AAAAAATCCAGGGTTCCGTTAAGGATCTGTC 9 AAAAAATGCCTCGCAGAGCAGGCTATGCTGA 10 AAAAAATTGATTCTTAGAGCGTTCCCGCCCA $ lexicmap utils masks --quiet -d demo.lmi/ | tail -n 10 39991 TTTTTTACACGCTGTGACTGCATTACAAAAA 39992 TTTTTTAGCCAGGGTTCACAGCGCCAAAACA 39993 TTTTTTATCGGACGCCAAGTTTGTAATCGTC 39994 TTTTTTCACTCGCATCTAGGAAGGAAGCATA 39995 TTTTTTCTTGCATCGTATTCAGCACGTTCCT 39996 TTTTTTGCCGAGTGACCCCGAAAAGCTCACA 39997 TTTTTTGGCGTGAGGCATTGTTTACTGCCTT 39998 TTTTTTTAAGTGGTCGTGGTAGGAGCCTCAC 39999 TTTTTTTCCGTAACTAGGTTCTGGCGATTCC 40000 TTTTTTTGAGGGTATAAGATAGAGAAAAGCT # check a specific mask $ lexicmap utils masks --quiet -d demo.lmi/ -m 12345 12345 CATTAGTAGAAGAAGGCACAATGTATCGTCG Freqency of prefixes.\n$ lexicmap utils masks --quiet -d demo.lmi/ \\ | csvtk mutate -Ht -f 2 -p \u0026#39;^(.{7})\u0026#39; \\ | csvtk freq -Ht -f 3 -nr \\ | head -n 10 AAAAAAA 3 AAAAAAT 3 AAAAACA 3 AAAAACC 3 AAAAACG 3 AAAAACT 3 AAAAAGC 3 AAAAAGG 3 AAAAAGT 3 AAAAATT 3 $ lexicmap utils masks --quiet -d demo.lmi/ \\ | csvtk mutate -Ht -f 2 -p \u0026#39;^(.{7})\u0026#39; \\ | csvtk freq -Ht -f 3 -n \\ | head -n 10 AAAAAAC 2 AAAAAAG 2 AAAAAGA 2 AAAAATA 2 AAAAATC 2 AAAAATG 2 AAAACAC 2 AAAACAT 2 AAAACCG 2 AAAACGC 2 Frequency of frequencies. i.e., for 40,000 masks, 47 = 16384. All 16,384 masks are duplicated twice, and 7,232 of them are duplicated 3 times.\n$ lexicmap utils masks --quiet -d demo.lmi/ | csvtk mutate -Ht -f 2 -p \u0026#39;^(.{7})\u0026#39; | csvtk freq -Ht -f 3 -n | csvtk freq -Ht -f 2 -k 2 9152 3 7232 ","description":"$ lexicmap utils masks -h View masks of the index or generate new masks randomly Usage: lexicmap utils masks [flags] { -d \u0026lt;index path\u0026gt; | [-k \u0026lt;k\u0026gt;] [-n \u0026lt;masks\u0026gt;] [-s \u0026lt;seed\u0026gt;] } [-o out.tsv.gz] Flags: -h, --help help for masks -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -m, --masks int ► Number of masks."},{"id":2,"href":"/LexicMap/usage/index/","title":"index","parent":"Usage","content":" Terminology differences In the LexicMap source code and command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Usage $ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers in the file names. 2. Input plain or gzip/xz/zstd/bzip2 compressed FASTA/Q files can be given via positional arguments or the flag -X/--infile-list with a list of input files. Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list. 3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level sub-directories allowed. A regular expression for matching sequencing files is available via the flag -r/--file-regexp. 4. Some non-isolate assemblies might have extremely large genomes (e.g., GCA_000765055.1, \u0026gt;150 mb). The flag -g/--max-genome is used to skip these input files, and the file list would be written to a file (-G/--big-genomes). You need to increase the value for indexing fungi genomes. 5. Maximum genome size: 268,435,456. More precisely: $total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456, as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index. 6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value). Attention: *1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome identifiers in the index and search result would be: the basenames of files with common FASTA/Q file extensions removed, which are extracted via the flag -N/--ref-name-regexp. ► The extracted genome identifiers better be distinct, which will be shown in search results and are used to extract subsequences in the command \u0026#34;lexicmap utils subseq\u0026#34;. 2) ► Unwanted sequences like plasmids can be filtered out by content in FASTA/Q header via regular expressions (-B/--seq-name-filter). 3) All degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. code bases saved A A A C C C G G G T/U T T M A/C A R A/G A W A/T A S C/G C Y C/T C K G/T G V A/C/G A H A/C/T A D A/G/T A B C/G/T C N A/C/G/T A Important parameters: --- Genome data --- *1. -b/--batch-size, ► Maximum number of genomes in each batch (maximum: 131072, default: 5000). ► If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. --- LexicHash mask generation --- 0. -M/--mask-file, ► File with custom masks, which could be exported from an existing index or newly generated by \u0026#34;lexicmap utils masks\u0026#34;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. *1. -k/--kmer, ► K-mer size (maximum: 32, default: 31). ■ Bigger values improve the search specificity and do not increase the index size. *2. -m/--masks, ► Number of LexicHash masks (default: 40000). ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. --- Seeds data (k-mer-value data) --- *1. --seed-max-desert ► Maximum length of distances between seeds (default: 200). The default value of 200 guarantees queries \u0026gt;=200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist (50 by default) bases. ■ Big values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. 2. -c/--chunks, ► Number of seed file chunks (maximum: 128, default: #CPUs). ► Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. *3. -J/--seed-data-threads ► Number of threads for writing seed data and merging seed chunks from all batches (maximum: -c/--chunks, default: 8). ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. 4. --partitions, ► Number of partitions for indexing each seed file (default: 1024). ► Bigger values bring a little higher memory occupation. ► After indexing, \u0026#34;lexicmap utils reindex-seeds\u0026#34; can be used to reindex the seeds data with another value of this flag. 5. --max-open-files, ► Maximum number of open files (default: 512). ► It\u0026#39;s only used in merging indexes of multiple genome batches. Usage: lexicmap index [flags] [-k \u0026lt;k\u0026gt;] [-m \u0026lt;masks\u0026gt;] { -I \u0026lt;seqs dir\u0026gt; | -X \u0026lt;file list\u0026gt;} -O \u0026lt;out dir\u0026gt; Flags: -b, --batch-size int ► Maximum number of genomes in each batch (maximum value: 131072) (default 5000) -G, --big-genomes string ► Out file of skipped files with $total_bases + ($num_contigs - 1) * $contig_interval \u0026gt;= -g/--max-genome. The second column is one of the skip types: no_valid_seqs, too_large_genome, too_many_seqs. -c, --chunks int ► Number of chunks for storing seeds (k-mer-value data) files. (default 16) --contig-interval int ► Length of interval (N\u0026#39;s) between contigs in a genome. (default 1000) -r, --file-regexp string ► Regular expression for matching sequence files in -I/--in-dir, case ignored. Attention: use double quotation marks for patterns containing commas, e.g., -p \u0026#39;\u0026#34;A{2,}\u0026#34;\u0026#39;. (default \u0026#34;\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --force ► Overwrite existing output directory. -h, --help help for index -I, --in-dir string ► Input directory containing FASTA/Q files. Directory and file symlinks are followed. -k, --kmer int ► Maximum k-mer size. K needs to be \u0026lt;= 32. (default 31) -M, --mask-file string ► File of custom masks. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed etc. -m, --masks int ► Number of LexicHash masks. (default 40000) -g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. Need to be smaller than the maximum supported genome size: 268435456 (default 15000000) --max-open-files int ► Maximum opened files, used in merging indexes. (default 512) -l, --min-seq-len int ► Maximum sequence length to index. The value would be k for values \u0026lt;= 0 (default -1) --no-desert-filling ► Disable sketching desert filling (only for debug). -O, --out-dir string ► Output LexicMap index directory. --partitions int ► Number of partitions for indexing seeds (k-mer-value data) files. The value needs to be the power of 4. (default 1024) -s, --rand-seed int ► Rand seed for generating random masks. (default 1) -N, --ref-name-regexp string ► Regular expression (must contains \u0026#34;(\u0026#34; and \u0026#34;)\u0026#34;) for extracting the reference name from the filename. Attention: use double quotation marks for patterns containing commas, e.g., -p \u0026#39;\u0026#34;A{2,}\u0026#34;\u0026#39; (default \u0026#34;(?i)(.+)\\\\.(f[aq](st[aq])?|fna)(\\\\.gz|\\\\.xz|\\\\.zst|\\\\.bz2)?$\u0026#34;) --save-seed-pos ► Save seed positions, which can be inspected with \u0026#34;lexicmap utils seed-pos\u0026#34;. -J, --seed-data-threads int ► Number of threads for writing seed data and merging seed chunks from all batches, the value should be in range of [1, -c/--chunks] (default 8) -d, --seed-in-desert-dist int ► Distance of k-mers to fill deserts. (default 50) -D, --seed-max-desert int ► Maximum length of sketching deserts, or maximum seed distance. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every --seed-in-desert-dist bases. (default 200) -B, --seq-name-filter strings ► List of regular expressions for filtering out sequences by contents in FASTA/Q header/name, case ignored. -S, --skip-file-check ► Skip input file checking when given files or a file list. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples See Building an index ","description":"Terminology differences In the LexicMap source code and command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Usage $ lexicmap index -h Generate an index from FASTA/Q sequences Input: *1."},{"id":3,"href":"/LexicMap/tutorials/misc/index-genbank/","title":"Indexing GenBank+RefSeq","parent":"More","content":"Make sure you have enough disk space, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;genbank\u0026quot; -M \u0026quot;ncbi\u0026quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \\ \u0026gt; failed.txt # empty files find $genomes -name \u0026quot;*.gz\u0026quot; -size 0 \u0026gt;\u0026gt; failed.txt # delete these files cat failed.txt | rush '/bin/rm {}' # redownload them: # run the genome_updater command again, with the flag -i Indexing. On a 48-CPU machine, time: 54 h, ram: 178 GB, index size: 4.94 TB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\nlexicmap index \\ -I files/ \\ --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' \\ -O genbank_refseq.lmi --log genbank_refseq.lmi.log \\ -b 25000 ","description":"Make sure you have enough disk space, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/pirovc/genome_updater, for downloading genomes https://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\ntime genome_updater.sh -d \u0026quot;refseq,genbank\u0026quot; -g \u0026quot;archaea,bacteria\u0026quot; \\ -f \u0026quot;genomic.fna.gz\u0026quot; -o \u0026quot;genbank\u0026quot; -M \u0026quot;ncbi\u0026quot; -t 12 -m -L curl cd genbank/2024-02-15_11-00-51/ # ----------------- check the file integrity ----------------- genomes=files # corrupted files # find $genomes -name \u0026quot;*.gz\u0026quot; \\ fd \u0026quot;.gz$\u0026quot; $genomes \\ | rush --eta 'seqkit seq -w 0 {} \u0026gt; /dev/null; if [ $?"},{"id":4,"href":"/LexicMap/introduction/","title":"Introduction","parent":"","content":" LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nPreprint:\nWei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Citation Support License Related projects Features LexicMap is scalable to up to millions of prokaryotic genomes. The sensitivity of LexicMap is comparable with Blastn. The alignment is fast and memory-efficient. LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs). LexicMap is easy to use (tutorials and usages). Both tabular and Blast-style output formats are available. Besides, we provide several commands to explore the index data and extract indexed subsequences. Introduction Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.\nExisting full alignment tools face challenges of high memory consumption and slow speeds. Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis. Prefilter+Align strategies have the sensitivity issue in the prefiltering step. Methods: (algorithm overview)\nAn improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently. We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee. We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched, which means seeds are immune to any single SNP. A hierarchical index enables fast and low-memory variable-length seed matching (prefix + suffix matching). A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment. A reimplemented Wavefront alignment algorithm is used for base-level alignment. Results:\nLexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million prokaryotic assemblies respectively). Running at this scale has previously only been achieved by Phylign (previously called mof-search), which compresses genomes with phylogenetic information and provides searching (prefiltering with COBS and alignment with minimap2).\nFor searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Bastn is unable to run with this dataset on common servers as it requires \u0026gt;2000 GB RAM. (see performance).\nWith LexicMap (48 CPUs),\nQuery Genome hits Time RAM A 1.3-kb marker gene 37,164 36 s 4.1 GB A 1.5-kb 16S rRNA 1,949,496 10 m 41 s 14.1 GB A 52.8-kb plasmid 544,619 19 m 20 s 19.3 GB 1003 AMR genes 25,702,419 187 m 40 s 55.4 GB Quick start Building an index (see the tutorial of building an index).\n# From a directory with multiple genome files lexicmap index -I genomes/ -O db.lmi # From a file list with one file per line lexicmap index -X files.txt -O db.lmi Querying (see the tutorial of searching).\n# For short queries like genes or long reads, returning top N hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 # For longer queries like plasmids, returning all hits. lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Sample output (queries are a few Nanopore Q20 reads). See output format details.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 CIGAR string, aligned query and subject sequences can be outputted as extra columns via the flag -a/--all.\n# Extracting similar sequences for a query gene. # search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \\ --min-qcov-per-hsp 90 --all # extract matched sequences as FASTA format sed 1d results.tsv | awk -F'\\t' '{print \u0026quot;\u0026gt;\u0026quot;$5\u0026quot;:\u0026quot;$14\u0026quot;-\u0026quot;$15\u0026quot;:\u0026quot;$16\u0026quot;\\n\u0026quot;$20;}' \\ | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT Export blast-style format:\nseqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 1 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 Learn more tutorials and usages.\nPerformance Indexing dataset genomes gzip_size tool db_size time RAM GTDB complete 402,538 443 GB LexicMap 973 GB 10 h 36 m 63.3 GB Blastn 387 GB 3 h 11 m 718 MB AllTheBacteria HQ 1,858,610 2.5 TB LexicMap 4.26 TB 48 h 08 m 88.6 GB Blastn 1.93 TB 14 h 03 m 2.9 GB Phylign 248 GB / / Genbank+RefSeq 2,340,672 2.7 TB LexicMap 5.43 TB 54 h 33 m 178.3 GB Blastn 2.37 TB 14 h 04 m 4.3 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). LexicMap index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Searching Blastn failed to run as it requires \u0026gt;2000GB RAM for Genbank+RefSeq and AllTheBacteria datasets. Phylign only has the index for AllTheBacteria HQ dataset.\nGTDB complete (402,538 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 5,170 5,143 17 s 1.4 GB Blastn 7,121 6,177 2,171 s 351.2 GB a 16S rRNA gene 1,542 bp LexicMap 303,925 278,141 235 s 4.4 GB Blastn 301,197 277,042 2,353 s 378.4 GB a plasmid 52,830 bp LexicMap 63,108 1,190 499 s 4.6 GB Blastn 69,311 2,308 2,262 s 364.7 GB 1033 AMR genes 1 kb (median) LexicMap 3,867,003 2,228,339 4,350 s 16.3 GB Blastn 5,357,772 2,240,766 4,686 s 442.1 GB AllTheBacteria HQ (1,858,610 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 27,963 27,953 31 s 3.4 GB Phylign_local 7,936 30 m 48 s 77.6 GB Phylign_cluster 7,936 28 m 33 s a 16S rRNA gene 1,542 bp LexicMap 1,857,761 1,740,000 9 m 36 s 14.9 GB Phylign_local 1,017,765 130 m 33 s 77.0 GB Phylign_cluster 1,017,765 86 m 41 s a plasmid 52,830 bp LexicMap 468,821 3,618 15 m 55 s 15.7 GB Phylign_local 46,822 47 m 33 s 82.6 GB Phylign_cluster 46,822 39 m 34 s 1033 AMR genes 1 kb (median) LexicMap 21,288,000 12,148,642 138 m 55 s 49.9 GB Phylign_local 1,135,215 156 m 08 s 85.9 GB Phylign_cluster 1,135,215 133 m 49 s Genbank+RefSeq (2,340,672 genomes):\nquery query_len tool genome_hits genome_hits(qcov\u0026gt;50) time RAM a marker gene 1,299 bp LexicMap 37,164 37,082 36 s 4.1 GB a 16S rRNA gene 1,542 bp LexicMap 1,949,496 1,381,974 10 m 41 s 14.1 GB a plasmid 52,830 bp LexicMap 544,619 6,563 19 m 20 s 19.3 GB 1033 AMR genes 1 kb (median) LexicMap 25,702,419 14,692,624 187 m 40 s 55.4 GB Notes:\nAll files are stored on a server with HDD disks. No files are cached in memory. Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz). Main searching parameters: LexicMap v0.4.0: --threads 48 --top-n-genomes 0 --min-qcov-per-genome 0 --min-qcov-per-hsp 0 --min-match-pident 70. Blastn v2.15.0+: -num_threads 48 -max_target_seqs 10000000. Phylign (AllTheBacteria fork 9fc65e6): threads: 48, cobs_kmer_thres: 0.33, minimap_preset: \u0026quot;asm20\u0026quot;, nb_best_hits: 5000000, max_ram_gb: 100; For cluster, maximum number of slurm jobs is 100. Installation LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.\nOr install with conda:\nconda install -c bioconda lexicmap Algorithm overview Citation Wei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459\nSupport Please open an issue to report bugs, propose new functions or ask for help.\nLicense MIT License\nRelated projects High-performance LexicHash computation in Go. Wavefront alignment algorithm (WFA) in Golang. ","description":"LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.\nPreprint:\nWei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459\nTable of contents Table of contents Features Introduction Quick start Performance Indexing Searching Installation Algorithm overview Citation Support License Related projects Features LexicMap is scalable to up to millions of prokaryotic genomes."},{"id":5,"href":"/LexicMap/usage/utils/kmers/","title":"kmers","parent":"utils","content":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out.tsv.gz] Flags: -h, --help help for kmers -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -m, --mask int ► View k-mers captured by Xth mask. (0 for all) (default 1) -f, --only-forward ► Only output forward k-mers. -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples The default output is captured k-mers of the first mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ---- ------------------------------- ------ ------ --------------- ------- ------ -------- 1 AAAAAAAAAAAAAAAGACAACAAAAGACATA 8 1 GCF_002950215.1 3870418 - yes 1 AAAAAAAAACAAACATTTGCGGCGGGGCCAT 8 1 GCF_000742135.1 2043044 + no 1 AAAAAAAAACCAGAAATCACACGCCAACTCC 8 1 GCF_002949675.1 1345415 + yes 1 AAAAAAAAACGATTATCCTCAATTAATTTCT 8 1 GCF_000392875.1 814251 + no 1 AAAAAAAAACGCTTCTACATCGAGCAGCGAG 8 1 GCF_001457655.1 941619 + yes 1 AAAAAAAAACGCTTTGTAACTCGATTGATAG 8 1 GCF_009759685.1 997945 + yes 1 AAAAAAAAACTGCTGTCCCTGGTCCGTCAGG 8 1 GCF_002950215.1 4262890 - yes 1 AAAAAAAAAGATTTGATTTTTTTCATTAATA 8 1 GCF_000392875.1 766998 - yes 1 AAAAAAAAAGCATTTTTTCGATCTCTTTACG 8 1 GCF_000392875.1 1623731 + yes 1 AAAAAAAAAGTTTCCGGGACACTACCTAACC 8 1 GCF_000017205.1 5804200 - yes 1 AAAAAAAAATTATTTTGCTAATCAATAGGTC 8 1 GCF_000006945.2 4886411 - yes 1 AAAAAAAACAAAGAATTATTACACAACATTC 8 1 GCF_003697165.2 4055655 + yes 1 AAAAAAAACACGGACTTATTGAAATCGTATT 8 1 GCF_000392875.1 746746 + yes 1 AAAAAAAACCAACTTTGAAAAAAGTAATGTA 8 1 GCF_000148585.2 917529 - yes 1 AAAAAAAACCATATTATGTCCGATCCTCACA 8 1 GCF_000392875.1 1060650 + yes 1 AAAAAAAACCCGCCGAAGCGGGTTTTTTTAT 8 1 GCF_000742135.1 1612499 + no 1 AAAAAAAACCTAATGGTAAATAACGTTTTGG 8 1 GCF_006742205.1 2346818 + yes 1 AAAAAAAACGAAAAACGGTAACACGGGAATT 8 1 GCF_001544255.1 1605298 + yes 1 AAAAAAAACGACTCCAGAGAGATCATCGTAT 8 1 GCF_000392875.1 1279686 + yes Only forward k-mers.\n$ lexicmap utils kmers --quiet -d demo.lmi/ -f | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ---- ------------------------------- ------ ------ --------------- ------- ------ -------- 1 AAAAAAAAACAAACATTTGCGGCGGGGCCAT 8 1 GCF_000742135.1 2043044 + no 1 AAAAAAAAACGATTATCCTCAATTAATTTCT 8 1 GCF_000392875.1 814251 + no 1 AAAAAAAACCCGCCGAAGCGGGTTTTTTTAT 8 1 GCF_000742135.1 1612499 + no 1 AAAAAAAACGGTTCAGCTGACCAGCCAGCTG 8 1 GCF_002950215.1 401140 + no 1 AAAAAAAAGAACAAATTCGAGGAAAAAGAAG 9 1 GCF_001027105.1 1268573 + no 1 AAAAAAAAGATATTGAAGTTAAAGTAATTTG 9 1 GCF_000742135.1 3038258 + no 1 AAAAAAAAGCCCACGAACCGGGGGCAATATC 9 1 GCF_002950215.1 3578394 + no 1 AAAAAAAAGCCCCGCCGAAGCGGGGCTTTTT 9 1 GCF_000017205.1 5110420 + no 1 AAAAAAAAGGATTATAACAAAATTTTGTCAT 9 1 GCF_001544255.1 426716 + no 1 AAAAAAAAGGCTTTACGGATGATCCGATGGA 9 1 GCF_009759685.1 3033057 + no 1 AAAAAAAAGTAATTGCAGCTATTATTGGGAC 10 1 GCF_001027105.1 437272 + no 1 AAAAAAAAGTATTAAGCAACTGACTAAAAGT 10 1 GCF_006742205.1 1841209 + no 1 AAAAAAAAGTCACAATTATTGGTGCCGGTTT 13 1 GCF_000392875.1 1508457 - no 1 AAAAAAAAGTCATCAAGGATTATTTGAGTTA 12 1 GCF_001457655.1 1847867 + no 1 AAAAAAAAGTCATCGCTTTATCTGTCAGTAT 12 1 GCF_001544255.1 156689 - no 1 AAAAAAAAGTCATCTTCGGATGGCTTTTTTA 12 1 GCF_000148585.2 1363150 - no 1 AAAAAAAAGTCCATCCTGCAGCATAAAATAA 11 1 GCF_000742135.1 4671015 + no 1 AAAAAAAAGTCCCTGCTGTTTGCCCAGTCCT 11 1 GCF_000006945.2 3796 - no 1 AAAAAAAAGTCCGCTGATAAGGCTTGAAAAG 11 3 GCF_002949675.1 2356807 + no Specify the mask.\n$ lexicmap utils kmers --quiet -d demo.lmi/ --mask 12345 | head -n 20 | csvtk pretty -t mask kmer prefix number ref pos strand reversed ----- ------------------------------- ------ ------ --------------- ------- ------ -------- 12345 CATTAGTAAAAACCAACTTAGTTACGACACG 8 1 GCF_001027105.1 1823411 + no 12345 CATTAGTAAAACATTTTGAACCTGTGATTGA 8 1 GCF_006742205.1 1192019 + no 12345 CATTAGTAAAAGTCGTTTGGTAAAGCGATTA 8 1 GCF_001027105.1 1334989 + yes 12345 CATTAGTAAACGTACAAAACTATTGGTTAGA 8 1 GCF_001027105.1 2037559 + yes 12345 CATTAGTAAATCCAGGAATCCTAACCGACGA 8 1 GCF_001027105.1 963152 + yes 12345 CATTAGTAACGCGTACGAAACCGTAGTAAGT 8 1 GCF_001027105.1 1958187 + yes 12345 CATTAGTAAGTTGTCGGTCTAACGCGGATTA 8 1 GCF_002950215.1 2882180 + yes 12345 CATTAGTACATTCAAGTATTATTCATTAAAC 8 1 GCF_009759685.1 665376 + yes 12345 CATTAGTACCGATAGGACATCATGAACACAA 8 1 GCF_002950215.1 4677222 + yes 12345 CATTAGTACCTTCATCGCTATCCCATTAGGC 8 1 GCF_000006945.2 92542 + yes 12345 CATTAGTACGTGTCCCGCAAAGAGAAAGAAC 8 1 GCF_000006945.2 3412102 + yes 12345 CATTAGTAGAAAAATACAAAGGCATTTATGA 11 1 GCF_900638025.1 665985 - no 12345 CATTAGTAGAAAATTGATAATCTAAGAGTTC 11 1 GCF_002950215.1 2940281 + no 12345 CATTAGTAGAAATGGGCAAAGAATAGGAAAA 11 1 GCF_000148585.2 81286 + no 12345 CATTAGTAGAAGAAATTGCAGCAAGTATTAA 14 1 GCF_001027105.1 621160 + no 12345 CATTAGTAGAAGAACTGAAGTTAGTGCCTAT 14 1 GCF_001096185.1 2113047 + no 12345 CATTAGTAGAAGAAGACCAAGCACGACGCAT 15 1 GCF_000392875.1 891723 + no 12345 CATTAGTAGAAGAGTTGTTCGTCAGTTACGG 13 1 GCF_001544255.1 831068 - no 12345 CATTAGTAGAAGATTTAGTGGCAAGCTCAAT 13 1 GCF_001457655.1 1280653 + no \u0026ldquo;reversed\u0026rdquo; means means if the k-mer is reversed for suffix matching. E.g., CATTAGTAAAAGTCGTTTGGTAAAGCGATTA is reversed, so you need to reverse it before searching in the genome.\n$ seqkit locate -p $(echo CATTAGTAAAAGTCGTTTGGTAAAGCGATTA | rev) refs/GCF_001027105.1.fa.gz -M | csvtk pretty -t seqID patternName pattern strand start end ------------- ------------------------------- ------------------------------- ------ ------- ------- NZ_CP011526.1 ATTAGCGAAATGGTTTGCTGAAAATGATTAC ATTAGCGAAATGGTTTGCTGAAAATGATTAC + 1334989 1335019 For all masks. The result might be very big, therefore, writing to gzip format is recommended.\n$ lexicmap utils kmers -d demo.lmi/ --mask 0 -o kmers.tsv.gz $ zcat kmers.tsv.gz | csvtk freq -t -f mask -nr | head -n 10 mask frequency 24088 322 15814 295 13923 293 27102 291 13922 282 15967 281 10001 280 15986 272 16440 269 a faster way\nseq 1 $(lexicmap utils masks -d demo.lmi/ --quiet | wc -l) \\ | rush --eta 'echo -e {}\u0026quot;\\t\u0026quot;$(lexicmap utils kmers -d demo.lmi/ -m {} -f --quiet | csvtk nrow)' \\ | csvtk add-header -t -n mask,seeds \\ | csvtk sort -t -k seeds:nr \\ | head -n 10 Lengths of shared prefixes between probes and captured k-mers.\nzcat kmers.tsv.gz \\ | csvtk grep -t -f reversed -p no \\ | csvtk plot hist -t -f prefix -o prefix.hist.png \\ --xlab \u0026quot;length of common prefixes between captured k-mers and masks\u0026quot; The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils kmers -h View k-mers captured by the masks Attention: 1. Mask index (column mask) is 1-based. 2. Prefix means the length of shared prefix between a k-mer and the mask. 3. K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. 4. Reversed means if the k-mer is reversed for suffix matching. Usage: lexicmap utils kmers [flags] -d \u0026lt;index path\u0026gt; [-m \u0026lt;mask index\u0026gt;] [-o out."},{"id":6,"href":"/LexicMap/tutorials/search/","title":"Step 2. Searching","parent":"Tutorials","content":" Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0 Input Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned. Input should be (gzipped) FASTA or FASTQ records from files or STDIN.\nHardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 16 GB) is preferred. The memory usage in searching is mainly related to: The number of matched genomes and sequences. The length of query sequences. Similarities between query and target sequences. The number of threads. It uses all CPUs by default (-j/--threads). Disk Sufficient space is required to store the index size. No temporary files are generated during searching. Algorithm Click to show details. ... Masking: Query sequence is masked by the masks of the index. In other words, each mask captures the most similar k-mer which shares the longest prefix with the mask, and stores its position and strand information. Seeding: For each mask, the captured k-mer is used to search seeds (captured k-mers in reference genomes) sharing prefixes or suffixes of at least p bases. Prefix matching Setting the search range: Since the seeded k-mers are stored in lexicographic order, the k-mer matching turns into a range query. For example, for a query CATGCT requiring matching at least 4-bp prefix is equal to extract k-mers ranging from CATGAA, CATGAC, CATGAG, \u0026hellip;, to CATGTT. Retrieving search start point: The index file of each seed data file stores some k-mers\u0026rsquo; offsets in the data file, and the index is loaded in RAM. Retrieving seed data: Seed k-mers are read from the file and checked one by one, and k-mers in the search range are returned, along with the k-mer information (genome batch, genome number, location, and strand). Suffix matching Reversing the query k-mer and performing prefix matching, returning seeds of reversed k-mers (see indexing algorithm). Chaining: Seeding results, i.e., anchors (matched k-mers from the query and subject sequence), are summarized by genome, and deduplicated. Performing chaining (see the paper). Alignment for each chain. Extending the anchor region. for extracting sequences from the query and reference genome. For example, extending 1 kb in upstream and downstream of anchor region. Performing pseudo-alignment with extended query and subject sequences, for find similar regions. For these similar regions that accross more than one reference sequences, splitting them into multiple ones. Fast alignment of query and subject sequence regions with our implementation of Wavefront alignment algorithm. Filtering alignments based on user options. Parameters Flags in bold text are important and frequently used.\nGeneral Flag Value Function Comment -w/--load-whole-seeds Load the whole seed data into memory for faster search Use this if the index is not big and many queries are needed to search. -n/--top-n-genomes Default 0, 0 for all Keep top N genome matches for a query in the chaining phase Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be \u0026gt;= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step. -a/--all Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026ldquo;lexicmap utils 2blast\u0026rdquo; -J/\u0026ndash;max-query-conc Default 12, 0 for all Maximum number of concurrent queries Bigger values do not improve the batch searching speed and consume much memory. Chaining Flag Value Function Comment -p, --seed-min-prefix Default 15 Minimum (prefix) length of matched seeds. Smaller values produce more results at the cost of slow speed. -P, --seed-min-single-prefix Default 17 Minimum (prefix) length of matched seeds if there\u0026rsquo;s only one pair of seeds matched. Smaller values produce more results at the cost of slow speed. --seed-max-dist Default 1000 Max distance between seeds in seed chaining. It should be \u0026lt;= contig interval length in database. --seed-max-gap Default 200 Max gap in seed chaining. Alignment Flag Value Function Comment -Q/--min-qcov-per-genome Default 0 Minimum query coverage (percentage) per genome. -q/--min-qcov-per-hsp Default 0 Minimum query coverage (percentage) per HSP. -l/--align-min-match-len Default 50 Minimum aligned length in a HSP segment. -i/--align-min-match-pident Default 70 Minimum base identity (percentage) in a HSP segment. --align-band Default 50 Band size in backtracking the score matrix. --align-ext-len Default 1000 Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. It should be \u0026lt;= contig interval length in database. --align-max-gap Default 20 Maximum gap in a HSP segment. Improving searching speed LexicMap\u0026rsquo;s searching speed is related to many factors:\nThe number of similar sequences in the index/database. More genome hits cost more time, e.g., 16S rRNA gene. Similarity between query and subject sequences. Alignment of diverse sequences is slower than that of highly similar sequences. The length of query sequence. Longer queries run with more time. The I/O performance and load. LexicMap is I/O bound, because seeds matching and extracting candidate subsequences for alignment require a large number of file readings in parallel. CPU frequency and the number of threads. Faster CPUs and more threads cost less time. Here are some tips to improve the search speed.\nIncreasing the concurrency number Make sure that the value of -j/--threads (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in info.toml file, e.g,\n# Seeds (k-mer-value data) files chunks = 48 Increasing the value of --max-open-files (default 512). You might also need to change the open files limit.\n(If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory.\n(If you have many queries) Loading the entire seed data into memoy (It\u0026rsquo;s unnecessary if the index is stored in SSD) Setting -w/--load-whole-seeds to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters. Returning less results Setting -n/--top-n-genomes to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time. Steps For short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 70 \\ --min-qcov-per-genome 70 \\ --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-match-pident 70 \\ --min-qcov-per-hsp 0 \\ --min-qcov-per-genome 0 \\ --top-n-genomes 0 Click to show the log of a demo run. ... $ lexicmap search -d demo.lmi/ q.gene.fasta -o q.gene.fasta.lexicmap.tsv 09:32:55.551 [INFO] LexicMap v0.4.0 09:32:55.551 [INFO] https://github.com/shenwei356/LexicMap 09:32:55.551 [INFO] 09:32:55.551 [INFO] checking input files ... 09:32:55.551 [INFO] 1 input file(s) given 09:32:55.551 [INFO] 09:32:55.551 [INFO] loading index: demo.lmi/ 09:32:55.551 [INFO] reading masks... 09:32:55.552 [INFO] reading indexes of seeds (k-mer-value) data... 09:32:55.555 [INFO] creating genome reader pools, each batch with 16 readers... 09:32:55.555 [INFO] index loaded in 4.192051ms 09:32:55.555 [INFO] 09:32:55.555 [INFO] searching ... 09:32:55.596 [INFO] 09:32:55.596 [INFO] processed queries: 1, speed: 1467.452 queries per minute 09:32:55.596 [INFO] 100.0000% (1/1) queries matched 09:32:55.596 [INFO] done searching 09:32:55.596 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv 09:32:55.596 [INFO] 09:32:55.596 [INFO] elapsed time: 45.230604ms 09:32:55.596 [INFO] Extracting similar sequences for a query gene.\n# search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --min-qcov-per-hsp 90 --all -o results.tsv # extract matched sequences as FASTA format sed 1d results.tsv | awk -F\u0026#39;\\t\u0026#39; \u0026#39;{print \u0026#34;\u0026gt;\u0026#34;$5\u0026#34;:\u0026#34;$14\u0026#34;-\u0026#34;$15\u0026#34;:\u0026#34;$16\u0026#34;\\n\u0026#34;$20;}\u0026#39; | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT Exporting blast-like alignment text.\nFrom file:\nlexicmap utils 2blast results.tsv -o results.txt Add genome annotation\nlexicmap utils 2blast results.tsv -o results.txt --kv-file-genome ass2species.map From stdin:\n# align only one long-read \u0026lt;= 500 bp $ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 1 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 Output Alignment result relationship Query ├── Subject genome # A query might have one or more genome hits, ├── Subject sequence # in different sequences. ├── High-Scoring segment Pair (HSP) # HSP is an alignment segment. Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs.\nA High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/\nOutput format Tab-delimited format with 17+ columns, with 1-based positions.\n1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026quot;|\u0026quot; and \u0026quot; \u0026quot;) between qseq and sseq. (optional with -a/--all) Examples A single-copy gene (SecY) query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------------------------------------- ---- ---- --------------- -------------------- ------- --- ------- ------- ------- ---- ------ ---- ------ ------ ---- ------- lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_000395405.1 NZ_KB947497.1 100.000 1 100.000 1299 100.000 0 1 1299 232279 233577 + 274511 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_019731615.1 NZ_JAASJA010000010.1 100.000 1 100.000 1299 100.000 0 1 1299 2798 4096 + 42998 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCA_004103085.1 RPCL01000012.1 100.000 1 100.000 1299 100.000 0 1 1299 44095 45393 + 84242 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_023571745.1 NZ_JAMKBS010000014.1 100.000 1 100.000 1299 100.000 0 1 1299 44077 45375 + 84206 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_013248625.1 NZ_JABTDK010000002.1 100.000 1 100.000 1299 100.000 0 1 1299 9609 10907 + 49787 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900092155.1 NZ_FLUS01000006.1 100.000 1 100.000 1299 100.000 0 1 1299 63161 64459 + 77366 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902165815.1 NZ_CABHHZ010000005.1 100.000 1 100.000 1299 100.000 0 1 1299 39386 40684 - 200163 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_014243495.1 NZ_SJAV01000002.1 100.000 1 100.000 1299 100.000 0 1 1299 39085 40383 - 256772 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900148695.1 NZ_FRXS01000009.1 100.000 1 100.000 1299 100.000 0 1 1299 39230 40528 - 96692 lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902164645.1 NZ_LR607334.1 100.000 1 100.000 1299 100.000 0 1 1299 236677 237975 + 3380663 A 16S rRNA gene query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen --------------------------- ---- ------ --------------- ----------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- NC_000913.3:4166659-4168200 1542 293398 GCF_002248685.1 NZ_NQBE01000079.1 100.000 1 100.000 1542 100.000 0 1 1542 40 1581 - 99259 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 1 100.000 1542 100.000 0 1 1542 1270211 1271752 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 2 100.000 1542 100.000 0 1 1542 5466287 5467828 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 3 100.000 1543 99.546 2 1 1542 557008 558549 + 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 4 100.000 1543 99.482 2 1 1542 4473658 4475199 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 5 100.000 1543 99.482 2 1 1542 5154150 5155691 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 6 100.000 1543 99.482 2 1 1542 5195176 5196717 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 7 100.000 1543 99.482 2 1 1542 5369865 5371406 - 5483624 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701684.1 100.000 1 100.000 1542 100.000 0 1 1542 1108651 1110192 - 1914390 NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701686.1 100.000 2 100.000 1542 99.741 0 1 1542 100680 102221 + 102235 A plasmid query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ---------- ----- ----- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ----- ------- ------- ---- ------- CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 4 0.916 484 91.116 0 51686 52169 27192 27675 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 5 0.829 438 90.868 1 52342 52779 26583 27019 - 34058 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086534.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 4 0.916 484 91.116 0 51686 52169 3843 4326 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 5 0.829 438 90.868 1 52342 52779 4499 4935 + 34058 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086546.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185 CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 1 77.157 40762 99.993 0 12069 52830 9513 50274 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 2 18.033 9528 99.990 1 1207 10733 1 9528 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 3 2.283 1206 100.000 0 1 1206 50275 51480 + 51480 CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058618.1 97.473 4 2.497 1319 100.000 0 25153 26471 3019498 3020816 - 4718403 Long reads Queries are a few Nanopore Q20 reads from a mock metagenomic community.\nquery qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen ------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- ------- ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974 ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979 ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974 ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551 ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551 ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974 ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974 Search results (TSV format) above are formatted with csvtk pretty.\nSummarizing results If you would like to summarize alignment results, e.g., the number of species, here\u0026rsquo;s the method.\nPrepare a two-column tab-delimited file for mapping reference (genome) or sequence IDs to any information (such as species name).\n# for GTDB/GenBank/RefSeq genomes downloaded with genome_updater cut -f 1,8 assembly_summary.txt \u0026gt; ref2species.tsv head -n 3 ass2species.tsv GCF_002287175.1 Methanobacterium bryantii GCF_000762265.1 Methanobacterium formicicum GCF_029601605.1 Methanobacterium formicicum Add information to the alignment result with csvtk or other tools.\n# add species cat b.gene_E_coli_16S.fasta.lexicmap.tsv \\ | csvtk mutate -t --after slen -n species -f sgenome \\ | csvtk replace -t -f species -p \u0026quot;(.+)\u0026quot; -r \u0026quot;{kv}\u0026quot; -k ass2species.tsv \\ \u0026gt; result.with_species.tsv # filter result with query coverage \u0026gt;= 80 and count the species cat result.with_species.tsv \\ | csvtk uniq -t -f sgenome \\ | csvtk filter2 -t -f \u0026quot;\\$qcovHSP \u0026gt;= 80\u0026quot; \\ | csvtk freq -t -f species -nr \\ \u0026gt; result.with_species.tsv.stats.tsv csvtk head -t -n 5 result.with_species.tsv.stats.tsv \\ | csvtk pretty -t species frequency ------------------------ --------- Salmonella enterica 135065 Escherichia coli 128071 Streptococcus pneumoniae 51971 Staphylococcus aureus 44215 Pseudomonas aeruginosa 34254 ","description":"Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query."},{"id":7,"href":"/LexicMap/tutorials/misc/index-allthebacteria/","title":"Indexing AllTheBacteria","parent":"More","content":"Make sure you have enough disk space, at least 8 TB, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/shenwei356/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.\nDownloading the list file of all assemblies in the latest version (v0.2 plus incremental versions). assemblies.\nmkdir -p atb; cd atb; # attention, the URL might changes, please check it in the browser. wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz If you only need to add assemblies from an incremental version. Please manually download the file list in the path AllTheBacteria/Assembly/OSF Storage/File_lists.\nDownloading assembly tarball files.\n# tarball file names and their URLs zcat file_list.all.latest.tsv.gz | awk 'NR\u0026gt;1 {print $3\u0026quot;\\t\u0026quot;$4}' | uniq \u0026gt; tar2url.tsv # download cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}' Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use gzip (can be replaced with faster pigz ) to compress them to save disk space.\n# {^tar.xz} is for removing the suffix \u0026quot;tar.xz\u0026quot; ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa' cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.\n$ tree atb | more atb ├── atb.assembly.r0.2.batch.1 │ ├── SAMD00013333.fa.gz │ ├── SAMD00049594.fa.gz │ ├── SAMD00195911.fa.gz │ ├── SAMD00195914.fa.gz Parepare a file list of assemblies.\nJust use find or fd (much faster).\n# find find atb/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; files.txt # fd fd .fa.gz$ atb/ \u0026gt; files.txt What it looks like:\n$ head -n 2 files.txt atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz (Optional) Only keep assemblies of high-quality. Please manually download the hq_set.sample_list.txt.gz file from this path, e.g., AllTheBacteria/Metadata/OSF Storage/Aggregated/Latest_2024-08/ (choose the latest date).\nfind atb/ -name \u0026quot;*.fa.gz\u0026quot; | grep -w -f \u0026lt;(zcat hq_set.sample_list.txt.gz) \u0026gt; files.txt Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)\nlexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log Steps for v0.2 hosted at EBI ftp Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/\nmkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.shenwei.me/LexicMap/AllTheBacteria-v0.2.url.txt # download # rush is used: https://github.com/shenwei356/rush # The download.rush file stores finished jobs, which will be skipped in a second run for resuming jobs. cat AllTheBacteria-v0.2.url.txt | rush --eta -j 2 -c -C download.rush 'wget {}' # list of high-quality samples wget https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/metadata/hq_set.sample_list.txt.gz Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use gzip (can be replaced with faster pigz ) to compress them to save disk space.\n# {^asm.tar.xz} is for removing the suffix \u0026quot;asm.tar.xz\u0026quot; ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa' cd .. After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.\n$ tree atb | more atb ├── achromobacter_xylosoxidans__01 │ ├── SAMD00013333.fa.gz │ ├── SAMD00049594.fa.gz │ ├── SAMD00195911.fa.gz │ ├── SAMD00195914.fa.gz # disk usage $ du -sh atb 2.9T atb $ du -sh atb --apparent-size 2.1T atb Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)\n# file paths of all samples find atb/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; atb_all.txt # wc -l atb_all.txt # 1876015 atb_all.txt # file paths of high-quality samples grep -w -f \u0026lt;(zcat atb/hq_set.sample_list.txt.gz) atb_all.txt \u0026gt; atb_hq.txt # wc -l atb_hq.txt # 1858610 atb_hq.txt # index lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.log For 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\n# disk usage $ du -sh atb_hq.lmi 4.6T atb_hq.lmi $ du -sh atb_hq.lmi --apparent-size 3.9T atb_hq.lmi $ dirsize atb_hq.lmi atb_hq.lmi: 3.88 TiB (4,261,437,129,065) 2.11 TiB seeds 1.77 TiB genomes 39.22 MiB genomes.map.bin 312.53 KiB masks.bin 332 B info.toml Note that, there\u0026rsquo;s a tmp directory atb_hq.lmi being created during indexing. In the tmp directory, the seed data would be bigger than the final size of seeds directory, however, the genome files are simply moved to the final index.\n","description":"Make sure you have enough disk space, at least 8 TB, \u0026gt;10 TB is preferred.\nTools:\nhttps://github.com/shenwei356/rush, for running jobs Info:\nAllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable. Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable Data on OSF: https://osf.io/xv7q9/ Steps for v0.2 and later versions hosted at OSF After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF."},{"id":8,"href":"/LexicMap/usage/utils/genomes/","title":"genomes","parent":"utils","content":" Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 8) Examples $ lexicmap utils genomes -d demo.lmi/ GCF_000148585.2 GCF_001457655.1 GCF_900638025.1 GCF_001096185.1 GCF_006742205.1 GCF_001544255.1 GCF_000392875.1 GCF_001027105.1 GCF_009759685.1 GCF_002949675.1 GCF_002950215.1 GCF_000006945.2 GCF_003697165.2 GCF_000742135.1 GCF_000017205.1 ","description":"Usage $ lexicmap utils genomes -h View genome IDs in the index Usage: lexicmap utils genomes [flags] Flags: -h, --help help for genomes -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments."},{"id":9,"href":"/LexicMap/tutorials/misc/index-globdb/","title":"Indexing GlobDB","parent":"More","content":"Info:\nGlobDB , a dereplicated dataset of the species reps of the GTDB, GEM, SPIRE and SMAG datasets a lot. https://x.com/daanspeth/status/1822964436950192218 Steps:\n# download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 ","description":"Info:\nGlobDB , a dereplicated dataset of the species reps of the GTDB, GEM, SPIRE and SMAG datasets a lot. https://x.com/daanspeth/status/1822964436950192218 Steps:\n# download data wget https://fileshare.lisc.univie.ac.at/globdb/globdb_r220/globdb_r220_genome_fasta.tar.gz tar -zxf globdb_r220_genome_fasta.tar.gz # file list find globdb_r220_genome_fasta/ -name \u0026quot;*.fa.gz\u0026quot; \u0026gt; files.txt # index with lexicmap # elapsed time: 3h:40m:38s # peak rss: 87.15 GB lexicmap index -S -X files.txt -O globdb_r220.lmi --log globdb_r220.lmi -g 50000000 "},{"id":10,"href":"/LexicMap/installation/","title":"Installation","parent":"","content":"LexicMap can be installed via conda, downloading executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Or use mamba, which is faster.\nconda install -c conda-forge mamba mamba install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64.tar.gz, 中国镜像 Linux arm64 lexicmap_linux_arm64.tar.gz, 中国镜像 Decompress it:\ntar -zxvf lexicmap_linux_amd64.tar.gz If you have the root privilege, simply copy it to /usr/local/bin:\nsudo cp lexicmap /usr/local/bin/ If you don\u0026rsquo;t have the root privilege, copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bashrc # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration MacOS Download the binary file.\nOS Arch File, 中国镜像 macOS 64-bit lexicmap_darwin_amd64.tar.gz, 中国镜像 macOS arm64 lexicmap_darwin_arm64.tar.gz, 中国镜像 Copy it to any directory in the environment variable PATH:\nmkdir -p $HOME/bin/; cp lexicmap $HOME/bin/ And optionally add the directory into the environment variable PATH if it\u0026rsquo;s not in.\n# bash echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.bashrc source $HOME/.bashrc # apply the configuration # zsh echo export PATH=\\$PATH:\\$HOME/bin/ \u0026gt;\u0026gt; $HOME/.zshrc source $HOME/.zshrc # apply the configuration Windows Download the binary file.\nOS Arch File, 中国镜像 Windows 64-bit lexicmap_windows_amd64.exe.tar.gz, 中国镜像 Decompress it.\nCopy lexicmap.exe to C:\\WINDOWS\\system32.\nOthers Please open an issue to request binaries for other platforms. Or compiling from the source. Compile from the source Install go (go 1.22 or later versions).\nwget https://go.dev/dl/go1.22.6.linux-amd64.tar.gz tar -zxf go1.22.6.linux-amd64.tar.gz -C $HOME/ # or # echo \u0026quot;export PATH=$PATH:$HOME/go/bin\u0026quot; \u0026gt;\u0026gt; ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin Compile LexicMap.\n# ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/LexicMap/lexicmap # The executable binary file is located in: # ~/go/bin/lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/lexicmap $HOME/bin/ # --------------- the development version -------------- git clone https://github.com/shenwei356/LexicMap cd LexicMap/lexicmap/ go build # The executable binary file is located in: # ./lexicmap # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./lexicmap $HOME/bin/ Shell-completion Supported shell: bash|zsh|fish|powershell\nBash:\n# generate completion shell lexicmap autocompletion --shell bash # configure if never did. # install bash-completion if the \u0026quot;complete\u0026quot; command is not found. echo \u0026quot;for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\u0026quot; \u0026gt;\u0026gt; ~/.bash_completion echo \u0026quot;source ~/.bash_completion\u0026quot; \u0026gt;\u0026gt; ~/.bashrc Zsh:\n# generate completion shell lexicmap autocompletion --shell zsh --file ~/.zfunc/_kmcp # configure if never did echo 'fpath=( ~/.zfunc \u0026quot;${fpath[@]}\u0026quot; )' \u0026gt;\u0026gt; ~/.zshrc echo \u0026quot;autoload -U compinit; compinit\u0026quot; \u0026gt;\u0026gt; ~/.zshrc fish:\nlexicmap autocompletion --shell fish --file ~/.config/fish/completions/lexicmap.fish ","description":"LexicMap can be installed via conda, downloading executable binary files, or compiling from the source.\nBesides, it supports shell completion, which could help accelerate typing.\nConda Install conda, then run\nconda install -c bioconda lexicmap Or use mamba, which is faster.\nconda install -c conda-forge mamba mamba install -c bioconda lexicmap Linux and MacOS (both x86 and arm CPUs) are supported.\nBinary files Linux Download the binary file.\nOS Arch File, 中国镜像 Linux 64-bit lexicmap_linux_amd64."},{"id":11,"href":"/LexicMap/usage/search/","title":"search","parent":"Usage","content":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed. Hence, you can search with default parameters and then filter the result with tools like awk or csvtk. Alignment result relationship: Query ├── Subject genome ├── Subject sequence ├── High-Scoring segment Pair (HSP) Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs. \u0026gt; A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the \u0026gt; highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/ Output format: Tab-delimited format with 17+ columns, with 1-based positions. 1. query, Query sequence ID. 2. qlen, Query sequence length. 3. hits, Number of subject genomes. 4. sgenome, Subject genome ID. 5. sseqid, Subject sequence ID. 6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen. 7. hsp, Nth HSP in the genome. (just for improving readability) 8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen. 9. alenHSP, Aligned length in the current HSP. 10. pident, Percentage of identical matches in the current HSP. 11. gaps, Gaps in the current HSP. 12. qstart, Start of alignment in query sequence. 13. qend, End of alignment in query sequence. 14. sstart, Start of alignment in subject sequence. 15. send, End of alignment in subject sequence. 16. sstr, Subject strand. 17. slen, Subject sequence length. 18. cigar, CIGAR string of the alignment. (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) Usage: lexicmap search [flags] -d \u0026lt;index path\u0026gt; [query.fasta.gz ...] [-o query.tsv.gz] Flags: --align-band int ► Band size in backtracking the score matrix (pseduo alignment phase). (default 50) --align-ext-len int ► Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. It should be \u0026lt;= contig interval length in database. (default 1000) --align-max-gap int ► Maximum gap in a HSP segment. (default 20) -l, --align-min-match-len int ► Minimum aligned length in a HSP segment. (default 50) -i, --align-min-match-pident float ► Minimum base identity (percentage) in a HSP segment. (default 70) -a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. -h, --help help for search -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --load-whole-seeds ► Load the whole seed data into memory for faster search. --max-open-files int ► Maximum opened files. (default 512) -J, --max-query-conc int ► Maximum number of concurrent queries. Bigger values do not improve the batch searching speed and consume much memory. (default 12) -Q, --min-qcov-per-genome float ► Minimum query coverage (percentage) per genome. -q, --min-qcov-per-hsp float ► Minimum query coverage (percentage) per HSP. -o, --out-file string ► Out file, supports a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) --pseudo-align ► Only perform pseudo alignment, alignment metrics, including qcovGnm, qcovSHP and pident, will be less accurate. --seed-max-dist int ► Max distance between seeds in seed chaining. (default 10000) --seed-max-gap int ► Max gap in seed chaining. (default 500) -p, --seed-min-prefix int ► Minimum (prefix) length of matched seeds. (default 15) -P, --seed-min-single-prefix int ► Minimum (prefix) length of matched seeds if there\u0026#39;s only one pair of seeds matched. (default 17) -n, --top-n-genomes int ► Keep top N genome matches for a query (0 for all) in chaining phase. Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be \u0026gt;= 5. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 8) Examples See Searching ","description":"$ lexicmap search -h Search sequences against an index Attention: 1. Input should be (gzipped) FASTA or FASTQ records from files or stdin. 2. For multiple queries, the order of queries might be different from the input. Tips: 1. When using -a/--all, the search result would be formatted to Blast-style format with \u0026#39;lexicmap utils 2blast\u0026#39;. And the search speed would be slightly slowed down. 2. Alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed."},{"id":12,"href":"/LexicMap/usage/utils/subseq/","title":"subseq","parent":"utils","content":" Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Usage: lexicmap utils subseq [flags] Flags: -h, --help help for subseq -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. -w, --line-width int ► Line width of sequence (0 for no wrap). (default 60) -o, --out-file string ► Out file, supports the \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -n, --ref-name string ► Reference name. -r, --region string ► Region of the subsequence (1-based). -R, --revcom ► Extract subsequence on the negative strand. -s, --seq-id string ► Sequence ID. If the value is empty, the positions in the region are treated as that in the concatenated sequence. Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Extracting subsequence with genome ID, sequence ID, position range and strand information.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4591684:4593225 -R \u0026gt;NZ_CP033092.2:4591684-4593225:- AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA GTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAA TGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCAT AACGTCGCAAGACCAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATG GGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGG GGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATT GACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAG GGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACC GGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCA AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCC CTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAAT TCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAG AATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGA AATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGC CGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCG ACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAAC TCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGT TCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGT AGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAA CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA If the sequence ID (-s/--seq-id) is not given, the positions are these in the concatenated sequence.\nChecking sequence lengths of a genome with seqkit.\n$ seqkit fx2tab -nil refs/GCF_003697165.2.fa.gz NZ_CP033092.2 4903501 NZ_CP033091.2 131333 Extracting the 1000-bp interval sequence inserted by lexicmap index.\n$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 \u0026gt;GCF_003697165.2:4903502-4904501:+ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA It detects if the end position is larger than the sequence length.\n# the length of NZ_CP033092.2 is 4903501 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903501:1000000000 \u0026gt;NZ_CP033092.2:4903501-4903501:+ C $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903502:1000000000 \u0026gt;NZ_CP033092.2:4903502-4903501:+ ","description":"Usage $ lexicmap utils subseq -h Exextract subsequence via reference name, sequence ID, position and strand Attention: 1. The option -s/--seq-id is optional. 1) If given, the positions are these in the original sequence. 2) If not given, the positions are these in the concatenated sequence. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes."},{"id":13,"href":"/LexicMap/tutorials/misc/index-uhgg/","title":"Indexing UHGG","parent":"More","content":"Info:\nUnified Human Gastrointestinal Genome (UHGG) v2.0.2 A unified catalog of 204,938 reference genomes from the human gut microbiome Number of Genomes: 289,232 Tools:\nhttps://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\n# meta data wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.2/genomes-all_metadata.tsv # gff url sed 1d genomes-all_metadata.tsv | cut -f 20 | sed 's/v2.0/v2.0.2/' | sed -E 's/^ftp/https/' \u0026gt; url.txt # download gff files mkdir -p files; cd files time cat ../url.txt \\ | rush --eta -v 'dir={///%}/{//%}' \\ 'mkdir -p {dir}; curl -s -o {dir}/{%} {}' \\ -c -C download.rush -j 12 cd .. # extract sequences from gff files find files/ -name \u0026quot;*.gff.gz\u0026quot; \\ | rush --eta \\ 'zcat {} | perl -ne \u0026quot;print if \\$s; \\$s=true if /^##FASTA/\u0026quot; | seqkit seq -w 0 -o {/}/{%:}.fna.gz' \\ -c -C extract.rush Indexing. On a 48-CPU machine, time: 3 h, ram: 41 GB, index size: 426 GB. If you don\u0026rsquo;t have enough memory, please decrease the value of -b.\nlexicmap index \\ -I files/ \\ -O uhgg.lmi --log uhgg.lmi.log \\ -b 5000 File sizes:\n$ du -sh files/ uhgg.lmi 658G files/ 509G uhgg.lmi $ du -sh files/ uhgg.lmi --apparent-size 425G files/ 426G uhgg.lmi $ dirsize uhgg.lmi uhgg.lmi: 425.15 GiB (456,497,171,291) 243.47 GiB seeds 181.67 GiB genomes 6.34 MiB genomes.map.bin 312.53 KiB masks.bin 330 B info.toml ","description":"Info:\nUnified Human Gastrointestinal Genome (UHGG) v2.0.2 A unified catalog of 204,938 reference genomes from the human gut microbiome Number of Genomes: 289,232 Tools:\nhttps://github.com/shenwei356/seqkit, for checking sequence files https://github.com/shenwei356/rush, for running jobs Data:\n# meta data wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.2/genomes-all_metadata.tsv # gff url sed 1d genomes-all_metadata.tsv | cut -f 20 | sed 's/v2.0/v2.0.2/' | sed -E 's/^ftp/https/' \u0026gt; url.txt # download gff files mkdir -p files; cd files time cat ../url.txt \\ | rush --eta -v 'dir={///%}/{//%}' \\ 'mkdir -p {dir}; curl -s -o {dir}/{%} {}' \\ -c -C download."},{"id":14,"href":"/LexicMap/releases/","title":"Releases","parent":"","content":" Latest version v0.4.0 v0.4.0 - 2024-08-15 New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs. Fix a bug in indexing small genomes. Change the default value of -b, --batch-size from 10,000 to 5,000. Improve lexichash data structure. Write and merge seed data in parallel, new flag -J/--seed-data-threads. Improve the log. lexicmap search: Fix chaining for highly-repetitive regions. Perform more accurate alignment with WFA. Use buffered reader for seeds file reading. Fix object recycling and reduce memory usage. Fix alignment against genomes with many short contigs. Fix early quit when meeting a sequence shorter than k. Add a new option -J/--max-query-conc to limit the miximum number of concurrent queries, with a default valule of 12 instead of the number of CPUs, which reduces the memory usage in batch searching. Result format: Cluster alignments of each target sequence. Remove the column seeds. Add columns gaps, cigar, align, which can be reformated with lexicmap utils 2blast. lexicmap utils kmers: Fix the progress bar. Fix a bug where some masks do not have any k-mer. Add a new column prefix to show the length of common prefix between the seed and the probe. Add a new column reversed to indicate if the k-mer is reversed for suffix matching. lexicmap utils masks: Add the support of only outputting a specific mask. lexicmap utils seed-pos: New columns: sseqid and pos_seq. More accurate seed distance. Add histograms of numbers of seed in sliding windows. lexicmap utils subseq: Fix a bug when the given end position is larger than the sequence length. Add the strand (\u0026quot;+\u0026quot; or \u0026ldquo;-\u0026rdquo;) in the sequence header. Please run lexicmap version to check update !!! Please run lexicmap autocompletion to update shell autocompletion script !!! Previous versions v0.3.0 v0.3.0 - 2024-05-14 lexicmap index: Better seed coverage by filling sketching deserts. Use longer (1000bp N\u0026rsquo;s, previous: k-1) intervals between contigs. Fix a concurrency bug between genome data writing and k-mer-value data collecting. Change the format of k-mer-value index file, and fix the computation of index partitions. Optionally save seed positions which can be outputted by lexicmap utils seed-pos. lexicmap search: Improved seed-chaining algorithm. Better support of long queries. Add a new flag -w/--load-whole-seeds for loading the whole seed data into memory for faster search. Parallelize alignment in each query, so it\u0026rsquo;s faster for a single query. Optional outputing matched query and subject sequences. 2-5X searching speed with a faster masking method. Change output format. Add output of query start and end positions. Fix a target sequence extracting bug. Keep indexes of genome data in memory. lexicmap utils kmers: Fix a little bug, wrong number of k-mers for the second k-mer in each k-mer pair. New commands: lexicmap utils gen-masks for generating masks from the top N largest genomes. lexicmap utils seed-pos for extracting seed positions via reference names. lexicmap utils reindex-seeds for recreating indexes of k-mer-value (seeds) data. lexicmap utils genomes for list genomes IDs in the index. v0.2.0 v0.2.0 - 2024-02-02 Software architecture and index formats are redesigned to reduce searching memory occupation. Indexing: genomes are processed in batches to reduce RAM usage, then indexes of all batches are merged. Searching: seeds matching is performed on disk yet it\u0026rsquo;s ultra-fast. v0.1.0 v0.1.0 - 2024-01-15 The first release. Seed indexing and querying are performed in RAM. GTDB r214 with 10k masks: index size 75GB, RAM: 130GB. ","description":"Latest version v0.4.0 v0.4.0 - 2024-08-15 New commands: lexicmap utils 2blast: Convert the default search output to blast-style format. lexicmap index: Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data. Better sketching desert filling for highly-repetitive regions. Change the default value of --seed-max-desert from 900 to 200 to increase alignment sensitivity. Mask gap regions (N\u0026rsquo;s). Fix skipping interval regions by further including the last k-1 bases of contigs."},{"id":15,"href":"/LexicMap/usage/utils/seed-pos/","title":"seed-pos","parent":"utils","content":" Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;. 2. All degenerate bases in reference genomes were converted to the lexicographic first bases. E.g., N was converted to A. Therefore, consecutive A\u0026#39;s in output might be N\u0026#39;s in the genomes. Extra columns: Using -v/--verbose will output more columns: len_aaa, length of consecutive A\u0026#39;s. seq, sequence between the previous and current seed. Figures: Using -O/--plot-dir will write plots into given directory: - Histograms of seed distances. - Histograms of numbers of seeds in sliding windows. Usage: lexicmap utils seed-pos [flags] Flags: -a, --all-refs ► Output for all reference genomes. This would take a long time for an index with a lot of genomes. -b, --bins int ► Number of bins in histograms. (default 100) --color-index int ► Color index (1-7). (default 1) --force ► Overwrite existing output directory. --height float ► Histogram height (unit: inch). (default 4) -h, --help help for seed-pos -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --max-open-files int ► Maximum opened files, used for extracting sequences. (default 512) -D, --min-dist int ► Only output records with seed distance \u0026gt;= this value. -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) -O, --plot-dir string ► Output directory for 1) histograms of seed distances, 2) histograms of numbers of seeds in sliding windows. --plot-ext string ► Histogram plot file extention. (default \u0026#34;.png\u0026#34;) -n, --ref-name strings ► Reference name(s). -s, --slid-step int ► The step size of sliding windows for counting the number of seeds (default 200) -w, --slid-window int ► The window size of sliding windows for counting the number of seeds (default 500) -v, --verbose ► Show more columns including position of the previous seed and sequence between the two seeds. Warning: it\u0026#39;s slow to extract the sequences, recommend set -D 1000 or higher values to filter results --width float ► Histogram width (unit: inch). (default 6) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples Adding the flag --save-seed-pos in index building.\n$ lexicmap index -I refs/ -O demo.lmi --save-seed-pos --force Listing seed position of one genome.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv $ head -n 10 seed_distance.tsv | csvtk pretty -t ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 90 90 - 89 GCF_000017205.1 NC_009656.1 133 133 + 43 GCF_000017205.1 NC_009656.1 137 137 - 4 GCF_000017205.1 NC_009656.1 139 139 - 2 GCF_000017205.1 NC_009656.1 160 160 - 21 GCF_000017205.1 NC_009656.1 300 300 - 140 GCF_000017205.1 NC_009656.1 338 338 + 38 GCF_000017205.1 NC_009656.1 360 360 + 22 GCF_000017205.1 NC_009656.1 361 361 + 1 Check the biggest seed distances.\n$ csvtk freq -t -f distance seed_distance.tsv \\ | csvtk sort -t -k distance:nr \\ | head -n 10 \\ | csvtk pretty -t distance frequency -------- --------- 199 43 198 49 197 52 196 43 195 44 194 47 193 43 192 53 191 38 Or only list records with seed distances longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -D 190 \\ | csvtk pretty -t | head -n 5 ref seqid pos_gnm pos_seq strand distance --------------- ----------- ------- ------- ------ -------- GCF_000017205.1 NC_009656.1 13964 13964 - 197 GCF_000017205.1 NC_009656.1 27420 27420 + 191 GCF_000017205.1 NC_009656.1 30942 30942 + 193 Plot histogram of distances between seeds and histogram of number of seeds in sliding windows.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -o seed_distance.tsv --plot-dir seed_distance In the plot below, there\u0026rsquo;s a peak at 50 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 50 bp by default.\nMore columns including sequences between two seeds.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v \\ | head -n4 | csvtk pretty -t -W 40 --clip ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 90 90 - 89 9 TTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGC... GCF_000017205.1 NC_009656.1 133 133 + 43 3 TTTCTTTTAAAGGATAGAAGCGGTTATTGCTCTTGGT... GCF_000017205.1 NC_009656.1 137 137 - 4 0 GGTT Or only list records with seed distance longer than a threshold.\n$ lexicmap utils seed-pos -d demo.lmi/ -n GCF_000017205.1 -v -D 190 \\ | head -n 2 \\ | csvtk pretty -t -W 40 ref seqid pos_gnm pos_seq strand distance len_aaa seq --------------- ----------- ------- ------- ------ -------- ------- ---------------------------------------- GCF_000017205.1 NC_009656.1 13964 13964 - 197 8 ATTTGCCCATTGAGGCGCCGGTATTGCGCATGGAAGTGGT GCGCATCGACGCCGAGGGCGTCGGCCTGCGCTTCCTCGCC GATCAATGAAACCCGAGTTCCACGTGGAACCACGGTCCTG CCATCGATCAGCGAACGGGCGAATCCGCCGCCCGTTATCG GCTAGAATGCGCGCCGCTCGGCATGGGGCCGGGCATG Listing seed position of all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz Show the number of seed positions in each genome. Frequencies larger than 40000 (the number of masks) means some k-mers can be foud in more than one positions in a genome.\n$ csvtk freq -t -f ref -nr seed-pos.tsv.gz | csvtk pretty -t ref frequency --------------- --------- GCF_000017205.1 134674 GCF_000742135.1 103882 GCF_003697165.2 92389 GCF_000006945.2 91007 GCF_002950215.1 89876 GCF_002949675.1 84731 GCF_009759685.1 72615 GCF_001027105.1 56806 GCF_000392875.1 55397 GCF_006742205.1 52670 GCF_001544255.1 49919 GCF_900638025.1 46654 GCF_001457655.1 46226 GCF_001096185.1 46222 GCF_000148585.2 44848 Plot the histograms of distances between seeds for all genomes.\n$ lexicmap utils seed-pos -d demo.lmi/ --all-refs -o seed-pos.tsv.gz \\ --plot-dir seed_distance --force 09:56:34.059 [INFO] creating genome reader pools, each batch with 1 readers... processed files: 15 / 15 [======================================] ETA: 0s. done 09:56:34.656 [INFO] seed positions of 15 genomes(s) saved to seed-pos.tsv.gz 09:56:34.656 [INFO] histograms of 15 genomes(s) saved to seed_distance 09:56:34.656 [INFO] 09:56:34.656 [INFO] elapsed time: 598.080462ms 09:56:34.656 [INFO] $ ls seed_distance/ GCF_000006945.2.png GCF_000742135.1.png GCF_001544255.1.png GCF_006742205.1.png GCF_000006945.2.seed_number.png GCF_000742135.1.seed_number.png GCF_001544255.1.seed_number.png GCF_006742205.1.seed_number.png GCF_000017205.1.png GCF_001027105.1.png GCF_002949675.1.png GCF_009759685.1.png GCF_000017205.1.seed_number.png GCF_001027105.1.seed_number.png GCF_002949675.1.seed_number.png GCF_009759685.1.seed_number.png GCF_000148585.2.png GCF_001096185.1.png GCF_002950215.1.png GCF_900638025.1.png GCF_000148585.2.seed_number.png GCF_001096185.1.seed_number.png GCF_002950215.1.seed_number.png GCF_900638025.1.seed_number.png GCF_000392875.1.png GCF_001457655.1.png GCF_003697165.2.png GCF_000392875.1.seed_number.png GCF_001457655.1.seed_number.png GCF_003697165.2.seed_number.png In the plots below, there\u0026rsquo;s a peak at 50 bp, because LexicMap fills sketching deserts with extra k-mers (seeds) of which their distance is 50 bp by default. And they show that the seed number, seed distance and seed density are related to genome sizes.\nGCF_000392875.1 (genome size: 2.9 Mb)\n","description":"Usage $ lexicmap utils seed-pos -h Extract and plot seed positions via reference name(s) Attention: 0. This command requires the index to be created with the flag --save-seed-pos in lexicmap index. 1. Seed/K-mer positions (column pos) are 1-based. For reference genomes with multiple sequences, the sequences were concatenated to a single sequence with intervals of N\u0026#39;s. So values of column pos_gnm and pos_seq might be different. The positions can be used to extract subsequence with \u0026#39;lexicmap utils subseq\u0026#39;."},{"id":16,"href":"/LexicMap/tutorials/misc/","title":"More","parent":"Tutorials","content":"","description":""},{"id":17,"href":"/LexicMap/tutorials/","title":"Tutorials","parent":"","content":"","description":""},{"id":18,"href":"/LexicMap/usage/utils/","title":"utils","parent":"Usage","content":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) The output (TSV format) is formatted with csvtk pretty.\n","description":"$ lexicmap utils Some utilities Usage: lexicmap utils [command] Available Commands: 2blast Convert the default search output to blast-style format genomes View genome IDs in the index kmers View k-mers captured by the masks masks View masks of the index or generate new masks randomly reindex-seeds Recreate indexes of k-mer-value (seeds) data seed-pos Extract and plot seed positions via reference name(s) subseq Extract subsequence via reference name, sequence ID, position and strand Flags: -h, --help help for utils Global Flags: -X, --infile-list string ► File of input file list (one file per line)."},{"id":19,"href":"/LexicMap/usage/utils/reindex-seeds/","title":"reindex-seeds","parent":"utils","content":" Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. The value needs to be the power of 4. (default 1024) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples $ lexicmap utils reindex-seeds -d demo.lmi/ --partitions 1024 10:20:29.150 [INFO] recreating seed indexes with 1024 partitions for: demo.lmi/ processed files: 16 / 16 [======================================] ETA: 0s. done 10:20:29.166 [INFO] update index information file: demo.lmi/info.toml 10:20:29.166 [INFO] finished updating the index information file: demo.lmi/info.toml 10:20:29.166 [INFO] 10:20:29.166 [INFO] elapsed time: 15.981266ms 10:20:29.166 [INFO] ","description":"Usage $ lexicmap utils reindex-seeds -h Recreate indexes of k-mer-value (seeds) data Usage: lexicmap utils reindex-seeds [flags] Flags: -h, --help help for reindex-seeds -d, --index string ► Index directory created by \u0026#34;lexicmap index\u0026#34;. --partitions int ► Number of partitions for re-indexing seeds (k-mer-value data) files. The value needs to be the power of 4. (default 1024) Global Flags: -X, --infile-list string ► File of input file list (one file per line)."},{"id":20,"href":"/LexicMap/usage/","title":"Usage","parent":"","content":"","description":""},{"id":21,"href":"/LexicMap/faqs/","title":"FAQs","parent":"","content":" Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How\u0026rsquo;s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nDoes LexicMap support fungi genomes? Yes. LexicMap mainly supports small genomes including prokaryotic, viral, and plasmid genomes. Fungi can also be supported, just remember to increase the value of -g/--max-genome when running lexicmap index, which is used to skip genomes larger than 15Mb by default.\n-g, --max-genome int ► Maximum genome size. Extremely large genomes (e.g., non-isolate assemblies from Genbank) will be skipped. (default 15000000) Maximum genome size is about 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nFor big and complex genomes, like the human genome (chr1 is ~248 Mb) which has many repetitive sequences, LexicMap would be slow to align.\nHow\u0026rsquo;s the hardware requirement? For index building. See details hardware requirement. For seaching. See details hardware requirement. Can I extract the matched sequences? Yes, lexicmap search has a flag\n-a, --all ► Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with \u0026#34;lexicmap utils 2blast\u0026#34;. to output CIGAR string, aligned query and subject sequences.\n18. cigar, CIGAR string of the alignment (optional with -a/--all) 19. qseq, Aligned part of query sequence. (optional with -a/--all) 20. sseq, Aligned part of subject sequence. (optional with -a/--all) 21. align, Alignment text (\u0026#34;|\u0026#34; and \u0026#34; \u0026#34;) between qseq and sseq. (optional with -a/--all) An example:\n# Extracting similar sequences for a query gene. # search matches with query coverage \u0026gt;= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \\ --min-qcov-per-hsp 90 --all # extract matched sequences as FASTA format sed 1d results.tsv | awk -F'\\t' '{print \u0026quot;\u0026gt;\u0026quot;$5\u0026quot;:\u0026quot;$14\u0026quot;-\u0026quot;$15\u0026quot;:\u0026quot;$16\u0026quot;\\n\u0026quot;$20;}' \\ | seqkit seq -g \u0026gt; results.fasta seqkit head -n 1 results.fasta | head -n 3 \u0026gt;NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT And lexicmap util 2blast can help to convert the tabular format to Blast-style format, see examples.\nHow can I extract the upstream and downstream flanking sequences of matched regions? lexicmap utils subseq can extract subsequencess via genome ID, sequence ID and positions. So you can use these information from the search result and expand the region positions to extract flanking sequences.\nWhy isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? It happens if there are some degenerate bases (e.g., N) in the query sequence. In the indexing step, all degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A. While for the query sequences, we don\u0026rsquo;t convert them.\nWhy is LexicMap slow for batch searching? LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes. There are some ways to improve the search speed of lexicmap search.\nIncreasing the concurrency number Make sure that the value of -j/--threads (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in info.toml file, e.g,\n# Seeds (k-mer-value data) files chunks = 48 Increasing the value of --max-open-files (default 512). You might also need to change the open files limit.\n(If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory.\nLoading the entire seed data into memoy (It\u0026rsquo;s unnecessary if the index is stored in SSD) Setting -w/--load-whole-seeds to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters. Returning less results Setting -n/--top-n-genomes to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time. Sacrificing accuracy Setting --pseudo-align to only perform pseudo alignment, which is slightly faster and uses less memory. It can be used in searching with long and divergent query sequences like nanopore long-reads. Click to read more detail of the usage.\n","description":"Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How\u0026rsquo;s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn\u0026rsquo;t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default."},{"id":22,"href":"/LexicMap/notes/","title":"Notes","parent":"","content":"","description":""},{"id":23,"href":"/LexicMap/","title":"","parent":"","content":" LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing\nlexicmap index -I genomes/ -O db.lmi Step 2: searching\nlexicmap search -d db.lmi q.fasta -o r.tsv Tutorials Usages FAQs Notes Accurate and efficient alignment Using LexicMap to search in the whole 2,340,672 Genbank+Refseq prokaryotic genomes with 48 CPUs.\nQuery Genome hits Time RAM A 1.3-kb gene 37,164 36s 4.1GB A 1.5-kb 16S rRNA 1,949,496 10m41s 14.1GB A 52.8-kb plasmid 544,619 19m20s 19.3GB 1003 AMR genes 25,702,419 187m40s 55.4GB Blastn is unable to run with the same dataset on common servers as it requires \u0026gt;2000 GB RAM.\nPerformance ","description":"LexicMap LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, virus, or long-read sequences against up to millions of prokaryotic genomes.\nIntroduction Feature overview Easy to install Linux, Windows, MacOS and more OS are supported.\nBoth x86 and ARM CPUs are supported.\nJust download the binary files and run!\nOr install it by\nconda install -c bioconda lexicmap Installation Releases Easy to use Step 1: indexing"},{"id":24,"href":"/LexicMap/usage/utils/2blast/","title":"2blast","parent":"utils","content":" Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G. You need increase the value when \u0026#34;bufio.Scanner: token too long\u0026#34; error reported (default \u0026#34;20M\u0026#34;) -h, --help help for 2blast -i, --ignore-case ► Ignore cases of sgenome and sseqid -g, --kv-file-genome string ► Two-column tabular file for mapping the target genome ID (sgenome) to the corresponding value -s, --kv-file-seq string ► Two-column tabular file for mapping the target sequence ID (sseqid) to the corresponding value -o, --out-file string ► Out file, supports and recommends a \u0026#34;.gz\u0026#34; suffix (\u0026#34;-\u0026#34; for stdout). (default \u0026#34;-\u0026#34;) Global Flags: -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Examples From stdin.\n$ seqkit seq -M 500 q.long-reads.fasta.gz \\ | seqkit head -n 2 \\ | lexicmap search -d demo.lmi/ -a \\ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_000017205.1_r160 Length = 478 [Subject genome #1/1] = GCF_000017205.1 Pseudomonas aeruginosa Query coverage per genome = 95.188% \u0026gt;NC_009656.1 Length = 6588339 HSP #1 Query coverage per seq = 95.188%, Aligned length = 463, Identities = 95.680%, Gaps = 12 Query range = 13-467, Subject range = 4866862-4867320, Strand = Plus/Plus Query 13 CCTCAAACGAGTCC-AACAGGCCAACGCCTAGCAATCCCTCCCCTGTGGGGCAGGGAAAA 71 |||||||||||||| |||||||| |||||| | ||||||||||||| |||||||||||| Sbjct 4866862 CCTCAAACGAGTCCGAACAGGCCCACGCCTCACGATCCCTCCCCTGTCGGGCAGGGAAAA 4866921 Query 72 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCATCTTCCACGGTGCCC 131 |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| Sbjct 4866922 TCGTCCTTTATGGTCCGTTCCGGGCACGCACCGGAACGGCGGTCAT-TTCCACGGTGCCC 4866980 Query 132 GCCCACGGCGGACCCGCGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTA-CA 190 ||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||| || Sbjct 4866981 GCC-ACGGCGGACCC-CGGAAACCGACCCGGGCGCCAAGGCGCCCGGGAACGGAGTATCA 4867038 Query 191 CTCGGCGTTCGGCCAGCGACAGC---GACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 247 |||||||| |||||||||||||| |||||||||||||||||||||||||||||||||| Sbjct 4867039 CTCGGCGT-CGGCCAGCGACAGCAGCGACGCGTTGCCGCCCACCGCGGTGGTGTTCACCG 4867097 Query 248 AGGTGGTGCGCTCGCTGAC-AAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCG- 305 ||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 4867098 AGGTGGTGCGCTCGCTGACGAAACGCAGCAGGTAGTTCGGCCCGCCGGCCTTGGGACCGG 4867157 Query 306 TGCCGGACAGCCCGTGGCCGCCGAACAGTTGCACGCCCACCACCGCGCCGAT-TGGTTTC 364 |||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| | Sbjct 4867158 TGCCGGACAGCCCGTGGCCGCCGAACGGTTGCACGCCCACCACCGCGCCGATCTGGTTGC 4867217 Query 365 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTGGATGCGGCGGGCGGTTTCCTCGT 424 |||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||| Sbjct 4867218 GGTTGACGTAGAGGTTGCCGACCCGCGCCAGCTCTTCGATGCGGCGGGCGGTTTCCTCGT 4867277 Query 425 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 467 ||||||||||||||||||||||||||||||||||||||||||| Sbjct 4867278 TGCGGCTGTGGACCCCCATGGTCAGGCCGAAACCGGTGGCGTT 4867320 Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% \u0026gt;NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677 From file.\n$ lexicmap utils 2blast r.lexicmap.tsv -o r.lexicmap.txt ","description":"Usage $ lexicmap utils 2blast -h Convert the default search output to blast-style format LexicMap only stores genome IDs and sequence IDs, without description information. But the option -g/--kv-file-genome enables adding description data after the genome ID with a tabular key-value mapping file. Input: - Output of \u0026#39;lexicmap search\u0026#39; with the flag -a/--all. Usage: lexicmap utils 2blast [flags] Flags: -b, --buffer-size string ► Size of buffer, supported unit: K, M, G."},{"id":25,"href":"/LexicMap/usage/lexicmap/","title":"lexicmap","parent":"Usage","content":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line). If given, they are appended to files from CLI arguments. --log string ► Log file. --quiet ► Do not print any verbose information. But you can write them to a file with --log. -j, --threads int ► Number of CPU cores to use. By default, it uses all available cores. (default 16) Use \u0026#34;lexicmap [command] --help\u0026#34; for more information about a command. ","description":"$ lexicmap -h LexicMap: efficient sequence alignment against millions of prokaryotic genomes Version: v0.4.0 Documents: https://bioinf.shenwei.me/LexicMap Source code: https://github.com/shenwei356/LexicMap Usage: lexicmap [command] Available Commands: autocompletion Generate shell autocompletion scripts index Generate an index from FASTA/Q sequences search Search sequences against an index utils Some utilities version Print version information and check for update Flags: -h, --help help for lexicmap -X, --infile-list string ► File of input file list (one file per line)."},{"id":26,"href":"/LexicMap/notes/motivation/","title":"Motivation","parent":"Notes","content":" BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information.\n","description":"BLASTN is not able to scale to millions of bacterial genomes, it\u0026rsquo;s slow and has a high memory occupation. For example, it requires \u0026gt;2000 GB for alignment a 2-kb gene sequence against all the 2.34 millions of prokaryotics genomes in Genbank and RefSeq.\nLarge-scale sequence searching tools only return which genomes a query matches (color), but they can\u0026rsquo;t return positional information."},{"id":27,"href":"/LexicMap/tutorials/index/","title":"Step 1. Building a database","parent":"Tutorials","content":"Terminology differences:\nOn this page and in the LexicMap command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. E.g., GCF_000006945.2.fna.gz A regular expression is also available to extract reference id from the file name. E.g., --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' extracts GCF_000006945.2 from GenBank assembly file GCF_000006945.2_ASM694v2_genomic.fna.gz While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it\u0026rsquo;s feasible as sequence IDs in search result can help to distinguish targe genomes. Run: From a directory with multiple genome files:\nlexicmap index -I genomes/ -O db.lmi From a file list with one file per line:\nlexicmap index -X files.txt -O db.lmi Input Genome size\nLexicMap is mainly suitable for small genomes like Archaea, Bacteria, Viruses and plasmids.\nMaximum genome size: 268 Mb (268,435,456). More precisely:\n$total_bases + ($num_contigs - 1) * 1000 \u0026lt;= 268,435,456 as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.\nSequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names. While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it\u0026rsquo;s feasible as sequence IDs in search result can help to distinguish targe genomes.\nFile type: FASTA/Q files, in plain text or gzip/xz/zstd/bzip2 compressed formats. File name: \u0026ldquo;Genome ID\u0026rdquo; + \u0026ldquo;File extention\u0026rdquo;. E.g., GCF_000006945.2.fna.gz. Genome ID: they should be distinct for accurate result interpretation, which will be shown in the search result. A regular expression is also available to extract reference id from the file name. E.g., --ref-name-regexp '^(\\w{3}_\\d{9}\\.\\d+)' extracts GCF_000006945.2 from GenBank assembly file GCF_000006945.2_ASM694v2_genomic.fna.gz File extention: a regular expression set by the flag -r/--file-regexp is used to match input files. The default value supports common sequence file extentions, e.g., .fa, .fasta, .fna, .fa.gz, .fasta.gz, .fna.gz, fasta.xz, fasta.zst, and fasta.bz2. Sequences: Only DNA or RNA sequences are supported. Sequence IDs should be distinct for accurate result interpretation, which will be shown in the search result. Sequence description (text behind sequence ID) is not saved. If you do need it, you can create a mapping file (seqkit seq -n ref.fa.gz | sed -E 's/\\s+/\\t/' \u0026gt; id2desc.tsv) and use it to add description in search result. One or more sequences (contigs) in each file are allowed. Unwanted sequences can be filtered out by regular expressions from the flag -B/--seq-name-filter. Genome size limit. Some none-isolate assemblies might have extremely large genomes, e.g., GCA_000765055.1 has \u0026gt;150 Mb. The flag -g/--max-genome (default 15 Mb) is used to skip these input files, and the file list would be written to a file via the flag -G/--big-genomes. For fungi genomes, please increase the value. Minimum sequence length. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value). At most 17,179,869,184 (234) genomes are supported. For more genomes, please create a file list and split it into multiple parts, and build an index for each part. Input files can be given via one of the following ways:\nPositional arguments. For a few input files. A file list via the flag -X/--infile-list with one file per line. It can be STDIN (-), e.g., you can filter a file list and pass it to lexicmap index. The flag -S/--skip-file-check is optional for skiping input file checking if you believe these files do exist. A directory containing input files via the flag -I/--in-dir. Multiple-level directories are supported. So you don\u0026rsquo;t need to saved hundreds of thousand files into one directoy. Directory and file symlinks are followed. Hardware requirements See benchmark of index building.\nLexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.\nCPU: No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported. More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads). RAM More RAM (\u0026gt; 100 GB) is preferred. The memory usage in index building is mainly related to: The number of masks (-m/--masks, default 40,000). Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=10,000 is high enough. The number of genomes. More genomes consume more memory. The divergence between genome sequences in each batch. Diverse genomes consume more memory. The genome batch size (-b/--batch-size, default 5,000). This is the main parameter to adjust memory usage. Bigger values increase indexing memory occupation. The maximum seed distance or the maximum sketching desert size (-D/--seed-max-desert, default 200), and the distance of k-mers to fill deserts (-d/--seed-in-desert-dist, default 50). Bigger -D/--seed-max-desert values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. If the RAM is not sufficient. Please: Use a smaller genome batch size. It decreases indexing memory occupation and has little affection on searching performance. Use a smaller number of masks, e.g., 20,000 performs well for small genomes (\u0026lt;=5 Mb). And if the queries are long (\u0026gt;= 2kb), there\u0026rsquo;s little affection for the alignment results. Disk More is better. LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance. See some examples. Note that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genomes bring higher compression rates. SSD disks are preferred, while HDD disks are also fast enough. Algorithm Click to show details. ... Generating m LexicHash masks.\nGenerate m prefixes. Generating all permutations of p-bp prefixes that can cover all possible k-mers, p is the biggest value for 4p \u0026lt;= m (desired number of masks), e.g., p=7 for 40,000 masks. (47 = 16384) Duplicating these prefixes to m prefixes. For each prefix, Randomly generating left k-p bases. If the mask is duplicated, re-generating. Building an index for each genome batch (-b/--batch-size, default 5,000, max 131,072).\nFor each genome file in a genome batch. Optionally discarding sequences via regular expression (-B/--seq-name-filter). Skipping genomes bigger than the value of -g/--max-genome. Concatenating all sequences, with intervals of 1000-bp N\u0026rsquo;s. Capturing the most similar k-mer (in non-gap and non-interval regions) for each mask and recording the k-mer and its location(s) and strand information. Base N is treated as A. Filling sketching deserts (genome regions longer than --seed-max-desert [default 200] without any captured k-mers/seeds). In a sketching desert, not a single k-mer is captured because there\u0026rsquo;s another k-mer in another place which shares a longer prefix with the mask. As a result, for a query similar to seqs in this region, all captured k-mers can’t match the correct seeds. For a desert region (start, end), masking the extended region (start-1000, end+1000) with the masks. Starting from start, every around --seed-in-desert-dist (default 50) bp, finding a k-mer which is captured by some mask, and adding the k-mer and its position information into the index of that mask. Saving the concatenated genome sequence (bit-packed, 2 bits for one base, N is treated as A) and genome information (genome ID, size, and lengths of all sequences) into the genome data file, and creating an index file for the genome data file for fast random subsequence extraction. Duplicate and reverse all k-mers, and save each reversed k-mer along with the duplicated position information in the seed data of the closest (sharing the longgest prefix) mask. This is for suffix matching of seeds. Compressing k-mers and the corresponding data (k-mer-data, or seeds data, including genome batch, genome number, location, and strand) into chunks of files, and creating an index file for each k-mer-data file for fast seeding. Writing summary information into info.toml file. Merging indexes of multiple batches.\nFor each k-mer-data chunk file (belonging to a list of masks), serially reading data of each mask from all batches, merging them and writting to a new file. For genome data files, just moving them. Concatenating genomes.map.bin, which maps each genome ID to its batch ID and index in the batch. Update the index summary file. Parameters Query length\nLexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.\nIf you just want to search long (\u0026gt;1kb) queries for highy similar (\u0026gt;95%) targets, you can build an index with a bigger -D/--seed-max-desert (200 by default), e.g.,\n--seed-max-desert 450 --seed-in-desert-dist 150 Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.\nFlags in bold text are important and frequently used.\nGenome batches Flag Value Function Comment -b/--batch-size Max: 131072, default: 5000 Maximum number of genomes in each batch If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. LexicHash mask generation Flag Value Function Comment -M/--mask-file A file File with custom masks File with custom masks, which could be exported from an existing index or newly generated by \u0026ldquo;lexicmap utils masks\u0026rdquo;. This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc. -k/--kmer Max: 32, default: 31 K-mer size ■ Bigger values improve the search specificity and do not increase the index size. -m/--masks Default: 40,000 Number of masks ■ Bigger values improve the search sensitivity, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=10,000 is high enough. Seeds (k-mer-value) data Flag Value Function Comment --seed-max-desert Default: 200 Maximum length of distances between seeds The default value of 200 guarantees queries \u0026gt;200 bp would match at least one seed. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every \u0026ndash;seed-in-desert-dist (50 by default) bases. ■ Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. -c/--chunks Maximum: 128, default: #CPUs Number of seed file chunks Bigger values accelerate the search speed at the cost of a high disk reading load. The maximum number should not exceed the maximum number of open files set by the operating systems. -J/--seed-data-threads Maximum: -c/\u0026ndash;chunks, default: 8 Number of threads for writing seed data and merging seed chunks from all batches ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. -p/--partitions Default: 1024 Number of partitions for indexing each seed file Bigger values bring a little higher memory occupation. ► After indexing, lexicmap utils reindex-seeds can be used to reindex the seeds data with another value of this flag. --max-open-files Default: 512 Maximum number of open files It\u0026rsquo;s only used in merging indexes of multiple genome batches. Also see the usage of lexicmap index.\nSteps We use a small dataset for demonstration.\nPreparing the test genomes (15 bacterial genomes) in the refs directory.\nNote that the genome files contain the assembly accessions (ID) in the file names.\ngit clone https://github.com/shenwei356/LexicMap cd LexicMap/demo/ ls refs/ GCF_000006945.2.fa.gz GCF_000392875.1.fa.gz GCF_001096185.1.fa.gz GCF_002949675.1.fa.gz GCF_006742205.1.fa.gz GCF_000017205.1.fa.gz GCF_000742135.1.fa.gz GCF_001457655.1.fa.gz GCF_002950215.1.fa.gz GCF_009759685.1.fa.gz GCF_000148585.2.fa.gz GCF_001027105.1.fa.gz GCF_001544255.1.fa.gz GCF_003697165.2.fa.gz GCF_900638025.1.fa.gz Building an index with genomes from a directory.\nlexicmap index -I refs/ -O demo.lmi It would take about 6 seconds and 3 GB RAM in a 16-CPU PC.\nOptionally, we can also use a file list as the input.\n$ head -n 3 files.txt refs/GCF_000006945.2.fa.gz refs/GCF_000017205.1.fa.gz refs/GCF_000148585.2.fa.gz lexicmap index -X files.txt -O demo.lmi Click to show the log of a demo run. ... # here we set a small --batch-size 5 $ lexicmap index -I refs/ -O demo.lmi --batch-size 5 16:22:49.745 [INFO] LexicMap v0.4.0 (14c2606) 16:22:49.745 [INFO] https://github.com/shenwei356/LexicMap 16:22:49.745 [INFO] 16:22:49.745 [INFO] checking input files ... 16:22:49.745 [INFO] 15 input file(s) given 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ main parameters ] --------------------- 16:22:49.745 [INFO] 16:22:49.745 [INFO] input and output: 16:22:49.745 [INFO] input directory: refs/ 16:22:49.745 [INFO] regular expression of input files: (?i)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$ 16:22:49.745 [INFO] *regular expressions for filtering out sequences: [] 16:22:49.745 [INFO] max genome size: 15000000 16:22:49.745 [INFO] output directory: demo.lmi 16:22:49.745 [INFO] 16:22:49.745 [INFO] mask generation: 16:22:49.745 [INFO] k-mer size: 31 16:22:49.745 [INFO] number of masks: 40000 16:22:49.745 [INFO] rand seed: 1 16:22:49.745 [INFO] prefix length for checking low-complexity in mask generation: 15 16:22:49.745 [INFO] 16:22:49.745 [INFO] seed data: 16:22:49.745 [INFO] maximum sketching desert length: 450 16:22:49.745 [INFO] distance of k-mers to fill deserts: 150 16:22:49.745 [INFO] seeds data chunks: 16 16:22:49.745 [INFO] seeds data indexing partitions: 1024 16:22:49.745 [INFO] 16:22:49.745 [INFO] general: 16:22:49.745 [INFO] genome batch size: 5 16:22:49.745 [INFO] batch merge threads: 8 16:22:49.745 [INFO] 16:22:49.745 [INFO] 16:22:49.745 [INFO] --------------------- [ generating masks ] --------------------- 16:22:50.180 [INFO] 16:22:50.180 [INFO] --------------------- [ building index ] --------------------- 16:22:50.328 [INFO] 16:22:50.328 [INFO] ------------------------[ batch 1/3 ]------------------------ 16:22:50.328 [INFO] building index for batch 1 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:51.192 [INFO] writing seeds... 16:22:51.264 [INFO] finished writing seeds in 71.756662ms 16:22:51.264 [INFO] finished building index for batch 1 in: 935.464336ms 16:22:51.264 [INFO] 16:22:51.264 [INFO] ------------------------[ batch 2/3 ]------------------------ 16:22:51.264 [INFO] building index for batch 2 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:53.126 [INFO] writing seeds... 16:22:53.212 [INFO] finished writing seeds in 86.823785ms 16:22:53.212 [INFO] finished building index for batch 2 in: 1.948770015s 16:22:53.212 [INFO] 16:22:53.212 [INFO] ------------------------[ batch 3/3 ]------------------------ 16:22:53.212 [INFO] building index for batch 3 with 5 files... processed files: 5 / 5 [======================================] ETA: 0s. done 16:22:54.350 [INFO] writing seeds... 16:22:54.437 [INFO] finished writing seeds in 87.058101ms 16:22:54.437 [INFO] finished building index for batch 3 in: 1.224414126s 16:22:54.437 [INFO] 16:22:54.437 [INFO] merging 3 indexes... 16:22:54.437 [INFO] [round 1] 16:22:54.437 [INFO] batch 1/1, merging 3 indexes to demo.lmi.tmp/r1_b1 with 8 threads... 16:22:54.613 [INFO] [round 1] finished in 175.640164ms 16:22:54.613 [INFO] rename demo.lmi.tmp/r1_b1 to demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] finished building LexicMap index from 15 files with 40000 masks in 4.875616203s 16:22:54.620 [INFO] LexicMap index saved: demo.lmi 16:22:54.620 [INFO] 16:22:54.620 [INFO] elapsed time: 4.875654824s 16:22:54.620 [INFO] Output The LexicMap index is a directory with multiple files.\nFile structure $ tree demo.lmi/ demo.lmi/ # the index directory ├── genomes # directory of genome data │ └── batch_0000 # genome data of one batch │ ├── genomes.bin # genome data file, containing genome ID, size, sequence lengths, bit-packed sequences │ └── genomes.bin.idx # index of genome data file, for fast subsequence extraction ├── seeds # seed data: pairs of k-mer and its location information (genome batch, genome number, location, strand) │ ├── chunk_000.bin # seed data file │ ├── chunk_000.bin.idx # index of seed data file, for fast seed searching and data extraction ... ... ... │ ├── chunk_015.bin # the number of chunks is set by flag `-c/--chunks`, default: #cpus │ └── chunk_015.bin.idx ├── genomes.map.bin # mapping genome ID to batch number of genome number in the batch ├── info.toml # summary of the index └── masks.bin # mask data Index size LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance.\nNote that the index size is not linear with the number of genomes, it\u0026rsquo;s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring higher compression rates.\nDemo data # 15 genomes demo.lmi: 73.30 MB (73,297,328) 59.41 MB seeds 13.57 MB genomes 320.03 kB masks.bin 375 B genomes.map.bin 323 B info.toml GTDB repr # 85,205 genomes gtdb_repr.lmi: 228.15 GB (228,149,871,198) 156.44 GB seeds 71.71 GB genomes 2.13 MB genomes.map.bin 320.03 kB masks.bin 329 B info.toml GTDB complete # 402,538 genomes gtdb_complete.lmi: 972.85 GB (972,854,821,322) 583.10 GB seeds 389.74 GB genomes 10.06 MB genomes.map.bin 320.03 kB masks.bin 330 B info.toml Genbank\u0026#43;RefSeq # 2,340,672 genomes genbank_refseq.lmi: 5.43 TB (5,428,824,803,581) 3.04 TB seeds 2.38 TB genomes 821.17 MB kmers-m12345.tsv 58.52 MB genomes.map.bin 320.03 kB masks.bin 332 B info.toml AllTheBacteria HQ # 1,858,610 genomes atb_hq.lmi: 4.26 TB (4,261,437,129,065) 2.32 TB seeds 1.94 TB genomes 41.12 MB genomes.map.bin 320.03 kB masks.bin 332 B info.toml Directory/file sizes are counted with https://github.com/shenwei356/dirsize v1.2.1 (dirsize -k, base: 1000). Index building parameters: -k 31 -m 40000. Genome batch size: -b 5000 for GTDB datasets, -b 25000 for others. Explore the index We provide several commands to explore the index data and extract indexed subsequences:\nlexicmap utils genomes can list genome IDs of indexed genomes, see the usage and example. lexicmap utils masks can list masks of the index, see the usage and example. lexicmap utils kmers can list details of all seeds (k-mers), including reference, location(s), the strand, and the k-mer direction. see the usage and example. lexicmap utils seed-pos can help to explore the seed positions, see the usage and example. Before that, the flag --save-seed-pos needs to be added to lexicmap index. lexicmap utils subseq can extract subsequences via genome ID, sequence ID and positions, see the usage and example. What\u0026rsquo;s next: Searching ","description":"Terminology differences:\nOn this page and in the LexicMap command line options, the term \u0026ldquo;mask\u0026rdquo; is used, following the terminology in the LexicHash paper. In the LexicMap manuscript, however, we use \u0026ldquo;probe\u0026rdquo; as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology. Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Steps Output File structure Index size Explore the index TL;DR Prepare input files: Sequences of each reference genome should be saved in separate FASTA/Q files, with identifiers in the file names."},{"id":28,"href":"/LexicMap/tags/","title":"Tags","parent":"","content":"","description":""}]
\ No newline at end of file
diff --git a/tutorials/search/index.html b/tutorials/search/index.html
index 35bf6f4..8c10210 100644
--- a/tutorials/search/index.html
+++ b/tutorials/search/index.html
@@ -71,7 +71,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
"headline": "Step 2. Searching",
"description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
- "wordCount" : "2941",
+ "wordCount" : "3088",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
@@ -2089,23 +2089,41 @@
Step 2. Searching
+
LexicMap’s searching speed is related to many factors:
+
+
The number of similar sequences in the index/database. More genome hits cost more time, e.g., 16S rRNA gene.
+
Similarity between query and subject sequences. Alignment of diverse sequences is slower than that of highly similar sequences.
+
The length of query sequence. Longer queries run with more time.
+
The I/O performance and load. LexicMap is I/O bound, because seeds matching and extracting candidate subsequences for alignment require a large number of file readings in parallel.
+
CPU frequency and the number of threads. Faster CPUs and more threads cost less time.
+
Here are some tips to improve the search speed.
-
Increasing the concurrency number
+
Increasing the concurrency number
-
Increasing the value of --max-open-files (default 512). You might need to
+
(If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory.
+>change the open files limit.
+
+
+
(If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory.
+
-
Loading the entire seed data into memoy (It’s unnecessary if the index is stored in SSD)
+
(If you have many queries) Loading the entire seed data into memoy (It’s unnecessary if the index is stored in SSD)
Setting -w/--load-whole-seeds to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.
-
Returning less results
+
Returning less results
Setting -n/--top-n-genomes to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.