Skip to content

Commit

Permalink
docs: update tips for improving searching speed
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Sep 24, 2024
1 parent 883b7e9 commit 2fceec9
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 15 deletions.
24 changes: 17 additions & 7 deletions faqs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/faqs/",
"headline": "FAQs",
"description": "Table of contents Table of contents Does LexicMap support short reads? Does LexicMap support fungi genomes? How’s the hardware requirement? Can I extract the matched sequences? How can I extract the upstream and downstream flanking sequences of matched regions? Why isn’t the pident 100% when aligning with a sequence from the reference genomes? Why is LexicMap slow for batch searching? Does LexicMap support short reads? LexicMap is mainly designed for sequence alignment with a small number of queries (gene\/plasmid\/virus\/phage sequences) longer than 200 bp by default.",
"wordCount" : "773",
"wordCount" : "818",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -1830,21 +1830,31 @@ <h1>FAQs</h1>
<p>LexicMap is mainly designed for sequence alignment with a small number of queries against a database with a huge number (up to 17 million) of genomes.
There are some ways to improve the search speed of <code>lexicmap search</code>.</p>
<ul>
<li>Increasing the concurrency number
<li><strong>Increasing the concurrency number</strong>
<ul>
<li>Increasing the value of <code>--max-open-files</code> (default 512). You might need to <a
<li>
<p>Make sure that the value of <code>-j/--threads</code> (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in <code>info.toml</code> file, e.g,</p>
<pre><code># Seeds (k-mer-value data) files
chunks = 48
</code></pre>
</li>
<li>
<p>Increasing the value of <code>--max-open-files</code> (default 512). You might also need to <a
class="gdoc-markdown__link"
href="https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux"
>change the open files limit</a>.</li>
<li>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</li>
>change the open files limit</a>.</p>
</li>
<li>
<p>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</p>
</li>
</ul>
</li>
<li>Loading the entire seed data into memoy (It&rsquo;s unnecessary if the index is stored in SSD)
<li><strong>Loading the entire seed data into memoy</strong> (It&rsquo;s unnecessary if the index is stored in SSD)
<ul>
<li>Setting <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
</ul>
</li>
<li>Returning less results
<li><strong>Returning less results</strong>
<ul>
<li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
</ul>
Expand Down
2 changes: 1 addition & 1 deletion search/en.data.min.json

Large diffs are not rendered by default.

32 changes: 25 additions & 7 deletions tutorials/search/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@
"url" : "https://bioinf.shenwei.me/LexicMap/tutorials/search/",
"headline": "Step 2. Searching",
"description": "Table of contents Table of contents TL;DR Input Hardware requirements Algorithm Parameters Improving searching speed Steps Output Alignment result relationship Output format Examples Summarizing results TL;DR Build a LexicMap index.\nRun:\nFor short queries like genes or long reads, returning top N hits.\nlexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \\ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000 For longer queries like plasmids, returning all hits.\nlexicmap search -d db.lmi query.",
"wordCount" : "2941",
"wordCount" : "3088",
"inLanguage": "en",
"isFamilyFriendly": "true",
"mainEntityOfPage": {
Expand Down Expand Up @@ -2089,23 +2089,41 @@ <h1>Step 2. Searching</h1>
<svg class="gdoc-icon gdoc_link"><use xlink:href="#gdoc_link"></use></svg>
</a>
</div>
<p>LexicMap&rsquo;s searching speed is related to many factors:</p>
<ul>
<li><strong>The number of similar sequences in the index/database</strong>. More genome hits cost more time, e.g., 16S rRNA gene.</li>
<li><strong>Similarity between query and subject sequences</strong>. Alignment of diverse sequences is slower than that of highly similar sequences.</li>
<li><strong>The length of query sequence</strong>. Longer queries run with more time.</li>
<li><strong>The I/O performance and load</strong>. LexicMap is I/O bound, because seeds matching and extracting candidate subsequences for alignment require a large number of file readings in parallel.</li>
<li><strong>CPU frequency and the number of threads</strong>. Faster CPUs and more threads cost less time.</li>
</ul>
<p>Here are some tips to improve the search speed.</p>
<ul>
<li>Increasing the concurrency number
<li><strong>Increasing the concurrency number</strong>
<ul>
<li>Increasing the value of <code>--max-open-files</code> (default 512). You might need to <a
<li>
<p>Make sure that the value of <code>-j/--threads</code> (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in <code>info.toml</code> file, e.g,</p>
<pre><code># Seeds (k-mer-value data) files
chunks = 48
</code></pre>
</li>
<li>
<p>Increasing the value of <code>--max-open-files</code> (default 512). You might also need to <a
class="gdoc-markdown__link"
href="https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux"
>change the open files limit</a>.</li>
<li>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</li>
>change the open files limit</a>.</p>
</li>
<li>
<p>(If you have many queries) Increase the value of <code>-J/--max-query-conc</code> (default 12), it will increase the memory.</p>
</li>
</ul>
</li>
<li>Loading the entire seed data into memoy (It&rsquo;s unnecessary if the index is stored in SSD)
<li>(If you have many queries) <strong>Loading the entire seed data into memoy</strong> (It&rsquo;s unnecessary if the index is stored in SSD)
<ul>
<li>Setting <code>-w/--load-whole-seeds</code> to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.</li>
</ul>
</li>
<li>Returning less results
<li><strong>Returning less results</strong>
<ul>
<li>Setting <code>-n/--top-n-genomes</code> to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.</li>
</ul>
Expand Down

0 comments on commit 2fceec9

Please sign in to comment.