From fe3d92495df728cbd817ad84493c73df378e974d Mon Sep 17 00:00:00 2001 From: Wei Shen Date: Wed, 28 Sep 2022 09:13:27 +0800 Subject: [PATCH] v0.9.0 --- CHANGELOG.md | 2 +- docs/database.md | 6 ++-- docs/download.md | 49 +++++++++++++++++++++++--------- docs/tutorial/profiling/index.md | 5 ++-- docs/tutorial/searching/index.md | 18 ++++++------ 5 files changed, 52 insertions(+), 28 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 03eeca6..16a6772 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,6 @@ # Changelog -### v0.9.0 - 2022-09-00 +### v0.9.0 - 2022-09-28 - `compute`: - smaller output files and faster speed. diff --git a/docs/database.md b/docs/database.md index 7cc3e24..91e4eea 100644 --- a/docs/database.md +++ b/docs/database.md @@ -6,11 +6,11 @@ All prebuilt databases and the used reference genomes are available at: - [OneDrive](https://1drv.ms/u/s!Ag89cZ8NYcqtjHwpe0ND3SUEhyrp?e=QDRbEC) for *global users*. - [CowTransfer](https://shenwei356.cowtransfer.com/s/c7220dd5901c42) for *Chinese users and global users*.
- **Please click the "kmcp+105 more files" link to browse directories and files, and choose an indiviual file to download**.
+ **Please click the "kmcp" link to browse directories and files, and choose an indiviual file to download**.

Please check file integrity with `md5sum` after download the files:

- md5sum -c gtdb.kmcp.tar.gz.md5.txt + md5sum -c gtdb.kmcp.tar.gz.md5.txt genbank-viral.kmcp.tar.gz.md5.txt refseq-fungi.kmcp.tar.gz.md5.txt **Hardware requirements** @@ -43,7 +43,7 @@ Users can also [build custom databases](#building-custom-databases), it's simple |**Bacteria and Archaea**|GTDB r202 |28073+ |47894 |k=21, chunks=10;
fpr=0.3, hashes=1 |[gtdb.kmcp.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtkBFGpARKkdzpfAxf?e=IPQN22) (50.34 GB, [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtkA8IUG1zuh2wuYCh?e=jUkUXQ)),
[CowTransfer link](https://shenwei356.cowtransfer.com/s/3426e055bee74a) ([md5](https://shenwei356.cowtransfer.com/s/a8e60e9040eb4c)) |58.03 GB | |**Bacteria and Archaea**|HumGut |1594+ |30691 |k=21, chunks=10;
fpr=0.3, hashes=1 |[humgut.kmcp.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtjUxZymOTLu1qJyDI?e=ZPWhDt) (18.77 GB, [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtjUVZu1Y-Vtussvdc?e=wHlWdm)),
[CowTransfer link](https://shenwei356.cowtransfer.com/s/0b88a8ef2cff42) ([md5](https://shenwei356.cowtransfer.com/s/a04127a6bfb648)) |21.52 GB | |**Fungi** |Refseq r208|398 |403 |k=21, chunks=10;
fpr=0.3, hashes=1 |[refseq-fungi.kmcp.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtkBCf0vPMatJbSvtF?e=2jE0HH) (3.68 GB, [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtkA0ZuDblb_hNJAtP?e=brrpFn)),
[CowTransfer link](https://shenwei356.cowtransfer.com/s/62e1abfa795443) ([md5](https://shenwei356.cowtransfer.com/s/09a50702304343)) |4.18 GB | -|**Viruses** |GenBank 246|23632 |27936 |k=21, chunks=5;
fpr=0.05, hashes=1 |[genbank-viral.kmcp.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtkA7ofenEH6ve7va7?e=rgb5Vz) (1.25 GB, [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtkAx0HPhHUSthZMxO?e=sUwaKM)),
[CowTransfer link](https://shenwei356.cowtransfer.com/s/351451ef4e6d41) ([md5](https://shenwei356.cowtransfer.com/s/e359c61253fb44))|4.72 GB | +|**Viruses** |GenBank 246|23632 |27936 |k=21, chunks=10;
fpr=0.05, hashes=1 |[genbank-viral.kmcp.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtkA7ofenEH6ve7va7?e=rgb5Vz) (1.25 GB, [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtkAx0HPhHUSthZMxO?e=sUwaKM)),
[CowTransfer link](https://shenwei356.cowtransfer.com/s/351451ef4e6d41) ([md5](https://shenwei356.cowtransfer.com/s/e359c61253fb44))|4.72 GB | |**Human** |CHM13 |1 |1 |k=21, chunks=1024;
fpr=0.3, hashes=1|[human-chm13.kmcp.tar.gz](https://1drv.ms/u/s!Ag89cZ8NYcqtjVQgKPCZ7jciZqEp?e=jAO76U) (818 MB, [md5](https://1drv.ms/t/s!Ag89cZ8NYcqtjU1nGeOJaFf70y_K?e=bzJPcE)),
[CowTransfer link](https://shenwei356.cowtransfer.com/s/07e614a36b1a4b) ([md5](https://shenwei356.cowtransfer.com/s/c91d4c98677645)) |946 MB | *based on NCBI taxonomy data 2021-12-06. `+` is used because some species are unclassfied xxx. diff --git a/docs/download.md b/docs/download.md index 842e76e..ceb07e3 100644 --- a/docs/download.md +++ b/docs/download.md @@ -15,27 +15,38 @@ in two packages for better searching performance. ## Current Version -### v0.8.3 - 2022-08-15 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.8.3/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.8.3) +### v0.9.0 - 2022-09-28 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.9.0/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.9.0) -- `kmcp`: fix compiling from source for ARM architectures.[#17](https://github.com/shenwei356/kmcp/issues/17) +- `compute`: + - smaller output files and faster speed. + - more even genome splitting. +- `index`: + - faster speed due to smaller input files. - `search`: - - fix searching with paired-end reads where the read2 is shorter than the value of `--min-query-len`. [#10](https://github.com/shenwei356/kmcp/issues/10) - - fix the log. [#8](https://github.com/shenwei356/kmcp/issues/8) - - a new flag `-f/--max-fpr`: maximum false positive rate of a query (default 0.05). It reduces the unnecessary output when searching with a low minimum query coverage (`-t/--min-query-cov`). + - ***more accurate and smaller query FPR following Theorem 2 in SBT paper, instead of the Chernoff bound***. + - change the default value of `-f/--max-fpr` from 0.05 to 0.01. + - ***10-20% speedup***. - `profile`: - - recommend using the flag `--no-amb-corr` to disable ambiguous reads correction when >= 1000 candidates are detected. - - fix logging when using `--level strain` and no taxonomy given. - + - ***more accurate abundance estimation using EM algorithm***. + - change the default value of `-f/--max-fpr` from 0.05 to 0.01. + - mode 0: change the default value of `-H/--min-hic-ureads-qcov` from 0.55 to 0.7. + - increase float width of reference coverage in KMCP profile format from 2 to 6. +- `util query-fpr`: + - compute query FPR following Theorem 2 in SBT paper, instead of the Chernoff bound. +- new commands: + - `utils split-genomes` for splitting genomes into chunks. + - `utils ref-info` for printing information of reference (chunks), including the number of k-mers + and the actual false-positive rate. ### Links OS |Arch |File, 中国镜像 |Download Count :------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Linux |**64-bit**|[**kmcp_linux_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_linux_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_linux_amd64.tar.gz) -Linux |arm64 |[**kmcp_linux_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_linux_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_linux_arm64.tar.gz) -macOS |**64-bit**|[**kmcp_darwin_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_darwin_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_darwin_amd64.tar.gz) -macOS |arm64 |[**kmcp_darwin_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_darwin_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_darwin_arm64.tar.gz) -Windows|**64-bit**|[**kmcp_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_windows_amd64.exe.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.8.3/kmcp_windows_amd64.exe.tar.gz) +Linux |**64-bit**|[**kmcp_linux_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_linux_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_linux_amd64.tar.gz) +Linux |arm64 |[**kmcp_linux_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_linux_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_linux_arm64.tar.gz) +macOS |**64-bit**|[**kmcp_darwin_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_darwin_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_darwin_amd64.tar.gz) +macOS |arm64 |[**kmcp_darwin_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_darwin_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_darwin_arm64.tar.gz) +Windows|**64-bit**|[**kmcp_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_windows_amd64.exe.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.0/kmcp_windows_amd64.exe.tar.gz) *Notes:* @@ -136,6 +147,18 @@ fish: ## Release History + +### v0.8.3 - 2022-08-15 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.8.3/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.8.3) + +- `kmcp`: fix compiling from source for ARM architectures.[#17](https://github.com/shenwei356/kmcp/issues/17) +- `search`: + - fix searching with paired-end reads where the read2 is shorter than the value of `--min-query-len`. [#10](https://github.com/shenwei356/kmcp/issues/10) + - fix the log. [#8](https://github.com/shenwei356/kmcp/issues/8) + - a new flag `-f/--max-fpr`: maximum false positive rate of a query (default 0.05). It reduces the unnecessary output when searching with a low minimum query coverage (`-t/--min-query-cov`). +- `profile`: + - recommend using the flag `--no-amb-corr` to disable ambiguous reads correction when >= 1000 candidates are detected. + - fix logging when using `--level strain` and no taxonomy given. + ### [v0.8.2](https://github.com/shenwei356/kmcp/releases/tag/v0.8.2) - 2022-03-26 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.8.2/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.8.2) - `search`: diff --git a/docs/tutorial/profiling/index.md b/docs/tutorial/profiling/index.md index 05a6287..c9699a6 100644 --- a/docs/tutorial/profiling/index.md +++ b/docs/tutorial/profiling/index.md @@ -129,7 +129,8 @@ can use stricter criteria in `kmcp profile`. --min-kmers 10 \ --min-query-len 30 \ --min-query-cov 0.55 \ - $read1 $read2 \ + $read1 \ + $read2 \ --out-file $sample.kmcp@$dbname.tsv.gz \ --log $sample.kmcp@$dbname.tsv.gz.log done @@ -137,7 +138,7 @@ can use stricter criteria in `kmcp profile`. # 2. Merging search results against multiple databases kmcp merge $sample.kmcp@*.tsv.gz --out-file $sample.kmcp.tsv.gz -Pair-end reads: +Paired-end reads: # --------------------------------------------------- # paired-end diff --git a/docs/tutorial/searching/index.md b/docs/tutorial/searching/index.md index a0e273c..97d1f15 100644 --- a/docs/tutorial/searching/index.md +++ b/docs/tutorial/searching/index.md @@ -206,7 +206,7 @@ The searching process is simple and [very fast](https://bioinf.shenwei.me/kmcp/b kmcp search --query-whole-file -d gtdb.minhash.kmcp/ \ --query-whole-file --sort-by jacc --min-query-cov 0.2 \ - --query-id genomme1 contigs.fasta -o result.tsv + --query-id genome1 contigs.fasta -o result.tsv The output is in tab-delimited format: @@ -232,34 +232,34 @@ A full search result: |#query |qLen |qKmers|FPR |hits|target |chunkIdx|chunks|tLen |kSize|mKmers|qCov |tCov |jacc |queryIdx| |:-------|:------|:-----|:----------|:---|:--------------|:-------|:-----|:------|:----|:-----|:-----|:-----|:-----|:-------| -|genomme1|9488952|18737 |0.0000e+00 |2 |GCF_000742135.1|0 |1 |5545784|31 |8037 |0.4289|0.7365|0.3719|0 | -|genomme1|9488952|18737 |3.1964e-183|2 |GCF_000392875.1|0 |1 |2881400|31 |3985 |0.2127|0.7062|0.1954|0 | +|genome1 |9488952|18737 |0.0000e+00 |2 |GCF_000742135.1|0 |1 |5545784|31 |8037 |0.4289|0.7365|0.3719|0 | +|genome1 |9488952|18737 |3.1964e-183|2 |GCF_000392875.1|0 |1 |2881400|31 |3985 |0.2127|0.7062|0.1954|0 | Reference IDs can be optionally mapped to their names, let's print the main columns only: kmcp search --query-whole-file -d gtdb.minhash.kmcp/ \ --name-map name.map \ --query-whole-file --sort-by jacc --min-query-cov 0.2 \ - --query-id genomme1 contigs.fasta \ + --query-id genome1 contigs.fasta \ | csvtk rename -t -C $ -f 1 -n query \ | csvtk cut -t -f query,jacc,target \ > result.tsv |query |jacc |target | |:-------|:-----|:-----------------------------------------------------------------------------------------------| -|genomme1|0.3719|NZ_KN046818.1 Klebsiella pneumoniae strain ATCC 13883 scaffold1, whole genome shotgun sequence | -|genomme1|0.1954|NZ_KB944588.1 Enterococcus faecalis ATCC 19433 acAqW-supercont1.1, whole genome shotgun sequence| +|genome1 |0.3719|NZ_KN046818.1 Klebsiella pneumoniae strain ATCC 13883 scaffold1, whole genome shotgun sequence | +|genome1 |0.1954|NZ_KB944588.1 Enterococcus faecalis ATCC 19433 acAqW-supercont1.1, whole genome shotgun sequence| Using closed syncmer: kmcp search --query-whole-file -d gtdb.syncmer.kmcp/ \ --name-map name.map \ --query-whole-file --sort-by jacc --min-query-cov 0.2 \ - --query-id genomme1 contigs.fasta \ + --query-id genome1 contigs.fasta \ | csvtk rename -t -C $ -f 1 -n query \ | csvtk cut -t -f query,jacc,target |query |jacc |target | |:-------|:-----|:-----------------------------------------------------------------------------------------------| -|genomme1|0.3712|NZ_KN046818.1 Klebsiella pneumoniae strain ATCC 13883 scaffold1, whole genome shotgun sequence | -|genomme1|0.1974|NZ_KB944588.1 Enterococcus faecalis ATCC 19433 acAqW-supercont1.1, whole genome shotgun sequence| +|genome1 |0.3712|NZ_KN046818.1 Klebsiella pneumoniae strain ATCC 13883 scaffold1, whole genome shotgun sequence | +|genome1 |0.1974|NZ_KB944588.1 Enterococcus faecalis ATCC 19433 acAqW-supercont1.1, whole genome shotgun sequence|