-
Notifications
You must be signed in to change notification settings - Fork 4
rkmh Read Classification by MinHash
Welcome to the rkmh wiki!
For questions and comments, please post an issue. You might also consider contacting me via email, though I prefer discussion to be on the Github and to let open development work as designed.
rkmh should be very easy to build as long as your compiler supports C++11 (clang 3.8 or newer; gcc 4.9 or newer).
git clone --recursive https://github.com/edawson/rkmh
cd rkmh
make
This should build the backing mkmh, murmurhash3 and kseq_reader libraries and produce the rkmh executable.
Running
./rkmh
or
./rkmh -h
Should give you a list of subcommands and their descriptions. The currently availably subcommands are:
-
hash - Generate the 64-bit hashes of the input sequences. Optionally, rkmh can be told to not hash and just output the kmers.
-
stream - Compare a set of references and reads and return a file which maps from a read name to the reference it most resembles.
-
filter - Given a set of reads, a set of references and a threshold N for the minimum number of matches, return all the reads that share at least N hashes with any reference.
-
call - Call SNVs against a reference. While multiple references are permitted we don't recommend it at the moment.
Soon to be deprecated:
- classify - do the same as stream but require exact counts when using minimum / maximum occurrence filters (i.e. use a std::map instead of a lock-free hashtable that permits collisions). Collision rates tend to be low, and we plan to give stream the option to do this soon.
Coming soon:
Consider the following functions experimental, unstable, and as yet unsupported (but with support coming in the next few months):
- count - count the number of times a kmer occurs in a query file, and return a two-colum file mapping each kmer to its number of occurrences. We'd recommend using Jellyfish for this if your genome are more than a few kb, as it's a fantastic kmer counter.
- search - Given a list of kmers or hashes, find reads in the query set that contain those. Much like filter but designed as a step in a different workflow (stay tuned).