Skip to content

Latest commit

 

History

History
57 lines (30 loc) · 5.15 KB

CHANGELOG.md

File metadata and controls

57 lines (30 loc) · 5.15 KB

MicrobeDB ChangeLog

  • Documentation greatly improved

  • Merged MAC installation in with general INSTALL documentation.

  • Added two additional example scripts for using the MicrobeDB API (get_genome_sizes_of_aquatic_genomes.pl & search_for_pathogen_recA_genes.pl)

  • Add ability to directly retrieve genomeproject and replicon objects from a gene object when using the MicrobeDB API

  • Several MySQL columns have been indexed to improve query time on these fields (e.g. gene_name, locus_tag, rep_accnum, etc.).

  • Users can choose to download only a subset of all genomes from NCBI using a search string at the genera and species level. (see -s option in download_version.pl). Works for both complete and incomplete genomes.

  • Support for incomplete/draft genomes from NCBI (see -i and -o options in download_version.pl), including downloading the incomplete genome metadata table. Also, unpack_version.pl has been adapted to properly unpack incomplete genomes (these are .tgz files WITHIN genome directories)

###v0.2###

  • MySQL database name can be set by the user using the environment variable 'MicrobeDB'

  • Added a write_fasta() routine to Replicon.pm; where you can write a fasta file and use a formatting string to set the sequence header format

  • Moved the changing of the symlink in download_load...pl to the END so that if it dies part way through it doesn't leave a /Bacteria symlink pointing at an incomplete MicrobeDB download/version.

  • Add aspera binary for Mac OS and make download_version.pl aware of it.

  • Fixed error causing bugs during load_version.pl associated with sub-directories not containing genomes.

  • All scripts in /scripts directory have proper POD documentation. Access via --help option or perldoc.

  • Script under 'information/UPDATE/' allows user to check if their microbedb schema is matches their MicrobeDB code. 'information/DEVELOPERS_ONLY' contains a script to use after making schema changes that updates the schema version number and makes the sql diff file.

  • Updated Versions.pm slightly adding calls such as isvalid() which will return if a given version number is valid in the database. A good test when doing diffs between versions, check if the version you're trying to update from is still loaded!

  • Added microbedb_meta table to schema file and update sql file

  • Changed gp_id from Project id to RefSeq project id which only seems to be actually referenced in about 2 locations plus one array index needed changing.

  • Parsing of files is all done within new Parse.pm module (a proper OO module that follows the rest of the MicrobeDB code). Will's old NCBI2hash.pm module is gone. Also, the parsing has been simplified so that everything comes from Genbank files and the two NCBI special table files. This will make maintaining the parsing code more robust and much easier to update.

  • Users can now easily add their own unpublished genomes. They just put their genbank files for each genome in a directory and load them using the --custom option. The only fields not filled are the information about the organism such as gram stain, pathogenicity, habitat, etc. Things like GC% and size of genome are calculated from the genbank files. Also, taxon information is retrieved for these custom genomes if a taxon id is given in the genbank file. This is a nice feature since many labs have their own unpublished genomes, and is the main reason I started changing microbedb since I have a comparative genomics project that I wanted MicrobeDB to help me organize.

  • Somewhat related to the custom genomes, is that I have added rep_type='contig' (just plasmid and chromosome were allowed before). This allows unfinished genomes to be added to MicrobeDB. Also, I added fields to the GenomeProject table class for number of chromosomes, plasmids and contigs. Use "information/INSTALL/update_microbedb.sql" to update your MySQL schema.

  • Unpacking of genomes and loading of genomes can be parallelized by giving the -p option (using option by itself it will detect number of cpus and uses all of them, if given number with option it limits itself to that many). This makes the whole update process much faster. User is now required to have perl modules: Parallel::ForkManager and Sys::CPU installed.

  • Parsing will work on gzipped genbank files. Is a nice feature if disk space is an issue or in the future when number of genomes gets more unbearable.

  • In theory parsing would work on embl files as well (since I am using bioperl), but this has not been tested.

  • Scripts in the 'scripts/' directory have been updated to use proper options. Also, scripts have been added to allow manipulation at the 'genomeproject' level and not just for entire versions (e.g. load_genome.pl, delete_genome.pl, and reload_genome.pl). This makes adding custom genomes easier as well as fixing 'problematic' genomes from NCBI without having to re-load entire versions.

###v0.1###

  • Matthew started using logging (Log::Log4perl now required) and I embraced this by adding logging in more places. Also, logging is output to screen at the 'info' level and above, while logging for all categories ('debug' and up) is output to a log file.

  • Aspera is used for download by default instead of FTP, allowing faster downloads.