README.Rmd

---
title: "README"
author: "dnanto"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output: 
  html_document: 
    keep_md: yes
    toc: yes
    toc_depth: 2
---

# Abstract

**ffbio** is a collection of scripts to work with flat-file sequence databases and help complete my dissertation/projects...

Local install: ```./setup.py install```
Remote install: ```pip install git+https://github.com/dnanto/ffbio#egg=ffbio```.

Note: this repository is experimental and there are no tests yet...
Note: this is an Rmarkdown document and uses a mix of R and bash code chunks.
Note: [search field descriptions](https://www.ncbi.nlm.nih.gov/books/NBK49540/).

# Scripts

## ffbio.ffdb

The ```ffdb.py``` script searches NCBI and downloads the entire result set into a destination folder that stores the [BGZip](http://www.htslib.org/doc/bgzip.html)-compressed flat-files while the [index_db](https://biopython.org/DIST/docs/api/Bio.SeqIO-module.html#index_db) method generates a [SQLite](https://www.sqlite.org/index.html) database in the same directory and modifies the meta_data table to preserve NCBI search parameters.

```{bash}
python -m ffbio.ffdb -h
```

### Ex-1

Initialize a repository of indexed GenBank files. Search NCBI for all *Mimivirus* genomic DNA sequences.

```{bash}
python -m ffbio.ffdb data/315393 \
  -term 'txid315393[PORGN] AND biomol_genomic[PROP] NOT gbdiv_pat[PROP] NOT gbdiv_syn[PROP] NOT WGS[KYWD]' \
  -rettype fasta \
  -retmax 100
```

List the results in the directory.

```{bash}
ls data/315393.*
```

### Ex-2

Update the same repository.

```{bash}
python -m ffbio.ffdb data/315393
```

## ffbio.ffidx

The ```ffidx.py``` script creates and/or queries an indexed set of sequence files created via the [index_db](https://biopython.org/DIST/docs/api/Bio.SeqIO-module.html#index_db) method.

```{bash}
python -m ffbio.ffidx -h
```

### Ex-1

Create an index using multiple multi-GenBank files and query some accessions.

```{bash}
ls data/oantigen.[1-2].gbk.gz | \
  xargs python -m ffbio.ffidx data/oantigen.db -entry AF390573.1 GU576499.1 -fi gb -filenames | \
  grep -A 2 \>
```

List the index.

```{bash}
du -sh data/oantigen.db
```

### Ex-2

Same thing, except create the index temporarily in memory.

```{bash}
ls data/oantigen.[1-2].gbk.gz | \
  xargs python -m ffbio.ffidx ":memory:" -entry AF390573.1 GU576499.1 -fi gb -filenames | \
  grep -A 2 \>
```

### Ex-3

Dump all sequences as GenBank records.

```{bash}
python -m ffbio.ffidx data/oantigen.db -dump -fo gb | grep ^DEFINITION
```

## ffbio.ffcds

The ```ffcds.py``` script extract all CDS records from a GenBank file.

```{bash}
python -m ffbio.ffcds -h
```

### Ex-1

Extract all CDS records.

```{bash}
gunzip -c data/oantigen.[1-2].gbk.gz | python -m ffbio.ffcds - | grep -A 2 \> | head
```

## ffbio.ffuseq

The ```ffuseq.py``` script computes the unique set of sequences and an optional tab-separated mapping file.

```{bash}
python -m ffbio.ffuseq -h
```

### Ex-1

This example creates a file stream with duplicate records. Output the unique set and a mapping file.

```{bash}
gunzip -c data/oantigen.[1-2].gbk.gz | python -m ffbio.ffuseq - -fmt gb -map data/urec.tsv > data/urec.gbk
```

```grep``` Unique record LOCUS tags.

```{bash}
grep ^LOCUS data/urec.gbk
```

```cat``` the unique record mapping file.

```{bash}
cat data/urec.tsv
```