-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
143 lines (96 loc) · 3.27 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "README"
author: "dnanto"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
keep_md: yes
toc: yes
toc_depth: 2
---
# Abstract
**ffbio** is a collection of scripts to work with flat-file sequence databases and help complete my dissertation/projects...
Local install: ```./setup.py install```
Remote install: ```pip install git+https://github.com/dnanto/ffbio#egg=ffbio```.
Note: this repository is experimental and there are no tests yet...
Note: this is an Rmarkdown document and uses a mix of R and bash code chunks.
Note: [search field descriptions](https://www.ncbi.nlm.nih.gov/books/NBK49540/).
# Scripts
## ffbio.ffdb
The ```ffdb.py``` script searches NCBI and downloads the entire result set into a destination folder that stores the [BGZip](http://www.htslib.org/doc/bgzip.html)-compressed flat-files while the [index_db](https://biopython.org/DIST/docs/api/Bio.SeqIO-module.html#index_db) method generates a [SQLite](https://www.sqlite.org/index.html) database in the same directory and modifies the meta_data table to preserve NCBI search parameters.
```{bash}
python -m ffbio.ffdb -h
```
### Ex-1
Initialize a repository of indexed GenBank files. Search NCBI for all *Mimivirus* genomic DNA sequences.
```{bash}
python -m ffbio.ffdb data/315393 \
-term 'txid315393[PORGN] AND biomol_genomic[PROP] NOT gbdiv_pat[PROP] NOT gbdiv_syn[PROP] NOT WGS[KYWD]' \
-rettype fasta \
-retmax 100
```
List the results in the directory.
```{bash}
ls data/315393.*
```
### Ex-2
Update the same repository.
```{bash}
python -m ffbio.ffdb data/315393
```
## ffbio.ffidx
The ```ffidx.py``` script creates and/or queries an indexed set of sequence files created via the [index_db](https://biopython.org/DIST/docs/api/Bio.SeqIO-module.html#index_db) method.
```{bash}
python -m ffbio.ffidx -h
```
### Ex-1
Create an index using multiple multi-GenBank files and query some accessions.
```{bash}
ls data/oantigen.[1-2].gbk.gz | \
xargs python -m ffbio.ffidx data/oantigen.db -entry AF390573.1 GU576499.1 -fi gb -filenames | \
grep -A 2 \>
```
List the index.
```{bash}
du -sh data/oantigen.db
```
### Ex-2
Same thing, except create the index temporarily in memory.
```{bash}
ls data/oantigen.[1-2].gbk.gz | \
xargs python -m ffbio.ffidx ":memory:" -entry AF390573.1 GU576499.1 -fi gb -filenames | \
grep -A 2 \>
```
### Ex-3
Dump all sequences as GenBank records.
```{bash}
python -m ffbio.ffidx data/oantigen.db -dump -fo gb | grep ^DEFINITION
```
## ffbio.ffcds
The ```ffcds.py``` script extract all CDS records from a GenBank file.
```{bash}
python -m ffbio.ffcds -h
```
### Ex-1
Extract all CDS records.
```{bash}
gunzip -c data/oantigen.[1-2].gbk.gz | python -m ffbio.ffcds - | grep -A 2 \> | head
```
## ffbio.ffuseq
The ```ffuseq.py``` script computes the unique set of sequences and an optional tab-separated mapping file.
```{bash}
python -m ffbio.ffuseq -h
```
### Ex-1
This example creates a file stream with duplicate records. Output the unique set and a mapping file.
```{bash}
gunzip -c data/oantigen.[1-2].gbk.gz | python -m ffbio.ffuseq - -fmt gb -map data/urec.tsv > data/urec.gbk
```
```grep``` Unique record LOCUS tags.
```{bash}
grep ^LOCUS data/urec.gbk
```
```cat``` the unique record mapping file.
```{bash}
cat data/urec.tsv
```