You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is related to #74 and to some degree with #73 .
The current formats ({fastq}, {mapping}) are somewhat hard use.
Both formats leak internal information (e.g. 1:file preproc.lno10.pairs.1) and require contextual information for interpretation (ref. to line numbers in script.ngl or the filenames).
The fastq format is particularly hairy when mixing samples with a variable number of input files.
Since stats are computed per-file, you will end up with a variable number of pair and or single files. Once collect()ed, zero lines will be added to ensure alignment.
The parsing and zero-filling problem could be addressed by using a long format or some other form of tidy data.
These are popular with Python and R users that use pandas and tidyverse frameworks.
To address the information leak problem and avoid contextual information dependency, I believe an explicit function call is required. This could be use to define explicit requirements for computing stats but also to provide additional metadata.
A proposal: instead of implicitly calculating statistics on select()#65 , preprocess(), as_reads(), ... one could:
this could in turn produce outputs/qcstats.tsv (tab-delimited - visually aligned here):
sample steptype step identifier feature value
SAMPLE_A mapping Human reads mapped sam reference hg19 (if internal reference, /path otherwise)
SAMPLE_A mapping Human reads mapped sam total 8348246
SAMPLE_A mapping Human reads mapped sam aligned 5678246
SAMPLE_A mapping Human reads mapped sam file /path (/path or 'step + identifier' if exists)
SAMPLE_A mapping Human reads mapped sam ... ... (other fields in mapping output)
SAMPLE_A mapping Human reads kept sam total 5678246
SAMPLE_A mapping Human reads kept sam aligned 5678246
SAMPLE_A mapping Human reads kept sam file Human reads mapped-sam (/path or 'step + identifier' if exists)
SAMPLE_A mapping Human reads kept sam ... ...
SAMPLE_A fastq Human reads after strict QC pair.1 file Human reads kept-sam
SAMPLE_A fastq Human reads after strict QC pair.1 numSeqs 2839123
SAMPLE_A fastq Human reads after strict QC pair.2 numSeqs 2839123
SAMPLE_A fastq Human reads after strict QC single numSeqs 0 (always present even if zero or non-existing)
SAMPLE_A fastq Human reads after strict QC ... ... ...
SAMPLE_B mapping Human reads mapped sam total 6412432
SAMPLE_B mapping Human reads mapped sam aligned 4327644
SAMPLE_B mapping Human reads mapped sam ... ...
SAMPLE_B mapping Human reads kept sam total 4327644
SAMPLE_B mapping Human reads kept sam aligned 4327644
SAMPLE_B mapping Human reads kept sam ... ...
SAMPLE_B fastq Human reads after strict QC pair.1 file Human reads kept-sam
SAMPLE_B fastq Human reads after strict QC pair.1 numSeqs 0 (always present)
SAMPLE_B fastq Human reads after strict QC pair.2 numSeqs 0 (always present)
SAMPLE_B fastq Human reads after strict QC single numSeqs 4327644 (all single end)
SAMPLE_B fastq Human reads after strict QC ... ... ...
See notes on the rightmost side of above table.
Aspects not discussed:
implicit dependency between *stats() functions and qcstats({stats}) - should this be explicit too?
impact on internal hashing - dependency on user specified name= for qcstats().
The text was updated successfully, but these errors were encountered:
This is related to #74 and to some degree with #73 .
The current formats (
{fastq}
,{mapping}
) are somewhat hard use.Both formats leak internal information (e.g.
1:file preproc.lno10.pairs.1
) and require contextual information for interpretation (ref. to line numbers inscript.ngl
or the filenames).The
fastq
format is particularly hairy when mixing samples with a variable number of input files.Since stats are computed per-file, you will end up with a variable number of
pair
and orsingle
files. Oncecollect()
ed, zero lines will be added to ensure alignment.The parsing and zero-filling problem could be addressed by using a long format or some other form of tidy data.
These are popular with Python and R users that use
pandas
andtidyverse
frameworks.To address the information leak problem and avoid contextual information dependency, I believe an explicit function call is required. This could be use to define explicit requirements for computing stats but also to provide additional metadata.
A proposal: instead of implicitly calculating statistics on
select()
#65 ,preprocess()
,as_reads()
, ... one could:this could in turn produce
outputs/qcstats.tsv
(tab-delimited - visually aligned here):See notes on the rightmost side of above table.
Aspects not discussed:
*stats()
functions andqcstats({stats})
- should this be explicit too?name=
forqcstats()
.The text was updated successfully, but these errors were encountered: