PSMB_Seqan_2013_NGS_Quality_Control_Details_Features
Our application will read fastq files and collect statistical data about its content.
The input itself will not be modified. A summary document will be created to
display the results.
The application implements 3 areas of functionality: input, data collection, results output formatting.
Each functionality is classified as
Must-Have
or
Nice-To-Have
. Implementation and testing will be done by
Antje
A
or Daniel
D
The application will be able to
- read fastq files
Must-Have
A
- read bam files
Nice-To-Have
D
- read compressed fastq files
Nice-To-Have
D
Data Collection
The following data will be collected:
- input filename and format
Must-Have
A
- which scoring system was used
Must-Have
A
- conversion of scoring systems
Must-Have
D
- total number of sequences
Must-Have
A
- overall quality score average of all bases in all sequences
Must-Have
A
- overall GC percent of all bases in all sequences
Must-Have
A
- per read and per position:
Must-Have
A
- basic quality distribution data: median, mean, quantiles (10,25,75,90)
- distribution of [A,C,G,T]
- GC percent content
- N Content
- for all reads:
Must-Have
A
- mean qualities distribution
A
- sequence length distribution
A
- overall sequence metrics
- duplicated sequences
Nice-To-Have
A
- k-mer distribution
Must-Have
D
Summary Document Generation
- Primary application output will be a tab-separated text file.
A
- The tsv file can be read by an accompanying R-script. This R-script and an HTML document will show the graphics output generated by R from the data.
D
- Document generation itself will be performed by a secondary application.
D
Application Options
- force quality score system
- k-mer length
- quick analysis (randomly choose a subset of reads to analyze)
Bonus Bonus Bonus List
- linking with Galaxy or KNIME
- One Script wich starts all the other
- Graphical User Interface
Milestones
-
1st week
- testing and implementation of a functionally minimal version that works through all steps
-
2nd week
- testing an implementation of all basic statistics (A) and k-mer content (D)
-
3rd week
- testing and implementation of sequence duplication (A) and output renement (D)
-
4th week
- buffer for surprises, testing and implementation of NICE-TO-HAVE features (A+D)