PSMB_Seqan_2013_NGS_Quality_Control_Details_Features

Our application will read fastq files and collect statistical data about its content. The input itself will not be modified. A summary document will be created to display the results.

The application implements 3 areas of functionality: input, data collection, results output formatting. Each functionality is classified as Must-Have or Nice-To-Have. Implementation and testing will be done by Antje A or Daniel D

Input

The application will be able to
  1. read fastq files Must-Have A
  2. read bam files Nice-To-Have D
  3. read compressed fastq files Nice-To-Have D

Data Collection

The following data will be collected:
  1. input filename and format Must-Have A
  2. which scoring system was used Must-Have A
  3. conversion of scoring systems Must-Have D
  4. total number of sequences Must-Have A
  5. overall quality score average of all bases in all sequences Must-Have A
  6. overall GC percent of all bases in all sequences Must-Have A
  7. per read and per position: Must-Have A
    • basic quality distribution data: median, mean, quantiles (10,25,75,90)
    • distribution of [A,C,G,T]
    • GC percent content
    • N Content
  8. for all reads: Must-Have A
    • mean qualities distribution A
    • sequence length distribution A
  9. overall sequence metrics
    • duplicated sequences Nice-To-Have A
    • k-mer distribution Must-Have D

Summary Document Generation

  • Primary application output will be a tab-separated text file. A
  • The tsv file can be read by an accompanying R-script. This R-script and an HTML document will show the graphics output generated by R from the data. D
  • Document generation itself will be performed by a secondary application. D

Application Options

  • force quality score system
  • k-mer length
  • quick analysis (randomly choose a subset of reads to analyze)

Bonus Bonus Bonus List

  • linking with Galaxy or KNIME
  • One Script wich starts all the other
  • Graphical User Interface

Milestones

1st week
testing and implementation of a functionally minimal version that works through all steps
2nd week
testing an implementation of all basic statistics (A) and k-mer content (D)
3rd week
testing and implementation of sequence duplication (A) and output re nement (D)
4th week
bu ffer for surprises, testing and implementation of NICE-TO-HAVE features (A+D)

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback