You are here: ABI » ThesesHome » MultiSplitMappingNGS

Multi-Split-Mapping of NGS reads for variant detection

Student
Kathrin Trappe

Academic Advisor
Prof. Dr. Knut Reinert, Anne-Katrin Emde

Expose

The goal of this thesis is to design, implement and test a “multi-split-read-mapping” method that
  1. generates local matches of reads using STELLAR,
  2. chains these local matches into “multi-split-matches”, localizing optimal junction positions and
  3. outputs all multi-split-matches, i.e., chains, that obtain a certain minimum score.
Conceptually, the method has two stages, first detecting regions of sufficiently many local hits and second, combining those local hits in an application dependent fashion. Paying attention to a generic implementation, the developed algorithms should be easily adaptable for applications in 1) structural variant detection in genomic resequencing data, 2) exon-exon-junction discovery in RNA-Seq reads, and 3) identification of methylated DNA in bisulfite-treated genomic DNA.

In this thesis, the initial focus will be on structural variant detection, but the method should be implemented in a generic way such that it can be easily extended in possible future work. The method is meant to be tested on simulated genomic data - where reads were simulated from a manipulated reference containing structural variants and are mapped onto the unmanipulated reference sequence - and/or on real transcriptomic sequencing data where we expect to see reads confirming known gene fusions.

The main steps of the algorithm will be:

  1. generate local matches for all reads
    • using STELLAR
    • optional: using external matches (code should be generic enough such that external (STELLAR) matches can be imported from a file)
  2. generate possible chains for each read
    • identify compatible local matches
    • merge into chains (possibly already discarding improbable chains)
  3. compute optimal breakpoint(s)
  4. assign a score to each chain
  5. rank and output all or up to x best chains per read
    • optional: upweight scores of chains that are supported by multiple reads

The output should be inspected with a simple clustering method to detect structural variants supported by multiple read chains.

Literature

[0] MultiSplitMapping

[1] Trapnell, Cole, Lior Pachter, and Steven L Salzberg. 2009. “TopHat: discovering splice junctions with RNA-Seq.” Bioinformatics (Oxford, England) 25 (9) (April 30): 1105-1111. doi:10.1093/bioinformatics/btp120.

[2] Maher, Christopher A, Chandan Kumar-Sinha, Xuhong Cao, Shanker Kalyana-Sundaram, Bo Han, Xiaojun Jing, Lee Sam, Terrence Barrette, Nallasivam Palanisamy, and Arul M Chinnaiyan. 2009. “Transcriptome sequencing to detect gene fusions in cancer.” Nature 457 (7234) (February 22): 97-101. doi:10.1038/nature07638.

[3] Wu, T, and C Watanabe. 2005. “GMAP: a genomic mapping and alignment program for mRNA and EST sequences.” Bioinformatics (Oxford, England).

[4] Alkan, Can, Coe, Bradley P., and Eichler, Evan E. 2011. “Genome structural variation discovery and genotyping.” Nature Rev. Gent. 12, 363-376 (May 2011). doi:10.1038/nrg2958

[#] PLEASE ADD

Work Plan

June:

  1. Learning SeqAn
  2. Simple demo program:
    • Read input
    • Call Stellar
    • Move Stellar matches to the FragmentStore
    • Create Interfaces for MultiSplitChaining function + MultiSplitChain type
    • Output of matches in SAM format

Milestone: Concept/Interfaces:
  • What defines the MultiSplitChain type (members and functions)?
  • What is the MultiSplitChaining function output?

July:

  1. From demo to real implementation
  2. Chaining function: test multiple possibilities
  3. Further conceptional design/thoughts:
Which conceptional problems are there? e.g. data types, interfaces, …

Milestone: First version of implementation/program (0.1)

August:

  1. Further coding/implementation
  2. Further conceptional design/thoughts: * Scoring function for matches/chains * Thesis writing: methods

Milestone: Usable program version (1.0)

September:

  1. Benchmarking (simulated/real data)
  2. What problems occur? e.g. memory or running time affecting parts, bad reads
  3. Thesis writing

Milestone: Working benchmark and code revision

October:

  1. Tests and results
  2. Thesis writing

Milestone: First thesis draft by end of October

November:

Thesis writing (final version)

Weekly Report

June:

  1. SeqAn tutorials
  2. Paper reading
  3. Simple demo program that can:
    • Read input file containing reads in fasta format and store reads in a FragmentStore
    • Call Stellar given the FragmentStore input in a direct call of function stellar()
    • Move Stellar matches to the FragmentStore
  4. Idea for match chaining: use graph structure to store matches in nodes and connect compatible matches
    • Compatible matches overlap
    • Compute break point between overlapping matches and store score gain on edge (i.e. number of gained matched)

July:

  1. ArrayGaps based StellarMatches incompatible with AnchorGaps based FragmentStore, would need to rewrite ArrayGaps class (ref: David) which is too complex atm. Alternative: Write SAM file output for Stellar and read input for FragmentStore from SAM file. Started writing SAM output file. Note: SAM file output within Stellar is not triggered when using function stellar() from another application (instead of using Stellar via terminal).
  2. StellarMatches have wrong read positions when using FragmentStore readSeqStore as input (s.b.). Now using _importSequence function from Stellar to load sequences from file and StringSet (as in Stellar) as input.
  3. shorter stellar() function sets default database ID → switched to longer one where you can manually provide database IDs
  4. wrote compareStellarMatch struct to be able to sort StellarMatches according to their read/query start position
  5. assigned/calculated distance score (edit distance) to matches using Stellars _analyzeAlignment function
  6. created (modified) SAM output for StellarMatches (external): has correct database and query IDs, atm additional information about distance score of matches
  7. code revision (from test.cpp to more organized code): created app msplazer that
    • reads genome and read sequences from fasta files and stores them intern (same structures as in Stellar)
    • calls Stellar for every read and genome and stores matches
    • computes and stores distance score for every match (external property map for graph vertices)
  8. created graph structure that (for one read)
    • inserts one vertex for each match
    • inserts an artificial start vertex and adds edges from the start to each vertex with cargo=distanceScore(match)
  9. included breakpoint functions from Anne-Katrin and got it working for StellarMatches from a simple breakpointtest example (1 Genome, 1 Read)
    • computes breakpoint score for two (overlapping) matches and inserts edge into graph with cargo=distance(match2)-breakpointScore (mind problem of context of negative edge weights)
  10. tried dijkstra() on the simple graph from the breakpointtest example to compute shortest paths from the start vertex to every other vertex
    • have to mind vertices not connected to start vertex (have infinite distance, see _getInfinityDistance() function in dijkstra file)

August:

  1. created chainQueryMatches function that (for each read)
    • checks matches for compatibility, i.e. if they are overlapping with at most x% of their length or not overlapping but within y bases in reach
    • computes breakpoint score for overlapping matches and inserts edge into graph with cargo=distance(match2)- breakpointScore + differentDatabasePenalty or diffStrandPen or for matches within reach inserts edge with cargo=distance(match2) + gap(match1, match2) + differentDatabasePenalty or diffStrandPen
  2. expanded criteria for match compatibility: have to be in the right order within the genome sequence (i.e. transfer/exchange of parts within the same genome sequence is not considered to be compatible).
  3. wrote specified dot output for the graph containing the match vertices, compatibility edges with cargo and match information within the vertices.
  4. started writing methods part for thesis
  5. modified StellarMatches on reverse complement strand: match position referred to reverse complement strand (which is not available as a real copy), this caused issues when using these matches. Now they refer to the forward strand (modified breakpoint function input according to that, fixed issue with breakpoint assertion failure on reverse complement matches).
  6. tested FusionMaher reads: found the right matches and chained them in the right way

Comments

  • Stellar can now handle FragmentStore input which is based on an Index using a StringSet based on Owner<ConcatDirect<>>. However, StellarMatch positions are wrong when processing multiple reads (i.e. after read 1). Maybe bc. of consecutive positions due to the ConcatDirect format?
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback