MultiSplitMappingNGS < ABI

You are here: Foswiki>ABI Web>ThesesHome>MultiSplitMappingNGS (02 May 2013, KtrappeUserTopic)Edit Attach

Multi-Split-Mapping of NGS reads for variant detection

Student
Kathrin Trappe

Academic Advisor
Prof. Dr. Knut Reinert, Anne-Katrin Emde

Expose

The goal of this thesis is to design, implement and test a “multi-split-read-mapping” method that

generates local matches of reads using STELLAR,
chains these local matches into “multi-split-matches”, localizing optimal junction positions and
outputs all multi-split-matches, i.e., chains, that obtain a certain minimum score.

Conceptually, the method has two stages, first detecting regions of sufficiently many local hits and second, combining those local hits in an application dependent fashion. Paying attention to a generic implementation, the developed algorithms should be easily adaptable for applications in 1) structural variant detection in genomic resequencing data, 2) exon-exon-junction discovery in RNA-Seq reads, and 3) identification of methylated DNA in bisulfite-treated genomic DNA.

In this thesis, the initial focus will be on structural variant detection, but the method should be implemented in a generic way such that it can be easily extended in possible future work. The method is meant to be tested on simulated genomic data - where reads were simulated from a manipulated reference containing structural variants and are mapped onto the unmanipulated reference sequence - and/or on real transcriptomic sequencing data where we expect to see reads confirming known gene fusions.

The main steps of the algorithm will be:

generate local matches for all reads
- using STELLAR
- optional: using external matches (code should be generic enough such that external (STELLAR) matches can be imported from a file)
generate possible chains for each read
- identify compatible local matches
- merge into chains (possibly already discarding improbable chains)
compute optimal breakpoint(s)
assign a score to each chain
rank and output all or up to x best chains per read
- optional: upweight scores of chains that are supported by multiple reads

The output should be inspected with a simple clustering method to detect structural variants supported by multiple read chains.

Literature

[0] MultiSplitMapping

[1] Trapnell, Cole, Lior Pachter, and Steven L Salzberg. 2009. “TopHat: discovering splice junctions with RNA-Seq.” Bioinformatics (Oxford, England) 25 (9) (April 30): 1105-1111. doi:10.1093/bioinformatics/btp120.

[2] Maher, Christopher A, Chandan Kumar-Sinha, Xuhong Cao, Shanker Kalyana-Sundaram, Bo Han, Xiaojun Jing, Lee Sam, Terrence Barrette, Nallasivam Palanisamy, and Arul M Chinnaiyan. 2009. “Transcriptome sequencing to detect gene fusions in cancer.” Nature 457 (7234) (February 22): 97-101. doi:10.1038/nature07638.

[3] Wu, T, and C Watanabe. 2005. “GMAP: a genomic mapping and alignment program for mRNA and EST sequences.” Bioinformatics (Oxford, England).

[4] Alkan, Can, Coe, Bradley P., and Eichler, Evan E. 2011. “Genome structural variation discovery and genotyping.” Nature Rev. Gent. 12, 363-376 (May 2011). doi:10.1038/nrg2958

[#] PLEASE ADD

Work Plan

June:

Learning SeqAn
Simple demo program:
- Read input
- Call Stellar
- Move Stellar matches to the FragmentStore
- Create Interfaces for MultiSplitChaining function + MultiSplitChain type
- Output of matches in SAM format

Milestone: Concept/Interfaces:

What defines the MultiSplitChain type (members and functions)?
What is the MultiSplitChaining function output?

July:

From demo to real implementation
Chaining function: test multiple possibilities
Further conceptional design/thoughts:

Which conceptional problems are there? e.g. data types, interfaces, …

Milestone: First version of implementation/program (0.1)

August:

Further coding/implementation
Further conceptional design/thoughts: * Scoring function for matches/chains * Thesis writing: methods

Milestone: Usable program version (1.0)

September:

Benchmarking (simulated/real data)
What problems occur? e.g. memory or running time affecting parts, bad reads
Thesis writing

Milestone: Working benchmark and code revision

October:

Tests and results
Thesis writing

Milestone: First thesis draft by end of October

November:

Thesis writing (final version)

Weekly Report

June:

SeqAn tutorials
Paper reading
Simple demo program that can:
- Read input file containing reads in fasta format and store reads in a FragmentStore
- Call Stellar given the FragmentStore input in a direct call of function stellar()
- Move Stellar matches to the FragmentStore
Idea for match chaining: use graph structure to store matches in nodes and connect compatible matches
- Compatible matches overlap
- Compute break point between overlapping matches and store score gain on edge (i.e. number of gained matched)

July:

ArrayGaps based StellarMatches incompatible with AnchorGaps based FragmentStore, would need to rewrite ArrayGaps class (ref: David) which is too complex atm. Alternative: Write SAM file output for Stellar and read input for FragmentStore from SAM file. Started writing SAM output file. Note: SAM file output within Stellar is not triggered when using function stellar() from another application (instead of using Stellar via terminal).
StellarMatches have wrong read positions when using FragmentStore readSeqStore as input (s.b.). Now using _importSequence function from Stellar to load sequences from file and StringSet (as in Stellar) as input.
shorter stellar() function sets default database ID → switched to longer one where you can manually provide database IDs
wrote compareStellarMatch struct to be able to sort StellarMatches according to their read/query start position
assigned/calculated distance score (edit distance) to matches using Stellars _analyzeAlignment function
created (modified) SAM output for StellarMatches (external): has correct database and query IDs, atm additional information about distance score of matches
code revision (from test.cpp to more organized code): created app msplazer that
- reads genome and read sequences from fasta files and stores them intern (same structures as in Stellar)
- calls Stellar for every read and genome and stores matches
- computes and stores distance score for every match (external property map for graph vertices)
created graph structure that (for one read)
- inserts one vertex for each match
- inserts an artificial start vertex and adds edges from the start to each vertex with cargo=distanceScore(match)
included breakpoint functions from Anne-Katrin and got it working for StellarMatches from a simple breakpointtest example (1 Genome, 1 Read)
- computes breakpoint score for two (overlapping) matches and inserts edge into graph with cargo=distance(match2)-breakpointScore (mind problem of context of negative edge weights)
tried dijkstra() on the simple graph from the breakpointtest example to compute shortest paths from the start vertex to every other vertex
- have to mind vertices not connected to start vertex (have infinite distance, see _getInfinityDistance() function in dijkstra file)

August:

created chainQueryMatches function that (for each read)
- checks matches for compatibility, i.e. if they are overlapping with at most x% of their length or not overlapping but within y bases in reach
- computes breakpoint score for overlapping matches and inserts edge into graph with cargo=distance(match2)- breakpointScore + differentDatabasePenalty or diffStrandPen or for matches within reach inserts edge with cargo=distance(match2) + gap(match1, match2) + differentDatabasePenalty or diffStrandPen
expanded criteria for match compatibility: have to be in the right order within the genome sequence (i.e. transfer/exchange of parts within the same genome sequence is not considered to be compatible).
wrote specified dot output for the graph containing the match vertices, compatibility edges with cargo and match information within the vertices.
started writing methods part for thesis
modified StellarMatches on reverse complement strand: match position referred to reverse complement strand (which is not available as a real copy), this caused issues when using these matches. Now they refer to the forward strand (modified breakpoint function input according to that, fixed issue with breakpoint assertion failure on reverse complement matches).
tested FusionMaher reads: found the right matches and chained them in the right way

Comments

Stellar can now handle FragmentStore input which is based on an Index using a StringSet based on Owner<ConcatDirect<>>. However, StellarMatch positions are wrong when processing multiple reads (i.e. after read 1). Maybe bc. of consecutive positions due to the ConcatDirect format?

Topic revision: r14 - 02 May 2013, KtrappeUserTopic - This page was cached on 04 Mar 2025 - 15:32.

ABI

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback