You are here: ABI » ThesesHome » ThesisGenomesPerMail » Expose

Page Expose

An expose for the Master's Thesis: "Journal Set: A container for utilizing Incremental Index Structures".

Abstract

In this Master's Thesis the concepts of incremental index structures (Journal Strings) [1] should be extended to allow collecting Journal Strings in a StringSet. Further more, different tools working on this Journal-StringSet used to be implemented to facilitate easy data access on it, while reducing the memory requirements needed to store the sequences.

Description

The Master's Thesis deals with the implementation of an appropriate data structure that utilizes the Incremental Index Structures known as JournalStrings. Using a StringSet [2] that collects finite many objects of the same type as a basic class maintain the SeqAn library consistent. In the following paragraphs the concept and the interfaces of the JournalString specific StringSet ( JournalSet ) are described.

1. Container concept

The JournalSet, despite of beeing a subclass of a StringSet, allows many specific operations on the set contained JournalStrings than just collecting several sequences from the same type. As a generous difference the JournalSet contains two string objects. It stores on the one hand the assigned JournalStrings but is also able to store several reference sequences too. The assigned JounalStrings can than be assigned to different reference sequences. The JournalString collects indeed only JournalStrings but those can be from different organisms or sequence families. Hence the JournalSet is a subclass of the SeqAn StringSet the container must provide the basic interfaces of the StringSet class. The interfaces must be adapted to the concept of the JournalSet. You find the basic methods and their description in the subsequent table.

Function Description
resize Acquires enough memory for the specified number of elements.
length Returns the current length of the set
begin The first element of the container
end The last element of the container
appendValue Append a new element to the container
clear Resets the container
value Reference to the value
remove deletes elements of this container
erase frees the memory for the erased positions
insert inserts elements in the container

Furthermore the container must implement a RandomAccessIterator for interacting with the elements contaiened in the set.

2. Assigning sequences

When using the JournalSet to collect JournalStrings it should also be possible to add normal sequences to the set, that would be transformed into a JournalString given the reference sequence automatically. For this feature 3 basic behaviours are expected.

2.1. Fast-Assignment

The user can put the sequence as a entire insertion into the JournalSet, which results in a JournalTree with just one node for the insertion. This method would be very fast, since no comparisons between the sequence to assign and their reference sequence have to be made.

2.2. Memory-Efficient-Assignment

In addition the whole data structure remains best memory efficient if that global alignment between sequence and reference would be computed, that computes the smallest JournalTree. Therefor the entire recursion of the DP-Algorithm must be adapted such that it computes the trace that requires the fewest memory allocation. This recursion must be formalised and in the further progress conveyed to the more appropriate seed and extend algorithm for larger sequences such as human genomes. In that case the chaning metric has to be adapted.

2.3. Alignment-Assignment

Also the user should have the opportunity to just uses his own global alignment between the sequence and the reference to transform it into a JournalTree and put it to the set.

3. Special interfaces

Beside the basic interfaces that come with the basic StringSet data structure some more functions have to be implemented in order to guarantee a high flexible and comprehensive container structure.

3.1. flatten()

This function allows the user to a) request the entire sequence of a JournalString by applying the stored modifications to the reference sequence and by that flatten the JournalTree accordingly and b) produces a new reference sequence that is appended to the JournalSet member taht manages the reference sequences.

3.2. join()

This function serves to reassign the reference sequence of a set of stored JournalStrings within the set. Unfortunately, for all JournalStrings a global alignment with the new reference sequence has to be computed, while the old reference sequence remains unchanged. Because the both reference sequences don't need to be related necessarily, there is no other way around. Also the revised JournalStrings don't need necessarily reference the same sequence.

3.3. swap()

In contrast it is also possible to switch the reference sequence within a group of JournalStrings that match the same reference sequence, by choosing one of these group members to be the new reference sequence. In that case the JournalTrees of the group members must be revised such that they fit the new reference. Furthermore the reference will be journaled in that way that it gets the inverse modification of the new reference sequence. All other group members can than changed without recomputing the entire alignment, since all sequences are related to each other some how.

Scheduled Tasks

1. Task - Concept of the JournalSet
  • stores assigned JournalStrings in a String object.
  • stores reference sequences in a String object.
  • additional c'tor that initializes set of reference sequences.
  • export/import of the JournalSet

Literature

[1] Pyl Paul Theodor (2010). Incremental Index Structures

[2] A. Döring, D. Weese, T. Rausch, and K. Reinert, SeqAn - An efficient, generic C++ library for sequence analysis, BMC Bioinformatics 2008, 9:11

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback