Non-coding RNAs (ncRNAs), transcripts that do not code for proteins, started receiving a lot of attention in the last decade, thanks to several high-throughput genome-wide sequencing efforts which showed that, whereas less than 2% of the genome encodes proteins, at least 75% is actively transcribed into non-coding RNAs. NcRNAs have emerged as important key players in several biological processes throughout development, differentiation and diseases, including mechanisms of transcriptional and post-transcriptional gene regulation. Their size varies extremely from small RNAs of size 20 to hundred thousands of nucleotides. Of interest in this proposal are long ncRNAs, transcripts longer than 200 nucleotides, which have been estimated to be about 20000 in the genome and exhibiting distinct patterns of spatio-temporal expression. Despite their abundance, only few of them have a characterized function: most of the lncRNAs detected by high-throughput methods remain without a functional classification.
LncRNA sequences evolve very rapidly, hence it is accepted that analysis methods focusing solely on sequence conservation are less suited compared to methods that take the secondary RNA structure into account, allowing for non-conventional motifs, such as pseudoknots, G-quadruplexes and intramolecular RNA triplexes, which have been shown to be related to functional aspects of lncRNAs.
While global approaches for elucidating sequence-structure conservation are valuable, it is relevant to assume that for lncRNAs one has to resort to locally conserved sequence-structure motifs. In this proposal we will address and research various computational aspects important for searching and clustering lncRNAs which share local sequence-structure motifs. Such shared motifs might indicate that the lncRNAs are functionally related (e.g. bind the same RNA Binding Protein via the sequence-structure motif) and therefore will enable us a first comprehensive functional classification of lncRNAs. We have two main goals attributed mainly to the two PIs respectively.
1) We want to devise computational methods for aligning (pseudoknotted) RNA families, then derive probabilistic global and local motifs from the alignments while allowing to incorporate experimental evidence, and finally develop methods to search the motifs fast and sensitive in large genomic sequences.
2) We want to extend and apply the methodology derived under the first goal to retrieve sequence structure motifs in lncRNAs. We want to enable classification of lncRNAs into different classes based on sequence structure motifs and other genomic features, and finally find and functionally annotate new classes of non-coding RNAs.
The project started in January 2018 and terminates in December 2020.