SFB 1404 FONDA
DFG ● GZ:
Foundations for Large-Scale Scientific Data Analysis Workflows
Essentially all scientific disciplines are generating an ever-increasing amount of data. To derive scientific discoveries, these data sets are analyzed by complex data analysis workflows (DAWs), which are series of discrete analysis programs arranged in (often non-linear) pipelines. Because they usually deal with very large data sets, these DAWs must be executed on distributed and/or parallel computational infrastructures, ranging from multi-core servers over mid-sized clusters to high-performance computing infrastructures (HPC). Traditionally, DAWs are optimized for speed, which leads to solutions that are hard to reproduce and share, and that are tightly bound to exactly one type of input. They are optimized for exactly the computational infrastructure available at the time of de-velopment, which requires scientists to fiddle around with heterogeneous low-level programming concepts.
The CRC FONDA – “Foundations of workflows for large-scale scientific data analysis” – will investigate methods for increasing productivity in the development, execution, and maintenance of DAWs for large scientific data sets. Our long-term goal is to develop methods and tools that achieve substantial reductions in development time and development cost of DAWs.
DAW runtime in distributed infrastructures if often dominated by the time required for data access and data exchange (DADE), which in turn depends on the data being analyzed, the tasks being executed, and the infrastructure on which a DAW runs. Changes in either of these aspects can quickly lead to deteriorating runtimes when a DAW is not adapted properly. Subproject A2 investigates methods that can adapt a given DAW to new input data or a different infrastructure with the goal to keep runtime low.
A2 is an interdisciplinary project; it will develop its research using DAWs for large-scale genome data analysis, which are typically IO heavy and thus particularly depend on proper DADE operations. It will intensively cooperate with subproject A6 by testing its newly developed methods also on DAWs for finding structural genomic variations, and it will use the hardware abstractions developed in B1. It will be carried out by Prof. Reinert, an expert in data structures and algorithms for genomic data, and Prof. Leser, an expert in optimization of UDF-heavy DAWs.The FONDA project constitutes a collaborative effort of a consortium composed of:
HUB: Humboldt-Universität zu Berlin (Speaker)
TUB: Technische Universität Berlin
FUB: Freie Universität Berlin
UP/HPI: Universität Potsdam, Hasso-Plattner-Institut for Digital Engineering
UO: Universität Osnabrück
Charité: Charité - Universitätsmedizin Berlin, Berlin Institute of Health (BIH), Bernstein Center for Computational Neuroscience (BCCN)
HHI: Fraunhofer Heinrich-Hertz-Institut, Berlin
MDC: Max Delbrück Center for Molecular Medicine, Berlin
ZIB: Zuse-Institut Berlin