You are here: CompMolBio » SoftwareFramework

Software Framework for Time Series Analysis / Phase Profiler

Overview

SoftwareFramework_Mindmap.pdf: Mindmap of ideas for Software Framework

Goal

This project aims at the development of a phase profiler for the analysis of (potentially very large) time series. These time series can contain data from multiple sources. The phase profiler should be able to analyse data from sources as different as:

  • ° financial (stock market),
  • ° meteorological (sensor and satellite measurements) or
  • ° physical experiments (mass spectrometry).

The phase profiler's task is to recognize certain states hidden in those time series.

Open Questions

As this is a a work in progress some questions remain to be solved. The most prominent issues are:

  • ° Which algorithms should be contained in the framework?
  • ° Which standarts should be used?
  • ° How are users supposed to access the framework?
  • ° How is the handling of large quantities of datas supposed to be managed via network? Can clusters be used?
  • ° Which programming languages should be used?
  • ° What kind of visualization should be used?


Survey of potentially useful software and libraries

The final framework should bundle a collection of established tools to avoid redundant work. On this wiki-Page a distinction is made between software (tools) and libraries. (In general the software can be used stand-alone, while the libraries are used by software tools. All the examined libraries are used for mathematical purposes, mainly Linear Algebra and Numerical Calculations.) The following programs could provide useful mechanisms, algorithms and libraries.

After surveying every software tool/library a short evaluation will be given, proposing the usefulness of the tool in the coming Phase Profile Project.

Molecular Dynamics

GROMACS

GROMACS (GROningen MAchine for Chemical Simulations) is a molecular dynamics simulation package originally developed in the University of Groningen, now maintained and extended at different places, including the University of Uppsala, University of Stockholm and the Max Planck Institute for Polymer Research. The GROMACS project was originally started to construct a dedicated parallel computer system for molecular simulations, based on a ring architecture. The molecular dynamics specific routines were rewritten in the C programming language from the Fortran77-based program GROMOS, which had been developed in the same group.

Evaluation: A lot of the MD calculations of GROMACS can hardly be solved by another product, therefore the inclusion of GROMACS in the Phase Profiler Project is necessary. Furthermore the code of GROMACS is highly optimized. It could be included via JNI or through the Bioclipse, a Java project utilizing GROMACS' Fortran code.

MolTools

Collection of Java tools by F. Noe. Featuring a trajectory implementation this toolbox allows (amongst other things) the testing of trajectories on markovian properties, creation of transition matrix (+ finding of einvalues/-vectors) and the mapping of micro states to metastable states.

File format: xtc

Evaluation:

- This Java library contains algorithms which aid in PCCA. The problem is:

  • How easy can algorithms from MolTools be implemented/used together with other routines?

OpenMS

OpenMS is an open-source C++ framework for mass spectrometry. It contains tools to analyse spectrometry data (e.g. peptide & protein identification and clustering) and libraries for LC-MS data management. OpenMS can find mass spectrometric peaks in raw LC-MS data (in mzData format). Peptides can be recognized by a special isotopic pattern.

Furthermore it provides a framework for the development of mass spectrometry related software.

File format (input data): mzData

Evaluation:

This C++ tools main purpose is to assist in Liquid chromatography-mass spectrometry (LC-MS).

This program is very specialized, fit for the single task of aiding in LC-MS. It is probably of no big use in the Phase Profiler Project.

Metamacs

Metamacs is a a Java library for simulation and analysis of metastable Markov chains. It contains (amongst others) multiple algorithms for molecular structure alignment, time series discretization, HMMs, etc.

This library contain the COLT package by Cern.

Metamacs contains the following algorithms:

  • ° Computes an optimal alignment for molecular structures in terms of the mean square distance while the position of one atom/one axis is fixed
  • ° Discretizes a time series, Inner Simplex Algorithm (ISA) from Marcus Weber
  • ° Graph Theory: Dijkstra (shortest path), representing flow, transition pathways between two metastable sets in a rough energy landscape
  • ° Langevin dynamics, Lennard-Jones cluster, Mueller Potential, Ryckaert-Bellemans united atoms
  • ° HMM: Compute likelihood of an observation series by means of backward variables OR forward variables, BaumWelch: Estimates model parameters, which maximizes the likelihood of the given observations, Generates a realization of a Hidden Markov Model with the output distribution of specified parameters, Viterbi (Compute the most likely state path q * for a given observed time series), Deterministic and stochastic integrators for Hamiltonian systems
  • ° Linear algebra subroutines like an eigenvalue solver
  • ° Markov chain Monte Carlo sampling methods
  • ° Variants of the string method for finding transition paths in (rough) energy landscapes
  • ° Wrapper for Gromacs Pipe Interface

COLT library:
  • ° Fundamental general-purpose data structures optimized for numerical data, e.g.
  • ° Dense and sparse matrices (multi-dimensional arrays), Linear Algebra, resizable arrays, associative containers, buffer management

File format: trr

Evaluation:

This Java library contains many useful algorithms for discretization of time series, molecular structure alignment, analization of HMMs.

Aida/FreeHEP

The developers of Aida (Abstract Interfaces for Data Analysis) are working on high-energy physics data analysis tools. The open source AIDA Project aims at developing abstract interfaces for common physics analysis objects, such as histograms and clouds. Tools which implement AIDA interfaces can exchange objects in an XML format.

There are AIDA implementations in Java (JAIDA), C++ and Python. JAIDA is a subproject of FreeHEP, another open source high-energy physics Java library. Files written with JAIDA adhere to the AIDA IO standards and can be read by any AIDA compliant analysis system.

Interesting libraries:

  • ° FreeHep Physics (collection of High Energy Physics related classes, including 3- and 4- vectors, simple matrices, particles and events, particle properties and jet finding)
  • ° JAIDA: clouds, data points (1D, 2D, 3D), histogramms, …

TODO: - Fitting with JAIDA

Evaluation: Flexible Interfaces could be useful if they can be implemented by other components. The existing interfaces are however restricted to a limited amount of data types.


Mathematical Tools / Statistics Software

MATLAB

MATLAB is a numerical computing environment and programming language. Created by The MathWorks, MATLAB allows easy matrix manipulation, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs in other languages.

MATLAB is a proprietary product of The MathWorks, so users are subject to vendor lock-in. Some other source languages, however, are partially compatible (like GNU Octave) or provide a simple migration path (like Scilab).

MATLAB doesn't support static type checking and not the use of references.

MATLAB owns it's own implementation of the ARPACK package and can calculate eigenvalues through the eigs()-function.

MATLAB implements the HDF5 data format sepcification and can access HDF5-files accordingly.

R

GNU R is a free statisitcs software packet, avaiable under the GNU public license. R is oriented at the programming language 'S', which was developed in the Bell Laboratories to process statistical data.

Mathematics Libraries

Libraries are needed to perform Numeric Analysis, especially:
  • ° Linear Algebra
  • ° Time Series Analysis
  • ° Operation on Matrixes


Professional Mathematics Libraries

For the exchange of scientific data it is important to implement the same standards/data formats as other projects. In this section the libraries & formasts used by other projects are explored.

COLT

Colt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java. The Colt library provides fundamental general-purpose data structures optimized for numerical data, such as resizable arrays, dense and sparse matrices (multi-dimensional arrays), linear algebra, associative containers and buffer management.

Features of Colt are:

  • ° Templated Multi-dimensional matrices: Dense and sparse fixed sized (non-resizable) 1,2, 3 and d-dimensional matrices holding objects or primitive data types such as int, double, etc; Also known as multi-dimensional arrays or Data Cubes.
  • ° Linear Algebra: Standard matrix operations and decompositions. LU, QR, Cholesky, Eigenvalue, Singular value.
  • ° Statistics: Tools for basic and advanced statistics: Estimators, Gamma functions, Beta functions, Probabilities, Special integrals, etc.

Colt uses AIDA's (Package: hep.aida) histogram implementation.

Basic Linear Algebra Subprograms (BLAS)

BLAS are standardized application programming interfaces for subroutines to perform basic linear algebra operations such as vector and matrix multiplication. First published in 1979, they are used to build larger packages such as LAPACK. Heavily used in high-performance computing, highly optimized implementations of the BLAS interface have been developed by hardware vendors such as by Intel as well as by other authors.

LINPACK

LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Pete Stewart, and was intended for use on supercomputers in the 1970s and early 1980s. It has been largely superseded by LAPACK, which will run more efficiently on modern architectures.

LINPACK makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.

LAPACK (Linear Algebra PACKage)

LAPACK, the Linear Algebra PACKage, is a software library for numerical computing written in Fortran 77. It provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, etc.

LAPACK can be seen as the successor to the original LINPACK, which was designed to run on the then-modern vector computers with shared memory. LAPACK, in contrast, depends upon the Basic Linear Algebra Subprograms (BLAS) in order to effectively exploit the caches on modern cache-based architectures, and thus can run orders of magnitude faster than LINPACK on such machines, given a well-tuned BLAS implementation.

ARPACK (ARnoldi PACKage)

ARPACK is a collection of Fortran77 subroutines designed to solve large scale eigenvalue problems. ARPACK is also avaiable as a MATLAB library. ARPACK uses BLAS and LAPACK.

The package is designed to compute a few eigenvalues and corresponding eigenvectors of a general n by n matrix A. It is most appropriate for large sparse or structured matrices A where structured means that a matrix-vector product w ← Av requires order n rather than the usual order n2 floating point operations. This software is based upon an algorithmic variant of the Arnoldi process called the Implicitly Restarted Arnoldi Method (IRAM). When the matrix A is symmetric it reduces to a variant of the Lanczos process called the Implicitly Restarted Lanczos Method (IRLM). These variants may be viewed as a synthesis of the Arnoldi/Lanczos process with the Implicitly Shifted QR technique that is suitable for large scale problems. For many standard problems, a matrix factorization is not required. Only the action of the matrix on a vector is needed.

PARPACK, a parallel version of the ARPACK library is now availible.

  • ° Representation of sparse matrices in MATLABS' eigs-function (using ARPACK):
  • MATLAB is using the Harwell-Boeing format. This method uses three arrays internally to store sparse matrices with real elements. Consider an m-by-n sparse matrix with nnz nonzero entries stored in arrays of length nzmax:

  • The first array contains all the nonzero elements of the array in floating-point format. The length of this array is equal to nzmax.

  • The second array contains the corresponding integer row indices for the nonzero elements stored in the first nnz entries. This array also has length equal to nzmax.

  • The third array contains n integer pointers to the start of each column in the other arrays and an additional pointer that marks the end of those arrays. The length of the third array is n+1.

GLPK (GNU Linear Programming Kit)

The GLPK (GNU Linear Programming Kit) package is intended for solving large-scale linear programming (LP), mixed integer programming (MIP), and other related problems. It is a set of routines written in ANSI C and organized in the form of a callable library.

GLPK supports the GNU MathProg language, which is a subset of the AMPL language. The GLPK package includes the following main components:

  • Revised simplex method.
  • Primal-dual interior point method.
  • Branch-and-bound method.
  • Translator for GNU MathProg.
  • Application program interface (API).
  • Stand-alone LP/MIP solver.

Quantummechanical software

WavePacket

WavePacket is a general purpose program package for numerical simulation of quantum-mechanical wavepacket dynamics for distinguishable particles. It can be used to solve time-independent or time-dependent linear Schrödinger equations yielding stationary wavefunctions or dynamically evolving wavepackets, respectively. Accounting also for the (semiclassical) interaction of the quantum system through its dipole moment with external electric fields, WavePacket is especially suited to simulate modern experiments in time-dependent spectroscopy using ultrashort laser pulses on ps/fs/as timescales. Thus it can be used as a flexible tool for many simulation tasks in photoinduced physics, (femto-)chemistry, and in related fields In addition, the extended graphical capabilities allow visualization of wavepacket dynamics 'on the fly', thereby generating animations for later use in presentations. Hence, WavePacket is especially suitable for teaching of quantum mechanics in physics, chemistry, and scientific computing.

Turbomole

TURBOMOLE is a powerful Quantum Chemistry (QC) program package, developed at the group of Prof. Ahlrichs at the University of Karlsruhe and at the Forschungszentrum Karlsruhe, covering a wide range of research areas from both academia and industry. With more than 15 years of development, TURBOMOLE has become a valuable tool for chemists, physicists and engineers.

Presently TURBOMOLE is one of the fastest and most stable codes available for standard quantum chemical applications (HF, DFT, MP2). Unlike many other programs, the main focus in the development of TURBOMOLE has not been to implement all new methods and functionals, but to provide a fast and stable code which is able to treat molecules of industrial relevance at reasonable time and memory requirements. Especially the RI-DFT method often saves a factor 10 in CPU-time compared with many other QM programs. TURBOMOLE runs under LINUX, several UNIX variants, and Windows in serial and parallel mode.

Databases

Microsoft SQL Server

Microsoft SQL Server is a relational database management system (RDBMS) produced by Microsoft. Its primary query language is Transact-SQL, an implementation of the ANSI/ISO standard Structured Query Language (SQL) used by both Microsoft and Sybase.

Microsoft SQL Server uses a variant of SQL called T-SQL, or Transact-SQL, an implementation of SQL-92 (the ISO standard for SQL, certified in 1992) with many extensions. T-SQL mainly adds additional syntax for use in stored procedures, and affects the syntax of transaction support. (Note that SQL standards require Atomic, Consistent, Isolated, Durable or "ACID" transactions.) Microsoft SQL Server and Sybase/ASE both communicate over networks using an application-level protocol called Tabular Data Stream (TDS). The TDS protocol has also been implemented by the FreeTDS project in order to allow more kinds of client applications to communicate with Microsoft SQL Server and Sybase databases. Microsoft SQL Server also supports Open Database Connectivity (ODBC).

SQL Server includes support for database mirroring and clustering. A SQL server cluster is a collection of identically configured servers, which help distribute the workload among multiple servers. All the servers share an identical virtual server name, and it is resolved into the IP address of any of the identically configured machines by the clustering runtime.

Hibernate

Hibernate is an object-relational mapping (ORM) solution for the Java language: it provides an easy to use framework for mapping an object-oriented domain model to a traditional relational database. Its purpose is to relieve the developer from a significant amount of relational data persistence-related programming tasks.

Hibernate is free as open source software that is distributed under the GNU Lesser General Public License.

Hibernate's primary feature is mapping from Java classes to database tables (and from Java data types to SQL data types). Hibernate also provides data query and retrieval facilities. Hibernate generates the SQL calls and relieves the developer from manual result set handling and object conversion, keeping the application portable to all SQL databases, with database portability delivered at very little performance overhead.

Hibernate provides transparent persistence for Plain Old Java Objects (POJOs). The only strict requirement for a persistent class is a no-argument constructor, not compulsorily public. (Proper behavior in some applications also requires special attention to the equals() and hashCode() methods.)

Hibernate can be used both in standalone Java applications and in Java EE applications using servlets or EJB session beans.

JBoss Application Server

JBoss AS 4.0 is a Java EE 1.4 application server, with embedded Apache Tomcat 5.5. Any Java Virtual Machine between versions 1.4 and 1.5 is supported. JBoss can run on numerous operating systems including many POSIX platforms like Red Hat Enterprise Linux, MacOS X, Microsoft Windows and others, as long as a suitable JVM is present. JBoss AS 4.2 is also a Java EE 1.4 application server, but Enterprise JavaBeans 3.0 is deployed by default. It requires the Java Development Kit version 5. Tomcat 6 is bundled with it.

JBoss Features include:

  • Clustering
  • Failover (including sessions)
  • Load balancing
  • Distributed caching (using JBoss Cache, a standalone product)
  • Distributed deployment (farming)
  • Enterprise JavaBeans version 3

JBoss is using the Hypersonic Database (Java, Open Source) as a defaukt, but it's possible to enable another DB.

JBoss is used as an Application Server by EJB (together with Hibernate).

GUI

Qt

Qt is a cross-platform application development framework, widely used for the development of GUI programs (in which case it is known as a Widget toolkit), and also used for developing non-GUI programs such as console tools and servers. Qt is most notably used in KDE, the web browser Opera, Google Earth, Skype, Qtopia and OPIE. It is produced by the Norwegian company Trolltech. Trolltech insiders pronounce Qt as "cute".

Qt uses C++ with several non-standard extensions implemented by an additional pre-processor that generates standard C++ code before compilation. Qt can also be used in several other programming languages; bindings exist for Python (PyQt), Ruby (RubyQt), PHP (PHP-Qt), Pascal, C#, Perl, Java, and Ada. It runs on all major platforms, and has extensive internationalization support. Non-GUI features include SQL database access, XML parsing, thread management, and a unified cross-platform API for file handling.

Visualization

JFreeChart

JFreeChart is an open-source framework for the programming language Java, which allows the creation of complex charts in a simple way. JFreeChart also works with GNU Classpath, a free software implementation of the standard class library for the Java programming language. Following chart types are supported:

  • X-Y charts
  • Pie charts
  • Gantt charts
  • Bar charts

VMD

Visual molecular dynamics (VMD) is a molecular modelling and visualization computer program. VMD is primarily developed as a tool for viewing and analyzing the results of molecular dynamics simulations, but it also includes tools for working with volumetric data, sequence data, and arbitrary graphics objects. Molecular scenes can be exported to external rendering tools such as POV-Ray, Renderman, Tachyon, VRML, and many others. Users can run their own Tcl and Python scripts within VMD as it includes embedded Tcl and Python interpreters. VMD is available free of charge, and includes source code, but it's under a non-free license.

PyMOL

PyMOL is an open-source, user-sponsored, molecular visualization system created by Warren Lyford DeLano and commercialized by DeLano Scientific LLC, which is a private software company dedicated to creating useful tools that become universally accessible to scientific and educational communities. It is well suited to producing high quality 3D images of small molecules and biological macromolecules such as proteins. According to the author, almost a quarter of all published images of 3D protein structures in the scientific literature were made using PyMOL.

PyMOL is one of few open source visualization tools available for use in structural biology. The Py portion of the software's name refers to the fact that it extends, and is extensible by, the Python programming language.

Amira

Amira is a tool for visualizing, manipulating, and understanding scientific and industrial data and was developed by the Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB). Amira is written in C++ and uses OpenGL. The amira Molecular Pack includes a very powerful molecule editor with specific tools for molecular visualization and data analysis, such as molecular surfaces, sequence alignment, configuration density computation, molecule trajectories and more. The amira Very Large Data Pack manages and visualizes very large amounts of volume data, up to hundreds of gigabytes.

Features of amira Molecular Pack include:

  • ° Visualization of static molecules as well as time dependent data (trajectories).
  • ° Computation and visualization of configuration densities from trajectories
  • ° Flexible and fast ball and stick visualization, flexible color schemes
  • ° "BondAngle-style" visualization
  • ° Flexible and convenient tools to select and display parts of a molecule including color management
  • ° Extraction and visualization of molecular surfaces (van der Waals surfaces and solvent accessible surfaces)
  • ° Visualization of back bone
  • ° Computation and visualization of secondary structures & Hydrogen bonds
  • ° Visualization of additional quantities like scalar or vector fields with color coding
  • ° Simultaneous display of multiple molecules
  • ° Measuring of lengths and angles in molecules
  • ° Sequence and structural alignment of molecules
  • ° All molecular visualization tools can be arbitrarily combined with the standard amira modules like volume rendering, slicing, or iso-surfaces

File format (amira Molecular Pack, input): PDB, MDL, Tripos, UniChem, PSF/DCD (CHARMM, trajectories), AMBER, GROMACS, amira's native file format ZibMolFile (trajectories)

File format (3D data exchange file formats): IGES, Step, Catia, SEGY, Fluent, FIDAP, AVS, Madymo,

File format (output only): ACR-NEMA/DICOM, TIFF, PPM, JPEG, binary raw data, VRML, PLY, Fluent/UNS, AVS, …

Evaluation:

Amira Developer Pack provides all the C++ header files needed to compile custom extensions. Amira uses Qt, a platform independent C++ library for building graphical user interfaces (GUIs). The amira Molecular Pack contains useful algorithms.

Data Format

Transfer Formats

HDF (Hierarchical Data Format)

Current Version: HDF5

Benefits:
  • ° Comfortable self-describing properties, Header contains all meta information (used data types, length, etc. …)
  • ° No maximum file size
  • ° Easy network access
  • ° Reading and writing a portion of a dataset is possible

Example:
  • ° netCDF is written in HDF5

NetCDF

NetCDF (network Common Data Form) is a self-describing data format for scientific exchange, especially meteorological data. The NetCDF format is platform independant and using the format HDF5. Core libraries for NetCDF access exist in C++, Fortran and Java. An extension of NetCDF for parallel computing called Parallel-NetCDF exists.

More about NetCDF and its usefulness to the project here.

Advantages:
  • ° Platform independence
  • ° Interoperabaility with existing NetCDF projects
  • ° Predefined Datatypes for:
  • °° General scientific data: coordinate systems, gridded data, radial data, …
  • °° Meteorological data
  • °° Trajectories (TrajectoryObsDataset), time series station data

  • ° Access for example with the NetCDF-Java Library (predefined open()-, and process()-methods).

Evaluation: Its self-describing properties make this data format very interesting. Some useful data types (trajectories, meteological data) are already implemented.

Other Formats

edr
Gromacs' Energy file.

gro
Gromacs' file describing moleular structures using the Gromos87-format. This format can be used to describe trajectories.

mdp
Gromacs' file describing molecular dynamics parameters. Such a file conatins all necessary information (timesteps, temperature, pressure ) to run a simulation.

top
Gromacs' file describing topology (atoms, bonds, etc).

tpr
Gromacs' file contains starting structure of a simulation, e.g. molecular topology and parameters.

trr
Gromacs' file containing trajectory (coordinates, velocities, forces, energiees).

xtc
Gromacs' & MolTools' file for trajectories using a reduced precision algorithm for data compression.

pdb (protein databank file format)
This format is similar to xtc and is used for saving the molecular structure of proteins.


Professional Data Formats & Libraries (by Institution)

Numerical Analysis:

Data Format Libraries Used by Institution More information
Hierarchical Data Format (HDF) supporting libraries National Center for Supercomputing Applications (NCSA) HDF5
IRIS Explorer format NAG Library Numerical Algorithms Group (NAG) import data
Matrix Market Exchange Formats BLAS, LINPACK, LAPACK netlib: Matrix Market Matrix file formats
HDF, netCDF, netCDF Operators (NCO) ARPACK, ATLAS, BLAS, LAPACK, METIS, PBLAS, more... National Center for Computational Sciences (NCCS) more
netCDF - National Energy Research Scientific Computing Center (NERSC) more
- CodeLib (Zuse Institut Berlin) ZIB more
HDF BLAS, LAPACK, FFTs, NAMD National Renewable Energy Laboratory (NREL) more
- LAPACK High Performance Center Stuttgart (HLRS) more
- CERNLIB, Physics Analysis Workstation (PAW), ROOT CERN -

    J.Craig Venter Institute  

Visualization:

Data Format Libraries Used by Institution More information
- Matlab, Molekel, UCSF Chimera, VMD Swiss National Supercomputing Centre (SNSC) more

Other Tools

Apache Commons

The Commons is an Apache project focused on all aspects of reusable Java components. The Apache Commons project is composed of three parts:

The Commons Proper - A repository of reusable Java components. The Commons Sandbox - A workspace for Java component development. The Commons Dormant - A repository of Sandbox components that are currently inactive.

JNI (Java Native Interface)

The Java Native Interface (JNI) is a programming framework that allows Java code running in the Java virtual machine (JVM) to call and be called by native applications (programs specific to a hardware and operating system platform) and libraries written in other languages, such as C, C++ and assembly.

The JNI is used to write native methods to handle situations when an application cannot be written entirely in the Java programming language such as when the standard Java class library does not support the platform-specific features or program library. It is also used to modify an existing application, written in another programming language, to be accessible to Java applications. Many of the standard library classes depend on the JNI to provide functionality to the developer and the user, e.g. I/O file reading and sound capabilities. Including performance- and platform-sensitive API implementations in the standard library allows all Java applications to access this functionality in a safe and platform-independent manner. Before resorting to using the JNI, developers should make sure the functionality is not already provided in the standard libraries.


This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback