This project aims at the development of a phase profiler for the analysis of (potentially very large) time series. These time series can contain data from multiple sources. The phase profiler should be able to analyse data from sources as different as:
The phase profiler's task is to recognize certain states hidden in those time series.
As this is a a work in progress some questions remain to be solved. The most prominent issues are:
The final framework should bundle a collection of established tools to avoid redundant work. On this wiki-Page a distinction is made between software (tools) and libraries. (In general the software can be used stand-alone, while the libraries are used by software tools. All the examined libraries are used for mathematical purposes, mainly Linear Algebra and Numerical Calculations.) The following programs could provide useful mechanisms, algorithms and libraries.
After surveying every software tool/library a short evaluation will be given, proposing the usefulness of the tool in the coming Phase Profile Project.
GROMACS
GROMACS (GROningen MAchine for Chemical Simulations) is a molecular dynamics simulation package originally developed in the University of Groningen, now maintained and extended at different places, including the University of Uppsala, University of Stockholm and the Max Planck Institute for Polymer Research. The GROMACS project was originally started to construct a dedicated parallel computer system for molecular simulations, based on a ring architecture. The molecular dynamics specific routines were rewritten in the C programming language from the Fortran77-based program GROMOS, which had been developed in the same group.
Evaluation: A lot of the MD calculations of GROMACS can hardly be solved by another product, therefore the inclusion of GROMACS in the Phase Profiler Project is necessary. Furthermore the code of GROMACS is highly optimized. It could be included via JNI or through the Bioclipse, a Java project utilizing GROMACS' Fortran code.MolTools
Java
tools by F. Noe. Featuring a trajectory implementation this toolbox allows (amongst other things) the testing of trajectories on markovian properties, creation of transition matrix (+ finding of einvalues/-vectors) and the mapping of micro states to metastable states.
Trajectory
implementation of MolTools
Java
library contains algorithms which aid in PCCA. The problem is:
OpenMS
C++
framework for mass spectrometry. It contains tools to analyse spectrometry data (e.g. peptide & protein identification and clustering) and libraries for LC-MS data management. OpenMS
can find mass spectrometric peaks in raw LC-MS data (in mzData format). Peptides can be recognized by a special isotopic pattern.
Furthermore it provides a framework for the development of mass spectrometry related software.
File format (input data): mzData Evaluation: ThisC++
tools main purpose is to assist in Liquid chromatography-mass spectrometry (LC-MS).
This program is very specialized, fit for the single task of aiding in LC-MS. It is probably of no big use in the Phase Profiler Project.
Metamacs
Java
library for simulation and analysis of metastable Markov chains. It contains (amongst others) multiple algorithms for molecular structure alignment, time series discretization, HMMs, etc.
TimeSeries
implementation of Metamacs
COLT
package by Cern.
Metamacs
contains the following algorithms:
COLT
library: Java
library contains many useful algorithms for discretization of time series, molecular structure alignment, analization of HMMs.
Aida/FreeHEP
AIDA
Project aims at developing abstract interfaces for common physics analysis objects, such as histograms and clouds. Tools which implement AIDA
interfaces can exchange objects in an XML
format.
There are AIDA
implementations in Java
(JAIDA), C++
and Python
. JAIDA is a subproject of FreeHEP
, another open source high-energy physics Java
library. Files written with JAIDA adhere to the AIDA IO standards and can be read by any AIDA compliant analysis system.
Interesting libraries:
JAIDA
: clouds, data points (1D, 2D, 3D), histogramms, …
TODO: - Fitting with JAIDA
Evaluation: Flexible Interfaces could be useful if they can be implemented by other components. The existing interfaces are however restricted to a limited amount of data types.MATLAB doesn't support static type checking and not the use of references.
MATLAB owns it's own implementation of the ARPACK package and can calculate eigenvalues through the eigs()-function.
MATLAB implements the HDF5 data format sepcification and can access HDF5-files accordingly.
GNU R is a free statisitcs software packet, avaiable under the GNU public license. R is oriented at the programming language 'S', which was developed in the Bell Laboratories to process statistical data.
For the exchange of scientific data it is important to implement the same standards/data formats as other projects. In this section the libraries & formasts used by other projects are explored.
Features of Colt are:
Colt uses AIDA's (Package: hep.aida) histogram implementation.
LINPACK makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.
LAPACK can be seen as the successor to the original LINPACK, which was designed to run on the then-modern vector computers with shared memory. LAPACK, in contrast, depends upon the Basic Linear Algebra Subprograms (BLAS) in order to effectively exploit the caches on modern cache-based architectures, and thus can run orders of magnitude faster than LINPACK on such machines, given a well-tuned BLAS implementation.
The package is designed to compute a few eigenvalues and corresponding eigenvectors of a general n by n matrix A. It is most appropriate for large sparse or structured matrices A where structured means that a matrix-vector product w ← Av requires order n rather than the usual order n2 floating point operations. This software is based upon an algorithmic variant of the Arnoldi process called the Implicitly Restarted Arnoldi Method (IRAM). When the matrix A is symmetric it reduces to a variant of the Lanczos process called the Implicitly Restarted Lanczos Method (IRLM). These variants may be viewed as a synthesis of the Arnoldi/Lanczos process with the Implicitly Shifted QR technique that is suitable for large scale problems. For many standard problems, a matrix factorization is not required. Only the action of the matrix on a vector is needed.
PARPACK, a parallel version of the ARPACK library is now availible.
The GLPK (GNU Linear Programming Kit) package is intended for solving large-scale linear programming (LP), mixed integer programming (MIP), and other related problems. It is a set of routines written in ANSI C and organized in the form of a callable library.
GLPK supports the GNU MathProg language, which is a subset of the AMPL language. The GLPK package includes the following main components:TURBOMOLE is a powerful Quantum Chemistry (QC) program package, developed at the group of Prof. Ahlrichs at the University of Karlsruhe and at the Forschungszentrum Karlsruhe, covering a wide range of research areas from both academia and industry. With more than 15 years of development, TURBOMOLE has become a valuable tool for chemists, physicists and engineers.
Presently TURBOMOLE is one of the fastest and most stable codes available for standard quantum chemical applications (HF, DFT, MP2). Unlike many other programs, the main focus in the development of TURBOMOLE has not been to implement all new methods and functionals, but to provide a fast and stable code which is able to treat molecules of industrial relevance at reasonable time and memory requirements. Especially the RI-DFT method often saves a factor 10 in CPU-time compared with many other QM programs. TURBOMOLE runs under LINUX, several UNIX variants, and Windows in serial and parallel mode.
Microsoft SQL Server is a relational database management system (RDBMS) produced by Microsoft. Its primary query language is Transact-SQL, an implementation of the ANSI/ISO standard Structured Query Language (SQL) used by both Microsoft and Sybase.
Microsoft SQL Server uses a variant of SQL called T-SQL, or Transact-SQL, an implementation of SQL-92 (the ISO standard for SQL, certified in 1992) with many extensions. T-SQL mainly adds additional syntax for use in stored procedures, and affects the syntax of transaction support. (Note that SQL standards require Atomic, Consistent, Isolated, Durable or "ACID" transactions.) Microsoft SQL Server and Sybase/ASE both communicate over networks using an application-level protocol called Tabular Data Stream (TDS). The TDS protocol has also been implemented by the FreeTDS project in order to allow more kinds of client applications to communicate with Microsoft SQL Server and Sybase databases. Microsoft SQL Server also supports Open Database Connectivity (ODBC).SQL Server includes support for database mirroring and clustering. A SQL server cluster is a collection of identically configured servers, which help distribute the workload among multiple servers. All the servers share an identical virtual server name, and it is resolved into the IP address of any of the identically configured machines by the clustering runtime.
Hibernate is an object-relational mapping (ORM) solution for the Java language: it provides an easy to use framework for mapping an object-oriented domain model to a traditional relational database. Its purpose is to relieve the developer from a significant amount of relational data persistence-related programming tasks.
Hibernate is free as open source software that is distributed under the GNU Lesser General Public License.
Hibernate's primary feature is mapping from Java classes to database tables (and from Java data types to SQL data types). Hibernate also provides data query and retrieval facilities. Hibernate generates the SQL calls and relieves the developer from manual result set handling and object conversion, keeping the application portable to all SQL databases, with database portability delivered at very little performance overhead.
Hibernate provides transparent persistence for Plain Old Java Objects (POJOs). The only strict requirement for a persistent class is a no-argument constructor, not compulsorily public. (Proper behavior in some applications also requires special attention to the equals() and hashCode() methods.)
Hibernate can be used both in standalone Java applications and in Java EE applications using servlets or EJB session beans.
JBoss Features include:
JBoss is using the Hypersonic Database (Java, Open Source) as a defaukt, but it's possible to enable another DB.
JBoss is used as an Application Server by EJB (together with Hibernate).
Qt is a cross-platform application development framework, widely used for the development of GUI programs (in which case it is known as a Widget toolkit), and also used for developing non-GUI programs such as console tools and servers. Qt is most notably used in KDE, the web browser Opera, Google Earth, Skype, Qtopia and OPIE. It is produced by the Norwegian company Trolltech. Trolltech insiders pronounce Qt as "cute".
Qt uses C++ with several non-standard extensions implemented by an additional pre-processor that generates standard C++ code before compilation. Qt can also be used in several other programming languages; bindings exist for Python (PyQt), Ruby (RubyQt), PHP (PHP-Qt), Pascal, C#, Perl, Java, and Ada. It runs on all major platforms, and has extensive internationalization support. Non-GUI features include SQL database access, XML parsing, thread management, and a unified cross-platform API for file handling.Visual molecular dynamics (VMD) is a molecular modelling and visualization computer program. VMD is primarily developed as a tool for viewing and analyzing the results of molecular dynamics simulations, but it also includes tools for working with volumetric data, sequence data, and arbitrary graphics objects. Molecular scenes can be exported to external rendering tools such as POV-Ray, Renderman, Tachyon, VRML, and many others. Users can run their own Tcl and Python scripts within VMD as it includes embedded Tcl and Python interpreters. VMD is available free of charge, and includes source code, but it's under a non-free license.
Amira
C++
and uses OpenGL
. The amira Molecular Pack includes a very powerful molecule editor with specific tools for molecular visualization and data analysis, such as molecular surfaces, sequence alignment, configuration density computation, molecule trajectories and more. The amira Very Large Data Pack manages and visualizes very large amounts of volume data, up to hundreds of gigabytes.
Features of amira Molecular Pack include:
Current Version: HDF5
Benefits:NetCDF
NetCDF
format is platform independant and using the format HDF5. Core libraries for NetCDF
access exist in C++
, Fortran
and Java
. An extension of NetCDF
for parallel computing called Parallel-NetCDF
exists.
More about NetCDF and its usefulness to the project here.
Advantages: NetCDF
projects
Data Format | Libraries![]() |
Used by Institution | More information |
---|---|---|---|
netCDF | - | National Energy Research Scientific Computing Center (NERSC) | more |
HDF, netCDF, netCDF Operators (NCO) | ARPACK, ATLAS, BLAS, LAPACK, METIS, PBLAS, more... | National Center for Computational Sciences (NCCS) | more |
HDF | BLAS, LAPACK, FFTs, NAMD | National Renewable Energy Laboratory (NREL) | more |
Matrix Market Exchange Formats | BLAS, LINPACK, LAPACK | netlib: Matrix Market | Matrix file formats |
- | CERNLIB, Physics Analysis Workstation (PAW), ROOT | CERN | - |
- | CodeLib | (Zuse Institut Berlin) ZIB | more |
- | LAPACK | High Performance Center Stuttgart (HLRS) | more |
IRIS Explorer format | NAG Library | Numerical Algorithms Group (NAG) | import data |
Hierarchical Data Format (HDF) | supporting libraries | National Center for Supercomputing Applications (NCSA) | HDF5 |
J.Craig Venter Institute |
Data Format | Libraries | Used by Institution | More information |
---|---|---|---|
- | Matlab, Molekel, UCSF Chimera, VMD | Swiss National Supercomputing Centre (SNSC) | more |
The Commons is an Apache project focused on all aspects of reusable Java components. The Apache Commons project is composed of three parts:
The Commons Proper - A repository of reusable Java components. The Commons Sandbox - A workspace for Java component development. The Commons Dormant - A repository of Sandbox components that are currently inactive.
The Java Native Interface (JNI) is a programming framework that allows Java code running in the Java virtual machine (JVM) to call and be called by native applications (programs specific to a hardware and operating system platform) and libraries written in other languages, such as C, C++ and assembly.
The JNI is used to write native methods to handle situations when an application cannot be written entirely in the Java programming language such as when the standard Java class library does not support the platform-specific features or program library. It is also used to modify an existing application, written in another programming language, to be accessible to Java applications. Many of the standard library classes depend on the JNI to provide functionality to the developer and the user, e.g. I/O file reading and sound capabilities. Including performance- and platform-sensitive API implementations in the standard library allows all Java applications to access this functionality in a safe and platform-independent manner. Before resorting to using the JNI, developers should make sure the functionality is not already provided in the standard libraries.