# Presentation

The Inria ParaSuite aims to provide to the research community a new set of benchmarks. These benchmarks are not just computation codelets but real applications which reflect the nowadays parallel software ecosystem. They cover a wide range of domains and parallel technologies.

This suite has been designed for hardware architecture, compilation and system researchers.

# Benchmarks

The Inria ParaSuite groups multiple applications developped at the Inria research centers. These applications come from a large range of domains like biology, hydrogeology, etc.

## Bocop

**Team**: Commands

**Categories**:
OpenMP
Multicore

**Dependencies**:

**Description**:
The package BocopHJB implements a global optimization method. Similarly to the Dynamic Programming approach, the optimal control problem is solved in two steps. First we solve the Hamilton-Jacobi-Bellman equation satisfied by the value function of the problem. Then we simulate the optimal trajectory from any chosen initial condition. The computational effort is essentially taken by the first step, whose result, the value function, can be stored for subsequent trajectory simulations. The BocopHJB package can be found at www.bocop.org.

## Genfield

**Team**: SAGE

**Categories**:
MPI
Multicore
Cluster

**Dependencies**:

- MPI

**Description**:
The application Genfield is dedicated to hydrogeology. It is based on the PARADIS application. It relies essentialy on fft functions, and use the OpenMPI version of the library FFTW.

## Grph

**Team**: COATI

**Categories**:
pThread
Multicore

**Dependencies**:

- Java 7 (JDK and JRE)

**Description**:
Grph is an open-source Java library for the manipulation of graphs. Its design objectives are to make it portable, simple to use/extend, computationally/memory efficient, and, according to its initial motivation: useful in the context of graph experimentation and network simulation. Grph also has the particularity to come with tools like an evolutionary computation engine, a bridge to linear solvers, a framework for distributed computing, etc.

## HIPS

**Team**: HiePACS

**Categories**:
MPI
Multicore
Cluster

**Dependencies**:

- MPI

**Description**:
Multilevel method, domain decomposition, Schur complement, parallel iterative solver.
HIPS (Hierarchical Iterative Parallel Solver) is a scientific library that provides an efficient parallel iterative solver for very large sparse linear systems.
The key point of the methods implemented in HIPS is to define an ordering and a partition of the unknowns that relies on a form of nested dissection ordering in which cross points in the separators play a special role (Hierarchical Interface Decomposition ordering). The subgraphs obtained by nested dissection correspond to the unknowns that are eliminated using a direct method and the Schur complement system on the remaining of the unknowns (that correspond to the interface between the sub-graphs viewed as sub-domains) is solved using an iterative method (GMRES or Conjugate Gradient at the time being). This special ordering and partitioning allows for the use of dense block algorithms both in the direct and iterative part of the solver and provides a high degree of parallelism to these algorithms. The code provides a hybrid method which blends direct and iterative solvers. HIPS exploits the partitioning and multistage ILU techniques to enable a highly parallel scheme where several subdomains can be assigned to the same process. It also provides a scalar preconditioner based on the multistage ILUT factorization.

## Opennl

**Team**: Alice

**Categories**:
CUDA
GPGPU

**Dependencies**:

- Cuda

**Description**:
OpenNL is a library of sparse linear solvers, for solving linear and quadratic problems with or without equality constraints. It is tailored for maximum portability, ease of use through a simple API. It uses several data structures: a dynamic matrix data structure used during system assembly, and compressed row storage used during system solve. On multicore architectures, it uses OpenMP directives to optimize sparse matrix vector products. It also has a GPU extension, that implements parallel sparse matrix vector product in Cuda.

## PaStiX

**Team**: HiePACS

**Categories**:
MPI
pThread
Multicore
Cluster

**Dependencies**:

- MPI
- BLAS

**Description**:
Complete and incomplete supernodal sparse parallel factorizations. PaStiX (Parallel Sparse matriX package) is a scientific library that provides a high performance parallel solver for very large sparse linear systems based on block direct and block ILU(k) iterative methods. Numerical algorithms are implemented in single or double precision (real or complex): LLt (Cholesky), LDLt (Crout) and LU with static pivoting (for non symmetric matrices having a symmetric pattern).

The PaStiX library uses the graph partitioning and sparse matrix block ordering package Scotch . PaStiX is based on an efficient static scheduling and memory manager, in order to solve 3D problems with more than 50 million of unknowns. The mapping and scheduling algorithm handles a combination of 1D and 2D block distributions. This algorithm computes an efficient static scheduling of the block computations for our supernodal parallel solver which uses a local aggregation of contribution blocks. This can be done by taking into account very precisely the computational costs of the BLAS 3 primitives, the communication costs and the cost of local aggregations. We also improved this static computation and communication scheduling algorithm to anticipate the sending of partially aggregated blocks, in order to free memory dynamically. By doing this, we are able to reduce the aggregated memory overhead, while keeping good performance.

Another important point is that our study is suitable for any heterogeneous parallel/distributed architecture when its performance is predictable, such as clusters of multicore nodes. In particular, we now offer a high performance version with a low memory overhead for multicore node architectures, which fully exploits the advantage of shared memory by using an hybrid MPI-thread implementation.

Direct methods are numerically robust methods, but the very large three dimensional problems may lead to systems that would require a huge amount of memory despite any memory optimization. A studied approach consists in defining an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughs in sparse direct methods and particularly should be very competitive in CPU time (effective power used from processors and good scalability) while avoiding the memory limitation encountered by direct methods.

## PLAST

**Team**: GenScale

**Categories**:
pThread
Multicore

**Dependencies**:

- None

**Description**:
PLAST is a software dedicated to the comparison of banks of genomic sequences. It takes as input either banks of protein sequences or banks of DNA sequences. The output is a list of couples of sequences where similarities have been found. For each couple, the percentage of similarity is indicated together with other information
of interest. The PLAST algorithm first indexes the two banks into memory, then searches for similarities using the “seed-and-extend” heuristic. Compared to the reference software of the domaine (BLAST), PLAST run 5 to 10 time faster.

## Selalib

**Team**: Calvi

**Categories**:
MPI
Cluster

**Dependencies**:

- MPI

**Description**:

In the context of the French Action Fusion (Action d’envergure Nationale Fusion) and a longstanding collaboration between the CEA-Cadarache (French Atomic Energy Commision), the Tonus INRIA Project Team based at the University of Strasbourg and the Max-Planck-Institut für Plasmaphysik (IPP) in Garching, we are currently developing the Selalib library. Selalib is a collection of interfaces and implementations of individual building blocks that can help researchers create and test numerical methods, develop new types of simulations, parallelization schemes, etc., either independently or in preparation for making modifications in existing production code such as GYSELA, the CEA’s massively parallel gyrokinetic, full-tokamak simulator. The Selalib project arose from the need of researchers to develop numerical methods with simplified test cases while also having independently tested modules that would facilitate gradual changes in existing production code. While originally envisioned to be specialized on the semi-lagrangian method, the abstractions that we have built can be used with other types of approaches, such as particle-in-cell. Thus far, the development is being kept within an international group of collaborators, but we expect a release later in 2014. Stay tuned!

# Categories

These tables below detail the categories belonging to each application. These categories are the one used for the `--bench`

option (consult doumentation - benchmarks for further details)

## Technologies

The applications use a wide range of technologies to implement their parallelism. This table sums up their usages.

Applications | OpenMP | MPI | pThread | SSE2 | Cuda |
---|---|---|---|---|---|

Bocop | ✓ | ||||

Genfield | ✓ | ||||

Grph | ✓ | ||||

Hips | ✓ | ||||

Opennl | ✓ | ||||

Pastix | ✓ | ✓ | |||

Plast | ✓ | ✓ | |||

Selalib | ✓ |

## Plateforms

According to the implemented technology, the application designed to run on specific architectures. The targeted architectures for each application are detailed in the following table.

Applications | Multicore | Cluster | GPGPU |
---|---|---|---|

Bocop | ✓ | ||

Genfield | ✓ | ✓ | |

Grph | ✓ | ||

Hips | ✓ | ✓ | |

Opennl | ✓ | ||

Pastix | ✓ | ✓ | |

Plast | ✓ | ||

Selalib | ✓ |