# Direct Solution of Linear Systems

Tools for Solution of Algebraic Systems
Direct Solution of Linear Systems

## Summary

Providing factorization-based sparse solvers and preconditioners on large-scale parallel machines.

We are developing scalable sparse direct linear solvers and factorization-based algebraic preconditioners, which can solve the most challenging linear systems that are often too difficult for iterative methods. This work is applicable to accelerator design, earthquake simulation, fluid dynamics, fusion energy, material sciences, quantum mechanics, and many other engineering and science domains.

Our effort is focused on the development of three types of linear solvers. The first, encapsulated in the software symPACK and SuperLU, are pure direct solvers using LU factorization followed by triangular solutions. The second, encapsulated in the software STRUMPACK, is the quasi-linear complexity preconditioners using low-rank approximate factorization of the dense submatrices, where the hierarchical matrix algebra, such as Hierarchically Semi-separable (HSS) matrix structure, is employed. The third, encapsulated in the software PDSLin, is a Schur complement-based nonoverlapping domain decomposition method, blending direct solvers for the subdomain solution with preconditioned iterative solvers for the solution of the Schur complement system.

Sparse direct linear solvers are usually organized into three distinct phases: a preprocessing (or commonly called symbolic) phase, the numerical factorization phase, and finally the triangular solution phase. When factoring a sparse matrix, some zero entries become nonzero. Increasing both memory and computational costs. These new nonzero entries are referred to as the fill-in entries. By applying good permutations to reorder the equations (rows) and the variables (columns) it is possible to reduce the number of fill-ins. Computing good permutations is referred to as the Ordering problem, and is usually performed during the symbolic phase. Our work covers all these areas.

Recent accomplishments:

Sparse Symmetric Solver. FASTMath has developed and released a new direct linear solver for sparse symmetric positive definite matrices, symPACK (www.sympack.org). By exploiting the symmetry explicitly, symPACK achieves low storage costs. This novel solver implements the Fan-Both method on distributed memory platforms by using the UPC++ framework, which allows an asynchronous pull communication strategy coupled with an event-driven task scheduling to be incorporated. SymPACK often outperforms state-of-the-art solvers on a wide collection of problems. It has been integrated into the Pole EXpansion and Selective Inversion library (PEXSI), developed under a couple of BES SciDAC-3 projects on computational chemistry and computational materials science.

Providing memory-scalable implementations of various steps in the symbolic phase is a challenging but crucial task. A significant accomplishment of FASTMath in this area is the implementation of a scalable distributed-memory Reverse Cuthill-McKee (RCM) ordering algorithm for reducing the profile of a sparse matrix. This is the first distributed-memory implementation of the RCM ordering algorithm. This figure shows the performance advantage of symPACK when using UPC++ features over MPI, and up to 2.65x speedup over the other state-of-the-art symmetric direct solvers.

Sparse Unsymmetric Solver. For the sparse direct solver SuperLU library, we developed new algorithms to make elective use of GPUs and Intel Xeon Phi. These include aggregating large dense blocks for BLAS, overlapping GPU/CPU computations, using OpenMP tasking for irregular nested loops to avoid load imbalance. SuperLU achieves 3x speedup and 2-5x memory saving using 100s of GPU nodes on Titan at ORNL, and strong scales to over 64 KNL nodes on Cori at NERSC. SuperLU is used in a wide variety of applications, in particular in the SciDAC fusion M3D-C1 and NIMROD code, HEP ACE3P code, BES AMIS and DGDFT codes.

Recently we have developed a communication-avoiding 3D sparse LU factorization in SuperLU_DIST, which reduces the latency by a factor of O(log n) for plannar graphs and O(n^(1/3)) for non-plannar graphs. Our new 3D code achieves speedups up to 27x for planar graphs and up to 2.5x for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30, Edison at NERSC. This figure shows a heatmap plot of performance (TFLOPS/s) for two matrices K2D5pt4096 and nlpkkt80, with different combinations of 3D process grid P_xy X P_z. Here we increase P_xy and P_z by a factor of two along x and y-axis respectively. Thus, the bottommost row (P_z = 1) is the baseline 2D algorithm. In the best case, the 3D algorithm achieved 27.4x speedup for graph K2D5pt4096.

Sparse Low-rank Unsymmetric Solver. We developed faster algorithms using hierarchical low-rank matrix algebra (e.g. Hierarchically Semiseparable (HSS) matrix), which generalizes fast multipole methods and leads to factorization with quasi-linear arithmetic and memory complexity. The compressed representation also results in much reduced communication. We implemented the new algorithms in the STRUMPACK (STRUctured Matrices PACKage) software. The dense package is effective for the matrices associated with boundary element methods, integral equations, and the Toeplitz type of structured matrices. The sparse package can be used as (inexact) direct solver or preconditioner for general PDEs. STRUMPACK strong scales to over 8000 cores on Edison at NERSC. It is integrated into MFEM and PETSc and is used in fusion SciDAC M3D-C1 code for MHD equations. This figure shows STRUMPACK parallel strong scaling of the symbolic and numerical factorization phases and the overall time spent in preconditioner application on a Cray XC30, Edison at NERSC. Matrix tdr455k comes from accelerator shape design which involves solving Maxwell equations. This has always been a challenge for the other scalable solvers, including multigrid, whereas STRUMPACK can easily solve this class of problems.

Communication-Avoiding Numerical Linear Algebra. We developed new linear algebra algorithms that communicate as little as possible, attaining lower bounds when possible. Our inspiration was a classical result for sequential O(n^3) matrix multiplication, implemented with a large DRAM and small cache of size M, which says that any implementation must move at least (n^3/\sqrt{M}) words between DRAM and cache, and that this is attained by a well-known blocking technique. We first extended this result to the rest of classical O(n^3) linear algebra, with lower bounds and optimal algorithms for both sequential and parallel algorithms, then to sparse linear algebra (with optimal algorithms in some cases, depending on the sparsity structure), then to "fast linear algebra" (i.e. Strassen-like algorithms), then to all algorithms that can be described by nested loops accessing arrays, whose subscripts are affine functions of the loop indices (e.g., A(i), B(i; j), C(2*i- j + 1, k), etc). In a second body of work, we looked at Krylov subspace methods (KSMs), where the communication costs of the multiple sparse-matrix-vector multiplications and dot-products dominate. We again showed how to reorganize many such methods to avoid communication, for example BiCGStab, while maintaining numerical stability. Our work has resulted in large speedups.

## Contact

Sherry Li

Lawrence Berkeley National Laboratory, One Cyclotron Road, Mail Stop 50F, Berkeley, CA 94720

510-486-6684