Optimizing BLAS Level 3 with Machine Learning

ADSALA dynamically selects optimal thread configurations for BLAS operations, achieving 1.5× to 3.0× speedups across multi-core systems.

Default
ADSALA
OpenBLAS
MKL

Performance comparison on Intel Xeon Gold 6248R (3.0GHz)

Average 2.1× speedup over default configuration

Research Overview

Adaptive Thread Configuration for BLAS

Machine learning-driven optimization of Basic Linear Algebra Subprograms (BLAS) Level 3 operations

Performance Gains

Achieves 1.5× to 3.0× speedups compared to traditional maximum-thread approaches across various hardware platforms.

ML-Powered

Machine learning models trained during installation to predict optimal thread counts for each operation on your specific hardware.

Comprehensive Coverage

Expanded ADSALA library includes all single- and double-precision BLAS Level 3 operations with adaptive optimization.

Performance

Benchmark Results

Speedup Comparison

DGEMM 2.8×
DSYMM 2.1×
DTRMM 1.7×

Speedup factors compared to default maximum-thread configuration across different BLAS Level 3 operations.

System Compatibility

Intel Xeon

Gold/Platinum series processors

AMD EPYC

Rome/Milan series processors

Consumer CPUs

Intel Core i7/i9, AMD Ryzen

Performance Across Matrix Sizes

ADSALA shows consistent performance improvements across varying matrix dimensions, with particularly strong gains for medium-sized matrices (512×512 to 2048×2048).

Implementation

How ADSALA Works

Installation Process

  1. 1

    Download and compile ADSALA library

  2. 2

    Training phase executes benchmarks to profile system performance

  3. 3

    Machine learning models are trained on the collected data

  4. 4

    Optimized library is ready for production use

Runtime Operation

  • BLAS Call

    Application calls a BLAS Level 3 routine

  • Model Prediction

    ADSALA predicts optimal thread count based on operation parameters

  • Execution

    Operation executes with optimized thread configuration

adsala_install.sh
# Installation and training process
./configure --prefix=/usr/local/adsala
make
make train   # Executes benchmark suite
make install # Installs optimized library

# Training output example
[ADSALA] Training on Intel Xeon Gold 6248R
[ADSALA] Benchmarking DGEMM... 2.4× potential
[ADSALA] Benchmarking DSYMM... 1.9× potential
[ADSALA] Generated optimization models
[ADSALA] Installation complete

Technical Details

Machine Learning Models

Random Forest and Gradient Boosted Decision Trees trained on operation parameters (matrix dimensions, operation type) and hardware counters.

Features Used

Matrix dimensions, operation type, memory hierarchy characteristics, core utilization patterns.

Overhead

Less than 1% runtime overhead for model prediction, with substantial gains in execution time.

Get ADSALA

Download and Installation

Install via package manager

ADSALA is available through popular package managers for easy installation.

sudo add-apt-repository ppa:adsala/optimized-blas
sudo apt-get update
sudo apt-get install libadsala

Source Code

Compile from source for maximum customization

GitHub Repository

Contribute or report issues

Documentation

Full API reference and usage examples

Research Paper

Cite Our Work

The complete technical details and evaluation of ADSALA are available in our peer-reviewed paper.

Abstract

We present ADSALA, an adaptive BLAS Level 3 library that uses machine learning to dynamically select the optimal number of threads for each operation. Our approach achieves 1.5× to 3.0× speedups compared to the traditional maximum-thread configuration across various multi-core systems. The library automatically trains models during installation that capture the performance characteristics of the target hardware, then uses these models at runtime to predict the best thread configuration for each BLAS operation.

Published: June 2023