Overview

The Auto-Tuning Framework (ATF) is a general-purpose auto-tuning approach: given a program that is implemented as generic in performance-critical program parameters (a.k.a. tuning parameters), such as sizes of tiles and numbers of threads, ATF fully automatically determines a hardware- and data-optimized configuration of such parameters.

Key Feature of ATF

A key feature of ATF is its support for Tuning Parameter Constraints. Parameter constraints allow auto-tuning programs whose tuning parameters have so-called interdependencies among them, e.g., the value of one tuning parameter has to evenly divide the value of another tuning parameter.

ATF’s support for parameter constraints is important: modern parallel programs target novel parallel architectures, and such architectures typically have deep memory and core hierarchies thus requiring constraints on tuning parameters, e.g., the value of a tile size tuning parameter on an upper memory layer has to be a multiple of a tile size value on a lower memory layer.

For such parameters, ATF introduces novel concepts for Generating & Storing & Exploring the search spaces of constrained tuning parameters, thereby contributing to a substantially more efficient overall auto-tuning process for such parameters, as confirmed in our Experiments.

Generality of ATF

For wide applicability, ATF is designed as generic in:

  1. The target program’s Programming Language, e.g., C/C++, CUDA, OpenMP, or OpenCL. ATF offers pre-implemented cost functions for conveniently auto-tuning C/C++ programs, as well as CUDA and OpenCL kernels which require host code for their execution which is automatically generated and executed by ATF’s pre-implemented CUDA and OpenCL cost functions. ATF also offers a pre-implemented generic cost function that can be used for conveniently auto-tuning programs in any other programming language different from C/C++, CUDA, and OpenCL.

  2. The Search Technique to use. ATF offers different kinds of pre-implemented search techniques, such as simulated annealing and AUC bandit (inspired by OpenTuner) which combines multiple techniques for exploration (such as differential evolution, Nelder-Mead, and Torczon hillclimbers). New techniques can be conveniently added to ATF, by implementing a straightforward interface.

  3. The Tuning Objective, e.g., high runtime performance and/or low energy consumption. By default, ATF’s pre-implemented cost functions auto-tune for high runtime performance. The user can choose arbitrary, self-defined tuning objectives.

  4. The Abort Condition which specifies when to stop the auto-tuning process. ATF’s pre-implemented abort conditions allow to stop the auto-tuning process dependent on the tuning time (e.g., after a user-defined time interval has been exceeded or a user-defined number of parameter configurations has been explored), but also dependent on the tuning result (e.g., when particular performance is already achieved) or dependent on both time and result (e.g., when within a particular time interval the auto-tuning run was not able to further improve the target program’s performance). New conditions can be conveniently added to ATF by the user, by implementing a straightforward interface.


Getting Started

ATF is open source on GitHub, with both of its interface kinds, under a commercially permissive MIT license:

  1. pyATF – ATF with its Python-based user interface
  2. cppATF – ATF with its C++-based user interface

We encourage you to use ATF in other projects, open source, or commercial.


Code Examples

ATF currently offers two kinds of user interfaces that can be freely chosen by the user:

  1. A Python-based user interface – we call ATF with its Python-based interface pyATF
  2. A C++-based user interface – we call ATF with its C++-based interface cppATF

More interface kinds for ATF, in further mainstream programming languages, might follow.

Using ATF for Constrained Search Spaces

We show how ATF is used for auto-tuning CUDA SAXPY Routine from the CLBlast library.

The routine relies on two tuning parameters – WPT and LS. Both parameters have as their range of potential values 1,...,N where N denotes the user-defined input size. The two parameters are interdependent: parameter LS needs to divide N/WPT – the input size N divided by the value of the WPT parameters – for correctness of the CUDA routine.


  • The SAXPY CUDA routine is conveniently auto-tuned via pyATF using its pre-implemented CUDA cost function cuda.CostFunction, which automatically generates and executes the CUDA host code required for executing the routine.

    import numpy as np
    
    from pyatf import TP, Interval, Tuner
    from pyatf.cost_functions import cuda
    from pyatf.search_techniques import AUCBandit
    from pyatf.abort_conditions import Evaluations
    
    # kernel code as string
    saxpy_kernel_as_string = '''
    extern "C" __global__ void saxpy( const int N, const float a, const float* x, float* y )
    {
        for( int w = 0 ; w < WPT ; ++w )
        {
            const int index = w * gridDim.x * blockDim.x + blockIdx.x * blockDim.x + threadIdx.x;
            y[ index ] += a * x[ index ];
        }
    }
    '''
    
    # input size
    N = 1000
    
    # Step 1: Generate the Search Space
    WPT = TP('WPT', Interval( 1, N ), lambda WPT: N % WPT == 0           )
    LS  = TP('LS',  Interval( 1, N ), lambda WPT, LS: (N / WPT) % LS == 0)
    
    # Step 2: Implement a Cost Function
    saxpy_kernel = cuda.Kernel( cuda.source(saxpy_kernel_as_string), 'saxpy' )  # kernel's code & name
    
    cf_saxpy = cuda.CostFunction( saxpy_kernel ).device_id( 0 )                                       \
                                                .kernel_args( np.int32( N )                        ,
                                                              np.float32(np.random.random())       ,
                                                              np.random.rand(N).astype(np.float32) ,
                                                              np.random.rand(N).astype(np.float32) )  \
                                                .grid_dim( lambda WPT, LS: N/WPT/LS )                 \
                                                .block_dim( lambda LS: LS )
    
    # Step 3: Explore the Search Space
    tuning_result = Tuner().tuning_parameters( WPT, LS )       \
                           .search_technique( AUCBandit() )    \
                           .tune( cf_saxpy, Evaluations(50) )
    

  • The SAXPY CUDA routine is conveniently auto-tuned via cppATF using its pre-implemented CUDA cost function atf::cuda::cost_function, which automatically generates and executes the CUDA host code required for executing the routine.

    #define ENABLE_CUDA_COST_FUNCTION
    #include <atf.hpp>
    
    int main( int argc, char* argv[] )
    {
        // kernel code as string
        auto saxpy_kernel_as_string = R"(
    extern "C" __global__ void saxpy( const int N, const float a, const float* x, float* y )
    {
        for( int w = 0 ; w < WPT ; ++w )
        {
            const int index = w * gridDim.x * blockDim.x + blockIdx.x * blockDim.x + threadIdx.x;
            y[ index ] += a * x[ index ];
        }
    })";
    
        // input size
        int N = 1000;
    
        // Step 1: Generate the Search Space
        auto WPT = atf::tuning_parameter( "WPT"                        ,
                                          atf::interval<size_t>( 1,N ) ,
                                          atf::divides( N )            );
    
        auto LS  = atf::tuning_parameter( "LS"                         ,
                                          atf::interval<size_t>( 1,N ) ,
                                          atf::divides( N/WPT )        );
    
        // Step 2: Implement a Cost Function
        auto saxpy_kernel = atf::cuda::kernel< atf::scalar<int>   ,  // N
                                               atf::scalar<float> ,  // a
                                               atf::buffer<float> ,  // x
                                               atf::buffer<float> >  // y
                                             ( atf::source(saxpy_kernel_as_string), "saxpy" );  // kernel's code & name
    
        auto cf_saxpy = atf::cuda::cost_function( saxpy_kernel ).device_id( 0 )                     // CUDA device id
                                                                .inputs( atf::scalar<int>( N )   ,  // N
                                                                         atf::scalar<float>()    ,  // a
                                                                         atf::buffer<float>( N ) ,  // x
                                                                         atf::buffer<float>( N ) )  // y
                                                                .grid_dim( N/WPT/LS )               // CUDA grid dim
                                                                .block_dim( LS );                   // CUDA block dim
    
        // Step 3: Explore the Search Space
        auto tuning_result = atf::tuner().tuning_parameters( WPT, LS )
                                         .search_technique( atf::auc_bandit() )
                                         .tune( cf_saxpy, atf::evaluations(50) );
    }
    

Using ATF for Conventional Search Spaces

We show how ATF is used for auto-tuning the C++ Matrix Multiplication example from OpenTuner – a state-of-the-art auto-tuning approach for programs relying on conventional search spaces, i.e., whose tuning parameters have no interdependencies among them.

The matrix multiplication implementation uses as tuning parameter the BLOCK_SIZE which is a value within the range 1,...,10 according to the OpenTuner’s tuning program for matrix multiplication.

The corresponding, equivalent OpenTuner program for auto-tuning the C++ matrix multiplication implementation is available here.

Our Experiments confirm that we achieve the same high auto-tuning quality as OpenTuner for programs with unconstrained search spaces.


  • We conveniently auto-tune the matrix multiplication implementation via pyATF using pyATF’s pre-implemented cost function generic.CostFunction which allows auto-tuning programs implemented in arbitrary programming languages. The full example is available here.

    from pyatf import TP, Interval, Tuner
    from pyatf.cost_functions import generic
    from pyatf.search_techniques import Exhaustive
    
    # Step 1: Generate the Search Space
    BLOCK_SIZE = TP('BLOCK_SIZE', Interval(1, 10))
    
    # Step 2: Implement a Cost Function
    run_command     = './tmp.bin'
    compile_command = 'g++ ../mmm_block.cpp -DBLOCK_SIZE=$BLOCK_SIZE -o ./tmp.bin'
    
    cf_matmul = generic.CostFunction(run_command).compile_command(compile_command)
    
    # Step 3: Explore the Search Space
    tuning_result = Tuner().tuning_parameters( BLOCK_SIZE )   \
                           .search_technique( Exhaustive() )  \
                           .tune( cf_matmul )
    

  • We conveniently auto-tune the matrix multiplication implementation via cppATF using cppATF’s pre-implemented cost function atf::generic::cost_function which allows auto-tuning programs implemented in arbitrary programming languages. The full example is available here.

    #include <atf.hpp>
    
    int main()
    {
      // Step 1: Generate the Search Space
      auto BLOCK_SIZE = atf::tuning_parameter( "BLOCK_SIZE", atf::interval<int>( 1,10 ) );
    
      // Step 2: Implement a Cost Function
      auto run_command     = "./run.bin";
      auto compile_command = "g++ mmm_block.cpp -DBLOCK_SIZE=$BLOCK_SIZE -o ./run.bin";
    
      auto cf_matmul = atf::generic::cost_function( run_command ).compile_command( compile_command );
    
      // Step 3: Explore the Search Space
      auto tuning_result = atf::tuner().tuning_parameters( BLOCK_SIZE )
                                       .search_technique( atf::auc_bandit() )
                                       .tune( cf_matmul, atf::duration<std::chrono::seconds>( 30 ) );
    }
    

    Note: Since cppATF is implemented in C++, the same as the matrix multiplication example to tune, we can use as cost function for this example also cppATF’s pre-implemented cost function atf::cxx::cost_function (as demonstrated here). This cost function is more convenient to use and also more performant for auto-tuning C++ programs via cppATF than our cost function atf::generic::cost_function which is generic in the programming language the program to tune is implemented in. We use function atf::generic::cost_function in this example for demonstration.


Publications

  1. A. Rasch, R. Schulze, M. Steuwer, S. Gorlatch
    Efficient Auto-Tuning of Parallel Programs With Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)
    ACM Transactions on Architecture and Code Optimization (TACO 2021), (original work, presented at HiPEAC’21 conference)
    Paper Slides Talk

  2. A. Rasch, S. Gorlatch
    ATF: A Generic, Directive-Based Auto-Tuning Framework
    Concurrency and Computation: Practice and Experience (CCPE 2018)
    Paper

  3. A. Rasch, M. Haidl, S. Gorlatch
    ATF: A Generic Auto-Tuning Framework
    IEEE International Conference on High Performance Computing and Communications (HPCC 2017)
    Paper Slides

WIP/Short Papers & Talks

  1. A. Rasch, R. Schulze
    Auto-Tuning Framework (ATF)
    Generic Autotuning Technology for GPU Applications (Lorentz Center 2022), (invited talk)
    Slides


Citations

Please use the following citation, when referring to ATF’s:

  1. Internal Design & Implementation
    @article{10.1145/3427093,
      author = {Rasch, Ari and Schulze, Richard and Steuwer, Michel and Gorlatch, Sergei},
      title = {Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)},
      year = {2021},
      issue_date = {March 2021},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      volume = {18},
      number = {1},
      issn = {1544-3566},
      url = {https://doi.org/10.1145/3427093},
      doi = {10.1145/3427093},
      journal = {ACM Trans. Archit. Code Optim.},
      month = {jan},
      articleno = {1},
      numpages = {26},
      keywords = {parallel programs, Auto-tuning, interdependent tuning parameters}
    }
    
  2. DSL-Based User Interface
    @article{https://doi.org/10.1002/cpe.4423,
      author = {Rasch, Ari and Gorlatch, Sergei},
      title = {ATF: A generic directive-based auto-tuning framework},
      journal = {Concurrency and Computation: Practice and Experience},
      volume = {31},
      number = {5},
      pages = {e4423},
      keywords = {auto-tuning, CLBlast, CLTune, CUDA, dependent tuning parameters, GEMM, many-core, multi-core, multi-objective auto-tuning, OpenCL, OpenTuner, tuning parameter constraints},
      doi = {https://doi.org/10.1002/cpe.4423},
      url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4423},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4423},
      note = {e4423 cpe.4423},
      year = {2019}
    }
    
  3. GPL-Based C++ User Interface
    @INPROCEEDINGS{8291912,
      author={Rasch, Ari and Haidl, Michael and Gorlatch, Sergei},
      booktitle={2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)},
      title={ATF: A Generic Auto-Tuning Framework},
      year={2017},
      volume={},
      number={},
      pages={64-71},
      doi={10.1109/HPCC-SmartCity-DSS.2017.9}
    }
    


Contact


You can also find us on Discord Discord.