Overview
The Auto-Tuning Framework (ATF) is a general-purpose auto-tuning approach: given a program that is implemented as generic in performance-critical program parameters (a.k.a. tuning parameters), such as sizes of tiles and numbers of threads, ATF fully automatically determines a hardware- and data-optimized configuration of such parameters.
Key Feature of ATF
A key feature of ATF is its support for Tuning Parameter Constraints. Parameter constraints allow auto-tuning programs whose tuning parameters have so-called interdependencies among them, e.g., the value of one tuning parameter has to evenly divide the value of another tuning parameter.
ATF’s support for parameter constraints is important: modern parallel programs target novel parallel architectures, and such architectures typically have deep memory and core hierarchies thus requiring constraints on tuning parameters, e.g., the value of a tile size tuning parameter on an upper memory layer has to be a multiple of a tile size value on a lower memory layer.
For such parameters, ATF introduces novel concepts for Generating & Storing & Exploring the search spaces of constrained tuning parameters, thereby contributing to a substantially more efficient overall auto-tuning process for such parameters, as confirmed in our Experiments.
Generality of ATF
For wide applicability, ATF is designed as generic in:
-
The target program’s Programming Language, e.g., C/C++, CUDA, OpenMP, or OpenCL. ATF offers pre-implemented cost functions for conveniently auto-tuning C/C++ programs, as well as CUDA and OpenCL kernels which require host code for their execution which is automatically generated and executed by ATF’s pre-implemented CUDA and OpenCL cost functions. ATF also offers a pre-implemented generic cost function that can be used for conveniently auto-tuning programs in any other programming language different from C/C++, CUDA, and OpenCL.
-
The Search Technique to use. ATF offers different kinds of pre-implemented search techniques, such as simulated annealing and AUC bandit (inspired by OpenTuner) which combines multiple techniques for exploration (such as differential evolution, Nelder-Mead, and Torczon hillclimbers). New techniques can be conveniently added to ATF, by implementing a straightforward interface.
-
The Tuning Objective, e.g., high runtime performance and/or low energy consumption. By default, ATF’s pre-implemented cost functions auto-tune for high runtime performance. The user can choose arbitrary, self-defined tuning objectives.
-
The Abort Condition which specifies when to stop the auto-tuning process. ATF’s pre-implemented abort conditions allow to stop the auto-tuning process dependent on the tuning time (e.g., after a user-defined time interval has been exceeded or a user-defined number of parameter configurations has been explored), but also dependent on the tuning result (e.g., when particular performance is already achieved) or dependent on both time and result (e.g., when within a particular time interval the auto-tuning run was not able to further improve the target program’s performance). New conditions can be conveniently added to ATF by the user, by implementing a straightforward interface.
Getting Started
ATF is open source on GitHub, with both of its interface kinds, under a commercially permissive MIT license:
We encourage you to use ATF in other projects, open source, or commercial.
Code Examples
ATF currently offers two kinds of user interfaces that can be freely chosen by the user:
- A Python-based user interface – we call ATF with its Python-based interface pyATF
- A C++-based user interface – we call ATF with its C++-based interface cppATF
More interface kinds for ATF, in further mainstream programming languages, might follow.
Using ATF for Constrained Search Spaces
We show how ATF is used for auto-tuning CUDA SAXPY Routine from the CLBlast library.
The routine relies on two tuning parameters – WPT
and LS
.
Both parameters have as their range of potential values 1,...,N
where N
denotes the user-defined input size.
The two parameters are interdependent: parameter LS
needs to divide N/WPT
– the input size N
divided by the value of the WPT
parameters – for correctness of the CUDA routine.
-
The SAXPY CUDA routine is conveniently auto-tuned via pyATF using its pre-implemented CUDA cost function
cuda.CostFunction
, which automatically generates and executes the CUDA host code required for executing the routine.import numpy as np from pyatf import TP, Interval, Tuner from pyatf.cost_functions import cuda from pyatf.search_techniques import AUCBandit from pyatf.abort_conditions import Evaluations # kernel code as string saxpy_kernel_as_string = ''' extern "C" __global__ void saxpy( const int N, const float a, const float* x, float* y ) { for( int w = 0 ; w < WPT ; ++w ) { const int index = w * gridDim.x * blockDim.x + blockIdx.x * blockDim.x + threadIdx.x; y[ index ] += a * x[ index ]; } } ''' # input size N = 1000 # Step 1: Generate the Search Space WPT = TP('WPT', Interval( 1, N ), lambda WPT: N % WPT == 0 ) LS = TP('LS', Interval( 1, N ), lambda WPT, LS: (N / WPT) % LS == 0) # Step 2: Implement a Cost Function saxpy_kernel = cuda.Kernel( cuda.source(saxpy_kernel_as_string), 'saxpy' ) # kernel's code & name cf_saxpy = cuda.CostFunction( saxpy_kernel ).device_id( 0 ) \ .kernel_args( np.int32( N ) , np.float32(np.random.random()) , np.random.rand(N).astype(np.float32) , np.random.rand(N).astype(np.float32) ) \ .grid_dim( lambda WPT, LS: N/WPT/LS ) \ .block_dim( lambda LS: LS ) # Step 3: Explore the Search Space tuning_result = Tuner().tuning_parameters( WPT, LS ) \ .search_technique( AUCBandit() ) \ .tune( cf_saxpy, Evaluations(50) )
-
The SAXPY CUDA routine is conveniently auto-tuned via cppATF using its pre-implemented CUDA cost function
atf::cuda::cost_function
, which automatically generates and executes the CUDA host code required for executing the routine.#define ENABLE_CUDA_COST_FUNCTION #include <atf.hpp> int main( int argc, char* argv[] ) { // kernel code as string auto saxpy_kernel_as_string = R"( extern "C" __global__ void saxpy( const int N, const float a, const float* x, float* y ) { for( int w = 0 ; w < WPT ; ++w ) { const int index = w * gridDim.x * blockDim.x + blockIdx.x * blockDim.x + threadIdx.x; y[ index ] += a * x[ index ]; } })"; // input size int N = 1000; // Step 1: Generate the Search Space auto WPT = atf::tuning_parameter( "WPT" , atf::interval<size_t>( 1,N ) , atf::divides( N ) ); auto LS = atf::tuning_parameter( "LS" , atf::interval<size_t>( 1,N ) , atf::divides( N/WPT ) ); // Step 2: Implement a Cost Function auto saxpy_kernel = atf::cuda::kernel< atf::scalar<int> , // N atf::scalar<float> , // a atf::buffer<float> , // x atf::buffer<float> > // y ( atf::source(saxpy_kernel_as_string), "saxpy" ); // kernel's code & name auto cf_saxpy = atf::cuda::cost_function( saxpy_kernel ).device_id( 0 ) // CUDA device id .inputs( atf::scalar<int>( N ) , // N atf::scalar<float>() , // a atf::buffer<float>( N ) , // x atf::buffer<float>( N ) ) // y .grid_dim( N/WPT/LS ) // CUDA grid dim .block_dim( LS ); // CUDA block dim // Step 3: Explore the Search Space auto tuning_result = atf::tuner().tuning_parameters( WPT, LS ) .search_technique( atf::auc_bandit() ) .tune( cf_saxpy, atf::evaluations(50) ); }
Using ATF for Conventional Search Spaces
We show how ATF is used for auto-tuning the C++ Matrix Multiplication example from OpenTuner – a state-of-the-art auto-tuning approach for programs relying on conventional search spaces, i.e., whose tuning parameters have no interdependencies among them.
The matrix multiplication implementation uses as tuning parameter the BLOCK_SIZE
which is a value within the range 1,...,10
according to the OpenTuner’s tuning program for matrix multiplication.
The corresponding, equivalent OpenTuner program for auto-tuning the C++ matrix multiplication implementation is available here.
Our Experiments confirm that we achieve the same high auto-tuning quality as OpenTuner for programs with unconstrained search spaces.
-
We conveniently auto-tune the matrix multiplication implementation via pyATF using pyATF’s pre-implemented cost function
generic.CostFunction
which allows auto-tuning programs implemented in arbitrary programming languages. The full example is available here.from pyatf import TP, Interval, Tuner from pyatf.cost_functions import generic from pyatf.search_techniques import Exhaustive # Step 1: Generate the Search Space BLOCK_SIZE = TP('BLOCK_SIZE', Interval(1, 10)) # Step 2: Implement a Cost Function run_command = './tmp.bin' compile_command = 'g++ ../mmm_block.cpp -DBLOCK_SIZE=$BLOCK_SIZE -o ./tmp.bin' cf_matmul = generic.CostFunction(run_command).compile_command(compile_command) # Step 3: Explore the Search Space tuning_result = Tuner().tuning_parameters( BLOCK_SIZE ) \ .search_technique( Exhaustive() ) \ .tune( cf_matmul )
-
We conveniently auto-tune the matrix multiplication implementation via cppATF using cppATF’s pre-implemented cost function
atf::generic::cost_function
which allows auto-tuning programs implemented in arbitrary programming languages. The full example is available here.#include <atf.hpp> int main() { // Step 1: Generate the Search Space auto BLOCK_SIZE = atf::tuning_parameter( "BLOCK_SIZE", atf::interval<int>( 1,10 ) ); // Step 2: Implement a Cost Function auto run_command = "./run.bin"; auto compile_command = "g++ mmm_block.cpp -DBLOCK_SIZE=$BLOCK_SIZE -o ./run.bin"; auto cf_matmul = atf::generic::cost_function( run_command ).compile_command( compile_command ); // Step 3: Explore the Search Space auto tuning_result = atf::tuner().tuning_parameters( BLOCK_SIZE ) .search_technique( atf::auc_bandit() ) .tune( cf_matmul, atf::duration<std::chrono::seconds>( 30 ) ); }
Note: Since cppATF is implemented in C++, the same as the matrix multiplication example to tune, we can use as cost function for this example also cppATF’s pre-implemented cost function
atf::cxx::cost_function
(as demonstrated here). This cost function is more convenient to use and also more performant for auto-tuning C++ programs via cppATF than our cost functionatf::generic::cost_function
which is generic in the programming language the program to tune is implemented in. We use functionatf::generic::cost_function
in this example for demonstration.
Publications
-
A. Rasch, R. Schulze, M. Steuwer, S. Gorlatch
Efficient Auto-Tuning of Parallel Programs With Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)
ACM Transactions on Architecture and Code Optimization (TACO 2021), (original work, presented at HiPEAC’21 conference)
Paper Slides Talk -
A. Rasch, S. Gorlatch
ATF: A Generic, Directive-Based Auto-Tuning Framework
Concurrency and Computation: Practice and Experience (CCPE 2018)
Paper -
A. Rasch, M. Haidl, S. Gorlatch
ATF: A Generic Auto-Tuning Framework
IEEE International Conference on High Performance Computing and Communications (HPCC 2017)
Paper Slides
WIP/Short Papers & Talks
- A. Rasch, R. Schulze
Auto-Tuning Framework (ATF)
Generic Autotuning Technology for GPU Applications (Lorentz Center 2022), (invited talk)
Slides
Citations
Please use the following citation, when referring to ATF’s:
- Internal Design & Implementation
@article{10.1145/3427093, author = {Rasch, Ari and Schulze, Richard and Steuwer, Michel and Gorlatch, Sergei}, title = {Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)}, year = {2021}, issue_date = {March 2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {18}, number = {1}, issn = {1544-3566}, url = {https://doi.org/10.1145/3427093}, doi = {10.1145/3427093}, journal = {ACM Trans. Archit. Code Optim.}, month = {jan}, articleno = {1}, numpages = {26}, keywords = {parallel programs, Auto-tuning, interdependent tuning parameters} }
- DSL-Based User Interface
@article{https://doi.org/10.1002/cpe.4423, author = {Rasch, Ari and Gorlatch, Sergei}, title = {ATF: A generic directive-based auto-tuning framework}, journal = {Concurrency and Computation: Practice and Experience}, volume = {31}, number = {5}, pages = {e4423}, keywords = {auto-tuning, CLBlast, CLTune, CUDA, dependent tuning parameters, GEMM, many-core, multi-core, multi-objective auto-tuning, OpenCL, OpenTuner, tuning parameter constraints}, doi = {https://doi.org/10.1002/cpe.4423}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4423}, eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4423}, note = {e4423 cpe.4423}, year = {2019} }
- GPL-Based C++ User Interface
@INPROCEEDINGS{8291912, author={Rasch, Ari and Haidl, Michael and Gorlatch, Sergei}, booktitle={2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)}, title={ATF: A Generic Auto-Tuning Framework}, year={2017}, volume={}, number={}, pages={64-71}, doi={10.1109/HPCC-SmartCity-DSS.2017.9} }
Contact
Richard Schulze
Affiliation: | University of Münster |
Email: | r.schulze@uni-muenster.de |
Website: | richardschulze.net |
You can also find us on Discord.