Short talks

Théo Cantaloube (Émeraude, CITI, INSA Lyon) – A New Constraint Programming Model for the Multiple Constant Multiplication

Co-authors: Xiao Peng, Christine Solnon, Anastasia Volkova

The Multiple Constant Multiplication (MCM) problem arises in many applications such as, for example, digital signal processing. Given a set T of target constants, the goal of MCM is to find the most efficient way for multiplying an input number with each constant in T, where multiplications are realized through bit-shifts and additions, and where intermediate results may be shared to produce different target constants. Different metrics may be considered for evaluating the cost of a solution, and a classical objective function is to minimize the number of adders. State-of-the-art methods, based on Integer Linear Programming (ILP), suffer from numerous performance and scalability bottlenecks. In this work, we propose for the first time a Constraint Programming (CP) model for minimizing the number of adders for the MCM. Compared to the state-of-the-art ILP approach, CP does not suffer from the curse of linearization, hence permits significantly simpler formulations of the mathematical model. In order to evaluate our CP model, we focus on a widely used benchmark extracted from a collection of digital filter designs and compare ourselves with state-of-the-art ILP and SAT models. We show that our CP approach is less efficient on some easy instances, but much more efficient on hard instances. We also introduce a pseudo-polynomial time algorithm which is able to solve some instances, and show that using this algorithm during a preprocessing step improves the solution process.

Maël Chaumette (OCKHAM, LIP, ENS Lyon)CROQuant: Complex Rank-One Quantization Algorithm

Co-authors: Rémi Gribonval, Elisa Riccietti

This paper presents a quantization algorithm for complex-valued rank-one matrices that exploits rescaling-invariances of the problem to obtain better results than round-to-nearest strategy. This algorithm is also used as a building block for an heuristic strategy to quantize complex-valued butterfly-structured sparse matrices appearing for example in the fast Fourier transform. Compared to element-wise round-to-nearest quantization, the number of bits is reduced by 30% for a given precision on butterfly matrices, while maintaining a polynomial time complexity in the dimension of the matrices.

Orégane Desrentes (Émeraude, CITI, INSA Lyon and LMF, Inria Saclay) – Reciprocal Square Root Accelerated Using Hardware and Software Techniques

The square root function is an elementary function with wide-ranging applications, particularly in the field of vector mathematics.

One of the common variants is the reciprocal square root, used to compute the Euclidean norm of a vector during normalisation. In this process, each vector coefficient is divided by the norm to ensure that the resulting vector has a length of one. This normalisation is essential for various graphical transformations. It is also used to accelerate training with batch normalisation. Squaring the reciprocal square root can be used to compute the reciprocal, and multiplying it by the input to compute the square root.

The square root function and its variants are already hardware accelerated in most processors, including Kalray's where one square root function can be computed per cycle in the core. This leads to two issues. First, when intensive computations are carried out within the accelerator, going back and forth with the core for every computation is not ideal. Secondly, the throughput of one function approximation per cycle is too low for an accelerator that manipulates a vector of 8 FP32 numbers, even worse for the reciprocal square root if it is computed in two step.

This leads to the need of a new accelerator-specific way of implementing the family of square root functions, using a mix of hardware and software accelerations. This presentation proposes a method to accelerate the reciprocal square root, the square root and the reciprocal function using one hardware operator based on multipartite tables, and software refinements as needed to fit the required accuracy target.

Alexy Dutois (IMATH, Université de Toulon) – RSA-CRT Using a Hybrid RNS/PMNS System

Co-authors: Laurent-Stéphane Didier, Jean-Marc Robert

RSA is one of the most popular encryption systems. It makes use of two or more prime numbers and makes heavy use of modular exponentiations by the product of these prime numbers. Such operations are costly. When we have access to the prime numbers (for example in a signature process), one way to improve its efficiency is to perform modular exponentiations by these smaller numbers. Then we use the Chinese Remainder Theroem (CRT) to get the final result. This algorithm is called RSA-CRT. RNS and PMNS are alternative modular number representations. In the first system, we represent numbers by remainders and in the second one, by polynomials. Both systems allow us to get rid of the need for a carry propagation, therefore allowing computations to be parallelized. In this work we will explain how we used a hybrid RNS/PMNS system to improve RSA-CRT's efficiency. We will describe the algorithms we used, and we will explain how we made use of advanced vectorized instruction sets in our algorithms.

El-Mehdi El Arar (LIP6, Sorbonne Université) – Mixed Precision Accumulation for Neural Network Inference Guided by Componentwise Forward Error Analysis

Co-authors: Silviu-Ioan Filip, Elisa Riccietti, Theo Mary

This work proposes a mathematically founded mixed precision accumulation strategy for the inference of neural networks. Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks. Specifically, our analysis shows that the error in each component of the output of a layer is proportional to the condition number of the inner product between the weights and the input, multiplied by the condition number of the activation function. These condition numbers can vary widely from one component to the other, thus creating a significant opportunity to introduce mixed precision: each component should be accumulated in a precision inversely proportional to the product of these condition numbers. We propose a numerical algorithm that exploits this observation: it first computes all components in low precision, uses this output to estimate the condition numbers, and recomputes in higher precision only the components associated with large condition numbers. We test our algorithm on various networks and datasets and confirm experimentally that it can significantly improve the cost--accuracy tradeoff compared with uniform precision accumulation baselines.

Cédric Gernigon (Nantes Université) Beyond Weight-Only: Mixed-Precision Quantization for BERT Weights, Activations and Embeddings

Co-authors: Xavier Pillet, Anastasia Volkova, Richard Dufour

Pre-trained language models deliver strong performance across various Natural Language Processing (NLP) tasks but remain costly to deploy due to memory and compute demands. To address this, model compression techniques such as pruning, knowledge distillation, and quantization have emerged, with quantization gaining traction due to hardware support for low precision. While uniform and extremely low-precision quantization have shown promise, mixed-precision approaches that assign variable bit-widths to weights/activations across the model offer a superior balance between compression and accuracy. In this work, we aim to evaluate the impact of mixed-precision quantization for inference on BERT language model. Unlike prior work that often neglects activation quantization, our study systematically explores both weights and activations in mixed-precision configurations. To further improve performance, we integrate knowledge distillation into the mixed-precision pipeline. We also evaluate the impact of quantization on the embedding layer, which is generally restricted solely to quantizing token weights. Evaluated on SQuAD and GLUE benchmarks, our results achieve substantial memory and computational reductions without sacrificing accuracy.

David Hamelin (EDF R&D and LIP6, Sorbonne Univ. and LMF) – Generation of Pathological Cases for Rounding Errors

Co-authors: Sylvie Boldo, Pierre-Yves Piriou, Thibault Hilaire

Given a program with floating-point computations, there are many analysis tools, available to obtain an upper bound on the rounding error. We aim to evaluate the tightness of this rounding bound by exhibiting input values for the program where the rounding error is close to this bound (called "pathological cases"). We propose an automatic method for generating pathological cases based on delta debugging and guided by Gappa. We have implemented this method and applied it to numerous examples from FPBench and publications.

Antoine Jégo (LIP6, Sorbonne Université) – Memory Accessor for Mixed-precision GMRES-based iterative refinement

Co-authors: Patrick Amestoy, Jean-Yves L'Excellent, Theo Mary

Mixed-precision iterative refinement algorithms have been designed to provide high precision solution to well-conditioned problems while achieving high performance by relying on low precision factorizations. However both the factorization step and the iterative correction step of these algorithms may use multiple arithmetics. This mixture of precisions requires from the developper that they either convert the factorized system or use flexible iterative correction steps. Such modifications limit the problem scalability as the memory footprint would grow. Instead we propose to use mixed precision memory accessor approaches that decouple storage and compute precisions (data is stored and accessed in low precision, but computations are kept in higher precision) and reduce data accesses, improve accuracy, and simplify programming. In this work, we present experimental results of mixed-precision GMRES-based iterative refinement. We leverage the block low-rank structures and the mixed-precision storage of the sparse direct solver MUMPS to achieve low memory footprint. We also present the BLAS-based block memory accessor of MUMPS that is leveraged during the solve step to reach high performance. We discuss the importance of the memory accessor to reach better problem scalability compared with a flexible GMRES-based iterative refinement.

Christoph Lauter (University of Texas at El Paso)Computer Arithmetic on the 8-bit Floating-Point Formats G.711 alaw and ulaw

With the 2020ies hype in AI, with P3109, IEEE is putting effort into standardizing 8-bit Floating-Point formats. The current draft of the standard foresees two to three different 8 bit formats E3M4, E4M3 through E5M2, with a precision on 5 through 2 bits.

In this talk, we would like to present two other 8-bit Floating-Point formats, namely CCITT/ITU G.711 alaw and ulaw. These formats date back to the middle of the 1960ies and are still in widespread use for telephony all around the world. Standardizing documents such as the ITU G.711 Recommendation do exist; however, they differ heavily from documents like IEEE754 or the P3109 draft. For instance, the words "exponent" or "mantissa" nowhere appear in the G.711 Standard. The Standard also leaves the decoding of G.711 encoded bit-patterns to real numbers as "an exercise to the reader", merely hinting on the absolute value of a bit-pattern by refering to the power of a special G.711 alaw sequence in a table in annex.

We will show that G.711 alaw is very similar to an IEEE754-inspired floating-point format, with 5 bits of precision, subnormal numbers and signed zeros, but leaving out special values like NaNs and Infinities. We will show branch-less, loop-less and table-less procedures to decode and encode G.711 alaw to Fixed-Point and IEEE754 Floating-Point arithmetic. We will present alaw-in and alaw-out addition and multiplication operations and report on experiments around the implementation of Digital Signal Processing algorithms directly in the alaw domain, such as FFT and Goertzel algorithms.

We will also comment on the even older G.711 ulaw encoding and its non-traditional non-semi-logarithmic encoding that ends up being semi-logarithmic on its first 7 most significant bits.

Elisa Riccietti (OCKHAM, LIP, ENS Lyon) – Optimal Quantization of Rank-One Matrices in Floating-Point Arithmetic

Co-authors: Rémi Gribonval, Theo Mary

We consider the problem of optimally quantizing rank-one matrices to low precision floating-point arithmetic. By exploiting the natural scaling invariance property of the method, we characterize the optimal solution as the quantization of suitably scaled factors of the rank-one matrix and we develop an algorithm of tractable complexity to find the optimal scalingparameters. We show experimentally that our algorithm can significantly reduce the quantization error with respect to the naive strategy of separately quantizing the two rank-one factors, which may be far from optimal. In particular, we propose an application of our algorithm to design a heuristic strategy to quantizate a product of butterfly factors and we show that the proposed strategy can achieve storage reductions of up to 30% with no loss of accuracy.

Kristalys Ruiz-Rohena (University of Texas at El Paso) Semi-Automatic Compile-Time Math Intrinsic Optimization Using LLVM

Co-authors: Christoph Lauter, Shirley Moore

For scientific applications requiring higher numerical accuracy, simply increasing precision, such as switching from single to double precision, is often insufficient. Errors introduced during computation can accumulate over many operations, degrading the accuracy of the result. To address this, we introduce a tool built on LLVM that detects operations of interest and replaces them with user-defined implementations. The tool is developed in C++ and targets Libtorch, the low-level C++ API for PyTorch. It will be evaluated by replacing transcendental functions in deep learning model architectures such as Inception and ResNet. The test will include accuracy and performance tests using CIFAR-10 and ImageNet. This work aims to systematically enable exploration of numerical robustness in machine learning and scientific computing applications.

Alexandre Tabouret (LIP6, Sorbonne Université) – Solving Sparse Linear Systems with Adaptive Precision GMRES

Co-authors: Emmanuel Agullo, Luc Giraud, Pierre Jolivet, Theo Mary

The growing use of low-precision arithmetic, initially popularized by artificial intelligence, has created new opportunities for scientific computing. In high-performance computing (HPC), low-precision formats open the door to mixed-precision algorithms that can deliver large performance gains, but at the cost of reduced numerical accuracy if not carefully monitored. In this talk, we present a variant of the Generalized Minimal Residual (GMRES) method that relies on a mixed-precision sparse matrix-vector multiplication. In our approach, the precision of the coefficients of the system matrix can vary both across entries within the same iteration and over the course of the iterations, effectively turning GMRES into a mixed and adaptive precision solver. We also show how combining this approach with restarted GMRES further reduces memory usage and computational cost, while maintaining stable convergence. We will discuss the algorithmic design, highlight how accuracy losses can be controlled in Krylov subspace methods, and present performance results compared to standard double-precision GMRES. Our results show that, when accuracy is properly monitored, low-precision arithmetic together with restarts can significantly improve solver efficiency on modern HPC systems.

Lisa Taldir (Université de Perpignan, UPVD) Detection of Numerical Bugs Using Large Language Models

Co-author: David Defour

The study examines how Large Language Models (LLMs) can detect and classify floating-point arithmetic errors in software, which are subtle but potentially catastrophic. To evaluate this, the authors introduce InterFLOPBench, a benchmark of 91 C code instances and 1,138 test samples covering six error categories (cancellation, overflow, underflow, NaN, division by zero, and comparison errors), validated with FPChecker. Five LLMs (Gemini 2.0 Flash, Gemini 2.5 Flash, GPT-4o, DeepSeek-R1, and Gemma 3 27BIT) are tested using different methods, including single models, ensembles, confidence-based classification, and a Division of Labor pipeline that separates code analysis from error detection. The results show that LLMs display strong numerical reasoning, with the best models achieving Micro F1 scores above 0.90. DeepSeek-R1 and Gemini 2.5 Flash perform best individually, while the Division of Labor pipeline yields the highest overall performance (0.8481 F1). LLMs detect explicit errors such as comparison and division by zero well but struggle with subtler issues like underflow. Confidence-based classification provides a good balance of accuracy and coverage, outperforming simple binary approaches, while prompt design has a major impact on performance, and temperature settings have little effect. This research demonstrates that LLMs can support educational use and code review for explicit numerical errors but should be combined with specialized tools to handle subtler cases. Overall, the findings highlight both the potential and the limitations of LLMs in numerically-aware programming analysis.

Anastasia Volkova (Inria Emeraude, CITI, INSA Lyon) – Hardware-aware numerical data formats for DNN acceleration 

Co-authors: Florent de Dinechin, Bastien Barbe and Romain Bouarah.

The growing demand for maximum efficiency in deep neural network (DNN) inference and training, especially on embedded and edge devices, has intensified research into customized numerical data formats. Traditional floating-point and fixed-point representations are being replaced by extremely low-precision formats that promise significant gains in computational and energy efficiency. However, implementing such formats in hardware requires bridging the gap between algorithmic design and practical realization through careful co-optimization. In our current work, we focus on the design of highly efficient arithmetic units for sub-8-bit integer computations. When the representable range is limited to only a few discrete values—for example, 16 levels—the assumptions behind classic fixed-point arithmetic no longer align with the specific requirements of modern machine learning models. We therefore consider a data format as a pair: a set of representable values and an associated rounding function. This view allows exploration of non-uniform and non-symmetric ranges that better reflect the distribution of learned parameters and have a lower associated hardware cost of arithmetic operations. We illustrate this concept using two examples: a logarithmic number system and of shift-and-add friendly numbers. Preliminary results from quantization-aware training show the potential of these formats to improve both hardware efficiency and model performance. This is a joint work with Florent de Dinechin, Bastien Barbe and Romain Bouarah.

Loading... Loading...