

Institut de Recherche en Informatique et Systèmes Aléatoires

### Machine Learning for Timing Estimation: the good, the bad and the ugly – short version

Isabelle Puaut RT-ML workshop, 2024







 Motivations for using ML in WCET analysis

- Our contributions (in a nutshell)
- The good

Outline

어마

- The bad
- The ugly
- Takeways







with no branch except last instruction





# Contributions



#### Machine learning for timing estimation

Replace the low-level analysis by a machine learning (ML) model

#### **Supervised learning**











1. Collecting training data

2. Training the model

# Context

#### Spectrum of contributions

- Collection of timing data (at basic block level)
  - Synthetic programs vs real code
  - Metric: average-case performance and worst-case timing
  - Features: proportions of instructions, sequences of instructions (BB "in context")
- Learning
  - Basic techniques: Linear Regression, Random Forests, Gradient Boosting, Neural Networks (scikit-learn)
  - Natural Language Processing (NLP) Techniques: LSTM, Transformers-XL
- Large panel of architectures:
  - Very simple ones: TI MSP430, Cortex M4
  - More complex ones: Cortex M7, Cortex A53



operation/operandsinstructionbasic blockbasic block sequence====letterswordsentenceparagraph



#### CAWET: use of Transformers XL

#### Deep learning model for processing (very) long sequential data







Context

Contributions





# Benefits of machine learning for timing estimation (by construction)

| No need for details<br>of the processor<br>microarchitecture | Once deployed, no<br>measurements<br>needed | Easy porting to<br>a new<br>architecture | Tokenization for<br>free |
|--------------------------------------------------------------|---------------------------------------------|------------------------------------------|--------------------------|
|--------------------------------------------------------------|---------------------------------------------|------------------------------------------|--------------------------|

Only need to measure (in the worst-case) Fast predictions

Only re-train Tokenizers exist

Context



#### Good precision with simple targets

The bad

The ugly

Takeways

| ML algorithm          | MAPE (MSP430) |
|-----------------------|---------------|
| Linear regression     | 56.7%         |
| Bayesian Ridge        | 62.1%         |
| Gradient Boosting     | 42.1%         |
| Random Forest         | 43.4%         |
| Multilayer perceptron | 8.2%          |

Low-power TI MSP430 micro-controller (2-stage pipeline, tiny icache, no dcache), basic ML algorithms



| ML algorithm           | MAPE<br>(Cortex M4) |
|------------------------|---------------------|
| Multilayer perceptron  | 43.8%               |
| LSTM                   | 36.2%               |
| CAWET (Transformer-XL) | 23.8%               |

MAPE on Cortex M4 (in-order pipeline, 3-stages, no cache, jtag)





On Cortex M4 (BB level on left, program level at right)



### The bad



Pessimism augments with more complex targets

| ML algorithm           | MAPE<br>(Cortex M4) | MAPE<br>(Cortex M7) |
|------------------------|---------------------|---------------------|
| Multilayer perceptron  | 43.8%               | 132.7%              |
| LSTM                   | 36.2%               | 126.4%              |
| CAWET (Transformer-XL) | 23.8%               | 102.2%              |

Cortex M7 (in-order pipeline, 6-stages, L1 caches, jtag)



#### Hyper-parameter selection may get you nuts

| Loss function               | MSE   |       |       |       | MAPE |       |       |       |       |       |       |       |
|-----------------------------|-------|-------|-------|-------|------|-------|-------|-------|-------|-------|-------|-------|
| Learning rate               | 10    | -4    | 10    | -3    | 10   | -2    | 10    | -4    | 10    | -3    | 10    | -2    |
| Optimizer                   | SGD   | ADAM  | SGD   | ADAM  | SGD  | ADAM  | SGD   | ADAM  | SGD   | ADAM  | SGD   | ADAM  |
| Default                     | 163%  | 159%  | 182%  | 198%  | 170% | 195%  | 152%  | 161%  | 210%  | 176%  | 182%  | 169%  |
| Larger ML<br>network size   | 210%  | 139%  | 156%  | 167%  | 321% | 124%  | 110%  | 134%  | 152%  | 143%  | 126%  | 134%  |
| Without float instructions  | 77%   | 69%   | 88%   | 65%   | 75%  | 98%   | 78%   | 65%   | 56%   | 55%   | 89%   | 60%   |
| Learning on normalized time | 19.2% | 18.7% | 19.5% | 19.2% | 17%  | 22.1% | 15.2% | 11.4% | 14.4% | 15.9% | 17.3% | 19,1% |

ACET learning, LSTM, Cortex-M7, learning lasts several days



• Handcrafted features, Multi-Layer-Perceptron (MLP)

| Instruction proportions | Adding access type                                                   |
|-------------------------|----------------------------------------------------------------------|
| %MOV, %ADD, %SUB        | %instruction_with_direct_access<br>%instruction_with_indirect_access |
| Error = 311%            | Error = 181%                                                         |



 $\sim$ 

The good

#### The bad

#### Models are hard to debug



- Local Interpretable Model-agnostic Explanations (LIME) - see Wortex talk
- The most impacting feature is instruction count, really?

| Instruction_Src_Dst | MOV.W_X(Rn)_Rn | SUB.W_#N_Rn    |
|---------------------|----------------|----------------|
| MOV.W_Rn_Rn         | ADD.W_#N_Rn    | SUB.W_Rn_ADDF  |
| MOV.W_@Rn_Rn        | CMP.W_Rn_Rn    | JMP            |
| MOV.B_#N_Rn         | MOV.W_Rn_Rn    | SUB.W_X(Rn)_Rn |





The ugly





# **Takeaways**



#### Takeways: lessons learnt

- Feature selection is key to success
- Training data is crucial
- ML for timing estimation works pretty well, but
  - Many (too many) parameters to control: be calm, patient, and methodic
  - Requires domain expertise and ML expertise: cooperate with ML experts!
  - Techniques hard to debug: need for (more) explainability
  - No formal guarantee of safety/precision: certifiable ML



• Extension multi-cores



Postdoc position (University of Toulouse, and University of Rennes, France), project AlxIA (Artificial Intelligence for Interference Analysis)

For the bounty please contact:

- Thomas CARLE : thomas.carle@irit.fr
- Isabelle PUAUT : puaut@irisa.fr
- More details needed? Join the WCET workshop for the long version!

# Any question?



# No question, really?





- A. N. Amalou, I. Puaut and G. Muller. WE-HML: Hybrid WCET Estimation using Machine Learning for Architectures With Caches. RTCSA 2021.
- A. N. Amalou, E. Fromont and I. Puaut. CATREEN: Context-Aware Code Timing Estimation with Stacked Recurrent Networks. ICTAI 2022.
- A. N. Amalou, E. Fromont and I. Puaut. CAWET: Context-Aware Worst-Case Execution Time Estimation Using Transformers." ECRTS 2023.
- A. N. Amalou, E. Fromont and I. Puaut. Fast and Accurate Context-Aware Basic Block Timing Prediction using Transformers.". CC, 2024.
- H. Reymond, A. N. Amalou and I. Puaut. WORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable ML. WCET 2024



- A. N. Amalou, I. Puaut. A dataset of synthetically generated code blocks for the learning of WCET on Cortex A53 [Dataset]. Zenodo
- A. N. Amalou, E. Fromont and I. Puaut. Training dataset for transformers consisting of basic blocks and their execution times along with the execution context of these blocks, for various Cortex processors M7, M4, A53, and A72. [Dataset]. Zenodo.
- H. Reymond, H. Chabot, A. N. Amalou, I. Puaut, MSP430FR5969 Basic Block Worst-Case Energy Consumption (WCEC) and Worst-Case Execution Time (WCET) dataset. [Dataset]. Zenodo



[22] Hardy, D., Rouxel, B., & Puaut, I. (2017). The heptane static worst-case execution time estimation tool. In 17th International Workshop on Worst-Case Execution Time Analysis







#### Pessimism and hardware complexity



Benchmark



#### Never under-estimates WCETs (Cortex-M7)



# Contributions The good The bad The ugly Takeways

#### Experimental setup: competitors

#### Multilayer perceptron (MLP) based on the work of WE-HML [8]



#### LSTM based: ITHEMAL [20] and CATREEN [9]



40

[8] AMALOU A. N., PUAUT I. and MULLER G. "WE-HML: Hybrid WCET Estimation using Machine Learning for Architectures With Caches." The 27th International Conference on Embedded and Real-Time Computing Systems and Applications. IEEE, 2021.

[20] MENDIS, C., et al. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. International Conference on Learning Representations, 2018.

[9] AMALOU A. N., FROMONT E., and PUAUT I. "CATREEN: Context-Aware Code Timing Estimation with Stacked Recurrent Networks." The 34th IEEE International Conference on Tools with Artificial Intelligence IEEE, 2022.

#### ARM targets used in experimental evaluation

The bad

The ugly

Takeways

The good

Contributions

**;;**;

| Core                            | Cortex-M4    | Cortex-M7                   | Cortex-A53                  |
|---------------------------------|--------------|-----------------------------|-----------------------------|
| Board                           | STM32F407    | STM32H743                   | Raspberry Pi 3              |
| Pipeline type & (#stage)        | In-order (3) | In-order<br>superscalar (6) | In-order<br>superscalar (8) |
| Cache memory                    | N/A          | L1                          | L2                          |
| Branch predictor                | N/A          | Yes                         | Yes                         |
| Measurement solution            | JTAG         | JTAG                        | Instrumentation             |
| Microarchitecture<br>complexity | Low          | Medium                      | High                        |



#### ORXESTRA (ACET) on M4, M7, A53. Evaluation metrics: (MAPE)

| Target     | Metric | MLP [1] | ITHEMAL [9] | CATREEN [2] | ORXESTRA [3] |
|------------|--------|---------|-------------|-------------|--------------|
| Cortex-M4  | MAPE   | 26.4%   | 14.4%       | 8.8%        | 7.8%         |
| Cortex-M7  | MAPE   | 22.7%   | 17.6%       | 13.3%       | 9.6%         |
| Cortex-A53 | MAPE   | 38.4%   | 10.1%       | 8.5%        | 5.2%         |

- ORXESTRA outperforms all models on all targets
- Context aware models are better than context-agnostic ones



#### **CAWET: Context extraction**





Solution:

- Divide to conquer
- Local exploration in SESE regions
- (more details if asked for)



#### Comparison of WCET predictions for CAWET [12], WE-HML [8] and a neural network baseline on TacleBench programs

| Predictor                             | Cortex-M4<br>MRPE | Cortex-M7<br>MRPE | Cortex-A53<br>MRPE |
|---------------------------------------|-------------------|-------------------|--------------------|
| WE-HML [1]<br>(Multilayer perceptron) | -                 | -                 | 494.2%             |
| Multilayer perceptron                 | 43.8%             | 132.7%            | 85.7%              |
| CAWET [3]                             | 23.8%             | 102.2%            | 62.4%              |

• CAWET is less pessimistic than WE-HML

