#### **ROC: A Rank-switching, Open-row DRAM Controller for Time-predictable Systems**

Yogen Krishnapillai Zheng Pei Wu **Rodolfo Pellizzoni** 



#### **Multi-Requestor Systems**



#### **Multi-Requestor Systems**



# **Multi-Requestor Systems**

- Schedula ut
  WCET d Problem: DRAM latency is variable and changes depending on its state d resources depending on its state
- Existing approaches can bound the interference but they assume the latency for DRAM access is constant

# **DRAM Latency - Solutions**

- Solution 1: use (complex) analysis to determine upper latency bound
  - Ex: "Bounding Memory Interference Delay in COTS-based Multi-Core Systems", RTAS'14
  - Problem: COTS DRAM controllers optimized for average case latency; worst case bound can be very pessimistic
- Solution 2: predictable DRAM controller
  - Various solutions available
  - Typically simplifies analysis by making latency constant
  - Problem: without architectural optimizations, latency can be high for modern DRAM devices (more on this later)

# Our Solution

- Key idea: design new architectural optimizations targeted at reducing worst case, not average case latency
- In this paper: DRAM rank-switching with open-row policy
- ROC: Rank-switching, Open-row Controller
- We discuss:
  - Design
  - Latency Analysis
  - Implementation
  - Results



## Outline

- 1. Background & Related Work
- 2. Rank-Switching Mechanism
- 3. Worst Case Latency Analysis
- 4. Memory Controller Model
- 5. Results & Conclusion

## Outline

- 1. Background & Related Work
- 2. Rank-Switching Mechanism
- 3. Worst Case Latency Analysis
- 4. Memory Controller Model
- 5. Results & Conclusion























# Row Policy

- Close Row Policy:
  - Used by most predictable memory controllers
  - After each access, the row buffer is automatically pre-charged
  - Constant request latency
  - Cannot take advantage of locality (row hits)
- Open Row Policy
  - Used in our approach
  - Keep the row open to exploit locality
  - Different latency for open/close requests

Interleaving Banks

Accessing data in multiple banks







Interleaving Banks
 This is good for system
 with small DRAM data
 bus width (e.g. 16 bits)



A R Data A A R Data Larger data buses can transfer same amount of data without interleaving so many banks 7/24



Private Banks



- Used in our approach
- Partition banks among requestors. How:
  - Hardware if memory controller supports
  - By compiler
  - In OS, using virtual memory
- No row conflicts

### **Related Work**

- AMC – Int In RTSS'13 [5], we presented the first
  - Cle predictable controller with:
    - Open row policy
- Cons
   Private Banks
  - Int

- Cl

- Le Improved latency but...
  - Problem
- PRE DRAM is very inefficient when switching – Pri between write and read

# Write to Read Switching

• Transactions of the same type can be pipelined...



Huge Latency Penalty! We need a solution

• ... but a read after a write cannot.



# Outline

- 1. Background & Related Work
- 2. Rank-Switching Mechanism
- 3. Worst Case Latency Analysis
- 1. Memory Controller Model
- 2. Results & Conclusion

## **Multi-Rank Device**



# **Multi-Rank Device**







#### **Example: Write-Read-Write-Read**



# Outline

- 1. Background & Related Work
- 2. Rank-Switching Mechanism
- 3. Worst Case Latency Analysis
- 4. Memory Controller Model
- 5. Results & Conclusion

### Worst Case Analysis



# Worst Case Analysis



#### Single Request Latency



## Single Request Latency



Our arbitration mechanism similarly distinguishes between PRE/ACT and CAS (R/W) commands

# Outline

- 1. Background & Related Work
- 2. Rank-Switching Mechanism
- 1. Worst Case Latency Analysis
- 2. Memory Controller Model
- 3. Results & Conclusion

# **Back End Model**

- Front End adds constant delay focus on Back End
- Three levels arbitration



## **Back End Model**

 L1: CAS (R/W) commands have higher priority than PRE/ACT commands – priority to data bus contention



# **Back End Model**

- L2 alternates among ranks
- L3 alternates among requestors within a rank



# Back End Model

- Let:
  - -R : number of ranks
  - $M_r$ : number of requestors for rank r
- Then given the alternation in L2/L3, the latency of each command is a function of  $R\cdot M_r$ 
  - Isolation property: the latency of a requestor does not depend on the # of requestors or scheduling policy used in other ranks
  - We can dedicate some ranks to hard real-time requestors, and others to soft real-time requestors
  - Optimize hard requestors for latency, soft requestors for bandwidth
- Full details for hard requestors in the paper...

# CAS Rank Switching Rule



# Outline

- 1. Background & Related Work
- 2. Rank-Switching Mechanism
- 3. Memory Controller Model
- 4. Worst Case Latency Analysis
- 5. Results & Conclusion

- Comparison against Analyzable Memory Controller (AMC) [1]
  - Fair arbitration (Round Robin) similar to our approach
  - Focus on WCET guarantees for hard real-time tasks
- Synthetic Benchmarks
  - Used to show how worst case latency bound varies as parameters are changed
- CHStone Benchmarks
  - Memory traces are obtained from gem5 simulator and used as inputs to both analysis and simulators
  - Core under analysis is in-order
  - Interfering requestors are out-of-order running *lbm*

- DDR3-1333H memory device
  - 64 and 32 bits data bus width
- Simulations
  - Python simulators for our RTSS'13 work [5] and AMC [1]
- ROC Implementation
  - Three stages pipelined implementation in Verilog RTL
  - Synthesizes to Xilinx FPGA at 340Mhz (original soft memory controller: 400Mhz)
  - ASIC implementation could likely be significantly faster...
- Code available at http://ece.uwaterloo.ca/~rpellizz/roc.php

• Synthetic Benchmarks, 20% write



• Synthetic Benchmarks, 20% write

450 400 AMC can parallelize Avg Worst Case Latency (ns) 350 transactions over 2 banks – 300 no advantage from private 250 bank parallelism 200 150 100 50 0 Analysis [5] 0% 20% 40% 60% 80% 100% ---- ROC-2Rank **Row Hit %** 

Synthetic: 8 Requestors-32bits

#### ROC-4 has between 5 and 35% lower WCET than [5]



# Conclusions

- Architectural optimizations designed for general purpose systems do not necessarily work for time-predictable systems
- We need to design the architecture around the concept of guaranteed worst case latency
- We introduced a new DRAM optimization targeted at reducing worst case latency: rank-switching
- The implemented ROC memory controller significantly reduces latency for hard requestors and guarantees strong isolation between hard/soft requestors

# Future Work

- Implementation:
  - Support for shared data
  - Soft requestor optimizations
  - Improved RTL code
- Extended comparison with other real-time controllers

### References

[1] M. Paolieri, E. Quinones, F. Cazorla, and M. Valero, "An Analyzable Memory Controller for Hard Real-Time CMPs," *Embedded Systems* 

| Letters, IEE      |                   |                          |
|-------------------|-------------------|--------------------------|
| [2] B. Akes       |                   | ator: a predictable      |
| SDRAM me          | Thank you!        | . 251–256.               |
| [3] S. Goos       |                   | rvative Open- page       |
| Policy for N      | <u>Questions?</u> | " in <i>DATE</i> , 2013. |
| [4] J. Reine      |                   | "Pret dram               |
| controller:       |                   | mporal isolation,"       |
| in CODES+ISSS, 20 | 11, pp. 99–108.   |                          |

[5] Z. P. Wu, Y. Krish, and R. Pellizzoni, "Worst Case Analysis of DRAM Latency in Multi-Requestor Systems"