Intel® High Level Synthesis Compiler Standard Edition: Best Practices Guide

ID 683259
Date 12/18/2019
Public
Document Table of Contents

5.4. Example: Specifying Bank-Selection Bits for Local Memory Addresses

You have the option to tell the Intel® HLS Compiler Standard Edition which bits in a local memory address select a memory bank and which bits select a word in that bank. You can specify the bank-selection bits with the hls_bankbits(b 0 , b 1, ..., b n) attribute.

The (b 0 , b 1 , ... ,b n ) arguments refer to the local memory address bit positions that the Intel® HLS Compiler should use for the bank-selection bits. Specifying the hls_bankbits(b 0, b 1, ..., b n) attribute implies that the number of banks equals 2 number of bank bits .

Table 8.  Example of Local Memory Addresses Showing Word and Bank Selection Bits

This table of local memory addresses shows an example of how a local memory might be addressed. The memory attribute is set as hls_bankbits(3,4). The memory bank selection bits (bits 3, 4) in the table bits are in bold text and the word selection bits (bits 0-2) are in italic text.

  Bank 0 Bank 1 Bank 2 Bank 3
Word 0 00 000 01 000 10 000 11 000
Word 1 00 001 01 001 10 001 11 001
Word 2 00 010 01 010 10 010 11 010
Word 3 00 011 01 011 10 011 11 011
Word 4 00 100 01 100 10 100 11 100
Word 5 00 101 01 101 10 101 11 101
Word 6 00 110 01 110 10 110 11 110
Word 7 00 111 01 111 10 111 11 111
Restriction: Currently, the hls_bankbits(b 0, b 1, ..., b n) attribute supports only consecutive bank bits.

Example of Implementing the hls_bankbits Attribute

Consider the following example component code:

1

component int bank_arb_consecutive_multidim (int raddr,
                                             int waddr,
                                             int wdata,
                                             int upperdim) {

  int a[2][4][128] hls_numbanks(1);

  #pragma unroll
  for (int i = 0; i < 4; i++) {
    a[upperdim][i][(waddr & 0x7f)] = wdata + i;
  }

  int rdata = 0;

  #pragma unroll
  for (int i = 0; i < 4; i++) {
    rdata += a[upperdim][i][(raddr & 0x7f)];
  }

  return rdata;
}

As illustrated in the following figure, this code example generates multiple load and store instructions, and therefore multiple load/store units (LSUs) in the hardware. If the memory system is not split into multiple banks, there are fewer ports than memory access instructions, leading to arbitrated accesses. This arbitration results in a high loop initiation interval (II) value. Avoid arbitration blocks whenever possible because they consume a lot of FPGA area and can hurt the performance of your component.

Figure 18. Accesses to Local Memory a for Component bank_arb_consecutive_multidim

By default, the Intel® HLS Compiler splits the memory into banks if it determines that the split is beneficial to the performance of your component. When the compiler generates a memory system, it uses the lower-order memory address bits to access the different memory banks. This behavior means that if you define your component memory structure so that the lowest order addresses are accessed in parallel, the compiler automatically infers the bank-selection bits for you.

This access pattern prevents stallable arbitration on the memory. In this case, preventing stallable arbitration reduced the II value to 1. In practice, this might mean that you store a matrix in column-major format instead of row-major format, if you intend to access multiple matrix rows concurrently.

Swapping the 128-element and 4-element dimension in the code example that follows results in no stallable memory arbitration.

component int bank_arb_consecutive_multidim (int raddr,
                                             int waddr,
                                             int wdata,
                                             int upperdim) {
  int a[2][128][4];
                
  #pragma unroll
  for (int i = 0; i < 4; i++) {
    a[upperdim][(waddr & 0x7f)][i] = wdata + i;
  }
                    
  int rdata = 0;
                    
  #pragma unroll
  for (int i = 0; i < 4; i++) {
    rdata += a[upperdim][(raddr & 0x7f)][i];
  }
               
  return rdata; 
}

The dimension that is accessed in parallel is moved to be the lowest-order dimension in the memory array. The load has a width of 128 bits, which is the same as four 32-bit loads.

If you cannot change your memory structure, you can use the hls_bankbits attribute to explicitly control how load and store instructions access local memory. As shown in the following modified code example and figure, when you choose constant bank-select bits for each access to the local memory a, each pair of load and store instructions needs to connect to only one memory bank. In this example, there are four 32-bit loads, which results in a memory system similar to the earlier example.

Figure 19. Accesses to Local Memory a for Component bank_arb_consecutive_multidim with the hls_bankbits Attribute
component int bank_arb_consecutive_multidim (int raddr,
                                             int waddr,
                                             int wdata,
                                             int upperdim) {
  int a[2][4][128] hls_bankbits(8,7);
                
  #pragma unroll
  for (int i = 0; i < 4; i++) {
    a[upperdim][i][(waddr & 0x7f)] = wdata + i;
  }
                
  int rdata = 0;
                
  #pragma unroll
  for (int i = 0; i < 4; i++) {
    rdata += a[upperdim][i][(raddr & 0x7f)];
  }
                
  return rdata;
}

When specifying the word-address bits for the hls_bankbits attribute, ensure that the resulting bank-select bits are constant for each access to local memory. As shown in the following example, the local memory access pattern does not guarantee that the chosen bank-select bits are constant for each access. As a result, each pair of load and store instructions must connect to all the local memory banks, leading to stallable accesses.

Figure 20. Stallable Accesses to Local Memory a for Component bank_arb_consecutive_multidim with the hls_bankbits Attribute
component int bank_arb_consecutive_multidim (int raddr,
                                             int waddr,
                                             int wdata,
                                             int upperdim){

  int a[2][4][128] hls_bankbits(5,4);

  #pragma unroll
  for (int i = 0; i < 4; i++) {
    a[upperdim][i][(waddr & 0x7f)] = wdata + i;
  }

  int rdata = 0;

  #pragma unroll
  for (int i = 0; i < 4; i++) {
    rdata += a[upperdim][i][(raddr & 0x7f)];
  }

  return rdata;
}

In this case, the II is estimated to be approximately 64.

1 For this example, the initial component was generated with the hls_numbanks attribute set to 1 (hls_numbanks(1)) to prevent the compiler from automatically splitting the memory into banks.