Processing math: 100%
Usage Analysis NN on ZCU102 (1) --- Memory Usage

In Design NN on ZCU102 (3), we have introduced the variables used in the design of NNs. In this blog, we are going to analyze the usage of hardware based on these variables.
There are two kinds of hardware usages we are targeting: (1) the memory usage, (2) DSP usage.
About the DSP usage, we consider the 16 bit fix-point for data related to ifm or intermidate data, which indicates the number of DSP used for one multiplication is 1.
Before going deeper, we first review the related variables. The last column “value” is the corresponding value of layer CONV3 in AlexNet.

Notations Definition Value
ifm_ch_mem The number of channels that the on-chip buffers can stored 8
ifm_ch_proc The number of channels that will be processed in each iteration 4
ifm_len The number of pixels in each channel of IFM 13*13
ifm_row,ifm_col Ilen=ifm_row×ifm_col 13
ofm_ch_mem The number of channels that the on-chip buffers can stored 4
ofm_ch_proc The number of channels that will be processed in each iteration 4
ofm_len The number of pixels in each channel of IFM 13*13
ofm_row,ifm_col ofm_len=ofm_row*ofm_col 13
kernel_size The size of a kernel 9
kernel_row The number of rows in one kernel 3
win_pad_size the padding for convolution 1
win_stride the stride in performing convolution 1

###Memory Usage in One Convolutional Layer

First, let’s see the memory usage of a convolutional layer.
It composed of several parts, including buffers for (1) input feature maps, denoted as “IFM_BUF“, (2) output feature maps, denoted as “OFM_BUF“, (3) kernels, denoted as “weight_BUF“, (4) bias, denoted as “BIAS_BUF“, (5) slide windows, denoted as “window_BUF“, (6) line buffers, denoted as “Line_BUF“, and (7) last line buffer “temp_BUF“. For the buffer types (5)-(7), please refer to Design NN on ZCU102 (2).

#####IFM_BUF
IFM_BUF=ifm_ch_mem×ifm_len

#####OFM_BUF
OFM_BUF=ofm_ch_mem×ofm_len

#####weight_BUF
weight_BUF=ofm_ch_mem×ifm_ch_mem×kernel_size

#####BIAS_BUF
BIAS_BUF=ofm_ch_mem

#####window_BUF
window_BUF=ifm_ch_prockernel_size

####Line_BUF
Note that, since some layer requires padding in convolution, we need to consider the padding in line buffer. For 3×3 kernel, there are 2 line buffers. The size of each buffer is:
Line_BUF=ifm_ch_proc(ifm_col+win_pad_size+win_pad_size)

####temp_BUF
For each channel, there is a temp buffer, which stores the last element for each row. Hence, its size is:
temp_BUF=ifm_ch_proc×kernel_row

####Summary
We need to consider the bitwidth for each kind of buffer. For example, in the implementation, we can use 16-bit fix-point for IFM, OFM, window, Line buffer, temp buffer, while the weight and bias can be 8-bit fix-point to reduce the space of buffers. Here, we consider all data types are 16-bit fix-point for the ease of analysis.

In terms of the numbers in the above table, we can obtain the total memory usage of the IP core is: 3164, which equals 6328 Bytes.

###BRAM Usage in One Convolutional Layer
The above section analyzes the memory usage of one layer. However, these results cannot be directly applied to calculate the usage of on-chip memory (i.e., BRAM or LUTRAM on FPGA).
Before introducing the root cause of the above statement, we first present the property of BRAM in FPGA.
Unlike in DDR based main memory, BRAMs are distributed in one FPGA. The size of one BRAM is fixed, e.g., 16Kbit.

Now, let’s see the main reason that we cannot directly use the above calculations to obtain the usage of BRAM on an FPGA. The reason is:

  • We need high parallelism to accelerate the application, which requires us to retrieve/store data in parallel. In consequence, it is hard to fully occupy each BRAM.

In order to calculate the BRAM usage, we need understand how to allocate BRAM in FPGA. It involves the “array partition“. Let’s see an example of array partition as follows.

1
2
3
static FPGA_DATA IFM[ifm_ch_mem][ifm_len];
#pragma HLS RESOURCE variable=IFM core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=IFM cyclic factor=ifm_ch_proc dim=1

In the above codes, we use single port BRAM to implement IFM, and the IFM array is partitioned using factor of ifm_ch_proc on the first dimension, and the partition way is cyclic. (Details pls refer to HLS Pragma).

The factor ifm_ch_proc means, the number of BRAM being accessed simultaneously is ifm_ch_proc. In other words, it requires “at least” ifm_ch_proc different BRAMs to store one IFM. Here, I use “at least” for the reason that ifm_len may be too large that one BRAM cannot hold the whole data.

With the knowledge of array partition, we can analyze the usage of BRAM for each kind of buffers. Before going deeper, we first introduce the metrics related to BRAM: (1) the words used in each BRAM and (2) The number of BRAMs. Now, we take IFM_BUF as an example to analyze the number of BRAM required for each kind of buffer.

#####IFM_BUF

1
2
3
static FPGA_DATA IFM[ifm_ch_mem][ifm_len];
#pragma HLS RESOURCE variable=IFM core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=IFM cyclic factor=ifm_ch_proc dim=1

We can obtain the following information from the above codes.

  • The size of IFM is ifm_ch_mem×ifm_len.
  • The partition factor is ifm_ch_proc.
  • For each partition, there are totally ifm_ch_mem×ifm_lenifm_ch_proc data needing to be stored. With the consideration of 16-bit fixed point, the number of BRAM required for each partition is 18K/(ifm_ch_mem×ifm_lenifm_ch_proc×16).

In summary, the number of BRAM used by IFM is 18K/(ifm_ch_mem×ifm_lenifm_ch_proc×16)×ifm_ch_proc.

#####weight_BUF and OFM_BUF

1
2
3
4
5
6
7
8
9
  static FPGA_WEIGHTS WEIGHT[ofm_ch_mem][ifm_ch_mem][kernel_size];
#pragma HLS RESOURCE variable=WEIGHT core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=WEIGHT cyclic factor=ofm_ch_proc/2 dim=1
#pragma HLS ARRAY_PARTITION variable=WEIGHT cyclic factor=ifm_ch_proc/2 dim=2
#pragma HLS ARRAY_PARTITION variable=WEIGHT complete dim=3

static FPGA_DATA OFM[ofm_ch_mem][ofm_len];
#pragma HLS RESOURCE variable=OFM core=RAM_S2P_BRAM
#pragma HLS ARRAY_PARTITION variable=OFM cyclic factor=ofm_ch_proc dim=1
Buffers # of Partition # of BRAM
IFM P=ifm_ch_proc (ifm_ch_mem×ifm_lenP×16)/18K×P
OFM P=ofm_ch_proc (ofm_ch_mem×ofm_lenP×16)/18K×P
WEIGHTs P=ofm_ch_proc×ifm_ch_proc×kernel_size4 (ofm_ch_mem×ifm_ch_mem×kernel_sizeP×16)/18K×P

####Simplify and Example

To simplify the above formulas, we create a mapping table to use simpler notations. The mapping table is given as follows. In the meanwhile, we use convolution 3 as an example. The corresponding value of each variable is also given in this table.

Original Notation New Notation Value
ifm_ch_mem Im 64
ifm_ch_proc Ip 8
ifm_len Ilen 13*13
ofm_ch_mem Om 64
ofm_ch_proc Op 8
ofm_len Olen 13*13
kernel_size K 9

Using these simplified notations, let’s rewrite these formulas.

Buffers # of Partition # of BRAM
IFM Ip=8 (Im×Ilen1125×Ip)×Ip=1.20×8=16
OFM Op=8 (Om×Olen1125×Op)×Op=1.20×8=16
WEIGHTs Op×Ip×K4=144 (4×Om×Im×K1125×Op×Ip×K)×Op×Ip×K4=0.2275×144=144

Based on the above table, we know that the number of BRAM used in the design for IFM, OFM, and WEIGHTs is 16+16+144=176, which exactly the same with that obtained from HLS and post-implementation.

####Efficiency of using BRAMs

In the above sections, we understand how many BRAM will be involved in terms of array partition. Now, let’s go further to see the efficiency of using BRAMs. Considering that the 18K BRAM, the efficiency means the percentage of space in 18K used for storing data. For example, if 9K IN 18k IS used, then the efficiency is 50%.

We build the following formulas to compute the BRAM efficiency for different types of buffers.

Buffers Efficiency of BRAM Example
IFM (Im×IlenIp×16)/(Im×Ilen1125×Ip×18K) 64×13×13×168/36K=60.08
OFM (Om×OlenOp×16)/(Om×Olen1125×Op×18K) 64×13×13×168/36K=60.08
WEIGHTs (4×Om×Im×KOp×Ip×K×16)/(4×Om×Im×K1125×Op×Ip×K)×18K 4096/18K=22.75

Weiwen Jiang
Jul 19, 2018
jiang.wwen@pitt.edu
At UPITT