JQub | ECE@GMU

Usage Analysis NN on ZCU102 (1) --- Memory Usage

In Design NN on ZCU102 (3), we have introduced the variables used in the design of NNs. In this blog, we are going to analyze the usage of hardware based on these variables.
There are two kinds of hardware usages we are targeting: (1) the memory usage, (2) DSP usage.
About the DSP usage, we consider the 16 bit fix-point for data related to ifm or intermidate data, which indicates the number of DSP used for one multiplication is 1.
Before going deeper, we first review the related variables. The last column “value” is the corresponding value of layer CONV3 in AlexNet.

Notations	Definition	Value
ifm_ch_mem	The number of channels that the on-chip buffers can stored	8
ifm_ch_proc	The number of channels that will be processed in each iteration	4
ifm_len	The number of pixels in each channel of IFM	13*13
ifm_row,ifm_col	$I_{len}=ifm\_row\times ifm\_col$	13
ofm_ch_mem	The number of channels that the on-chip buffers can stored	4
ofm_ch_proc	The number of channels that will be processed in each iteration	4
ofm_len	The number of pixels in each channel of IFM	13*13
ofm_row,ifm_col	ofm_len=ofm_row*ofm_col	13
kernel_size	The size of a kernel	9
kernel_row	The number of rows in one kernel	3
win_pad_size	the padding for convolution	1
win_stride	the stride in performing convolution	1

###Memory Usage in One Convolutional Layer

First, let’s see the memory usage of a convolutional layer.
It composed of several parts, including buffers for (1) input feature maps, denoted as “IFM_BUF“, (2) output feature maps, denoted as “OFM_BUF“, (3) kernels, denoted as “weight_BUF“, (4) bias, denoted as “BIAS_BUF“, (5) slide windows, denoted as “window_BUF“, (6) line buffers, denoted as “Line_BUF“, and (7) last line buffer “temp_BUF“. For the buffer types (5)-(7), please refer to Design NN on ZCU102 (2).

#####IFM_BUF
$IFM\_BUF=ifm\_ch\_mem\times ifm\_len$

#####OFM_BUF
$OFM\_BUF=ofm\_ch\_mem\times ofm\_len$

#####weight_BUF
$weight\_BUF=ofm\_ch\_mem\times ifm\_ch\_mem\times kernel\_size$

#####BIAS_BUF
$BIAS\_BUF=ofm\_ch\_mem$

#####window_BUF
$window\_BUF=ifm\_ch\_proc*kernel\_size$

####Line_BUF
Note that, since some layer requires padding in convolution, we need to consider the padding in line buffer. For $3\times 3$ kernel, there are 2 line buffers. The size of each buffer is:
$Line\_BUF=ifm\_ch\_proc*(ifm\_col+win\_pad\_size+win\_pad\_size)$

####temp_BUF
For each channel, there is a temp buffer, which stores the last element for each row. Hence, its size is:
$temp\_BUF=ifm\_ch\_proc\times kernel\_row$

####Summary
We need to consider the bitwidth for each kind of buffer. For example, in the implementation, we can use 16-bit fix-point for IFM, OFM, window, Line buffer, temp buffer, while the weight and bias can be 8-bit fix-point to reduce the space of buffers. Here, we consider all data types are 16-bit fix-point for the ease of analysis.

In terms of the numbers in the above table, we can obtain the total memory usage of the IP core is: 3164, which equals 6328 Bytes.

###BRAM Usage in One Convolutional Layer
The above section analyzes the memory usage of one layer. However, these results cannot be directly applied to calculate the usage of on-chip memory (i.e., BRAM or LUTRAM on FPGA).
Before introducing the root cause of the above statement, we first present the property of BRAM in FPGA.
“Unlike in DDR based main memory, BRAMs are distributed in one FPGA. The size of one BRAM is fixed, e.g., 16Kbit.“

Now, let’s see the main reason that we cannot directly use the above calculations to obtain the usage of BRAM on an FPGA. The reason is:

We need high parallelism to accelerate the application, which requires us to retrieve/store data in parallel. In consequence, it is hard to fully occupy each BRAM.

In order to calculate the BRAM usage, we need understand how to allocate BRAM in FPGA. It involves the “array partition“. Let’s see an example of array partition as follows.

1
2
3

static FPGA_DATA IFM[ifm_ch_mem][ifm_len];
#pragma HLS RESOURCE variable=IFM core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=IFM cyclic factor=ifm_ch_proc dim=1

In the above codes, we use single port BRAM to implement IFM, and the IFM array is partitioned using factor of ifm_ch_proc on the first dimension, and the partition way is cyclic. (Details pls refer to HLS Pragma).

The factor ifm_ch_proc means, the number of BRAM being accessed simultaneously is ifm_ch_proc. In other words, it requires “at least” ifm_ch_proc different BRAMs to store one IFM. Here, I use “at least” for the reason that ifm_len may be too large that one BRAM cannot hold the whole data.

With the knowledge of array partition, we can analyze the usage of BRAM for each kind of buffers. Before going deeper, we first introduce the metrics related to BRAM: (1) the words used in each BRAM and (2) The number of BRAMs. Now, we take IFM_BUF as an example to analyze the number of BRAM required for each kind of buffer.

#####IFM_BUF

1
2
3

static FPGA_DATA IFM[ifm_ch_mem][ifm_len];
#pragma HLS RESOURCE variable=IFM core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=IFM cyclic factor=ifm_ch_proc dim=1

We can obtain the following information from the above codes.

The size of IFM is $ifm\_ch\_mem\times ifm\_len$.
The partition factor is $ifm\_ch\_proc$.
For each partition, there are totally $\frac{ifm\_ch\_mem\times ifm\_len}{ifm\_ch\_proc}$ data needing to be stored. With the consideration of 16-bit fixed point, the number of BRAM required for each partition is $\left\lceil 18K/( \frac{ifm\_ch\_mem\times ifm\_len}{ifm\_ch\_proc}\times 16)\right\rceil$.

In summary, the number of BRAM used by IFM is $\left\lceil 18K/( \frac{ifm\_ch\_mem\times ifm\_len}{ifm\_ch\_proc}\times 16)\right\rceil \times ifm\_ch\_proc$.

#####weight_BUF and OFM_BUF

  static FPGA_WEIGHTS WEIGHT[ofm_ch_mem][ifm_ch_mem][kernel_size];
#pragma HLS RESOURCE variable=WEIGHT core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=WEIGHT cyclic factor=ofm_ch_proc/2 dim=1
#pragma HLS ARRAY_PARTITION variable=WEIGHT cyclic factor=ifm_ch_proc/2 dim=2
#pragma HLS ARRAY_PARTITION variable=WEIGHT complete dim=3

	static FPGA_DATA OFM[ofm_ch_mem][ofm_len];
#pragma HLS RESOURCE variable=OFM core=RAM_S2P_BRAM
#pragma HLS ARRAY_PARTITION variable=OFM cyclic factor=ofm_ch_proc dim=1

Buffers	# of Partition	# of BRAM
IFM	$P=ifm\_ch\_proc$	$\left\lceil ( \frac{ifm\_ch\_mem\times ifm\_len}{P}\times 16)/18K\right\rceil \times P$
OFM	$P=ofm\_ch\_proc$	$\left\lceil ( \frac{ofm\_ch\_mem\times ofm\_len}{P}\times 16)/18K\right\rceil \times P$
WEIGHTs	$P=\frac{ofm\_ch\_proc\times ifm\_ch\_proc \times kernel\_size}{4}$	$\left\lceil ( \frac{ofm\_ch\_mem\times ifm\_ch\_mem\times kernel\_size}{P}\times 16)/18K\right\rceil \times P$

####Simplify and Example

To simplify the above formulas, we create a mapping table to use simpler notations. The mapping table is given as follows. In the meanwhile, we use convolution 3 as an example. The corresponding value of each variable is also given in this table.

Original Notation	New Notation	Value
ifm_ch_mem	$I_m$	64
ifm_ch_proc	$I_p$	8
ifm_len	$I_{len}$	13*13
ofm_ch_mem	$O_m$	64
ofm_ch_proc	$O_p$	8
ofm_len	$O_{len}$	13*13
kernel_size	$K$	9

Using these simplified notations, let’s rewrite these formulas.

Buffers	# of Partition	# of BRAM
IFM	$I_p=8$	$\left\lceil(\frac{I_m\times I_{len}}{1125\times I_p})\right\rceil\times I_p=\left\lceil 1.20\right\rceil\times 8=16$
OFM	$O_{p}=8$	$\left\lceil (\frac{O_{m}\times O_{len}}{1125\times O_{p}})\right\rceil \times O_{p}=\left\lceil 1.20 \right\rceil \times 8=16$
WEIGHTs	$\frac{O_{p}\times I_p \times K}{4}=144$	$\left\lceil ( \frac{4\times O_{m}\times I_m\times K}{1125\times O_{p}\times I_p \times K})\right\rceil \times \frac{O_{p}\times I_p \times K}{4}=\left\lceil 0.2275 \right\rceil \times 144=144$

Based on the above table, we know that the number of BRAM used in the design for IFM, OFM, and WEIGHTs is 16+16+144=176, which exactly the same with that obtained from HLS and post-implementation.

####Efficiency of using BRAMs

In the above sections, we understand how many BRAM will be involved in terms of array partition. Now, let’s go further to see the efficiency of using BRAMs. Considering that the 18K BRAM, the efficiency means the percentage of space in 18K used for storing data. For example, if 9K IN 18k IS used, then the efficiency is 50%.

We build the following formulas to compute the BRAM efficiency for different types of buffers.

Buffers	Efficiency of BRAM	Example
IFM	$(\frac{I_m\times I_{len}}{I_p}\times 16) /(\left\lceil \frac{I_m\times I_{len}}{1125\times I_p}\right\rceil\times 18K)$	$\frac{64\times 13\times 13\times 16}{8} / 36K = 60.08%$
OFM	$(\frac{O_m\times O_{len}}{O_p}\times 16) /(\left\lceil \frac{O_m\times O_{len}}{1125\times O_p}\right\rceil\times 18K)$	$\frac{64\times 13\times 13\times 16}{8} / 36K = 60.08%$
WEIGHTs	$(\frac{4\times O_{m}\times I_m\times K}{O_{p}\times I_p \times K}\times 16) /\left\lceil ( \frac{4\times O_{m}\times I_m\times K}{1125\times O_{p}\times I_p \times K})\right\rceil\times 18K$	$4096/18K=22.75%$

Weiwen Jiang
Jul 19, 2018
jiang.wwen@pitt.edu
At UPITT