JQub | ECE@GMU

Design of "Lenet" on ZCU102 (1) — HLS Implementation

In this blog, I’ll outline the main operations implemented in HLS to generate the IP core for the lenet. Since the resource of ZCU102 is enough to store all weights and inputs in on-chip memory, we will allocate the buffer large enough for storing all paramenters. Therefore, PS only needs to send input feature map to IP core. And there is no extra data transmission between PS and PL except the intermidate data.

The design of programmable logic (PL) part, using Vivado HLS

In order to clearly show the process of Lenet on one FPGA, we will design an IP core for each layer. For each IP core, it contains two channels (in_stream and out_stream) and an integer. The weights and input feature map (IFM) will send to PL part through the in_stream, and output feature map (OFM) will be transmitted to PS part through the out_stream. The integer is an operation code, indicating what operation will be performed.

1	void conv1(hls::stream<DMA_DATA> &in_stream, hls::stream<DMA_DATA> &out_stream, int op)

Definition of variables

We define three sets of variables related to IFM, OFM, and kernel, respectively.

First, let’s see the variables related to IFM.

IFM Notations	Definition
ifm_ch_mem	The number of channels that the on-chip buffers can stored
ifm_ch_proc	The number of channels that will be processed in each iteration
ifm_len	The number of pixels in each channel of IFM
ifm_row,ifm_col	ifm_len=ifm_row*ifm_col

Then, let’s see the variables related to OFM. They are similar with that of IFM.

OFM Notations	Definition
ofm_ch_mem	The number of channels that the on-chip buffers can stored
ofm_ch_proc	The number of channels that will be processed in each iteration
ofm_len	The number of pixels in each channel of IFM
ofm_row,ifm_col	ofm_len=ofm_row*ofm_col

Note that ifm_len and ofm_len are related, which is determined by the kernel size, stride, etc.

Finally, let’s see the variables related to kernel/weights.

Kernel Notations	Definition
kernel_size	The size of a kernel, e.g., for 5*5 kernel, kernel_size=25
kernel_row	The number of rows in one kernel

The design of conv1 IP Core

We define the operation code as follows, i.e., when the IP core will perform the following operations when op=x.

x=1: The IP core receives weights and bias from PS part.
x=2: The IP core will perform four operations sequentially.
1. It receives IFM from PS.
2. It obtains the processing window.
3. It performs convolution operation.
4. It performs pool operation.
5. It send out the intermidate results to PS.

The most important operations that affect the overall performance are the “window generation”, “convolution”, “pooling” operations. We will introduce these two operations one by one.

Window Generation

First, let’s talk about the window generation. The code is attached as follows.

// We have kernel_size=25, kernel_row=5
void window_generator_5_5(FPGA_DATA d_in,FPGA_DATA win_out[kernel_size],int column,
	FPGA_DATA linebuf1[ifm_col],FPGA_DATA linebuf2[ifm_col],FPGA_DATA linebuf3[ifm_col],
    FPGA_DATA linebuf4[ifm_col],FPGA_DATA temp[kernel_row]){

	temp[0] = d_in;
	temp[1] = linebuf1[column];
	temp[2] = linebuf2[column];
	temp[3] = linebuf3[column];
	temp[4] = linebuf4[column];	

	for(int i=0;i<kernel_row-1;i++){
		for(int j=i;j<kernel_size;j+=kernel_row){
			win_out[j] = win_out[j+1];
		}
	}

	int i=0;
	for(int j=kernel_row;j<kernel_size;j+=kernel_row){
		win_out[j] = temp[kernel_row-i];
		i++;
	}

	linebuf1[column] = temp[0];
	linebuf2[column] = temp[1];
	linebuf3[column] = temp[2];
	linebuf4[column] = temp[3];	
}

The idea is that we use two line buffers to keep the processed data. Each time, we receive a new data from a column (c), it will be placed at the right-bottom of the window (win_out). The each column in the window will move left for 1 step. And data in the line buffer at column (c) will be the last column in the window. Finally, we will update the line buffers by moving the data in the column (c) of linebuf_x to linebuf_x+1. You can refer to the video VIVADO HLS 2D Convolution on hardware for more details.

Convolution

Now, let’s talk about the convolution operation. For each time, we’d like to perform ofm_ch_proc*ifm_ch_proc*kernel_size multiplications simultaneously. [Details can be found in our CASES 18 paper, titled “Heterogeneous FPGA-based Cost-Optimal Design for Timing-Constrained CNNs“]. Then, we will use adder tree to sum the results up.

	FPGA_DATA MUL_RES[KERNEL_SIZE];
	for(int k=0;k<OFM_CHANNEL;k++){
#pragma HLS UNROLL
		for(int i = 0; i < IFM_CHANNEL; i++){
#pragma HLS UNROLL
			for(int i=0;i<KERNEL_SIZE;i++){
#pragma HLS UNROLL            
            	MUL_RES[i] = data[i]*kernel[i];
            }
        }
    }		
    return ADDER_TREE_25(MUL_RES);

In the final step, we invoke ADDER_TREE_25 to sum up the results obtained from the multiplications.
For the ease of implementation, in the adder tree, we regard the number of inputs as 2^5=32>25.
The detailed codes are given as follows.

FPGA_DATA ADDER_TREE_121(FPGA_DATA data[25]){
#pragma HLS INLINE
#pragma HLS PIPELINE
	FPGA_DATA sum6[16];
#pragma HLS ARRAY_PARTITION variable=sum6 complete dim=1
	FPGA_DATA sum5[8];
#pragma HLS ARRAY_PARTITION variable=sum5 complete dim=1
	FPGA_DATA sum4[4];
#pragma HLS ARRAY_PARTITION variable=sum4 complete dim=1
	FPGA_DATA sum3[2];
#pragma HLS ARRAY_PARTITION variable=sum3 complete dim=1
	FPGA_DATA sum2;
	for(int i=0; i<12; i++){
#pragma HLS UNROLL
		sum6[i] = data[2*i] + data[2*i+1];
	}
	for(int i=0; i<8; i++){
#pragma HLS UNROLL
		sum5[i] = sum6[2*i] + sum6[2*i+1];
	}
	for(int i=0; i<4; i++){
#pragma HLS UNROLL
		sum4[i] = sum5[2*i] + sum5[2*i+1];
	}
	for(int i=0; i<2; i++){
#pragma HLS UNROLL
		sum3[i] = sum4[2*i] + sum4[2*i+1];
	}
	sum2 = sum3[0]+sum3[1];
	return sum2+data[24];
}

Pooling

The pooling operation in Lenet is max pooling, and the window size is 2 by 2. For the implementation of pooling operation in HLS, we just need to traverse the OFM, and select the maximum value in each window to be the corresponding results in the OFM_POOL matrix.
The code is listed as follows.

FPGA_DATA max_pool_2_2(FPGA_DATA A,FPGA_DATA B,FPGA_DATA C, FPGA_DATA D){
	FPGA_DATA tmp1,tmp2;
	tmp1 = A>B?A:B;
	tmp2 = C>D?C:D;
	return tmp1>tmp2?tmp1:tmp2;
}

POOL_1:for(int i=0;i<OFM_CHANNEL;i++){
			int pool_add=0;
			POOL_2:for(int j=0;j<OFM_HEIGHT;j+=2){
#pragma HLS PIPELINE II=1
				POOL_3:for(int k=0;k<OFM_WIDTH;k+=2){
#pragma HLS UNROLL
					OFM_POOL[i][pool_add]=max_pool_2_2(OFM[i][j*OFM_WIDTH+k],
							OFM[i][j*OFM_WIDTH+k+1],OFM[i][(j+1)*OFM_WIDTH+k],
							OFM[i][(j+1)*OFM_WIDTH+k+1]);
					pool_add++;
				}
			}
		}

July 13, 2018 Weiwen Jiang jiang.wwen@pitt.edu At UPITT