

ECE618 Hardware Accelerators for Machine Learning (Spring 2022)

### Lecture 1: Course Information & Machine Learning and FPGA Accelerator Recap

Weiwen Jiang, Ph.D.

**Electrical and Computer Engineering** 

George Mason University

wjiang8@gmu.edu







## **Course Information**

| Instructor         | Dr. Weiwen Jiang                                  |
|--------------------|---------------------------------------------------|
| E-Mail             | wjiang8@gmu.edu                                   |
| Phone              | (703)993-5083                                     |
| Lecture Time       | <u>Monday 19:20 - 22:00</u>                       |
| Location           | Room 1002, Music/Theater Building                 |
| Office Hour        | Monday 16:30 - 17:30                              |
| Office             | Room 3247, Nguyen Engineering Building            |
| Zoom               | http://go.gmu.edu/zoom4weiwen                     |
| Backup Course Zoom | https://go.gmu.edu/ece618 (Need Permission First) |

## About Me.



Dr. Weiwen Jiang

#### Background

- Researcher at University of Pittsburgh (2017-2019)
- Postdoc at University of Notre Dame (2019-2021)
- George Mason University (2021 present)

#### **Research Interests**

- HW/SW Co-Design
- Quantum Machine Learning
- Contacts:
  - wjiang8@gmu.edu
  - Nguyen Engineering Building, Room3247
  - (703)993-5083
  - <u>https://jqub.ece.gmu.edu/</u>

## **Teaching Assistant**



Yi Sheng (Ph.D. Candidate)

ysheng2@gmu.edu

https://jqub.ece.gmu.edu/yi/

Office Hours: TBD

ECE618 HW Accelerators for ML

Dr. Weiwen Jiang, ECE, GMU

## **Course Description**

Covers the <u>hardware design</u> principles to <u>deploy</u> different machine learning algorithms. The emphasis is on understanding the fundamentals of <u>machine learning and hardware</u> <u>architectures</u> and determine plausible methods to <u>bridge them</u>.

Topics include precision scaling, in-memory computing, hyperdimensional computing, architectural modifications, GPUs and vector architectures, quantum computing as well as recent hardware programming tools such as <u>Xilinx AI Vitis, Xilinx</u> <u>HLS, and IBM Qiskit</u>.

## **Recommend Prerequisite**

• ECE 554: Machine Learning for Embedded Systems

- Good C programming
  - Especially required for FPGA-related project
- Familiar with Python and PyTorch











Tools for Lab





## **Course Resources**

- Blackboard:
  - Assignments will be posted and submitted here!
  - Online discussion, shared documents, announcements.
    - Do NOT upload codes in discussion.
- Course Website:
  - https://jqub.ece.gmu.edu/2022/01/01/HA4ML/
  - Course information (TA time, location, zoom, etc.)
  - Slides, readings, and documents will be posted here!



| Course Information |
|--------------------|
|--------------------|







Tools for Lab





## **Grading Policy**

| <ul> <li>Midterm Exam</li> </ul>                | 10% |
|-------------------------------------------------|-----|
| <ul> <li>Final Exam</li> </ul>                  | 20% |
| <ul> <li>Research Paper Presentation</li> </ul> | 20% |
| <ul> <li>Assignments and Labs</li> </ul>        | 20% |
| <ul> <li>Project</li> </ul>                     | 30% |

## You Have Been Warned. Zero Tolerance!

No matter vaccinated or not, face mask is required

in class



Request to a Zoom access for a few classes if needed

Dr. Weiwen Jiang, ECE, GMU

## You Have Been Warned. Zero Tolerance!

 Lecture content and materials should NOT go online without explicit permission



### • No plagiarism!

The most common sense of way interpreting no plagiarism: You need to DO your work.

Dr. Weiwen Jiang, ECE, GMU



| Course Information |
|--------------------|
|--------------------|









ECE618 HW Accelerators for ML

Ξ

Dr. Weiwen Jiang, ECE, GMU



## **Tools for lab**

Google <u>Colab</u>





ECE618 HW Accelerators for ML

Dr. Weiwen Jiang, ECE, GMU



| Course Information |
|--------------------|
|--------------------|







Tools for Lab



Ξ

### What Software to Be Accelerated? --- MLP/CNN

#### Supervised Learning Example: Classification

#### Training

**Given:** <u>Labeled</u> data as training dataset

 $(x_i, y_i)$ :  $x_i$  training data,  $y_i$ : label

 $x_i = 3$   $y_i = 3$ 

**Output:** A learned function **f** from X to Y

 $f: x \mapsto y$ 

#### Inference/Execution

**Given:** Unseen data test dataset A learned function *f* 





### What Software to Be Accelerated? --- MLP/CNN



- Local receptive fields
- Shared weights
- Pooling (subsampling)



### What Software to Be Accelerated? --- RNN

#### Supervised Learning

**Example:** Classification

#### Training

**Given:** <u>Labeled</u> data as training dataset  $(x_i, y_i)$ :  $x_i$  training data,  $y_i$ : label

$$x_i = \bigwedge y_i =$$
 "can l"

**Output:** A learned function *f* from X to Y

 $\pmb{f}{:}\,x\mapsto y$ 

#### Inference/Execution

**Given:** Unseen data test dataset A learned function *f* 

**Do:** 
$$f(-)$$
 = "brown fox"



#### SEP-28k Dataset

man man

Actual speech Can I feed my d-dog uh uh [...] pea-ea-nut butter? Intended speech Can I feed my dog peanut butter?

### What Software to Be Accelerated? --- RNN





classes)

ECE618 HW Accelerators for ML

words)

Dr. Weiwen Jiang, ECE, GMU

sentiment)



The von Neumann structure, also known as the Princeton structure, is a memory structure that merges program instruction memory and data memory together.

The program instruction memory address and the data memory address point to different physical locations in the same memory, so the program instruction and data are of the same width.



Intel's 12th Gen "Alder Lake" 10nm Desktop CPU





NVIDIA RTX A6000 Workstation Graphics Card (in my lab)



h <u>NVIDIA Jetson Nano</u>

ODROID-XU4 Single Board Computer with Quad Core 2GHz A15, 2GB RAM



**Streaming architecture:** data items are pushed in and out as sequential streams, the **instructions are mapped into programmable circuit units** along the path from the input ports to output ports. Therefore, instead of fetching instructions and data back and forth from the memory, the computation gets performed as the **data streams flow** through the circuit units in one pass.



Xilinx Alveo U280 Data Center Accelerator Card



ZCU Series (<u>102</u>, 104, 106)



ASIC

#### ECE618 HW Accelerators for ML

Dr. Weiwen Jiang, ECE, GMU

**In-memory computing** is the technique of **running computer calculations entirely in computer memory** (e.g., in RAM).



Memory

CPU

Quantum computing is a type of computation that harnesses the collective properties of quantum states, such as superposition, interference, and entanglement, to perform calculations.



- Specialized High-Efficiency Computing!
- Why specialization?
  - Power constraint of modern computers



- Specialized High-Efficiency Computing!
- Why specialization?
  - Power constraint of modern computers
  - In-efficiency of general-purpose computing

#### **Embedded Processor Energy Breakdown**

Arithmetic Clock and control Data supply Instruction supply



ECE618 HW Accelerators for ML

- Specialized High-Efficiency Computing!
- Why specialization?
  - Power constraint of modern computers ۲
  - In-efficiency of general-purpose computing ۲
  - Data and computation explosion (big data, AI) ullet





https://openai.com/blog/ai-and-compute/

[Images credit]: Prof. Callie Hao @ GATech Dr. Weiwen Jiang, ECE, GML versitv

- Specialized High-Efficiency Computing!
- Why specialization?
  - Power constraint of modern computers
  - In-efficiency of general-purpose computing
  - Data and computation explosion (big data, AI)
  - Real-time processing requirement



Images per second [FPS]

[Bianco, IEEE Access 2018]

#### **An Overview of Hardware Accelerators**



Intel's 12th Gen "Alder Lake" 10nm **Desktop CPU** 



ODROID-XU4 Single Board Computer with Quad Core 2GHz A15, 2GB RAM



NVIDIA RTX A6000 Workstation Graphics Card (in my lab)



**NVIDIA Jetson Nano** 



Array Topology I



**PYNQ** 



Xilinx Alveo U280 Data Center Accelerator Card



ZCU Series (<u>102</u>, 104, 106)



ASIC



Dr. Weiwen Jiang, ECE, GMU



ECE618 HW Accelerators for ML

#### Schedule



Intel's 12th Gen "Alder Lake" 10nm Desktop CPU



ODROID-XU4 Single Board Computer with Quad Core 2GHz A15, 2GB RAM



NVIDIA RTX A6000 Workstation Graphics Card (in my lab)



**NVIDIA Jetson Nano** 



**PYNQ** 



Xilinx Alveo U280 Data Center Accelerator Card



ZCU Series (<u>102</u>, 104, 106)



ASIC

#### Session I: Classical Computing Accelerators for Machine Learning

|            | Date               | Торіс                                                            |      |
|------------|--------------------|------------------------------------------------------------------|------|
|            | Jan. 24            | Course Information & Machine Learning and FPGA Accelerator Recap |      |
|            | Jan. 31            | Vector Architectures, FPGAs and GPU Architectures                |      |
|            | Feb. 7             | ASIC Accelerators                                                |      |
| CE618 HW A | ccelerators for ML | Dr. Weiwen Jiang, ECE, GMU 30   George Mason Univers             | sity |





#### **Session II: Novel Post-Moore Computing Accelerators for ML**

| Date    | Торіс                                   |
|---------|-----------------------------------------|
| Feb. 14 | In-Memory Computing Accelerator Design  |
| Feb. 21 | Neuromorphic Accelerators               |
| Feb. 28 | Hyperdimensional Computing Accelerators |
| Mar. 07 | Quantum Neural Network Accelerators     |

Dr. Weiwen Jiang, ECE, GMU

**Schedule** 

#### **Session III: Other Accelerator Related Topics**

| Date      | Торіс                           |
|-----------|---------------------------------|
| Mar. 28   | Project Proposal                |
| Apr. 04   | Distributed Learning            |
| Apr. 11   | Hands-on Accelerator Design (1) |
| Apr. 18   | Project Overview                |
| Apr. 25   | Hands-on Accelerator Design (2) |
| May 02    | Project Presentations           |
| May 11-18 | Final exam                      |

#### **Expectation & Final Project**

• Implement ML on any hardware in a team with 1-3 students



Dr. Weiwen Jiang, ECE, GMU

### What Did We Learn in ECE 554? (Recap)



ECE618 HW Accelerators for ML

Dr. Weiwen Jiang, ECE, GMU

## ECE 554 Course Recap

- Machine Learning Basis:
  - Different neural networks: MLP, CNN, RNN, RL
  - Training (Gradient Descent) and inferencing neural networks using Pytorch
  - Implement convolution using "for loops"

## Biological Neuron

# Human intelligence reside in the brain:

#### • Approximately **86 billion neurons** in the human brain

• The brain is a **network** of **neurons**, connected with nearly  $10^{14} - 10^{15}$  synapses

Dendrites

Cell body

Axon

Signal direction

Neuron

Veuron

### How to equip intelligence in the machine?

- To understand how the brain network is constructed
- To mimic the brain

Synapse

# Biological Neuron

#### Neurons work together:

- Cell body process the information
- **Dendrites** receive messages from other neurons
- Axon transmit the output to many smaller branches
- Synapses are the contact points between axon (Neuron 1) and dendrites (Neuron 2) for message passing

Dendrites

Cell body

Axon

Signal

direction

Neuron

Synapse

Neuron 1

**Cell body** receives input signal from **dendrites** and produce output signal along **axon**, which interact with the next neurons via **synaptic weights** 

#### Synaptic weights are learnable to perform useful computations

(e.g., Recognizing objects, understanding language, making plans, controlling the body.)

Dr. Weiwen Jiang, ECE, GMU

# **Artificial Neuron Design**

- Idealized neuron models
  - Idealization removes complicated details that are not essential for understanding the main principles.
  - It allows us to apply mathematics and to make analogies.

# **McCulloch-Pitts (MP) Neuron** The first computational model of a biological neuron @ 1943



Warren McCulloch



Walter Pitts



#### **Assumptions:**

- Binary devices (i.e.,  $x_i \in \{0,1\}$  and  $y \in \{0,1\}$ )
- Identical synaptic weights (i.e., +1)
- Activation function *f* has a fixed threshold *θ*



# 

- Idealized neuron models
  - Idealization removes complicated details that are not essential for understanding the main principles.
  - It allows us to apply mathematics and to make analogies.
- Break the limitations on MP Neuron
  - What about non-boolean inputs (say, real number)?
  - What if we want to assign more weight (importance) to some inputs?
  - What about functions which are not linearly separable ?
  - Do we always need to hand code the threshold?

Dr. Weiwen Jiang, ECE, GMU

## Multi-Layer Perceptron (MLP) – <u>Lecture 2</u>

• Input layer, output layer and hidden layers



#### **Deep Convolutional Neural Networks (CNN) – <u>Lecture 3</u>**

- One of the most widely used types of deep network
- Fully-connected nets treat far apart input pixels same as those close by — Hence spatial information must be inferred from the training data
- In contrast, CNN proposes an architecture that inherently tries to take advantage of the spatial structure
  - Such an architecture makes convolutional networks fast to train
  - This, in turn, helps us train even deeper, many-layer networks
- Today, deep convolutional networks or some close variants are used in solving many interesting problems that go beyond image classification
- We will use image classification as a driving use case to explain the main concepts behind CNN



#### **Parameters:**

- N: input channels
- M: output channels
- K: kernel size
- P: padding size
- S: stride
- D: dilation
- R: rows
- C: columns

[ref] Aqeel Anwar, What is Transposed Convolutional Layer? <u>https://towardsdatascience.com/what-is-transposed-convolutional-layer-40e5e6e31c11</u>

Dr. Weiwen Jiang, ECE, GMU

CLASS

#### From Static Image to Sequences of Data



#### **RNN and Feedforward Network – <u>Lecture 5</u>**



- Assume each connection has 1 unit delay
- RNN can be unrolled into feedforward networks
  - Each layer keeps on reusing the same weights



#### **RNN and Feedforward Network – <u>Lecture 5</u>**



Dr. Weiwen Jiang, ECE, GMU

# ECE 554 Course Recap

- Machine Learning Basis:
  - Different neural networks: MLP, CNN, RNN, RL
  - Training (Gradient Descent) and inferencing neural networks using Pytorch
  - Implement convolution using "for loops"
- Put Machine Learning onto Embedded Systems:
  - Introduction to HLS (Lec 8-9)
    - $\circ$  Using MLP as example in class
    - Using CNN as example in Labs, which is based on the "for loop" implementation
  - Model compression on FPGA: pruning and quantization (Lec 10-11)
  - Neural architecture search (Lec 12)
    - Using RNN-based RL as controller/optimizer
    - Using Gradient Descent approach for optimization
  - Data movement in HLS-based FPGA implementation (Lec 13)
  - Co-explore neural architectures and FPGA design (Lec 14)

## High-Level Synthesis: HLS – <u>Lecture 8</u>

- High-Level Synthesis
  - Creates an RTL implementation from C, C++, System C, OpenCL API C kernel code
  - Extracts control and dataflow from the source code
  - Implements the design based on defaults and user applied directives
- Many implementation are possible from the same source description
  - Smaller designs, faster designs, optimal designs
  - Enables design exploration



#### Accelerates Algorithmic C to RTL IP integration

## C Validation and RTL Verification – <u>Lecture 8</u>

- There are two steps to verifying the design
  - Pre-synthesis: C Validation
    - Validate the algorithm is correct
  - Post-synthesis: RTL Verification
    - Verify the RTL is correct
- C validation
  - A HUGE reason users want to use HLS
    - Fast, free verification
  - Validate the algorithm is correct before synthesis
    - Follow the test bench tips given over
- RTL Verification
  - Vivado HLS can co-simulate the RTL with the original test bench



#### AXI\_Stream – Lecture 13



#include <ap fixed.h> #include <hls stream.h> 3 const int Tr=4; 4 const int Tc=4; const int K=3; const int Tn=3; const int Tm=6; 8 9 //typedef ap\_fixed<16,8,AP\_TRN\_ZERO, AP\_SAT> FPGA\_DATA; 10 11 typedef float FPGA DATA; 12<sup>©</sup> **struct** DMA DATA{ FPGA DATA data; 13 bool last; 14 15 }; 16 17 void read\_data(hls::stream<DMA\_DATA> &input\_dma\_I, FPGA\_DATA \*output){ 18 static FPGA\_DATA IFM[Tn][Tr+K-1][Tc+K-1]; 19 20 DMA\_DATA ifm\_input\_dma; 21 22 I0:for(int i=0;i<Tn;i++){</pre> I1:for(int j=0;j<Tr+K-1;j++){</pre> 23 I2:for(int m=0;m<Tc+K-1;m++){</pre> 24 ifm\_input\_dma=input\_dma\_I.read(); 25 IFM[i][j][m]=ifm\_input\_dma.data; 26 27 } 28 } } 29 30 31 00:for(int i=0;i<Tn;i++){</pre> 01:for(int j=0;j<Tr+K-1;j++){ 32 33 O2:for(int m=0;m<Tc+K-1;m++){</pre> output[i\*(Tr+K-1)\*(Tc+K-1)+j\*(Tc+K-1)+m] = IFM[i][j][m]+2; 34 35 } 36 } 37 38 39 10 Dr. Weiwen Jiang, ECE, GMU George Mason University 50

#### Test Bench – <u>Lecture 13</u>

F(51 }

52

```
17
   FPGA DATA input[Tn*(Tr+K-1)*(Tc+K-1)] = { 0.47902363538742065,0.5932260751724243,0.59
18
19
   void read_data(hls::stream<DMA DATA> &input dma W, FPGA DATA *output);
20
21
22⊖ int main(){
        hls::stream<DMA DATA> input dma I("input dma I");
23
24
25
        FPGA DATA y[Tn*(Tr+K-1)*(Tc+K-1)]={0};
26
27
        DMA DATA ifm;
        for(int i=0;i<Tn;i++){</pre>
28
            for(int j=0;j<Tr+K-1;j++){</pre>
29
30
                 for(int m=0;m<Tc+K-1;m++){</pre>
                     ifm.data = input[i*(Tr+K-1)*(Tc+K-1)+j*(Tc+K-1)+m];
31
                     if(i==Tn-1 && j==Tr+K-1-1 && m==Tc+K-1-1)
32
                          ifm.last = true;
33
34
                      else
                          ifm.last = false;
35
                     input dma I.write(ifm);
36
37
38
<u>39</u>
40
         }
        read_data(input_dma_I,y);
<u>41</u>
42
        for(int i=0;i<Tn;i++){</pre>
43
            for(int j=0;j<Tr+K-1;j++){</pre>
                 for(int m=0;m<Tc+K-1;m++){</pre>
44
                     printf("gap: %f\n", input[i*(Tr+K-1)*(Tc+K-1)+j*(Tc+K-1)+m]-y[i*(Tr+K
45
46
47
48
49
50
        }
        printf("Done!\n");
        return 0;
```

- Golden results generation code for Lab 4

```
import torch
import torch.nn as nn
Tr=16
Tc=16
K=3
Tn=3
Tm=10
input = torch.rand(1,Tn,Tr+K-1,Tc+K-1)
weight = torch.rand(Tm, Tn, K, K)
conv = nn.Conv2d(3,2,3,bias=False)
conv.weight = torch.nn.Parameter(weight)
output = conv(input)
torch.set_printoptions(precision = 16)
torch.set_printoptions(profile="full")
print("FPGA_DATA input[Tn*(Tr+K-1)*(Tc+K-1)] = {", end=" ")
for elem in list(input.flatten()):
 print(float(elem), end=",")
print("};")
print("FPGA_DATA weight[Tm*Tn*K*K] = {", end=" ")
for elem in list(weight.flatten()):
 print(float(elem), end=",")
print("};")
print("FPGA_DATA output[Tm*Tr*Tc] = {", end=" ")
for elem in list(output.flatten()):
 print(float(elem), end=",")
print("};")
print()
```

https://colab.research.google.com/drive/1TufHcDN Mftm3bwAfcKEF5Njev0y\_v6rM#scrollTo=X3pBQ myNW4rs

#### Export RTL as IP Core – <u>Lecture 13</u>

|           |                                                                                                                                                                                                                                                                                                                               | ×                                                           |                                                                                                 |
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| Synthesis | Export RTL as IP/XO   Export Format Vivado IP (zip) Output Location C;/Users/wjiang8/AppData/Roaming/Xilinx/Vitis/lab3_2/rtl/conv.zip IP OOC XDC File IP XDC File IP Configuration Vendor Library Version 2.0 Description Display Name Taxonomy Do not show this OK THLS 200-1111 Ensted Scheduling: CPU user time: U sectors | Browse<br>Browse<br>Browse<br>a dialog again<br>c<br>Cancel | read_data_0<br>+ ap_ctrl<br>+ input_dma_1_V<br>ap_clk<br>ap_rst_n<br>Read_data (Pre-Production) |

ECE618 HW Accelerators for ML

#### Import the IP into Block Design – <u>Lecture 13</u>



#### **Goal: Enable AI for Everyone – <u>Lecture 14</u>**

#### AI Democratization — Two Levels



ECE618 HW Accelerators for ML

Dr. Weiwen Jiang, ECE, GMU

54 | George Mason University



Dr. Weiwen Jiang, ECE, GMU

## Datasets/Applications, Hardware, and Neural Networks – Lecture 14

# Datasets / Applications



#### **Hardware Platforms**





ECE618 HW Accelerators for ML

Dr. Weiwen Jiang, ECE, GMU

56 | George Mason University



ECE618 HW Accelerators for ML

•

Dr. Weiwen Jiang, ECE, GMU

57 George Mason University



Dr. Weiwen Jiang, ECE, GMU

George Mason University



#### AutoML: Hardware-Aware NAS – <u>Lecture 14</u>



#### AutoML: Network-FPGA Co-Design Using NAS – Lecture 14



#### How to Conduct Neural Architecture Search – Lecture 14

#### Selection of the Backbone Architecture

• VGG (NAS with RL, FNAS), GoogLeNet (NASNet), MobileNet (FBNet, ProxylessNAS), etc.

#### Determination of the Search Space

- **Software:** Number of Channels, Kernel Size, Convolution Type, etc.
- Hardware: Loop Titling Parameters, Loop Order, Schedule, etc.

#### Optimization Approaches

- Deep Reinforcement Learning: RNN based controller
- Gradient Descent: DARTS
- Metaheuristics: Swarm

#### Optimization Objective(s):

- Software: Accuracy, Robustness, Fairness, etc.
- Hardware: Latency, Chip Area, Energy Efficiency, etc.

#### **Programming Platform**



ECE618 HW Accelerators for ML



#### **GMU.EDU**

f S D @ in

**George Mason University** 

4400 University Drive Fairfax, Virginia 22030 Tel: (703)993-1000