# FPGA Acceleration on a Multi-Layer Perceptron Neural Network for Digit Recognition

Isaac Westby  $\cdot$  Xiaokun Yang<br/>\* $\cdot$  Tao Liu $\cdot$ Hailu Xu

Received: date / Accepted: date

Abstract This paper proposes field-programmable gate array (FPGA) acceleration on a scalable multi-layer perceptron (MLP) neural network (NN) for classifying handwritten digits. First, an investigation to the network architectures is conducted to find the optimal FPGA design corresponding to different classification rates. As a case study, then a specific single-hidden-layer MLP network is implemented with an eight-stage pipelined structure on Xilinx Ultrascale FPGA. It mainly contains a timing controller designed by Verilog Hardware Description Language (HDL) and sigmoid neurons integrated by Xilinx IPs. Finally, experimental results show a greater than  $\times 10$  speedup compared with prior implementations. The proposed FPGA architecture is expandable to other specifications on different accuracy (up to 95.82%) and hardware cost.

Isaac Westby University of Houston Clear Lake, 2700 Bay Area Blvd, Houston, TX 77058, US

Xiaokun Yang<sup>\*</sup> University of Houston Clear Lake, 2700 Bay Area Blvd, Houston, TX 77058, US \*E-mail: YangXia@uhcl.edu https://sceweb.sce.uhcl.edu/xiaokun/

Tao Liu Lawrence Technological University, 21000 West Ten Mile Road, Southfield, MI 48075, US

Hailu Xu California State University Long Beach 1250 Bellflower Blvd., Long Beach, CA 90840, US **Keywords** digit recognition, field-programmable gate array (FPGA), multi-layer perceptron (MLP), modified national institute of standards and technology (MNIST), neural network (NN)

#### **1** Introduction

Neural networks (NNs) have become an indispensable technique for a wide range of applications such as image classification, natural language processing, speech recognition, and many more [1–4]. Specifically in the era of edge computing, to limit the complexity of NN in the power-constrained and latencycritical scenarios is a big challenge. Many researches thus focused on mapping NN models onto field-programmable gate array (FPGA) with the benefits of high parallelism and energy efficiency [5,6].

The implementation of NN on FPGA is much harder than that on CPUs and GPUs. The development framework like Caffe and Tensorflow for CPU and GPU is absent for FPGA [7,8]. Most of existing FPGA designs of NN are based on software-hardware co-design platforms such as Xilinx Zynq FPGA and Intel HARPv2, where a CPU host and an FPGA in the same chip or package are integrated [9–11]. For example, the Zynq SoC family contains software programmability of an ARM-based processor with the hardware programmability of an FPGA [34,35]. The flexibility to use the programming system (PS) can significantly reduce the design work on Hardware Description Language (HDL); nonetheless, the utilization of the ARM core is very costly in terms of FPGA slice count and delay.

Another research direction to the FPGA accelerator on NNs is based on the High-Level Synthesis (HLS) tools like Xilinx Vivado [12–14]. HLS allows users to build a network by using high level language like C or C++, and then convert the design into register-transfer level (RTL). However, the high-level description has limitation to optimize real-life design metrics to meet timing and power requirement, making the design performance lower than the specific hardware implementation [15].

Therefore, in this paper a multi-layer perceptron (MLP) neural network, including RTL design to the controller and an integration with multiple Vivado IPs, is presented to perform a practical application of handwritten digits recognition. The database of Modified National Institute of Standards and Technology (MNIST), which was developed by Yann LeCun, Corinna Cortes and Christopher Burges, is used to build the MLP network and evaluate the accuracy of the NN models [32]. Specifically, the contributions are below.

- This paper first conducts an investigation to several design architectures of the MLP neural network related to different quality results. To show a case study, a single-hidden-layer MLP network is implemented with an eight-stage pipelined structure on Xilinx Ultrascale FPGA. Though a specific design is demonstrated in this paper, the proposed design structure is expandable and scalable to different accuracy constraints and hardware specifications. - Experimental results show that our proposed work can achieve a latency of 1.55 microseconds per digit recognition with an accuracy of 93.25%. To the best of our knowledge, this is the minimum inference latency compared to those of existing works. Additionally, the FPGA slice count and energy consumption are evaluated as well.

The remainder of this paper is organized as follows: Section 2 introduces the related works to the application of handwritten digit recognition with FPGA, and Section 3 presents the background of MLP network. In Section 4, the design architecture of the MLP is discussed. The implementation of the NN is further described in Section 5. In Section 6, the FPGA design performance is evaluated in terms of latency, slice count, energy consumption. Finally, Section 7 concludes this paper.

#### 2 Related Works

The practical application of handwritten digits recognition has been performed by numerous researches to overcome challenges such as reducing the computational complexity [16–18] and increasing the calcification correctness [19–21]. However, this paper focuses on finding the minimum latency corresponding to different quality bounds of classifying images.

The computational speed of handwritten numeral digit recognition has been greatly improved by using hardware accelerator in the past few years. For example, two HLS FPGA designs on LeNet-5 CNN were presented in [22] and [23], achieving a latency of 3.58 ms and 3.2 ms with accuracy of 98.64% and 96%, respectively. The implementations on LeNet-5 CNN contained three convolutional layers, two average pooling layers, and two fully connected layers, in addition to the input and output layer.

Logic design on FPGA can further reduce the latency of NNs with the benefit of computational parallelism. As an example, a deep neural network (DNN) was implemented on Xilinx Zync-7020 FPGA in [24]. By optimizing the scheduling of input memory and weight memory, the proposed work can reach a latency of 640 us with 100 MHz clock. Additionally, in [28] a CNN network was built end-to-end using a reconfigurable IP core and then implemented on an Intel Cyclone10 FPGA. Experimental result showed that the design spent 17.6 us to recognize a handwritten digital picture with an accuracy rate of 97.57%.

To reduce the design complexity on NNs, authors of [29] presented a Super-Skinny CNN (SS-CNN) with 39,541 parameters and only three layers in addition to input and output layer. The implementation on a Cyclone IVE FPGA achieved a latency of 2.2 seconds including both training (55,000 images) and inference (10,000 images). Another two-layer MLP network, containing only one input layer and one output layer, was presented by the same authors [30]. Using a 25 MHz clock frequency, the design took 3.8 seconds to train 55,000 images and recognize 10,000 handwritten digits. In this paper, a scalable MLP network architecture is proposed, aiming to significantly reduce the latency by slightly decreasing the classification rate. The MNIST data base is used to evaluate different design structures corresponding to different quality constrains. Though a case study on a singlehidden-layer design is finally implemented on FPGA, the proposed network architecture is expandable to meet different specifications on latency, accuracy, and hardware cost.

## **3** Fundamental Theorem of MLP Neural Network

This section discusses the fundamental knowledge of MLP networks, including the sigmoid neurons and the data set used for training the network. Finally, the way for finding the optimal design structure is depicted.

#### 3.1 Sigmoid Neurons

Due to the benefit that a sigmoid neuron is much smoother than the step functional output from perception, a sigmoid neural network is performed in our work [33]. Generally the output for the sigmoid neuron can be written as

$$sigmoid\_neuron\_output = \frac{1}{1 + exp(-\sum_{i=1}^{n} w_i \times x_i - b)}$$
(1)

where  $w_i$  denotes the weight corresponding to the input  $x_i$ , and b represents the bias. The parameter n demotes the number of input neurons. By making small changes to the weights and biases of the sigmoid neuron, small changes to the output would occur, eventually converging on a 'correct' or most effective set of weights and biases.

From the hardware perspective, the design on each sigmoid neuron needs the hardware designs on multiplication, addition, subtraction, accumulation, exponential, and reciprocal.

### 3.2 Finding the Design Structure of the MLP Network

In this paper, The MNIST handwritten digit data set is used to train the MLP network [32]. This data set has 60,000 handwritten digits with corresponding labels that can be used to train the network. There is then a separate set of 10,000 different handwritten digits with labels that can be used to estimate the network. Once the method for finding the accuracy of the network with a trained set of weights and biases have been established, the goal is to decide on a network design.

First, the input layer would require 784 neurons due to the fact that the MNIST digit images to train the network are  $28 \times 28$  input pixels in size.

The second thing that would need to remain static, is the fact there would be 10 output neurons because there are 10 possible outputs (0-9) that would converge to a value of around 1. This leaves the number of hidden layers, as well as the number of neurons in each hidden layer as the values that can be adjusted.

Once a network design (number of hidden layers and number of nodes in each layer) has been decided, the values of epochs used, mini\_batch\_size, and learning rate can be tweaked in order to find the most accurate set of weights and biases. These values are static with values of epoch = 30, mini\_batch\_size = 10, and learning rate = 3.0, when comparing different network designs.

## 4 Proposed Design Architecture

This section discusses the design methodology of choosing a network. First, we start an investigation to several single-hidden-layer and two-hidden-layer networks corresponding to different quality specifications. In what follows, one of the network structures is chosen as a case study to the design on FPGA acceleration.

#### 4.1 Comparing Different Networks

As stated above, various network designs are considered to find a network that would be able to be implemented with relative simplicity but still attain a high accuracy. When testing different networks, the epoch, mini\_batch size, and learning rate all remain the same, but the number of hidden layers and neurons in each layer are changed. First the one-hidden layer networks are tested: [784, 50, 10], [784, 30, 10], [784, 20, 10], [784, 16, 10], [784, 12, 10], [784, 10, 10], [784, 8, 10], [784, 5, 10], [784, 4, 10], [784, 3, 10], [784, 2, 10], [784, 1, 10], where the first number is the input layer number, the second number is the number of neurons in the hidden layer, and the third number is the output layer neurons.

When using two-hidden layers, the following networks are tested: [784, 50, 50, 10], [784, 30, 30, 10], [784, 20, 20, 10], [784, 16, 16, 10], [784, 12, 12, 10], [784, 10, 10, 10], [784, 8, 8, 10], [784, 5, 5, 10], [784, 4, 4, 10], [784, 3, 3, 10], [784, 2, 2, 10], [784, 1, 1, 10], where the second and third numbers denote the value of neurons used in the first and second hidden layers.

The accuracy of the one-hidden layer (blue dots) and two-hidden layer (orange dots) networks is summarized in Fig. 1(a). It can be observed that as the number of neurons in the hidden layer(s) increases, the accuracy increases exponentially and asymptotically approaches a value of 1.0. In Fig. 1(b) and Fig. 1(c), it shows that for networks that have 10 neurons or greater in the hidden-layer(s), the difference in accuracy is especially small. This is taken into consideration when deciding the final network design.



Fig. 1 Comparing accuracy of MLP networks with a single hidden layer and two hidden layers.

4.2 Adjusting Epoch, mini\_batch\_size, and Learning Rate

In what follows, three parameters – epoch, mini\_batch\_size, and learning rate are adjusted when training the network. These values are important to how quickly the network's weights and biases can be trained to the highest attainable digit recognition accuracy.

The mini\_batch is used to set the number size of the batches that are used to train the weights and biases. When setting learning rate = 3.0, the result in Fig. 2(a) from changing the mini\_batch shows that the highest digit recognition accuracy is achieved by the networks of mini\_batch=30 after 10 epochs.

When setting mini\_batch = 30, the accuracy result from changing learning rate is shown in Fig. 2(b). It can be observed that the highest digit recognition accuracy occurs when learning rate is 3.0. For these results there is virtually no discernible difference in the performance of training for learning rates between 2.0 - 10.

The final test is conducted in Fig. 2(c) using four different combinations across 60 epochs. The combinations used here are, mini\_batch = 10 & learning rate = 3.0, mini\_batch = 10 & learning rate = 0.1, mini\_batch = 30 & learning rate = 3.0. By testing these varying parameters, it is able to figure out which combination will lead to the most accurate set of weights and biases in the shortest amount of time. Since the biases and weights are only going to be generated one time, then used



Fig. 2 A graph comparing the digit recognition accuracy with (a) the mini-batch size, (b) learning rate, and (c) the combination

in the network after that point, this study trains the network for many more epoch, so that the accuracy can be as high as possible.

#### 4.3 Final Network Design

In conclusion, networks with two hidden layers perform better than networks with single hidden layer, but this increase is very limited. Second, at least 10 neurons in the hidden layer are needed to reach a classification rate over 90%. Therefore, a single-hidden-layer MLP network is chosen as a case study, and further 12 sigmoid neurons in the hidden layer is instantiated. Notice that the design structure is expandable to achieve higher accuracy with more hardware cost, or reversely, to trade the design accuracy for less hardware consumption.

Fig. 3 shows the network structure, containing 784 input neurons, 12 hidden neurons, and 10 output neurons. It results in  $784 \times 12 = 9408$  weights and 12 biases in the hidden layer, and  $12 \times 10 = 120$  weights and 10 biases in the



Fig. 3 A figure showing the final network design: 784 input pixels to the input layer,  $784 \times 12$  weights and 12 biases to the hidden layer, and  $12 \times 10$  weights and 10 biases to the output layer.

output layer. Totally there would be thus 9,550 parameters stored into FPGA buffers.

Once a network has been chosen, the weights and biases are generated by running the python program [33]. For this task, the techniques with Epoch = 60, mini\_batch\_size = 30, and Learning Rate = 3.0 are used. There is a random nature to these numbers, so the system is run multiple times until a set of weights and biases are obtained which achieves the highest accuracy of 93.25%.

# **5** Implementation

In this section, the ML algorithm is broken down by hardware operations. Further, different architectural designs to the algorithm are analyzed and evaluated by hardware implementations.

#### 5.1 Non-Pipelined Hardware Design Architecture

Generally the NN can be divided into the hidden layer neurons and the output layer neurons. First the equation for the output of the hidden layer neuron j can be rewritten:

$$hidden\_neuron\_output(j) = \frac{1}{1 + \exp(-\sum_{i=1}^{784} w_{i,j} \times x_i - b_j)}$$
(2)

where *i* specifies the number of input pixels (ranges from 1 to 784) and *j* denotes the number of the hidden neurons (ranges from 1 to 12). This allows us to break the problem of finding the output of a hidden layer neuron down into two parts. The first part is to multiply all 784 input pixels  $(x_i)$  by their corresponding weights  $(w_{i,j})$ , then sum those values into one result which can be formulated as  $\sum_{i=1}^{784} w_{i,j} \times x_i$ . A visual representation of this process is shown in Fig. 4(a). Assuming that each hardware operation has a latency of one clock cycle, in the first clock cycle 784 multipliers are needed to multiply the input pixels by their correct weights, and then 10 cycles of cascading adders to sum the results up into one final value.

The second part is to take that summed value, denoted as  $S_j = \sum_{i=1}^{784} w_{i,j} \times x_i$ ), and plug it into the sigmoid function formulated as  $\frac{1}{1+\exp(-S_j-b_j)}$ . This involves five different operations: 1) taking the negative value of the summed result, 2) subtracting bias, 3) taking exponential to that value, 4) adding a value of 1, and finally 5) taking the reciprocal. As shown in Fig. 4(b), this stage takes another 5 clock cycles by assuming a latency of one cycle for every operation. Putting the two stages together, it spends a total of 16 clock cycles in order to find the output of a hidden layer neuron.

| 11           |                      |                       |                       |                       |          |          | Adder10_1 |           |                          |                          |                           |                           | -           |
|--------------|----------------------|-----------------------|-----------------------|-----------------------|----------|----------|-----------|-----------|--------------------------|--------------------------|---------------------------|---------------------------|-------------|
| 10           |                      |                       |                       |                       |          | Adder9_1 | :         | Adder9_2  |                          |                          |                           |                           |             |
| 6            |                      |                       |                       |                       | Adder8_1 |          | :         | Adder8_3  |                          |                          |                           |                           | n neuron    |
| 8            |                      |                       |                       |                       | Adder7_1 |          | :         | Adder7_6  |                          |                          |                           |                           | the hidde   |
| 7            |                      |                       |                       | Addare 1              | T        |          | :         | Addare 12 | 7T <sup>-</sup> O ISNNY  |                          |                           |                           | dders with  |
| 9            |                      |                       |                       | Adder5_1              |          |          | :         |           | Adder5_24                |                          |                           |                           | liers and a |
| 5            |                      |                       | A Manufacture of      | T <sup>-</sup> +Janny |          |          | :         |           | Addrew 40                | Addel 4-49               |                           |                           | the multip  |
| 4            |                      |                       | Adder3_1              |                       |          |          | :         |           |                          | Adder3_98                |                           |                           | intation of |
| e            |                      |                       | Adder2_1              |                       |          |          | :         |           |                          | Addard 106               | OCT 7 IDDAY               |                           | ual represe |
| 2            | Addard 4             | T <sup>-</sup> T IDDD | C Profiles            |                       |          |          | :         |           | 100 Probby               | Tec_t laund              | COC 1                     | Adder 1_392               | (a) A vis   |
| 1            | Multiplier1          | Multiplier2           | Multiplier3           | Multiplier4           |          |          | •         |           | Multiplier781            | Multiplier782            | Multiplier783             | Multiplier784             |             |
| Clock Cycle: | Pixel1<br>Weight_x_1 | Pixel2<br>Weight x 2  | Pixel3<br>Weights_x_3 | Pixel4<br>Weights_x_4 |          |          | :         |           | Pixel781<br>Weight_x_781 | Pixel782<br>Weight_x_782 | Pixel783<br>Weights x 783 | Pixel784<br>Weights_x_784 |             |

Clock Cycle:

 Adder10\_1\_result
 Negative
 sub bias
 exp
 add 1.0
 Recip
 Neuron\_final\_result

 (b)
 A visual representation of the sigmod with the hiden neuron

 ${f Fig.}~4~$  A visual representation of the operations of a single hidden layer neuron.

| Clock Cycle: | 1                     | 2        | 3         | 4            | 5        | 6        | 7        | 8   | 9       | 10    | 10          |
|--------------|-----------------------|----------|-----------|--------------|----------|----------|----------|-----|---------|-------|-------------|
| Hidden_out1  | N As a lating line of |          |           |              |          |          |          |     |         |       |             |
| Weight_x_1   | wintiplier1           | Addor1 1 |           |              |          |          |          |     |         |       |             |
| Hidden_out2  | Multiplier2           | Adder1_1 |           |              |          |          |          |     |         |       |             |
| Weight_x_2   | wiutipiierz           |          | Addor 2 1 |              |          |          |          |     |         |       |             |
| Hidden_out3  | Multiplier2           |          | Adderz_1  |              |          |          |          |     |         |       |             |
| Weights_x_3  | wultipliers           | Addorf 2 |           | Addor2 1     |          |          |          |     |         |       |             |
| Hidden_out4  | Multiplier/           | Adder1_2 |           | Adder5_1     |          |          |          |     |         |       |             |
| Weights_x_4  | Wultiplier4           |          |           |              |          |          |          |     |         |       |             |
|              |                       |          |           |              | Adder/ 1 | Negative | sub bias | evn | add 1 0 | Recin | Neuron_     |
| •••          | •••                   | •••      | •••       | •••          | Adder+_1 | Negative | 300 0103 | Слр | auu 1.0 | Recip | final_resul |
| Hidden_out9  | Multiplier9           |          |           |              |          |          |          |     |         |       |             |
| Weight_x_9   | Watepiers             | Adder1 5 |           | Register2 3  |          |          |          |     |         |       |             |
| Hidden_out10 | Multiplier10          | Adder1_5 |           | iteBisterz_3 |          |          |          |     |         |       |             |
| Weight_x_10  | Waterplier 10         |          | Adder2 3  |              |          |          |          |     |         |       |             |
| Hidden_out11 | Multiplier11          |          | //ddcr2_5 |              |          |          |          |     |         |       |             |
| Weights_x_11 | Manpherii             | Adder1 6 |           |              |          |          |          |     |         |       |             |
| Hidden_out12 | Multiplier12          | /////1_0 |           |              |          |          |          |     |         |       |             |
| Weights x 12 | wultiplier12          |          |           |              | 1        |          |          | 1   | 1       |       |             |

Fig. 5 A visual representation of the operations of a single output layer neuron.

Then the output layer neurons are considered. Specifically the hardware involves taking the outputs of the hidden-layer neurons, then plugging those values into the output layers below:

$$final\_neuron\_output(k) = \frac{1}{1 + \exp(-\sum_{j=1}^{12} w_{j,k} \times hidden\_neuron\_output_j - b_k)}$$
(3)

where k ranges from 1 to 10 as the number of output neurons. This output layer is implemented in a similar manner as the hidden layer. A visual representation is shown in Fig. 5. In this stage, it takes only 4 cycles of adders to sum the multiplication results, allowing the entire process of calculating the result of the final layer neurons to take only 10 clock cycles.

In summary, the design described above contains the operations needed for one neuron in the hidden layer, and one neuron in the output layer. The hardware design on a non-pipelined network thus can be implemented for every neuron in the hidden layer, as well as every neuron in the output layer by repeating the operations 12 and 10 times respectively. Specifically, one neuron in the hidden layer spends 784 floating-point multiplyers, and one neuron in the output layer spends 12 floating-point multiplyers, hence the non-pipelined architecture will use  $784 \times 12 + 12 \times 10 = 9528$  floating-point multiplyers. Similarly, 9528 adders are needed including 9,408 adders in the hidden layer and 120 adders in the output layer. Though the latency is only 26 cycles the non-pipelined architecture is very costly on hardware and not affordable by most advanced FPGA boards.

| Design             | Latency  | Hardware Cost |      |     |     |     |  |  |  |
|--------------------|----------|---------------|------|-----|-----|-----|--|--|--|
| Structures         | (Cycles) | MUL           | ADD  | SUB | EXP | REC |  |  |  |
| Non-pipelined      | 26       | 9528          | 9528 | 44  | 22  | 22  |  |  |  |
| Fully Pipelined    | 48       | 796           | 796  | 4   | 2   | 2   |  |  |  |
| 8-statge Pipelined | 129      | 110           | 110  | 4   | 2   | 2   |  |  |  |

 Table 1
 A comparison of the resources needed for a pipelined, non-pipelined design, and pipelined with 98 multipliers designs.

#### 5.2 Pipelined Hardware Design Architecture

To reduce the hardware cost, a fully pipelined design architecture is further presented. Instead of duplicating the design neuron 12 times in the hidden layer and 10 times in the output layer, the design on the single neuron can be reused over clock cycles. The comparison between resource cost of the nonpipelined design and pipelined design is shown in Table 1. It can be observed that even though the non-pipelined approach is nearly twice the speed of the pipelined approach, it uses nearly 12 times the resources.

In order to further reduce the hardware cost, the multiplication of the first layer weights and inputs within the hidden layer neuron can be continuously broken down. For example, an eight-stage pipelined structure for the hidden layer neurons is shown in Fig. 6(a). This design keeps the same pipelined structure for the output level neurons, but for the hidden layer neurons, instead of 784 multipliers followed by adders it will only need 98. Hence, it is able to process 98 inputs with each iteration. So eight iterations will be executed in order to process all the inputs for a neuron. As the results from each eighth of the multiplications come through they are added together in an accumulator.

| 8            |             |                |            |             | -                                        |             |           | -           |          | Adder7_1 |          | -           |                  |              |                        |            |              | -          |               |             |
|--------------|-------------|----------------|------------|-------------|------------------------------------------|-------------|-----------|-------------|----------|----------|----------|-------------|------------------|--------------|------------------------|------------|--------------|------------|---------------|-------------|
| 7            |             |                |            |             |                                          |             |           | Addare 1    | T-nianne |          | Addarf 7 |             |                  |              |                        |            |              |            |               | bid on a    |
| 9            |             |                |            |             |                                          |             | 6 J       | Adder_1     |          | :        |          | Addare 0    |                  |              |                        |            |              |            |               |             |
| 5            |             |                |            |             |                                          | AddarA 4    |           |             |          | :        |          |             | AddarA C         | Auuer4_0     |                        |            |              |            |               |             |
| 4            |             |                |            |             | V CHEFT                                  |             |           | •           |          | :        |          |             |                  | CF CAOPPV    | 7T <sup>-</sup> cianne |            |              |            |               | tinline of  |
| e            |             |                |            | 1 Croppy    |                                          |             |           |             |          | :        |          |             |                  |              | Addry 11               | Auuei 2_24 |              |            |               | h. 00 mul   |
| 2            |             | Addord 1       | T_T IANNA  |             |                                          | C Property  | Adder 1_2 |             |          | :        |          |             | Addev1 10        | Auuel 1_40   |                        |            | Adder1 10    | Adder 1_49 |               | + Jo coitot |
| 1            | Multipliced | ואומות הוובו ד | Croinition | Multiplierz | (11) 11 11 11 11 11 11 11 11 11 11 11 11 | Multipliers |           | Multiplier4 |          | :        |          | Multipliced | . ระเอแต่เวเทเงเ | Multiplicede | ואומונוחוובו אס        |            |              | 00         | - wuunpiieraa | aoponada la |
| Clock Cycle: | Pixel1      | Weight_x_1     | Pixel2     | Weight_x_2  | Pixel3                                   | Weights_x_3 | Pixel4    | Weights_x_4 |          | :        |          | Pixel95     | Weight_x_95      | Pixel96      | Weight_x_96            | Pixel97    | Weights_x_97 | Pixel98    | Weights_x_98  | (a) A trian |

5 2 212 Ľ, (a)

| 21 22        | 0 Recip Neuron_final_result |
|--------------|-----------------------------|
| ) 20         | p add 1.                    |
| 8 15         | bias ex                     |
| 17 18        | Negative sub                |
| 16           | Accum8                      |
| 15           | Accum7                      |
| 14           | Accum6                      |
| 13           | Accum5                      |
| 12           | Accum4                      |
| 11           | Accum3                      |
| 10           | Accum2                      |
| 6            | Accum1                      |
| Clock Cycle: | Adder7_1_result             |

 ${\bf Fig.~6}$  A visual representation of the operations of a single hidden layer neuron.



Fig. 7 A visual representation of all the components of the design put together.

An idea of how the accumulator fits into the output of the hidden layer neurons is shown in Fig. 6(b). Specifically, it can be seen that at cycle nine the first value from the multiplier-adder comes in. For eight cycles these values are accumulated to get one final value for all 784 inputs. Once this value has been found, the result is just sent into the same sequence, of taking the negative, subtracting bias, taking exponential, adding 1.0, then taking the reciprocal. This entire process ends up taking 21 cycles to find the result.

A comparison of the resource utilization and latency between pipelined and non-pipelined designs is shown in Table 1. It can be observed that the eight-stage pipelined structure consumes more clock cycles but only spends 13.8% the number of hardware components as the fully-pipelined design, and 1.15% the number of hardware as the non-pipelined design.

The proposed design structure can be expanded to different pipelined levels with different specifications to resource cost and computational speed. The higher of the pipelined levels, the less hardware resource is needed but more clock cycles will be taken.

## 5.3 Design and Simulation

In what follows, the case study of the eight-stage structure of the MLP network is designed by Verilog HDL and integrated with multiple Vivado IPs. Basically the design structure can be divided into two stages, as shown in Fig. 7. Stage one is used for finding the output of the hidden layer neurons, and stage two is used for finding the output of the output layer neurons.

In stage one, the 98 multipliers and cascading adders are executed eight times, feeding the result each time into the accumulator. This allows the network to process all 784 input pixels and their corresponding weights. Once the accumulator has accumulated all eight summed values, the results propagate sequentially through the rest of stage one. Once all 12 of the stage one outputs have been calculated, stage two processes the results of the output layer neurons sequentially.

The design on the timing controller is verified by Mentor Graphic Mondelsim, and the final functionality of the digit recognition is tested by using Xilinx Vivado. As an example shown in Fig. 8, the final results of the network can be obtained by looking at the value of the signal 'final\_neuron\_result\_value', when 'cnt6' spanning over hexadecimal '0xa - 0x13'. These results come in

| Name                                                          | Value    |            | 1,59     | 0 ns        | ,600 ns    | 1,610 ns | 1,620    |  |  |  |
|---------------------------------------------------------------|----------|------------|----------|-------------|------------|----------|----------|--|--|--|
| 🕌 clk                                                         | 1        |            | 0        | 1           | 2          |          | 3        |  |  |  |
| > 😻 cnt6[4:0]                                                 | 0a       | 09         | 0a       | ( оь        | 0          | 2        | oa       |  |  |  |
| > 😼 final_nee[31:0]                                           | 37bdb0e4 |            | 37bdb0e4 | 339040      | )de X 3948 | 1839 X   | 36d3fec3 |  |  |  |
| s, 1,630 ns, 1,640 ns, 1,650 ns, 1,660 ns, 1,670 ns, 1,680 ns |          |            |          |             |            |          |          |  |  |  |
| 4                                                             | - 5      |            | 6        | 7           | 8          |          | 9        |  |  |  |
|                                                               | 0f       | <u>X</u>   |          | 0.999975324 |            |          | 13 X     |  |  |  |
| X 358937a6 X                                                  | 37b59523 | <u>Х з</u> | 7a6d32f  | 3f7ffe62    | 34e6f739   | X 379    | 85сы1 🗙  |  |  |  |

Fig. 8 Vivado waveform showing the final results of the network.

such that the value of 'final\_neuron\_result\_value' at 'cnt6 = 0xa' corresponds to detection of a handwritten digit '0', at 'cnt6 = 0xb' corresponds to a detection of a digit '1', etc. Thus the final detection results of the network show that the network has strongly detected a digit '7', with that output being equal to '0.999975324', and all other outputs being nearly '0'.

#### 6 Experimental Results

In this section the performance evaluation is further discussed in terms of speed, slice count, and energy cost. Xilinx Vivado is applied as the synthesis tool with FPGA part xcku035-sfva784-1LV-I.

#### 6.1 Execution Time on FPGA

For the FPGA execution, the latency of a single digit recognition is 1.55 us running on a 100 MHz clock. An equivalent design on Matlab is performed in order to compare the execution time between hardware and software. Generally the software latency is taken on a personal computer that has the following processor, Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz, 2904 Mhz, 2 Core(s), 4 Logical Processor(s).

As shown in Table 2, the execution time in software varies every time running it with the longest total execution time taken into consideration, and the fastest taken into consideration. The execution time for the software implementation is based upon 10,000 input images, so the total execution time is divided by 10,000 in order to find the execution time per image. The Speedup is found by a simple formula  $Speedup_A = \frac{(ExecutionTime)_B}{(ExecutionTime)_A}$ .

Specifically shown in the fourth column, the longest measured time per image in software is 197.19 us, and the fastest is 62.61 us. This leads to a speedup in hardware of  $127.2 \times$  over the longest software run, and a speedup of  $40.4 \times$  over the fastest software run. When further taking the difference of the clock frequency (100 MHz in hardware v.s. 2.7 GHz in software) into

| Compa    | arison  | Time for<br>10,000 Images (s) | Time per<br>Image (us) | Speedup |              |
|----------|---------|-------------------------------|------------------------|---------|--------------|
| Software | Fastest | 0.6261488                     | 62.61                  | n/2     |              |
| (Matlab) | Longest | 1.9718928                     | 197.19                 | 11/a    |              |
| Hardware |         | n/a                           | 1.55                   | 40.4    | over fastest |
| (FPC     | GA)     | II/a                          | 1.00                   | 127.2   | over longest |

Table 2 A comparison of the execution times in software and hardware.

Table 3 A further breakdown of the utilization results.

| Resource            | CLB LUTs | CLB Registers |
|---------------------|----------|---------------|
| digitrec (top)      | 44,668   | 14,274        |
| acc_add             | 1,880    | 656           |
| mul_98              | 30,100   | 11,255        |
| mul_12              | 3,473    | 1,330         |
| u0_neuron_finish    | 1,575    | 311           |
| u1_neuron_finish    | 1,543    | 311           |
| all blk_mem (total) | 6,097    | 411           |

Table 4 The energy consumption on FPGA

|                | D       | ynamic 1 | Energy | Static Energy | Total |      |  |
|----------------|---------|----------|--------|---------------|-------|------|--|
|                | Signals | Logic    | DSP    | I/O           | 0.01  | 0.88 |  |
| Energy (mJ)    | 0.35    | 0.49     | 0.03   | 0.00          | 0.01  | 0.00 |  |
| Percentage (%) | 40      | 56       | 3      | 0             | 1     | 100  |  |

consideration, the implementation of a solution in hardware allows significant speedup over a software implementation.

## 6.2 Resource Cost on FPGA

After synthesis, the hardware utilization is shown in Table 3. From the overall summary in the second row, it shows that the design on the MLP network mainly spends 44,668 LUTs and 14,274 FFs. In the further breakdown of the utilization from the third to the eighth row, it can be observed that most of the resource utilization comes from the multiplication modules, which includes 98 single-precision floating-point multipliers in the hidden layer (mul\_98) and additional 12 in the output layer (mul\_12).

## 6.3 Energy Consumption on FPGA

In what follows, the energy dissipation is summarized in Table 4. The total on-chip energy is 0.88 uJ, including 0.87 dynamic energy and 0.01 static energy. This gives a breakdown of 99% dynamic power and 1% static power consumption. Further, in the second and third column it can be observed that the high switching activities on signals and logic take most of the dynamic energy (96%).

| Companicon       | FPGA -                | Accuracy | Hardwa | are Cost | Latonau            |  |
|------------------|-----------------------|----------|--------|----------|--------------------|--|
| Comparison       | Clock (MHz)           | (%)      | LUTs   | FFs      | Latency            |  |
| [22] LeNet-5 CNN | Xilinx ZCU102-100     | 98.64    | 32,589 | 33,585   | $3.58 \mathrm{ms}$ |  |
| [23] LeNet-5 CNN | Xilinx Zync 7Z020-100 | 90-96    | 18,426 | 8,264    | 3.2 ms             |  |
| [24] DNN         | Xilinx Zync 7Z020-100 | 94.67    | 38,899 | 40,534   | 637 us             |  |
| [28] CNN         | lntel Cyclone10-150   | 97.57    | 12,588 | 48,765   | 17.6 us            |  |
| [29] SS-CNN      | lntel Cyclone IVE-30  | 98.8     | 98,000 | -        | 220 us             |  |
| [30] MLP NN      | lntel Cyclone IVE-25  | 89       | 34,000 |          | 380 us             |  |
| Our Case Study   | Xilinx Ultrascale-100 | 93.25    | 44,668 | 14,274   | 1.55 us            |  |

Table 5 Comparison to related works.

#### 6.4 Comparison to Related Works

Finally, the comparison to prior works is emphasized in Table 5. The accuracy shown in the third column is based on the 10,000 test images from the MNIST data set. Our case study to the eight-stage MLP network achieves an accuracy of 93.25%, between those of existing works. By using our proposed MLP architectures, the classification rate can reach 95.82% with a two-hidden-layer network and 50 neurons in each hidden layer.

Then the hardware cost is summarized in the fourth and fifth columns in terms of LUTs and FFs. For all the implementations except for [23], the slice count of LUTs and FFs are similar. The results of [23] in the fourth row doesn't include many hardware functions like sigmoid neurons thus the slice number is less than those of others. The resource cost is highly dependent on the pipelined levels with our proposed work. For example, the hardware cost on multiplications and additions can be reduced by half with a 16-stage pipelined design. In other words, the slice count can be reduced by half for  $2 \times$  level of the pipelined design. It is a trade off between hardware cost and recognition accuracy.

Focusing on FPGA acceleration, the latency to recognize handwritten digits is mainly compared in the last column. It can be observed that our case study achieves the highest speed by using a single-hidden-layer MLP network and logic-only implementation. Specifically, the latency of [22] and [23] are very high due to the limitation of timing and speed optimization using an HLS design. Compared to the DNN and CNN designs in the fifth and sixth rows, our proposed work achieves  $411 \times$  and  $11 \times$  speedup, respectively. Finally, the execution time for our proposed work is greatly less than [30] and [29], because the weights and biases for our work are trained beforehand, while the total execution time for [30,29] also includes training operations.

In summary, our proposed logic-only design on a single-hidden-layer network is able to detect handwritten digits with a significant speedup, and maintain a 93.25% accuracy and similar utilization to exiting works.

## 7 Conclusion

This paper proposes a scalable MLP network for the recognition of handwritten digits. As a case study, a single-hidden-layer design structure is implemented on FPGA, achieving 93% classification rate with just 12 hidden-layer neurons and 9,550 weight and bias parameters. The logic-only design and off-board training parameters enable to significantly decrease the complexity of the final network implementation and provide a low latency when classifying images. Experimental results show a > 10× acceleration over existing works.

#### References

- K. Benidis, S. Rangapuram, V. Flunkert, "Neural forecasting: Introduction and literature overview," arXiv:2004.10240, 2020.
- G. Ismayilov and H. R. Topcuoglu, "Neural network based multi-objective evolutionary algorithm for dynamic workflow scheduling in cloud computing," Future Generation Computer Systems, Vol. 102, PP. 307-322, 2020.
- 3. P. Molchanov, A. Mallya, S. Tyree, I. Frosio, J. Kautz, "Importance Estimation for Neural Network Pruning," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), PP. 11264-11272, 2019.
- C. Son, S. Park, J. Lee, J. Paik, "Deep Learning-based Number Detection and Recognition for Gas Meter Reading," IEIE Trans. on Smart Processing & Computing Vol. 8, No. 5, PP. 367-372, Oct. 2019.
- A. Shawahna, S. M. Sait and A. El-Maleh, "FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review," in IEEE Access, Vol. 7, PP. 7823-7859, 2019.
- 6. E. Nurvitadhi, D. Kwon, A. Jafari, et al., "Evaluating and Enhancing Intel Stratix 10 FPGAs for Persistent Real-Time AI," Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, PP. 119, Feb. 2019.
- K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, "A Survey of FPGABased Neural Network Accelerator," arXiv:1712.08934, Dec. 2017.
- Albert Reuther, et. al, "Survey and Benchmarking of Machine Learning Accelerators," arXiv:1908.11348v1, Aug. 2019.
- D. Gschwend, "ZynqNet: An FPGA-Accelerated Embedded Convolutional Neural Network," arXiv:2005.06892, 2020.
- C. Gao, S. Braun, I. Kiselev, J. Anumula, T. Delbruck and S. Liu, "Real-Time Speech Recognition for IoT Purpose using a Delta Recurrent Neural Network Accelerator," 2019 IEEE International Symposium on Circuits and Systems (ISCAS), PP. 1-5, 2019.
- 11. K. Vaca, A. Gajjar, and X. Yang , "Real-Time Automatic Music Transcription (AMT) with Zync FPGA," IEEE Computer Society Annual Symposium on VLSI (ISVLSI), PP. 378-384, 2019.
- 12. Q. Li, X. Zhang, J. Xiong, W. Hwu, and D. Chen, "Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS," Proceedings of the 24th Asia and South Pacific Design Automation Conference, PP. 693–698, 2019.
- 13. D. Rongshi and T. Yongming, "Accelerator Implementation of Lenet-5 Convolution Neural Network Based on FPGA with HLS," 2019 3rd International Conference on Circuits, System and Simulation (ICCSS), Nanjing, China, 2019, PP. 64-67
- Q. Zhang, J. Cao, Y. Zhang, S. Zhang, Q. Zhang and D. Yu, "FPGA Implementation of Quantized Convolutional Neural Networks," 2019 IEEE 19th International Conference on Communication Technology (ICCT), 2019, PP. 1605-1610
- J. Cong, B. Liu, S. Neuendorffer, et. al, "High-Level Synthesis for FPGAs: From Prototyping to Deployment," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 30, No. 4, PP. 473-491, April 2011.
- Akgun O. C. and Mei J., "An energy efficient time-mode digit classification neural network implementation," Phil. Trans. R. Soc. A.37820190163, 2020.

- Y. Xiang, et. al, "Hardware Implementation of Energy Efficient Deep Learning Neural Network Based on Nanoscale Flash Computing Array," Adv. Materials Technologies, Vol. 4, No. 5, Feb. 2019.
- Y. Ma, J. Guo and W. Wei, "An Exceedingly Fast Model for Low Resolution Handwritten Digit String Recognition," IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), PP. 282-288, 2019.
- M. B. Abdulrazzaq and J. N. Saeed, "A Comparison of Three Classification Algorithms for Handwritten Digit Recognition," International Conference on Advanced Science and Engineering (ICOASE), PP. 58-63, 2019.
- Ahlawat, S., Choudhary, A., Nayyar, A., Singh, S., and Yoon, B., "Improved Handwritten Digit Recognition Using Convolutional Neural Networks (CNN)," Sensors, Vol. 2020, No.20, 3344, June 2020.
- 21. Ali, S., Shaukat, Z., Azeem, M. et al., "An efficient and improved scheme for handwritten digit recognition based on convolutional neural network," SN Appl. Sci. 1, PP. 1125, 2019.
- 22. M. Cho and Y. Kim, "Implementation of Data-optimized FPGA-based Accelerator for Convolutional Neural Network," International Conference on Electronics, Information, and Communication (ICEIC), PP. 1-2, 2020.
- H. Madadum and Y. Becerikli, "FPGA-Based Optimized Convolutional Neural Network Framework for Handwritten Digit Recognition," 1st International Informatics and Software Engineering Conference (UBMYK), PP. 1-6, 2019.
- Tsai, T.-H., Ho, Y.-C., and Sheu, M.-H, "Implementation of FPGA-based Accelerator for Deep Neural Networks," 2019 IEEE 22nd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), 2019.
- 25. S. Ghaffari et al, "FPGA-based convolutional neural network accelerator design using high level synthesize," In Intl. Conf. of Signal Processing and Intelligent Syst., 2016.
- 26. Y. Zhou and J. Jiang, "An FPGA-based accelerator implementation for deep convolutional neural networks," 2015 th International Conference on Computer Science and Network Technology (ICCSNT), PP. 829-832, 2015.
- S. Wisayataksin and G. Boonyuu, "A Programmable Artificial Neural Network Coprocessor for Handwritten Digit Recognition," International Conference on Information and Communications Technology (ICOIACT), PP. 139-142, 2019.
- R. Xiao, J. Shi and C. Zhang, "FPGA Implementation of CNN for Handwritten Digit Recognition," IEEE Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), PP. 1128-1133, 2020.
- J. Si, E. Yfantis and S. L. Harris, "A SS-CNN on an FPGA for Handwritten Digit Recognition," 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), PP. 88-93, 2019.
- J. Si and S. L. Harris, "Handwritten digit recognition system on an FPGA," 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), PP. 402-407, 2018.
- 31. I. Westby, H. Koc, J. Lu, and X. Yang , "Accelerating Digit Recognition with Neural Network," The 22nd Int'l Conf on Artificial Intelligence (ICAI 2020), In Press, July 2020.
- Y. LeCun, C. Cortes, and C. Burges, "MNIST handwritten digit database," 2010.
   Nielsen, M. "Neural Networks and Deep Learning," Neural Networks and Deep Learning, Determination Press, neuralnetworksanddeeplearning.com, 2019.
- 34. "Zynq-7000 SoC Data Sheet: Overview," V1.11.1, Xilinx, July 2, 2018.
- 35. "Zynq-7000 SoC Technical Reference Manual," V1.12.2, Xilinx, July 2018.