TFlite之模型量化__专栏_RISC-V MCU中文社区

1 简介

1.1 什么是模型量化？

简单来说：模型量化（Model Quantization）就是通过某些方法将浮点模型转为定点模型。

一般通过Tensorflow、PyTorch、Caffe等工具训练出的模型权重等都是float32类型，其模型精度较高但模型尺寸较大，在一些内存受限的场景，需要尽量减小模型尺寸，模型量化就是利用一些方法将浮点模型转为定点模型，这样内存占用可以减少数倍，运算速度也有较大提升，并且精度只有稍许损失，可以把模型量化理解为一种有损压缩，虽然会损失一些精度，但是会显著降低模型尺寸，同时提高运行效率。

1.2 为什么要进行模型量化？

如下表：

	参数量	计算速度	内存占用	精度对比
量化前	参数量大	计算量大	内存占用多	精度高
量化后	压缩参数	提升速度	降低内存占用	精度有损失

在嵌入式AI场景，其对内存以及计算速度有较高要求，而可以接受一些精度损失，那么就可以使用模型量化技术来降低模型的复杂性。

2 量化方法

TensorFlow提供两种量化方式：

量化感知训练（Quantization aware training），也叫做训练中量化，基于tf.keras
训练后量化（Post-training quantization），训练得到模型后，使用TensorFlow Lite转换器量化

其区别如下：

训练后的量化技术迭代快，易于使用，但是模型精度损失较大；
训练中的量化技术相对难于使用，需要重新训练模型，但是模型精度保持较好。

用户可以在使用难易程度、迭代时间、模型精度之间权衡，选择适合的一种量化方式。

3 训练后量化

训练后量化比较简单一些，这篇文档以mnist为例讲述训练后量化方法（训练中量化以后有机会再写文档）。

3.1 训练模型

使用以下脚本，训练获得mnist_train.h5模型

# mnist_train.py
import tensorflow as tf
from tensorflow import keras

print("TensorFlow version {}".format(tf.__version__))

(train_images, train_labels),(test_images, test_labels) = tf.keras.datasets.mnist.load_data()
class_num = 10

# Train the model
model = tf.keras.models.Sequential([
    keras.layers.Flatten(input_shape=(train_images.shape[1], train_images.shape[2])),  # input_shape=(28,28)
    keras.layers.Dense(512, activation=tf.nn.relu),
    keras.layers.Dense(64, activation=tf.nn.relu),
    keras.layers.Dense(class_num, activation=tf.nn.softmax)
])
model.compile(optimizer='Adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("model stucture:")
model.summary()

# train
model.fit(train_images, train_labels, epochs=5, batch_size=64)

# evaluate accuracy
loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}% loss: {}".format(100*acc, loss))

# Convert the model to tflite
model.save('mnist_train.h5')
del model  # 删除现有模型

日志如下：

TensorFlow version 2.12.0
model stucture:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 flatten (Flatten)           (None, 784)               0

 dense (Dense)               (None, 512)               401920

 dense_1 (Dense)             (None, 64)                32832

 dense_2 (Dense)             (None, 10)                650

=================================================================
Total params: 435,402
Trainable params: 435,402
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
938/938 [==============================] - 3s 3ms/step - loss: 1.4964 - accuracy: 0.7708
Epoch 2/5
938/938 [==============================] - 3s 3ms/step - loss: 0.5160 - accuracy: 0.8773
Epoch 3/5
938/938 [==============================] - 3s 3ms/step - loss: 0.2994 - accuracy: 0.9254
Epoch 4/5
938/938 [==============================] - 3s 3ms/step - loss: 0.1852 - accuracy: 0.9519
Epoch 5/5
938/938 [==============================] - 3s 3ms/step - loss: 0.1400 - accuracy: 0.9615
313/313 [==============================] - 1s 2ms/step - loss: 0.1657 - accuracy: 0.9590
Restored model, accuracy: 95.90% loss:  0.17

最终获得mnist_train.h5模型。

3.2 训练后量化

1. 混合量化：

# mnist_quant_hybrid.py
import tensorflow as tf
from tensorflow import keras

# load h5模型，并评估其精度
(train_images, train_labels),(test_images, test_labels) = tf.keras.datasets.mnist.load_data()
model = keras.models.load_model('mnist_train.h5')  # 创建 HDF5 文件 'mnist_train.h5'
# evaluate accuracy
loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}% loss: {:5.2f}".format(100*acc, loss))

tflite_mnist_model = 'mnist_quant_hybrid.tflite'
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
flatbuffer_size = open(tflite_mnist_model, "wb").write(tflite_model)

print('hybrid: The size of the converted flatbuffer is: %d bytes' % flatbuffer_size)

# 评估量化后模型的准确度
#在PC python中测试tf lite 模型的准确率
def evaluate(interpreter_path):
    #加载模型并分配张量
    interpreter = tf.lite.Interpreter(model_path=interpreter_path)
    interpreter.allocate_tensors()

    #获得输入输出张量.
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    import numpy as np
    index = input_details[0]['index']
    shape = input_details[0]['shape']
    acc_count = 0
    image_count = test_images.shape[0]
    for i in range(image_count):
        interpreter.set_tensor(index, test_images[i].reshape(shape).astype("float32"))
        interpreter.invoke()
        output_data = interpreter.get_tensor(output_details[0]['index'])
        label = np.argmax(output_data)
        if label == test_labels[i]:
            acc_count += 1
    print("test_images accuracy is {:.2%}".format(acc_count/(image_count)))

evaluate(tflite_mnist_model)

2. 全整形量化：

# mnist_quant_full_integer.py
import tensorflow as tf
from tensorflow import keras
import numpy as np
print("TensorFlow version {}".format(tf.__version__))

# load h5模型，并评估其精度
(train_images, train_labels),(test_images, test_labels) = tf.keras.datasets.mnist.load_data()
model = keras.models.load_model('mnist_train.h5')  # 创建 HDF5 文件 'mnist_train.h5'
# evaluate accuracy
loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}% loss: {:5.2f}".format(100*acc, loss))

tflite_mnist_model = 'mnist_quant_full_integer.tflite'
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_data_gen():
    for image in train_images[0:100,:,:]:
        yield[image.reshape(-1,train_images.shape[1],train_images.shape[2]).astype("float32")]
# 设置representative_dataset
converter.representative_dataset = representative_data_gen
# 设置ops量化类型
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# 设置模型输入输出数据格式
converter.inference_input_type = tf.uint8 # or tf.uint8
converter.inference_output_type = tf.uint8 # or tf.uint8

tflite_model = converter.convert()
flatbuffer_size = open(tflite_mnist_model, "wb").write(tflite_model)

print('full_integer: The size of the converted flatbuffer is: %d bytes' % flatbuffer_size)

# 评估量化后模型的准确度
#在PC python中测试tf lite 模型的准确率
def evaluate(interpreter_path):
    #加载模型并分配张量
    interpreter = tf.lite.Interpreter(model_path=interpreter_path)
    interpreter.allocate_tensors()

    #获得输入输出张量.
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    import numpy as np
    index = input_details[0]['index']
    shape = input_details[0]['shape']
    acc_count = 0
    image_count = test_images.shape[0]
    for i in range(image_count):
        interpreter.set_tensor(index, test_images[i].reshape(shape).astype("uint8"))
        interpreter.invoke()
        output_data = interpreter.get_tensor(output_details[0]['index'])
        label = np.argmax(output_data)
        if label == test_labels[i]:
            acc_count += 1
    print("test_images accuracy is {:.2%}".format(acc_count/(image_count)))

evaluate(tflite_mnist_model)

这样得到：

h5:                                                    5258320 bytes accuracy: 95.80%
hybrid: The size of the converted flatbuffer is:       441576 bytes  accuracy is 95.78%
full_integer: The size of the converted flatbuffer is: 440312 bytes  accuracy is 95.69%

参考：