tfs模型加速之固化和转半精度

attention标点fp16和fp32速度对比

NVIDIA-SMI Driver Version: 410.104 CUDA Version: 10.0 使用TensorFlow Serving Docker方式

model\batch length 1 8 16 32 64 128
fp32+freeze length64 2.67ms 3.95ms 5.61ms 10.30ms 17.45ms 33.21ms
length128 2.81ms 6.06ms 10.50ms 17.70ms 34.18ms 65.86ms
fp16+freeze length64 7.94ms 3.13ms 4.08ms 5.36ms 9.46ms 16.18ms
length128 2.79ms 4.05ms 5.59ms 9.44ms 16.17ms 31.35ms

freeze指固化后的模型,此处使用tfs方式,其他方式见[保存格式转换(https://blog.csdn.net/qq_43208303/article/details/106528606)

NVIDIA-SMI Driver Version: 418.56 CUDA Version: 10.1 TensorFlow Version: 1.13(GPU)

model\batch length 1 8 16 32 64 128
fp32+freeze length64 4.7ms 13.55ms 21.47ms 38.20ms
length128 6.09ms 27.52ms 39.60ms 98.16ms
fp16+freeze length64 7.94ms 29.08ms 49.44ms 82.36ms
length128 10.44ms 54.47ms 80.03ms 170.35ms
  • 由结果可知运算速度有大幅提升,且模型文件转换前fp32模型大小118M,转换后fp16模型大小74M;

  • 实验结果直接跑GPU无加速甚至速度降低,对比同精度同batch和length下的tfs和cpu计算结果,考虑计算瓶颈占比过大,硬件计算方面fp32比fp16明显快更多;

  • 而tfs速度加速明显,大batch下接近一倍,考虑模型大小降低和参数量大小降低,优化时间主要在数据读取方面;

转换代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import os,sys
import modeling
from modeling import BertConfig
import tensorflow as tf
import numpy as np
from tensorflow.python.framework import graph_util

if __name__ == '__main__':
#存读取来的变量
weight_list = []
#1.要先有个图
loaded_graph = tf.Graph()
sconfig = tf.ConfigProto(log_device_placement=False)
with tf.Session(graph=loaded_graph, config=sconfig) as sess1:
#2.要有会话,后导入tfs32位的savedmodel文件,导入ckpt也可以,加载了图结构就可以转换成其他保存格式,详见上一篇格式转换博客
tf.saved_model.loader.load(sess1, [tf.saved_model.tag_constants.SERVING], "/mnt/lustre02/jiangsu/aispeech/home/jbl01/80w_offline/cc/saved_model_no_freeze")
# for op in loaded_graph.get_operations():
# print(op.name)
#3. 冻结变量,接下来可以保存成pb
frozen_graph_def = graph_util.convert_variables_to_constants(sess1,
tf.get_default_graph().as_graph_def(),
['whichPun/output'])#注意此处是输出op得到名字
with tf.gfile.FastGFile('./punc_model/pb_model_fp32/graph.pb', mode='wb') as f:
f.write(frozen_graph_def.SerializeToString())
#32位的输入输出,可以从Module.input得到或者从图节点名字得到
inputs = sess1.graph.get_tensor_by_name("inputs/input_ids:0")
print(inputs)
inputmask = sess1.graph.get_tensor_by_name("inputs/input_mask:0")
print(inputmask)

pout = sess1.graph.get_tensor_by_name("whichPun/output:0")
print(pout)

aaa = np.asarray([111]).reshape((-1, 1))
bbb = np.asarray([1]).reshape((-1, 1))
print(sess1.run([pout], feed_dict={inputs: aaa, inputmask: bbb}))

ddd = sess1.run(sess1.graph.get_tensor_by_name("bert/encoder/layer_1/attention/self/key/bias:0"),feed_dict={inputs: aaa, inputmask: bbb})
print("key bias in sess1 from tfs32", ddd)
print("key bias astype to fp16", ddd.astype(np.float16))
num = 0
for v in tf.get_collection(tf.GraphKeys.VARIABLES):
#if num==24:
# print("num in sess1:", sess1.run(v))
#第24个变量bias转换时出现部分nan,所以此处打印未转换前的
print(sess1.run(v).shape)
#读取从tfs里得到的32位变量转换成fp16存进去
weight_list.append(sess1.run(v).astype(np.float16))
#32位存进去,后续改modeling里的定义类型强制转换也可以
#weight_list.append(sess1.run(v))
num+=1

model = BertModel(bert_config="bert_config_online.json", ckpt="/mnt/lustre02/jiangsu/aispeech/home/jbl01/80w_offline/cc/-1")
print(model.input_ids)
print(model.input_mask)
print(model.logit)

with tf.Session() as sess:
list_id = 0
#sess.run(tf.global_variables_initializer())
#key step===tf.global varibal init函数对变量初始化,注释掉用assign赋值
for v in tf.get_collection(tf.GraphKeys.VARIABLES):
print(list_id, v)
sess.run(tf.assign(v, weight_list[list_id]))
list_id += 1
print("weight_list[24]",weight_list[24])

#1. 保存为ckpt 两行
#saver = tf.train.Saver()

ccc = sess.run([model.logit],
feed_dict={model.input_ids: aaa, model.input_mask: bbb})

转换完在 同样的sess下保存即可(不是sess1):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
    #2. 保存为pb 在sess中两行
frozen_graph_def = graph_util.convert_variables_to_constants(sess,
tf.get_default_graph().as_graph_def(),
['whichPun/output']
)
#预先读取一个需要新建空白文件
with tf.gfile.FastGFile('./punc_model/pb_model_fp16/graph.pb', mode='wb') as f:
f.write(frozen_graph_def.SerializeToString())

#3. 保存为tfs modle
with tf.Graph().as_default() as graph:
#===========模型固化import frozen_graph_def即可
tf.import_graph_def(frozen_graph_def, name="", )
with tf.Session() as sess:
export_path = "./punc_model/tfs_model_fp16/"
if export_path:
os.system("rm -rf " + export_path)
# 构造定义一个builder,并制定模型输出路径
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
# 声明模型的input和output
inids = tf.saved_model.utils.build_tensor_info(model.input_ids)
inmask = tf.saved_model.utils.build_tensor_info(model.input_mask)
poutput = tf.saved_model.utils.build_tensor_info(model.logit)


# signature_def将输入输出信息进行封装,在构建模型阶段可以随便给tensor命名
prediction_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={'input': inids, 'mask': inmask},
outputs={'punc_output': poutput},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
# 导入graph与变量信息
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={
'ac_forward': prediction_signature,
})

builder.save()