YOLOv5 TensorRT Benchmark for NVIDIA® Jetson™ AGX Xavier™ and NVIDIA® Laptop

YOLOv5 TensorRT Benchmark for NVIDIA® Jetson™ AGX Xavier™ and NVIDIA® Laptop

Jetson AGX Xavier

03 June 2021
WHAT YOU WILL LEARN?

1- How to setting up the YOLOv5 environment

2- How to create and test the engine files

3- Which model is faster than others







ENVIRONMENT

Hardware 1: Jetson AGX Xavier Dev. Kit

OS 1: JetPack 4.5 (L4T-32.5.0)

GPU 1: 512-Core Volta GPU with Tensor Cores

Hardware 2: Gigabyte Aero 15 Laptop

OS 2: Windows 10 Pro (Insider Preview 21382)

GPU 2: RTX 3060 Laptop GPU (6 GB)




YOLOv5 Environment Preparation


In this blog post, we will test TensorRT implemented YOLOv5 environment’s detection performance in our AGX Xavier and NVIDIA GPU integrated laptop.


First, we will set up the YOLOv5 environment on both PCs. Then, we will create and test the engine files for all models (s, m, l, x, s6, m6, l6, x6) into the both of devices. Finally, we will combine all results into two tables to compare them easily.


Our host PC is an NVIDIA GPU included Windows laptop PC. We installed WSL with CUDA and used Ubuntu-18.04 from Microsoft Store. We started docker service, cloned YOLOv5 repository and pulled Ultralytics’ latest YOLOv5 Docker image. Then, we downloaded all P5 and P6 model files, cloned TensorRTX repository, created “.wts” files in each type of models and configured calibration & test images. The setup made with these commands below:



sudo service docker start

# clone YOLOv5 repository
git clone https://github.com/ultralytics/yolov5.git
cd yolov5/
git checkout abfcf9eb79877971acd238cafe6149711c5056ad
cd ..

pwd
# Paste the pwd command output into the side in this command below
docker run --env NVIDIA_DISABLE_REQUIRE=1 --ipc=host --gpus all -it -v /yolov5:/yolov5 ultralytics/yolov5:latest

cd /yolov5
apt update
apt install unzip

# Download P5 and P6 models
./weights/download_weights.sh
./weights/download_weights_6.sh
python detect.py --source ./data/images/zidane.jpg

# clone TensorRTX repository
git clone https://github.com/wang-xinyu/tensorrtx.git
cd tensorrtx/
git checkout 668d89bbd5319c00be9f648d3db8f61689417d2d
cd ..

# create .wts files
cp tensorrtx/yolov5/gen_wts.py .
python3 gen_wts.py yolov5s.pt
python3 gen_wts.py yolov5m.pt
python3 gen_wts.py yolov5l.pt
python3 gen_wts.py yolov5x.pt
python3 gen_wts.py yolov5s6.pt
python3 gen_wts.py yolov5m6.pt
python3 gen_wts.py yolov5l6.pt
python3 gen_wts.py yolov5x6.pt
cd tensorrtx/yolov5/
mkdir build && cd build
mv ../../../*.wts .
cmake ..

# configure calibration and test images
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nZzd4jOM6XyVmne_BtmrmHGHGfrsBstP' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nZzd4jOM6XyVmne_BtmrmHGHGfrsBstP" -O coco_calib.zip && rm -rf /tmp/cookies.txt
unzip coco_calib.zip
mkdir test
cp ./coco_calib/*1.jpg ./test/


On the Jetson side, we created our YOLOv5 Docker environment. To do this, we cloned YOLOv5 repository, pulled L4T-ML Docker Image and configured the Docker environment. Then, we downloaded all P5 and P6 models’ file, cloned TensorRTX repository, created “.wts” files in each type of models and configured calibration & test images. This setup made with these commands below:



# clone YOLOv5 environment
git clone https://github.com/ultralytics/yolov5.git
cd yolov5/
git checkout 91547edec1a33121b3607e8c946e051514182b4f
cd ..
docker run -it --gpus all -v ${PWD}/yolov5:/yolov5 nvcr.io/nvidia/l4t-ml:r32.5.0-py3
cd /yolov5
apt update
apt install unzip

sed -i 's/opencv-python/#opencv-python/g' requirements.txt
sed -i 's/torch/#torch/g' requirements.txt
pip3 install -r requirements.txt

# Download P5 and P6 models
sed -i 's/python/python3/g' ./weights/download_weights.sh
cp weights/download_weights.sh weights/download_weights_p6.sh
sed -i 's/}/}6/g' ./weights/download_weights_p6.sh
./weights/download_weights.sh
./weights/download_weights_p6.sh
python3 detect.py --source ./data/images/zidane.jpg --conf 0.5

# clone TensorRTX repository
git clone https://github.com/wang-xinyu/tensorrtx.git
cd tensorrtx/
git checkout 668d89bbd5319c00be9f648d3db8f61689417d2d
cd ..

# create .wts files
cp tensorrtx/yolov5/gen_wts.py .
python3 gen_wts.py yolov5s.pt
python3 gen_wts.py yolov5m.pt
python3 gen_wts.py yolov5l.pt
python3 gen_wts.py yolov5x.pt
python3 gen_wts.py yolov5s6.pt
python3 gen_wts.py yolov5m6.pt
python3 gen_wts.py yolov5l6.pt
python3 gen_wts.py yolov5x6.pt
cd tensorrtx/yolov5/
mkdir build && cd build
mv ../../../*.wts .
cmake ..

# configure calibration and test images
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nZzd4jOM6XyVmne_BtmrmHGHGfrsBstP' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nZzd4jOM6XyVmne_BtmrmHGHGfrsBstP" -O coco_calib.zip && rm -rf /tmp/cookies.txt
unzip coco_calib.zip
mkdir test
cp ./coco_calib/*1.jpg ./test/

Creating and Testing The TensorRT Engine Files


At this step we used all the same commands in both devices. We checked the batch size, model type and image resolutions with these commands:



cat ../yololayer.h | grep INPUT
cat ../yolov5.cpp | grep "define USE_"
cat ../yolov5.cpp | grep "define BATCH"




Our current configuration’s mode is in FP16 mode, batch size is 1 and the resolution is 640x640. The resolution changed for P5 and P6 models. 640x640 is for P5 models (s, m, l, x) and 1280x1280 is for P6 models (s6, m6, l6, x6). In each configuration change, we rebuild the “yolov5” application.


Let’s start with INT8 and batch size as 1 to testing. Then, we changed the batch size to 8. After INT8 test ended, we tested the other modes one-by-one. These are the whole creating and testing commands:



# INT8 – batch size 1 – resolution 640x640
sed -i 's/define USE_FP16/define USE_INT8/g' ../yolov5.cpp
make clean
make
./yolov5 -s yolov5s.wts yolov5s_int8_b1.engine s
./yolov5 -s yolov5m.wts yolov5m_int8_b1.engine m
./yolov5 -s yolov5l.wts yolov5l_int8_b1.engine l
./yolov5 -s yolov5x.wts yolov5x_int8_b1.engine x
./yolov5 -d yolov5s_int8_b1.engine ./test
./yolov5 -d yolov5m_int8_b1.engine ./test
./yolov5 -d yolov5l_int8_b1.engine ./test
./yolov5 -d yolov5x_int8_b1.engine ./test

# INT8 – batch size 1 – resolution 1280x1280
sed -i 's/INPUT_H = 640/INPUT_H = 1280/g' ../yololayer.h
sed -i 's/INPUT_W = 640/INPUT_W = 1280/g' ../yololayer.h
make clean
make
./yolov5 -s yolov5s6.wts yolov5s6_int8_b1.engine s6
./yolov5 -s yolov5m6.wts yolov5m6_int8_b1.engine m6
./yolov5 -s yolov5l6.wts yolov5l6_int8_b1.engine l6
./yolov5 -s yolov5x6.wts yolov5x6_int8_b1.engine x6
./yolov5 -d yolov5s6_int8_b1.engine ./test
./yolov5 -d yolov5m6_int8_b1.engine ./test
./yolov5 -d yolov5l6_int8_b1.engine ./test
./yolov5 -d yolov5x6_int8_b1.engine ./test
sed -i 's/INPUT_H = 1280/INPUT_H = 640/g' ../yololayer.h
sed -i 's/INPUT_W = 1280/INPUT_W = 640/g' ../yololayer.h

# INT8 – batch size 8 – resolution 640x640
sed -i 's/define BATCH_SIZE 1/define BATCH_SIZE 8/g' ../yolov5.cpp
make clean
make
./yolov5 -s yolov5s.wts yolov5s_int8_b8.engine s
./yolov5 -s yolov5m.wts yolov5m_int8_b8.engine m
./yolov5 -s yolov5l.wts yolov5l_int8_b8.engine l
./yolov5 -s yolov5x.wts yolov5x_int8_b8.engine x
./yolov5 -d yolov5s_int8_b8.engine ./test
./yolov5 -d yolov5m_int8_b8.engine ./test
./yolov5 -d yolov5l_int8_b8.engine ./test
./yolov5 -d yolov5x_int8_b8.engine ./test

# INT8 – batch size 8 – resolution 1280x1280
sed -i 's/INPUT_H = 640/INPUT_H = 1280/g' ../yololayer.h
sed -i 's/INPUT_W = 640/INPUT_W = 1280/g' ../yololayer.h
make clean
make
./yolov5 -s yolov5s6.wts yolov5s6_int8_b8.engine s6
./yolov5 -s yolov5m6.wts yolov5m6_int8_b8.engine m6
./yolov5 -s yolov5l6.wts yolov5l6_int8_b8.engine l6
./yolov5 -s yolov5x6.wts yolov5x6_int8_b8.engine x6
./yolov5 -d yolov5s6_int8_b8.engine ./test
./yolov5 -d yolov5m6_int8_b8.engine ./test
./yolov5 -d yolov5l6_int8_b8.engine ./test
./yolov5 -d yolov5x6_int8_b8.engine ./test
sed -i 's/INPUT_H = 1280/INPUT_H = 640/g' ../yololayer.h
sed -i 's/INPUT_W = 1280/INPUT_W = 640/g' ../yololayer.h
sed -i 's/define BATCH_SIZE 8/define BATCH_SIZE 1/g' ../yolov5.cpp

# FP16 – batch size 1 – resolution 640x640
sed -i 's/define USE_INT8/define USE_FP16/g' ../yolov5.cpp
make clean
make
./yolov5 -s yolov5s.wts yolov5s_fp16_b1.engine s
./yolov5 -s yolov5m.wts yolov5m_fp16_b1.engine m
./yolov5 -s yolov5l.wts yolov5l_fp16_b1.engine l
./yolov5 -s yolov5x.wts yolov5x_fp16_b1.engine x
./yolov5 -d yolov5s_fp16_b1.engine ./test
./yolov5 -d yolov5m_fp16_b1.engine ./test
./yolov5 -d yolov5l_fp16_b1.engine ./test
./yolov5 -d yolov5x_fp16_b1.engine ./test

# FP16 – batch size 1 – resolution 1280x1280
sed -i 's/INPUT_H = 640/INPUT_H = 1280/g' ../yololayer.h
sed -i 's/INPUT_W = 640/INPUT_W = 1280/g' ../yololayer.h
make clean
make
./yolov5 -s yolov5s6.wts yolov5s6_fp16_b1.engine s6
./yolov5 -s yolov5m6.wts yolov5m6_fp16_b1.engine m6
./yolov5 -s yolov5l6.wts yolov5l6_fp16_b1.engine l6
./yolov5 -s yolov5x6.wts yolov5x6_fp16_b1.engine x6
./yolov5 -d yolov5s6_fp16_b1.engine ./test
./yolov5 -d yolov5m6_fp16_b1.engine ./test
./yolov5 -d yolov5l6_fp16_b1.engine ./test
./yolov5 -d yolov5x6_fp16_b1.engine ./test
sed -i 's/INPUT_H = 1280/INPUT_H = 640/g' ../yololayer.h
sed -i 's/INPUT_W = 1280/INPUT_W = 640/g' ../yololayer.h

# FP16 – batch size 8 – resolution 640x640
sed -i 's/define BATCH_SIZE 1/define BATCH_SIZE 8/g' ../yolov5.cpp
make clean
make
./yolov5 -s yolov5s.wts yolov5s_fp16_b8.engine s
./yolov5 -s yolov5m.wts yolov5m_fp16_b8.engine m
./yolov5 -s yolov5l.wts yolov5l_fp16_b8.engine l
./yolov5 -s yolov5x.wts yolov5x_fp16_b8.engine x
./yolov5 -d yolov5s_fp16_b8.engine ./test
./yolov5 -d yolov5m_fp16_b8.engine ./test
./yolov5 -d yolov5l_fp16_b8.engine ./test
./yolov5 -d yolov5x_fp16_b8.engine ./test

# FP16 – batch size 8 – resolution 1280x1280
sed -i 's/INPUT_H = 640/INPUT_H = 1280/g' ../yololayer.h
sed -i 's/INPUT_W = 640/INPUT_W = 1280/g' ../yololayer.h
make clean
make
./yolov5 -s yolov5s6.wts yolov5s6_fp16_b8.engine s6
./yolov5 -s yolov5m6.wts yolov5m6_fp16_b8.engine m6
./yolov5 -s yolov5l6.wts yolov5l6_fp16_b8.engine l6
./yolov5 -s yolov5x6.wts yolov5x6_fp16_b8.engine x6
./yolov5 -d yolov5s6_fp16_b8.engine ./test
./yolov5 -d yolov5m6_fp16_b8.engine ./test
./yolov5 -d yolov5l6_fp16_b8.engine ./test
./yolov5 -d yolov5x6_fp16_b8.engine ./test
sed -i 's/define BATCH_SIZE 8/define BATCH_SIZE 1/g' ../yolov5.cpp
sed -i 's/INPUT_H = 1280/INPUT_H = 640/g' ../yololayer.h
sed -i 's/INPUT_W = 1280/INPUT_W = 640/g' ../yololayer.h

# FP32 – batch size 1 – resolution 640x640
sed -i 's/define USE_FP16/define USE_FP32/g' ../yolov5.cpp
make clean
make
./yolov5 -s yolov5s.wts yolov5s_fp32_b1.engine s
./yolov5 -s yolov5m.wts yolov5m_fp32_b1.engine m
./yolov5 -s yolov5l.wts yolov5l_fp32_b1.engine l
./yolov5 -s yolov5x.wts yolov5x_fp32_b1.engine x
./yolov5 -d yolov5s_fp32_b1.engine ./test
./yolov5 -d yolov5m_fp32_b1.engine ./test
./yolov5 -d yolov5l_fp32_b1.engine ./test
./yolov5 -d yolov5x_fp32_b1.engine ./test

# FP32 – batch size 1 – resolution 1280x1280
sed -i 's/INPUT_H = 640/INPUT_H = 1280/g' ../yololayer.h
sed -i 's/INPUT_W = 640/INPUT_W = 1280/g' ../yololayer.h
make clean
make
./yolov5 -s yolov5s6.wts yolov5s6_fp32_b1.engine s6
./yolov5 -s yolov5m6.wts yolov5m6_fp32_b1.engine m6
./yolov5 -s yolov5l6.wts yolov5l6_fp32_b1.engine l6
./yolov5 -s yolov5x6.wts yolov5x6_fp32_b1.engine x6
./yolov5 -d yolov5s6_fp32_b1.engine ./test
./yolov5 -d yolov5m6_fp32_b1.engine ./test
./yolov5 -d yolov5l6_fp32_b1.engine ./test
./yolov5 -d yolov5x6_fp32_b1.engine ./test
sed -i 's/INPUT_H = 1280/INPUT_H = 640/g' ../yololayer.h
sed -i 's/INPUT_W = 1280/INPUT_W = 640/g' ../yololayer.h

# FP32 – batch size 8 – resolution 640x640
sed -i 's/define BATCH_SIZE 1/define BATCH_SIZE 8/g' ../yolov5.cpp
make clean
make
./yolov5 -s yolov5s.wts yolov5s_fp32_b8.engine s
./yolov5 -s yolov5m.wts yolov5m_fp32_b8.engine m
./yolov5 -s yolov5l.wts yolov5l_fp32_b8.engine l
./yolov5 -s yolov5x.wts yolov5x_fp32_b8.engine x
./yolov5 -d yolov5s_fp32_b8.engine ./test
./yolov5 -d yolov5m_fp32_b8.engine ./test
./yolov5 -d yolov5l_fp32_b8.engine ./test
./yolov5 -d yolov5x_fp32_b8.engine ./test

# FP32 – batch size 8 – resolution 1280x1280
sed -i 's/INPUT_H = 640/INPUT_H = 1280/g' ../yololayer.h
sed -i 's/INPUT_W = 640/INPUT_W = 1280/g' ../yololayer.h
make clean
make
./yolov5 -s yolov5s6.wts yolov5s6_fp32_b8.engine s6
./yolov5 -s yolov5m6.wts yolov5m6_fp32_b8.engine m6
./yolov5 -s yolov5l6.wts yolov5l6_fp32_b8.engine l6
./yolov5 -s yolov5x6.wts yolov5x6_fp32_b8.engine x6
./yolov5 -d yolov5s6_fp32_b8.engine ./test
./yolov5 -d yolov5m6_fp32_b8.engine ./test
./yolov5 -d yolov5l6_fp32_b8.engine ./test
./yolov5 -d yolov5x6_fp32_b8.engine ./test
sed -i 's/INPUT_H = 1280/INPUT_H = 640/g' ../yololayer.h
sed -i 's/INPUT_W = 1280/INPUT_W = 640/g' ../yololayer.h
sed -i 's/define BATCH_SIZE 8/define BATCH_SIZE 1/g' ../yolov5.cpp
sed -i 's/define USE_FP32/define USE_FP16/g' ../yolov5.cpp


These are the some results for FP16 type, 1 batch sized YOLOv5-P5 L model results:


Comparison of Benchmark Results



The average of image processing time (without preprocessing time (reading image, inserting engine file etc.)) for Jetson AGX Xavier and NVIDIA laptop tables are shown below:




Thank you for reading our blog post. 


Türkçe