How to Re-Train a Dataset using PyTorch?

How to Re-Train a Dataset using PyTorch?

Jetson AGX Xavier | Jetson Nano | Jetson TX2 NX | Jetson Xavier NX

09 August 2021

In this blog post, we will be explaining how to re-train a model that includes 5000 training, 1000 validation, and 200 test images. We have a dataset of equal numbers of cat and dog pictures. 

How to Re-Train 2 Class Dataset on Resnet18-Model?


Training images will be used for transfer learning, validation set is to analyze the accuracy of the training and the test images will be used after the training. The network we will be using is Resnet-18 for the training set.


Before we get started, make sure jetson-inference project is set up. If you haven’t downloaded the project, click here to learn how to do it step by step. While building up the project, do not forget to install PyTorch as well. If you haven’t installed PyTorch, you can type the following command.


cd jetson-inference/build 
./install-pytorch.sh


If you used Docker container, it will be installed automatically.

You should also import torch and torchvision. Execute the following commands on terminal by writing python or python3 first.


>>import torch
>>import torchvision


Before training our dataset, we must download it first. Go to jetson-inference/python/training/classification/data sub-directory where we will store our dataset. 

cd jetson-inference/python/training/classification/data


Then, download the dataset.


wget https://nvidia.box.com/shared/static/o577zd8yp3lmxf5zhm38svrbrv45am3y.gz -O cat_dog.tar.gz


Extract the compressed file. 

tar xvzf cat_dog.tar.gz


Now, we can move to train our dataset by re-training ResNet-18 model. Go back to jetson-inference directory and run the docker. While the container is running, go to python/training/classification directory to launch the training. 

cd ~/jetson-inference
docker/run.sh
# cd python/training/classification/


Next, we will move on to re-training the model.


By default, it will use the Resnet-18 model, but if you would like to use a different one, add --arch flag.

Also, you should adjust the batch-size which measures how many images are processed at once during the training. By default, it is set to 8, but we need to reduce it to 4 to be able to work on Jetson Nano which has lower RAM compared to other Jetson modules.

In addition, reduce the data loaders by setting --workers to 1 from the default of 2 for the same reason.

Another important factor is epoch which is the number of passes of the entire set. To have a more accurate model, we need to increase epoch. However, it takes ~7-8 minutes per epoch to train Resnet-18 which takes around 4 hours to train 30 epochs and 80% accuracy. You can see the epoch vs. accuracy graph below.



We will use 2 epochs to reduce the duration of training.

Now, run the following example command to start training.


python3 train.py --model-dir=models/cat_dog --batch-size=4 --workers=1 --epochs=2 data/cat_dog


model-dir=models/cat_dog: where the model will be stored.

data/cat_dog : where the dataset is stored.

While the training is running, you can see the following parameters that give information of accuracy, time, and epoch number.



Epoch [N]: which epoch you are on i.e., the number of passes the dataset

[N/1250]: Current image batch from the epoch. Since we set this value to 4, it means we are on the image [4xN/4x1250].

Acc@1: Classification accuracy that shows if the class is predicted correctly.

Acc@5: Classification accuracy that shows if the predicted class in in the top 5 class. Since we have only 2 classes in this example, it is always 100%.

You can stop the training by pressing Ctrl+C or Ctrl+Shift+C whatever is your shortcut preference to cancel. To restart the training, you can use --resume flag to continue from where you have stopped without waiting for to complete downloading.


How to Run Re-Trained Model with TensorRT?


To test and process the dataset on TensorRT which provides real-time inference, we need to convert our PyTorch model to an independent model format called ONNX. For this purpose, simply run onnx_export.py that comes with PyTorch. 

python3 onnx_export.py --model-dir=models/cat_dog


This operation will create resnet18.onnx model in jetson-inference/training/classification directory.

Next step is to process all the test images in the dataset for the dog and cat classes separately.

First, create output folders for both classes that we will store our test classified test images.


mkdir data/test_cat_output data/test_dog_output


Then, run the ImageNet program. Again, the directory that the model will be saved will be written first. Since there are large amount of dataset, it is necessary to take input and output as binary large object. Do not forget to add input-output paths as well.

imagenet --model=models/cat_dog/resnet18.onnx --labels=data/cat_dog/labels.txt --input_blob=input_0 --output_blob=output_0 data/cat_dog/test/cat data/ test_cat_output
imagenet --model=models/cat_dog/resnet18.onnx --labels=data/cat_dog/labels.txt --input_blob=input_0 --output_blob=output_0 data/cat_dog/test/dog data/ test_dog_output

How to Test Re-Trained Model on Live Camera?


To test the re-trained model, you must connect your camera first. If the camera is already connected, run the following code.


V4L2 (USB Camera):


imagenet --model=models/cat_dog/resnet18.onnx --labels=data/cat_dog/labels.txt --input_blob=input_0 --output_blob=output_0 /dev/video0


CSI Camera:


imagenet --model=models/cat_dog/resnet18.onnx --labels=data/cat_dog/labels.txt --input_blob=input_0 --output_blob=output_0 csi://0


If the camera is not connected while in the container, you must exit then run the docker again after connecting the camera. You can run the code above now. 

Thank you for reading our blog post. 


Türkçe