Nghiên cứu khoa học công nghệ 
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 455 
TRACKING UAV IN INFRARED VIDEOS 
USING SIAMESE NETWORKS 
Hoang Dinh Thang
1∗, Tran Quoc Long
2
, Thai Kien Trung
1
, Nguyen Chi Thanh
1 
Abstract: Siamese network-based trackers have achieved excellent performance 
on visual object tracking. Some Siamese network experiments on long-terms visual 
tracking benchmarks achieve state-of-the-art performance, confirming its 
e
                
              
                                            
                                
            
 
            
                 8 trang
8 trang | 
Chia sẻ: huongnhu95 | Lượt xem: 916 | Lượt tải: 0 
              
            Tóm tắt tài liệu Tracking uav in infrared videos using siamese networks, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
ffectiveness and efficiency. In this work, we study state-of-the-art Siamese 
networks, then, propose a model based on Siamese architecture to tracking UAV 
from the Anti-UAV Challenge dataset include 100 videos infrared. Network 
architecture using pre-trained ResNet50, depth-wise cross-correlation, focal loss. 
Experiments on the Anti-UAV infrared dataset show its robustness to the different 
challenges of real infrared scenes with a high efficiency. 
Keywords: Tracking UAV, Siamese network, Visual Tracking, Deep Learning. 
1. INTRODUCTION 
Visual tracking is a fundamental but challenging task in computer vision. Given the 
target state in the initial frame of a sequence, the tracker needs to predict the target state in 
each subsequent frame. It has a large range of applications in diverse fields like visual 
surveillance, human-computer-interactions, and augmented reality. Despite great progress 
in recent years, visual tracking still faces challenges due to occlusion, scale variation, 
background clutters, fast motion, illumination variation, and appearance variations. 
Deep learning technologies have significantly advanced the task of visual object 
tracking, by providing the strong capacity of learning powerful deep features. Recently, 
the Siamese network-based tracker [1-8] have drawn much attention in the community 
single object tracking. These Siamese trackers formulate the visual object tracking 
problem as learning a general similarity map by cross-correlation between the feature 
representations learned for the target template and the search region. 
Recently, the amount of aunmanned aerial vehicles (UAVs) is growing rapidly because 
of their autonomy, flexibility, and a broad range of applications. Computer vision on UAVs 
has received more and more attention due to their wide applications. The development of 
drone technology will inevitably promote the development of counter-drone technology, 
and the detection and tracking of the drones are becoming increasingly important. 
In this paper, we propose a tracking method using Siamese architecture which can track 
UAVs in complex real scenes robustly. The method determines accurately the position and 
size of the drone in the subsequent video based on the initial area specified in the first frame. 
The rest of the paper is organized as follows. We first review the related works in 
section 2, and present our approach for tracking UAVs in infrared videos in section3. 
Section4 demonstrates the experimental results in the Anti-UAV benchmark datasets. 
Finally, we summarize our work in section 5. 
2. RELATED WORKS 
Object Detectors: Region Proposal Networks (RPN) integrated proposal generation 
with the second-stage classifier into a single convolution network, forming the Faster R-
CNN framework [9]. The learned RPN also improves region proposal quality and thus the 
overall object detection accuracy. RetinaNet [10] propose the focal loss which applies a 
modulating term to the cross-entropy loss in order to focus learning on hard negative 
Toán học – Công nghệ thông tin 
 H. D. Thang,  , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.” 456 
examples. The approach is simple and highly effective, demonstrate its efficacy by 
designing a fully convolutional one-stage detector and report extensive experimental 
analysis showing that it achieves state-of-the-art accuracy and speed. 
Visual Object Tracking (VOT) is one of the most active research topics in computer 
vision in recent decades. VOT is the task of tracking an object through a video given the 
first-frame bounding box of the object. Many methods approach VOT using Siamese 
architectures. The Siamese network-based trackers have received significant attention for 
their well-balanced tracking accuracy and efficiency [1-8]. These trackers formulate visual 
tracking as a cross-correlation problem and are expected to better leverage the merits of 
deep networks from end-to-end learning. SiamFC [1] utilizes a large number of template 
and search region paired samples in offline training. In online tracking, through the forward 
propagation, the template and search region correlate with each other in the feature space 
and the target location is determined according to the peak position of the cross-correlation 
response. SiamRPN [2] adds the Region Proposal Network [9] to acquire various aspect 
ratio candidate target box. It interprets the template branch in Siamese subnetwork as 
training parameters to predict the kernel of the local detection task, which treats the 
tracking task as a one-shot local detection task. Recent tracking approaches improve upon 
SiamRPN, making it distractor aware (DaSiamRPN [3]), adding a cascade (C-RPN [4]), 
using deeper architectures (SiamRPN+ [5] and SiamRPN++ [6]). SiamRPN++ [6] propose 
a depth-wise separable correlation structure to enhance the cross-correlation to produce 
multiple similarity maps associated with different semantic meanings. 
Other recent developments in VOT include using domain-specific layers with online 
learning [11], learning an adaptive spatial filter regularizer [12], exploiting category 
specific semantic information [13], using continuous [14] or factorized (ECO[15]) 
convolutions, and achieving accurate bounding box predictions using an overlap 
prediction network (ATOM[16]). [17] propose a framework to convert any detector into a 
tracker. These methods rely on meta-learning and it achieves a much lower accuracy also 
applies two-stage architectures. 
Koksal et al. [18] use YOLOv3 [19] for the UAV detection problem. The YOLO 
Network is trained with the Anti-UAV Challenge dataset to detect UAVs. ATMF [20] 
used multi-modal fusion for tracking UAV in the Anti-UAV Challenge dataset. When the 
short-term tracker is failed a re-detection module is activated to search the target in the 
images globally. 
3. METHOD 
The standard Siamese architecture takes an image pair as input, comprising an 
exemplar image z and a candidate search image x. The image z represents the object of 
interest (e.g., an image patch centered on the target object in the first video frame), while x 
is typically larger and represents the search area in subsequent video frames. Both inputs 
are processed by a ConvNet with parameters . This yields two feature maps, which are 
cross-correlated as 
 ∗ (1) 
where denotes a bias term which takes the value at every location. Eq. 1 
amounts to performing an exhaustive search of the pattern z over the image x. The goal is 
to match the maximum value in response map to the target location. 
To achieve this goal, the network is trained offline with random image pairs (z, x) taken 
from training videos and the corresponding ground-truth label y. The parameters of the 
ConvNet are obtained by minimizing the following logistic loss over the training set: 
Nghiên cứu khoa học công nghệ 
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 457 
 (2) 
In this section, we describe the proposed network, which is a more advanced ConvNet 
 to learn an effective model that enhances tracking robustness and accuracy. As shown 
in figure 2, network consists of a Siamese network backbone and multiple RPN heads. The 
Siamese network backbone is responsible for computing the convolutional feature maps of 
the template patch and the search patch, which uses an off-the-shelf convolutional 
network. The RPN head includes a classification module and a regression module. 
3.1. Network Backbone 
In our tracking method, we adopt ResNet-50 [21] as the backbone network by 
modifying the strides and adding dilated convolutions for conv4 and conv5 blocks, detail 
in table 1. Feature maps in outputs of conv3, conv4, conv5 are fed into three RPN head 
modules individually. 
Table 1. ResNet50 backbone. 
 Bottleneck in conv4 Bottleneck in conv5 
conv1x1 conv3x3 conv1x1 conv1x1 conv3x3 conv1x1 
original 
ResNet-50 
stride 1 2 1 1 2 1 
padding 0 1 0 0 1 0 
dilation 1 1 1 1 1 1 
modified 
ResNet-50 
stride 1 1 1 1 1 1 
padding 0 2 0 0 4 0 
dilation 1 2 1 1 4 1 
3.2. RPN Head 
RPN head (figure 1, right) consists of a classification module and a regression module. 
Both modules receive features from the template branch and the search branch. Features 
from the template branch and the search branch is adjusted to the same number of 
channels. Then two feature maps with the same number of channels do the depth-wise 
cross-correlation channel by channel. 
Figure 1. Illustration of proposed framework. The left sub-figure shows its main structure, 
where c3, c4, and c5 denote the feature maps of the backbone network. The right sub-
figure shows each RPN head, where DW-Corr means depth-wise cross-correlation 
operation, Reg/Cls Map denote the feature maps of the RPN heads output. 
With k anchors, RPN Head needs to output 2k channels for classification and 4k 
channels for regression. The correlation is computed on both the classification branch and 
the regression branch: 
 ∗ (3) 
 ∗ 
Toán học – Công nghệ thông tin 
 H. D. Thang,  , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.” 458 
where * denotes the convolution operation with [ (z)]cls or [ (z)]reg as the convolution 
kernel, 
 denotes classification map, 
 indicates regression map. 
3.3. Classification Loss and Regression Loss 
With specifies the ground-truth class and is the model’s 
estimated probability for the class with label , define : 
 (4) 
Loss for classification is the focal loss [10]: 
 (5) 
Smooth loss with with normalized coordinates for regression: 
 ∗ 
 (6) 
Let , denote center point and shape of the anchor boxes. Let , 
denote those of the ground truth boxes, normalized distance [2]: 
 (7) 
Loss for regression is: 
 (8) 
The overall loss of the network is a combination of the classification loss and the 
regression loss: 
 (9) 
where λ1, λ2 is hyper-parameter to balance the two parts. 
3.4. Training and Inference 
Training. Our entire network can be trained end-to-end on large-scale datasets. The 
backbone network ResNet50[21] is pre-train on ImageNet[22] for image labeling. We 
train network with image pairs sampled on videos or still images to learn a generic notion 
of how to measure the similarities between general objects for visual tracking. The training 
sets include, ImageNet VID[22], ImageNet DET[22], COCO[23] and GOT-10k[24]. The 
size of a template patch is 127×127 pixels, while the size of a search patch is 255×255 
pixels. The number of anchors are k=5 with stride=8, scales=8 and 
ratios=[0.33,0.5,1,2,3] hyper-parameter . 
Data augmentation: we use data augmentation techniques such as flipping, shifting 
scale, blurring, gray scale etc. to increase the variety of samples fed to the network. 
Inference. During inference, we crop the template patch from the first frame and feed 
it to the feature extraction network. For subsequent frames, we crop the search patch and 
extract features based on the target position of the previous frame, and then perform 
prediction in the search region to get the total classification map and regression map. 
Afterward, we can get prediction boxes based on strategy is mentioned in [2]. After 
prediction boxes are generated, we use the cosine window and scale change penalty to 
smooth target movements and changes, then the prediction box with the best score is 
selected and its size is updated by linear interpolation with the state in the previous frame. 
In the case of Anti-UAV have challenges of long-terms tracking datasets are severe out-of-
Nghiên cứu khoa học công nghệ 
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 459 
view, full occlusion, and fast motion. During failure cases, we gradually increase the 
search region by local-to-global strategy [5]. Specifically, the size of the search region is 
iteratively growing with a constant step when failed tracking is indicated. 
4. EXPERIEMNTS 
We perform a lot of experiments on the Anti-UAV infrared dataset and evaluate the 
performance of the proposed tracking approach. 
4.1. Implementation Details 
The network backbone is pre-trained on the ImageNet-1k classification task. The 
Network is trained with stochastic gradient descent (SGD). We use synchronized SGD 
over 4 GPUs with a total of 64 pairs per minibatch (16 pairs per GPU), which takes 48 
hours to converge. We train a total of 20 epochs, using a warmup learning rate of 0.001 to 
0.005 in the first 5 epochs, and a learning rate exponentially decayed from 0.005 to 
0.00005 in the last 15 epochs. Weight decay of 0.0005 and momentum of 0.9 are used. 
The training loss is the sum of classification loss (focal-loss) and the standard 
 loss for regression. During the inference phase, the sizes of the search region in 
the short-term phase and defined failure cases are set to 255 and 832, respectively. The 
thresholds to enter and leave failure cases are set to 0.825 and 0.996. The code is 
implemented in Python using PyTorch base on the PySOT. 
4.2. Evaluation 
The Anti-UAV workshop (https://anti-uav.github.io) presents a benchmark dataset and 
evaluation methodology for detecting and tracking UAVs. The test-dev dataset consists of 
100 high-quality infrared video sequences, spanning multiple occurrences of multi-scale 
UAVs. This dataset has different challenges in: varying sizes, varying ratios, motion blur, 
fast motion, indistinguishable background (figure 2). 
Figure 2. Anti-UAV Dataset (https://anti-uav.github.io/dataset/). 
We use the data to evaluate our algorithm. The tracking average accuracy score (acc) is 
utilized for evaluation. The acc is defined as: 
 (10) 
At frame t, the is the IoU between the corresponding ground truth and tracking 
boxes. The is the visibility flag of the ground truth. If the target exists in the current 
frame, . When the target does not exist in the frame and 
 , if the tracker’s prediction is empty, , otherwise, 
 . The accuracy is averaged over all the frames. Our acc score is 
calculated according to the average results on the 100 IR videos. From table 2 it is seen 
that our approach gets the high average accuracy, more than SiamRPN++[6]. Results also 
Toán học – Công nghệ thông tin 
 H. D. Thang,  , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.” 460 
show that models released of SiamMask [7] and SiamBAN [8] not suitable for long-terms 
tracking datasets as Anti-UAV. The code of SiamFC [1] is the Pytorch version and the 
model is provided by the AntiUAV organizer. For SiamMask [7], SiamRPN++ [6] we use 
the codes and models released at https://github.com/STVIR/pysot. For SiamBAN [8] we 
uses the codes and models released at https://github.com/hqucv/siamban. Results of 
ATOM [16], DiMP [25], SiamDW-LT [26] are referenced from [20]. 
Table 2. Benchmark results. 
Tracker Name acc 
SiamBAN[8] 0.390 
SiamMask[7] 0.403 
SiamFC[1] 0.420 
ATOM[16] 0.5322 
DiMP[25] 0.5507 
SiamDW-LT[26] 0.6379 
SiamRPN++[6] 0.648 
Ours 0.654 
4.3. Visual Results 
To visualize the performance of our tracker, we provide some representative results of 
our tracker. The frames are from the Anti-UAV dataset. As shown in Fig. 3, each row 
represents a video sequence. The green box denotes ours, the yellow one denotes the 
ground truth. 
2
0
1
9
0
9
2
5
_
1
3
1
5
3
0
_
1
_
5
2
0
1
9
0
9
2
5
_
2
1
3
0
0
1
_
1
_
4
2
0
1
9
0
9
2
6
_
1
3
3
5
1
6
_
1
_
7
2
0
1
9
0
9
2
6
_
1
8
3
4
0
0
_
1
_
3
2
0
1
9
0
9
2
6
_
1
9
3
5
1
5
_
1
_
8
Figure 3. Visual results of our tracker on 5 videos, all the data from Anti-UAV infrared 
dataset. The yellow box denotes ground truth, the green box denotes ours. 
Nghiên cứu khoa học công nghệ 
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 461 
5. CONCLUSION 
In this paper, a tracker UAVs in infrared video-based on Siamese Network is proposed, 
which consists of backbone ResNet50 and three RPN heads for classification and 
regression. Experiments on the Anti-UAV dataset show that the proposed infrared tracking 
algorithm is robust to the challenges in real infrared scenes with high efficiency. Our 
approach gets acc equal to 0.654, compare to another method such as ATMF, we need 
more improve the model next time by applying some strategy when a failed case such as 
using random search or classifier. 
REFERENCES 
[1]. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. “Fully-
convolutional siamese networks for object tracking”. In ECCV Workshops, 2016. 
[2]. B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. “High performance visual tracking with 
siamese region proposal network”. In CVPR, 2018. 
[3]. Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. “Distractor-aware siamese 
networks for visual object tracking”. In ECCV, 2018. 
[4]. H. Fan and H. Ling. “Siamese cascaded region proposal networks for real-time visual 
tracking”. In CVPR, 2019. 
[5]. Z. Zhang, H. Peng, and Q. Wang. “Deeper and wider siamese networks for real-time 
visual tracking”. In CVPR, 2019. 2, 17, 20 
[6]. Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. 
SiamRPN++: “Evolution of siamese visual tracking with very deep networks”. In 
CVPR, 2019. 
[7]. Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. “Fast 
online object tracking and segmentation: A unifying approach”. In CVPR, 2019. 
[8]. Chen, Zedu and Zhong, Bineng and Li, Guorong and Zhang, Shengping and Ji, 
Rongrong. SiamBAN: “Siamese Box Adaptive Network for Visual Tracking”. In 
CVPR, 2020. 
[9]. S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: “Towards real-time object 
detection with region proposal networks”. In NIPS, 2015. 
[10]. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar. “Focal Loss for 
Dense Object Detection”. In ICCV, 2017. 
[11]. H. Nam and B. Han. “Learning multi-domain convolutional neural networks for 
visual tracking”. In CVPR, 2016. 
[12]. K. Dai, D. Wang, H. Lu, C. Sun, and J. Li. “Visual tracking via adaptive spatially-
regularized correlation filters”. In CVPR, 2019. 
[13]. A. S. Tripathi, M. Danelljan, L. Van Gool, and R. Timofte. “Tracking the known and 
the unknown by leveraging semantic information”. In BMVC, 2019. 
[14]. M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. “Beyond correlation filters: 
Learning continuous convolution operators for visual tracking”. In ECCV, 2016. 
[15]. M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. “ECO: Efficient convolution 
operators for tracking”. In CVPR, 2017. 
[16]. M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. “ATOM: accurate tracking by 
overlap maximization”. In CVPR, 2019. 
[17]. L. Huang, X. Zhao, and K. Huang. “Bridging the gap between detection and tracking: 
A unified approach”. In ICCV, 2019. 
[18]. Aybora Koksal, Kutalmis Gokalp Ince, A. Aydin Alatan. “Effect of Annotation Errors 
on Drone Detection with YOLOv3”. In CVPR, 2020. 
Toán học – Công nghệ thông tin 
 H. D. Thang,  , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.” 462 
[19]. Joseph Redmon and Ali Farhadi. “Yolov3: An incremental improvement”. CoRR, 
abs/1804.02767, 2018. 
[20]. Chunhui Zhang, Haolin Liu, Tianyang Xu, Yong Wang, Shiming Ge. “ATMF: 
Accurate Tracking by Multi-Modal Fusion”. In CVPR, 2020. 
[21]. K. He, X. Zhang, S. Ren, and J. Sun. “Deep residual learning for image recognition”. 
In CVPR, 2016. 
[22]. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. 
Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. “ImageNet Large 
Scale Visual Recognition Challenge”. IJCV, 2015. 
[23]. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva 
Ramanan, Piotr Dollar, and C Lawrence Zitnick. “Microsoft COCO: Common objects 
in context”. In ECCV, 2014. 
[24]. Lianghua Huang, Xin Zhao, and Kaiqi Huang. “GOT-10k: A large high-diversity 
benchmark for ggeneric object tracking in the wild”. IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 2019. 
[25]. Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. “Learning 
discriminative model prediction for tracking”. In IEEE International Conference on 
Computer Vision, 2019 
[26]. Zhipeng Zhang, Houwen Peng. “Deeper and Wider Siamese Networks for Real-Time 
Visual Tracking”. In CVPR, 2019. 
TÓM TẮT 
THEO DÕI CHUYỂN ĐỘNG CỦA UAV TRONG VIDEO HỒNG NGOẠI 
SỬ DỤNG MẠNG SIAMESE 
Trình theo dõi dựa trên mạng Siamese đã đạt được hiệu suất cao trong theo dõi 
đối tượng trực quan. Một số mạng Siamese thực nghiệm trên các bộ dữ liệu lớn về 
theo dõi đối tượng đạt được hiệu suất hiện đại, khẳng định hiệu suất và hiệu quả. 
Trong bài báo này, chúng tôi nghiên cứu các mạng Siamese hiện đại, sau đó đề xuất 
mô hình dựa trên kiến trúc Siamese để theo dõi UAV từ bộ dữ liệu Anti-UAV 
Challenge gồm 100 video hồng ngoại. Kiến trúc mạng sử dụng mạng đã được đào 
tạo trước như ResNet50, tương quan chéo sâu và rộng và focal-loss cho phân lớp 
UAV với nền. Các thử nghiệm trên bộ dữ liệu hồng ngoại Anti-UAV cho thấy, sự 
mạnh mẽ của mô hình đối với các thách thức khác nhau của cảnh hồng ngoại thực 
tế với hiệu quả cao. 
Từ khóa: Theo dõi UAV; Mạng Siamese; Theo dõi đối tượng; Học sâu. 
Received 3
rd
 August 2020 
Revised 5
th
 October 2020 
Published 5
th
 October 2020 
Author affiliations: 
1 
Military Information Technology Institute, Academy of Military Science and Technology; 
2 
University of Engineering and Technology, Vietnam National University. 
 *Email: hoangdinhthang@gmail.com. 
            Các file đính kèm theo tài liệu này:
 tracking_uav_in_infrared_videos_using_siamese_networks.pdf tracking_uav_in_infrared_videos_using_siamese_networks.pdf