UVid-Net: Enhanced Semantic Segmentation of UAV Aerial Videos by Embedding Temporal Information

Semantic segmentation of aerial videos has been extensively used for decisionmaking in monitoring environmental changes, urban planning, and disastermanagement. The reliability of these decision support systems is dependent onthe accuracy of the video semantic segmentation algorithms. The existing CNNbased video semantic segmentation methods have enhanced the image semanticsegmentation methods by incorporating an additional module such as LSTM oroptical flow for computing temporal dynamics of the video which is acomputational overhead. The proposed research work modifies the CNNarchitecture by incorporating temporal information to improve the efficiency ofvideo semantic segmentation. In this work, an enhanced encoder-decoder based CNN architecture (UVid-Net)is proposed for UAV video semantic segmentation. The encoder of the proposedarchitecture embeds temporal information for temporally consistent labelling.The decoder is enhanced by introducing the feature-refiner module, which aidsin accurate localization of the class labels. The proposed UVid-Netarchitecture for UAV video semantic segmentation is quantitatively evaluated onextended ManipalUAVid dataset. The performance metric mIoU of 0.79 has beenobserved which is significantly greater than the other state-of-the-artalgorithms. Further, the proposed work produced promising results even for thepre-trained model of UVid-Net on urban street scene with fine tuning the finallayer on UAV aerial videos.