AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Learning to represent videos is a very challenging task both algorithmicallyand computationally. Standard video CNN architectures have been designed bydirectly extending architectures devised for image understanding to include thetime dimension, using modules such as 3D convolutions, or by using two-streamdesign to capture both appearance and motion in videos. We interpret a videoCNN as a collection of multi-stream convolutional blocks connected to eachother, and propose the approach of automatically finding neural architectureswith better connectivity and spatio-temporal interactions for videounderstanding. This is done by evolving a population of overly-connectedarchitectures guided by connection weight learning. Architectures combiningrepresentations that abstract different input types (i.e., RGB and opticalflow) at multiple temporal resolutions are searched for, allowing differenttypes or sources of information to interact with each other. Our method,referred to as AssembleNet, outperforms prior approaches on public videodatasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and34.27% accuracy on Moments-in-Time.