SlowFast Explained: Dual-mode CNN for Video Understanding

The consequence of the smaller channel size is that the Fast pathway requires 4x less compute than the Slow pathway despite having a higher temporal frequency.An example instantiation of the SlowFast network..The dimensions of kernels are denoted by {T×S², C} for temporal (T), spatial (S), and channel © sizes..Strides are denoted as {temporal stride, spatial stride ^ 2}..The speed ratio (frame skipping rate) is α = 8 and the channel ratio is 1/β = 1/8..τ is 16..The green colors mark higher temporal resolution, and orange colors mark fewer channels, for the Fast pathway..The lower temporal resolution of the Fast pathway can be observed in the data layer row while the smaller channel size can be observed in the conv1 row and afterward in the residual stages..Residual blocks are shown by brackets..The backbone is ResNet-50..(Image & Description from SlowFast)High-level illustration of the SlowFast network with parameters (Image: SlowFast)Lateral ConnectionsAs shown in the visual illustration, data from the Fast pathway is fed into the Slow pathway via lateral connections throughout the network, allowing the Slow pathway to become aware of the results from the Fast pathway..The shape of a single data sample is different between the two pathways (Fast is {αT, S², βC} while Slow is {T, S², αβC}), requiring SlowFast to perform data transformation on the results of the Fast pathway, which is then fused into the Slow pathway by summation or concatenation.The paper suggests three techniques for data transformation, with the third one proving in practice to be the most effective:Time-to-channel: Reshaping and transposing {αT, S², βC} into {T , S², αβC}, meaning packing all α frames into the channels of one frame.Time-strided sampling: Simply sampling one out of every α frames, so {αT , S², βC} becomes {T , S², βC}.Time-strided convolution: Performing a 3D convolution of a 5×12 kernel with 2βC output channels and stride = α.Interestingly, the researchers found that bidirectional lateral connections, i.e..also feeding the Slow pathway into the Fast pathway, do not improve performance.Combining the pathwaysAt the end of each pathway, SlowFast performs Global Average Pooling, a standard operation intended to reduce dimensionality.. More details

Leave a Reply