Real Time Video Neural Style Transfer

The implementation of the model can be found in the PyTorch repository here.TrainingAn overview of the model architecture and training process (Image source)Since the goal of style transfer is to optimize the output image to contain a style similar to a designated “style target” image (ex. pablo picasso cubism style), and to contain content similar to that of the input image (ex. the frame of the video being stylized), the training loss is computed in two separate parts: the content loss, and the style loss, which are both extracted from a convolutional neural network that is pre-trained for image classification.The content loss represents how similar the content of the style transfer network’s output image (labeled y^ in the image above) is to the content of the input image, or “content target”, (yc)..Since image classification convolutional neural networks are forced to learn high level features/abstract representations or the “content” of images in order to accurately classify them, a representation of an image’s content can be readily grabbed from layer activations of a pre-trained image classification network..The pre-trained network used in this implementation is VGG-16..In image classification networks, deeper layers tend to learn higher and higher level features, so the content loss can be defined as the Euclidean distance between the output of layer j from VGG-16 when y^ and yc are passed through VGG-16..The choice of j (which layer to extract the content features from) can be chosen as desired, and the deeper the chosen layer is, the more abstract the output image will look.Similarly, albeit slightly more complex, the style loss between the “style target” (y_s) and the network’s output y^ is also computed as a distance between features extracted from layer outputs from VGG-16..The main difference here is that instead of taking feature representations directly from the layer activations of VGG-16, we first transform those feature representations into something called a Gram matrix..The Gram matrix is the matrix of all possible inner products of a vector..Without going into too many gory details of the math, essentially what the Gram matrix does is capture all information from the VGG’s layer activation, but it erases all of the information pertaining to where that information is represented spatially within the image..The Gram Matrix represents the general substance of the image without worrying about where certain objects/elements are within the image..In other words, the Gram matrix represents an image’s style..The style loss, then, is just the (Frobenius) distance between the Gram matrix of the VGG layer i activation when y^ is passed through and the Gram matrix of VGG’s layer i activation from y_s..The style loss is actually computed as the sum of style losses for multiple layers i in VGG, as is represented in the diagram above.The network is trained to simultaneously minimize the content loss and the style loss.. More details

Leave a Reply