-
An accurate generative model of videos captures the data distribution from which the observed data was generated.
-
A suitable feature representation for videos needs to consider
- the temporal coherence of the visual content across a sequence of frames,
- in addition to its visual presentation at any given point in time
-
Similar to FID, but they will use a pre-trained Inflated 3D Convnet (I3D)
- The I3D network generalizes the Inception architecture to sequential data, and is trained to perform action-recognition on the Kinetics data set consisting of human-centered YouTube videos
- Action-recognition can be viewed as a temporal extension of image classification, requiring visual context and temporal evolution to be considered simultaneously