• https://arxiv.org/pdf/1812.01717

  • An accurate generative model of videos captures the data distribution from which the observed data was generated.

  • A suitable feature representation for videos needs to consider

    • the temporal coherence of the visual content across a sequence of frames,
    • in addition to its visual presentation at any given point in time
  • Similar to FID, but they will use a pre-trained Inflated 3D Convnet (I3D)

    • The I3D network generalizes the Inception architecture to sequential data, and is trained to perform action-recognition on the Kinetics data set consisting of human-centered YouTube videos
    • Action-recognition can be viewed as a temporal extension of image classification, requiring visual context and temporal evolution to be considered simultaneously