Skip to content
Tom Bocklisch edited this page Apr 28, 2015 · 10 revisions

State of the art video classification

UFC-101 Accuracy (3-Fold) Notes
LRCN+CNN (Donahue) 82.92 Weighted average of RGB (1/3) and Flow (2/3) networks. LRCN after first fully connected CNN Layer
2stream CNN (Simonyan) Poster 88.0 Temporal + Spatial ConvNet. Fusion using SVM. Multi-task learning for temporal ConvNet. SpatialConv net pre-trained on ILSVRC-2012 and fine-tuning only on last layer.
LSTM + 30 Frame Unroll (Yue-Hei Ng) 88.6    Optical Flow + Image Frames. 1 FPS + Motion information through flow. Re-used GoogLeNet. LSTM performed better than feature pooling architecture.
Evaluating Two-Stream CNN (Ye) 87.7 TODO
Slow Fusion (Karpathy) 65.4 Trained on 1M sport videos first and then used transfer learning. They used multiresolution CNNs (fovea and context stream) and slow fusion.

Clone this wiki locally