The video action anticipation have become popular in many domains in recent years, because of its wide applications in AR/VR, robot imitation learning, and autonomous driving.
This project explores the Siamese network  based on the method of Temporal Shift Module  (TSM) to recognize the video action and anticipate the next future action given an egocentric video. In this work, we mainly focus on video action recognition. A subset of the popular EPIC-Kitchen dataset  is used to evaluate our method. Object masks are fused with RGB frames to enhance the action recognition accuracy, which leads to a 6.25% increase in terms of the top1 test accuracy compared with using RGB inputs along.