Kumara Kahatapitiya

I am a PhD candidate at Stony Brook University, working with Prof. Michael S. Ryoo. My primary research focus is on video understanding. More-recently, I also started working on video-language models, vision transformers and video diffusion models.

During my PhD, I was an intern at Qualcomm AI Research, Google Brain, and Wormpex AI Research. Prior to this, I was a Research Assistant at University of Moratuwa, Sri-Lanka, advised by Dr. Ranga Rodrigo, where I also received my Bachelors in Electronic & Telecommunication Engineering.

[Google Scholar]    [GitHub]    [Twitter]

profile photo
Recent News
[Mar 2024] Language Repository and MVU for Long Video Understanding are now on arXiv.
[Feb 2024] Video-conditioned Text Representations for activity recognition was accepted at CVPR 2024.
[Jan 2024] Object-Centric Diffusion for Efficient Video Editing is now on arXiv.
[Oct 2023] Grafting Vision Transformers for multi-scale and global information sharing was accepted at WACV 2024.
[July 2023] I joined Qualcomm AI Research, Amsterdam as a research intern.
[Apr 2023] SWAT, a structure-aware family of token-based models was accepted at IJCAI 2023.
[Feb 2023] Token Turing Machines for long-term memory in Transformers was accepted at CVPR 2023.
[Dec 2022] Weakly-guided Self-supervised detection pretraining was accepted at AAAI 2023.
[Sep 2022] StARformer extended to real-world robot environments was accepted at T-PAMI.
[Jul 2022] StARformer with an MDP-like inductive bias for RL was accepted at ECCV 2022.
[Mar 2022] MS-TCT for temporal action detection with CNN+Transformer embeddings was accepted at CVPR 2022.
[Feb 2022] I joined Robotics at Google as a student researcher.
[Dec 2021] I was a finalist (1/30) for the Adobe Research Fellowship 2022. Congratulations to all the winners!
[Dec 2021] Swift for real-time neural video decoding was accepted at NSDI 2022.
[Sep 2021] I am officially a PhD candidate now!
[Mar 2021] Coarse-Fine Networks for efficient temporal activity detection was accepted at CVPR 2021.
[Jan 2021] Exploiting Redundancy in CNNs for parameter reduction was accepted at WACV 2021.
Pre-prints
Language Repository for Long Video Understanding
Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo
arXiv 2024
[arxiv] [code]

Understanding Long Videos in One Multimodal Language Model Pass
Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo
arXiv 2024
[project page] [arxiv] [code]

Object-Centric Diffusion for Efficient Video Editing
Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Yuki M. Asano, Fatih Porikli, Amirhossein Habibian
arXiv 2024
[project page] [arxiv]

Selected Publications
VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo
CVPR 2024
[paper]

Grafting Vision Transformers
Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo
WACV 2024
[paper] [poster]

SWAT: Spatial Structure Within and Among Tokens
Kumara Kahatapitiya, Michael S. Ryoo
IJCAI 2023
[paper] [code] [slides]

Token Turing Machines
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
CVPR 2023
[paper] [code] [teaser]

Weakly-guided Self-supervised Pretraining for Temporal Activity Detection
Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua
AAAI 2023
[paper] [code] [talk] [poster]

StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning
Jinghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo
ECCV 2022, TPAMI
[paper] [journal] [code] [talk] [poster]

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond
CVPR 2022
[paper] [code] [poster]

Swift: Adaptive Video Streaming with Layered Neural Codecs
Mallesham Dasari, Kumara Kahatapitiya, Samir Das, Aruna Balasubramanian, Dimitris Samaras
NSDI 2022
[paper] [code] [slides]

Coarse-Fine Networks for Temporal Activity Detection in Videos
Kumara Kahatapitiya, Michael S. Ryoo
CVPR 2021
[paper] [code] [talk] [poster]

Exploiting the Redundancy in Convolutional Filters for Parameter Reduction
Kumara Kahatapitiya, Ranga Rodrigo
WACV 2021
[paper] [code] [talk]

Feature-dependent Cross-Connections in Multi-Path Neural Networks
Dumindu Tissera, Kasun Vithanage, Rukshan Wijesinghe, Kumara Kahatapitiya, Subha Fernando, Ranga Rodrigo
ICPR 2020
[paper]

Context-Aware Automatic Occlusion Removal
Kumara Kahatapitiya, Dumindu Tissera, Ranga Rodrigo
ICIP 2019
[paper] [code]

Other Projects

  • X3D-Multigrid [code]
    A PyTorch implementation for "X3D: Expanding Architectures for Efficient Video Recognition models" [CVPR2020] with "A Multigrid Method for Efficiently Training Video Models" [CVPR2020]. In contrast to the original repository by FAIR, this repository provides a simpler, less modular and more familiar structure of implementation for faster and easier adoptation.
  • Optimal Transport in NumPy [code]
    This repository contrains a few Optimal Transport Algorithms implemented using NumPy, including "A Direct O(1/epsilon) Iteration Parallel Algorithm for Optimal Transport" [NeurIPS2019], "Computational Optimal Transport: Complexity by Accelerated Gradient Descent is better than by Sinkhorn's Algorithm" [PMLR2018] and "Lightspeed Computation of Optimal Transport" [NeurIPS2013].
Teaching


Thanks Jon Barron for the template.