|
VDT: General-purpose Video Diffusion Transformers via Mask Modeling
Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding
ICLR2024
[Project]
[PDF]
[Code]
[机器之心]
We introduce Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation.
|
|
Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling
Haoyu Lu, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Wei Zhan, Masayoshi Tomizuka, Mingyu Ding
ICLR2024
[PDF]
[Code]
We propose UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models.
|
|
LGDN: Language-Guided Denoising Network for Video-Language Modeling
Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu
NeurIPS2022, Spotlight
[PDF]
We propose Language-Guided Denoising Network (key-frame selection) for video-language modeling.
|
|
Bmu-moco: Bidirectional momentum update for continual video-language modeling
Yizhao Gao, Nanyi Fei, Haoyu Lu, Zhiwu Lu, Hao Jiang, Yijie Li, Zhao Cao
NeurIPS2022, Spotlight
[PDF]
We propose a cross-modal MoCo-based continual learning algorithm with bidirectional momentum update (BMU).
|
|
Towards artificial general intelligence via a multimodal foundation model
Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Hao Sun, Ji-Rong Wen
Nature Communications
[PDF]
We develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks.
|
|
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, Ji-Rong Wen
CVPR2022
[PDF]
We propose COllaborative Two-Stream vision-language pre-training model (COTS) for cross-modal retrieval by enhancing cross-modal interaction.
|
|
Learning versatile neural architectures by propagating network codes
Mingyu Ding, Yuqi Huo, Haoyu Lu, Linjie Yang, Zhe Wang, Zhiwu Lu, Jingdong Wang, Ping Luo
ICLR2022
[PDF]
[Project]
We explores how to design a single neural network capable of adapting to multiple heterogeneous vision tasks, such as image segmentation, 3D detection, and video recognition.
|
|
Compressed video contrastive learning
Yuqi Huo, Mingyu Ding, Haoyu Lu, Nanyi Fei, Zhiwu Lu, Ji-Rong Wen, Ping Luo
NeurIPS2021
[PDF]
We propose Motion Vector based Cross Guidance Contrastive learning for video self-supervised learning.
|
|
Self-supervised video representation learning with constrained spatiotemporal jigsaw
Yuqi Huo, Mingyu Ding, Haoyu Lu, Zhiwu Lu, Tao Xiang, Ji-Rong Wen, Ziyuan Huang, Jianwen Jiang, Shiwei Zhang, Mingqian Tang, Songfang Huang, Ping Luo
IJCAI2021
[PDF]
We propose a pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos.
|
|
The website template was adapted from Jingxiang Sun.
|