Frontier Review of Multimodal AI

Duan Nan


Abstract
“Pre-training techniques have enabled foundation models (such as BERT, T5, GPT) to achieveremarkable success in natural language processing (NLP) and multimodal tasks that involve text,audio and visual contents. Some of the latest multimodal generative models, such as DALL·Eand Stable Diffusion, can synthesize novel visual content from text or video inputs, which greatlyenhances the creativity and productivity of content creators. However, multimodal AI also facessome challenges, such as adding new modalities or handling diverse tasks that require signalsbeyond their understanding. Therefore, a new trend in multimodal AI is to build a compositionalAI system that connects existing foundation models with external modules and tools. This way,the system can perform more varied tasks by leveraging different modalities and signals.Inthis paper, we will give a brief overview of the state-of-the-art multimodal AI techniques and thedirection of building compositional AI systems. We will also discuss the potential future researchtopics in multimodal AI.”
Anthology ID:
2023.ccl-2.9
Volume:
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)
Month:
August
Year:
2023
Address:
Harbin, China
Editor:
Jiajun Zhang
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
110–118
Language:
English
URL:
https://aclanthology.org/2023.ccl-2.9
DOI:
Bibkey:
Cite (ACL):
Duan Nan. 2023. Frontier Review of Multimodal AI. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum), pages 110–118, Harbin, China. Chinese Information Processing Society of China.
Cite (Informal):
Frontier Review of Multimodal AI (Nan, CCL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ccl-2.9.pdf