In this talk, we will first review the most representative pre-trained models and then present a Multitask Multilingual Multimodal Pre-trained model (M^3P) that combines multilingual-monomodal pretraining and monolingual-multimodal pre-training into a unified framework via multitask learning. This model learns universal representations that can map objects occurred in different modalities or expressed in different languages to vectors in a common semantic space. To verify the generalization capability of M3P, we fine-tune the pre-trained model for different types of downstream tasks: multilingual image-text retrieval, multilingual image captioning, multimodal machine translation, multilingual natural language inference and multilingual text generation. Evaluation shows that M3P can (i) achieve comparable results on multilingual tasks and English multimodal tasks, compared to the state-of-the-art models pre-trained for these two types of tasks separately, and (ii) obtain new state-of-the-art results on non-English multimodal tasks in the zeroshot or few-shot setting. In the last part, we will present our current progress and future plan on learning better universal representations based on different types of knowledge.