He is now an Associate Professor and a Graduate Advisor at the School of Computer Science and Technology, China University of Mining and Technology (CUMT), as well as a Postdoctoral Fellow at the Department of Computer Science and Engineering, Shanghai Jiao Tong University (SJTU) . He received the Ph.D. degree in Computer Science and Technology from SJTU in 2020, advised by Prof. Lizhuang Ma. From 2017 to 2018, he was a joint Ph.D. student at the Multimedia and Interactive Computing Lab, Nanyang Technological University (NTU), advised by Prof. Jianfei Cai. Before that, he received the B.Eng. degree in Computer Science and Technology from the Northwestern Polytechnical University (NPU) in 2015. He has been sponsored with fundings such as Young Scientists Fund of the National Natural Science Foundation of China, High-Level Talent Program for Innovation and Entrepreneurship (ShuangChuang Doctor) of Jiangsu Province, Talent Program for Deputy General Manager of Science and Technology of Jiangsu Province, and Young Scientists Fund of the Fundamental Research Funds for the Central Universities. He has published more than 30 academic papers in popular journals and conferences. He has been serving as a program committee member or a reviewer in top journals and conferences such as IEEE TPAMI, IJCV, IEEE TIP, IEEE CVPR, IJCAI, and AAAI. His research interests lie in the fields of Computer Vision and Deep Learning. The official faculty websites can be found here and here. [ Résumé ]
Jan. 2023: I am a recipient of Outstanding Young Teacher Program of China University of Mining and Technology.
Jan. 2023: I serve as a Publication Chair in Computer Graphics International (CGI) 2023.
Dec. 2022: Our paper Facial Action Unit Detection Using Attention and Relation Learning becomes an ESI highly cited paper.
July 2022: I am a recipient of Talent Program for Deputy General Manager of Science and Technology of Jiangsu Province.
June 2022: Our paper TextDCT: Arbitrary-Shaped Text Detection via Discrete Cosine Transform Mask is accepted by IEEE TMM.
Aug. 2021: I am a recipient of Young Scientists Fund of the National Natural Science Foundation of China.
July 2021: I am a recipient of High-Level Talent Program for Innovation and Entrepreneurship (ShuangChuang Doctor) of Jiangsu Province.
June 2021: Our paper Unconstrained Facial Action Unit Detection via Latent Feature Domain is accepted by IEEE TAFFC.
Apr. 2021: Our paper Explicit Facial Expression Transfer via Fine-Grained Representations is accepted by IEEE TIP.
Sept. 2020: I serve as a PC member in AAAI 2021 and IJCAI 2021.
Aug. 2020: Our paper JÂA-Net: Joint Facial Action Unit Detection and Face Alignment via Adaptive Attention is accepted by IJCV.
Oct. 2019: Our paper Facial Action Unit Detection Using Attention and Relation Learning is accepted by IEEE TAFFC.
July 2018: Our paper Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment is accepted by ECCV 2018.
Nov. 2017: I am now studying at Multimedia and Interactive Computing Lab of Nanyang Technological University (NTU), Singapore as a Research Assistant, advised by Prof. Jianfei Cai.
2023: Outstanding Young Teacher Program of China University of Mining and Technology, Principal Investigator
2022: Participation in Computer Graphics International (CGI) 2022 Supported by K.C.Wong Education Foundation, Principal Investigator
2022: Patent License Project for Method and Device of Facial Action Unit Recognition Based on Joint Learning and Optical Flow Estimation, Principal Investigator
2022: Talent Program for Deputy General Manager of Science and Technology of Jiangsu Province, Principal Investigator
2021: Young Scientists Fund of the National Natural Science Foundation of China, Principal Investigator
2021: High-Level Talent Program for Innovation and Entrepreneurship (ShuangChuang Doctor) of Jiangsu Province, Principal Investigator
2021: Young Scientists Fund of the Fundamental Research Funds for the Central Universities, Principal Investigator
2020: Start-Up Grant of China University of Mining and Technology, Principal Investigator
We propose IterativePFN (iterative point cloud filtering network), which consists of multiple IterationModules that model the true iterative filtering process internally, within a single network. We train our IterativePFN network using a novel loss function that utilizes an adaptive ground truth target at each iteration to capture the relationship between intermediate filtering results during training. This ensures filtered results converge faster to the clean surfaces.
We propose a novel MER method by identity-invariant representation learning and transformer-style relational modeling. Specifically, we propose to disentangle the identity information from the input via an adversarial training strategy. Considering the coherent relationships between AUs and MEs, we further employ AU recognition as an auxiliary task to learn AU representations with ME information captured. Moreover, we introduce a transformer to achieve MER by modeling the correlations among AUs. MER and AU recognition are jointly trained, in which the two correlated tasks can contribute to each other.
We propose an end-to-end deep learning based attention and relation learning framework for AU detection with only AU labels, which has not been explored before. In particular, multi-scale features shared by each AU are learned firstly, and then both channel-wise and spatial attentions are adaptively learned to select and extract AU-related local features. Moreover, pixel-level relations for AUs are further captured to refine spatial attentions so as to extract more relevant local features. Without changing the network architecture, our framework can be easily extended for AU intensity estimation.
We propose a novel light-weight anchor-free text detection framework called TextDCT, which adopts the discrete cosine transform (DCT) to encode the text masks as compact vectors. Further, considering the imbalanced number of training samples among pyramid layers, we only employ a single-level head for top-down prediction. To model the multi-scale texts in a single-level head, we introduce a novel positive sampling strategy by treating the shrunk text region as positive samples, and design a feature awareness module (FAM) for spatial-awareness and scale-awareness by fusing rich contextual information and focusing on more significant features. Moreover, we propose a segmented non-maximum suppression (S-NMS) method that can filter low-quality mask regressions.
We propose a novel hybrid relational reasoning (HRR) framework for AU detection. In particular, we propose to adaptively reason pixel-level correlations of each AU, under the constraint of predefined regional correlations by facial landmarks, as well as the supervision of AU detection. Moreover, we propose to adaptively reason AU-level correlations using a graph convolutional network, by considering both predefined AU relationships and learnable relationship weights. Our framework is beneficial for integrating the advantages of correlation predefinition and correlation learning.
Expression action unit (AU) recognition based on deep learning is a hot topic in the fields of computer vision and affective computing. Each AU describes a facial local expression action, and the combinations of AUs can quantitatively represent any expression. Current AU recognition mainly faces three challenging factors: scarcity of labels, difficulty of feature capture, and imbalance of labels. On the basis of this, this paper categorizes the existing researches into transfer learning based, region learning based, and relation learning based methods, and comments and summarizes each category of representative methods. Finally, this paper compares and analyzes different methods, and further discusses the future research directions of AU recognition.
We propose an end-to-end unconstrained facial AU detection framework based on domain adaptation, which transfers accurate AU labels from a constrained source domain to an unconstrained target domain by exploiting labels of AU-related facial landmarks. Specifically, we map a source image with label and a target image without label into a latent feature domain by combining source landmark-related feature with target landmark-free feature. Due to the combination of source AU-related information and target AU-free information, the latent feature domain with transferred source label can be learned by maximizing the target-domain AU detection performance. Moreover, we introduce a novel landmark adversarial loss to disentangle the landmark-free feature from the landmark-related feature by treating the adversarial learning as a multi-player minimax game.
We first use Structural Causal Models (SCMs) to show how two confounders damage the image captioning. Then we apply the backdoor adjustment to propose a novel causal inference based image captioning (CIIC) framework, which consists of an interventional object detector (IOD) and an interventional transformer decoder (ITD) to jointly confront both confounders. In the encoding stage, the IOD is able to disentangle the region-based visual features by deconfounding the visual confounder. In the decoding stage, the ITD introduces causal intervention into the transform decoder and deconfounds the visual and linguistic confounders simultaneously. Two modules collaborate with each other to eliminate the spurious correlations caused by the unobserved confounders.
We propose a weakly supervised few-shot semantic segmentation model based on the meta learning framework, which utilizes prior knowledge and adjusts itself according to new tasks. Thereupon then, the proposed network is capable of both high efficiency and generalization ability to new tasks. In the pseudo mask generation stage, we develop a WRCAM method with the channel-spatial attention mechanism to refine the coverage size of targets in pseudo masks. In the few-shot semantic segmentation stage, the optimization based meta learning method is used to realize few-shot semantic segmentation by virtue of the refined pseudo masks.
We propose a framework for visual tracking based on the attention mechanism fusion of multi-modal and multi-level features. This fusion method can give full play to the advantages of multi-level and multi-modal information. Specificly, we use a feature fusion module to fuse these features from different levels and different modalities at the same time. We use cycle consistency based on a correlation filter to implement unsupervised training of the model to reduce the cost of annotated data.
We propose a novel geodesic guided convolution (GeoConv) for AU recognition by embedding 3D manifold information into 2D convolutions. Specifically, the kernel of GeoConv is weighted by our introduced geodesic weights, which are negatively correlated to geodesic distances on a coarsely reconstructed 3D morphable face model. Moreover, based on GeoConv, we further develop an end-to-end trainable framework named GeoCNN for AU recognition.
We propose to explicitly transfer facial expression by directly mapping two unpaired input images to two synthesized images with swapped expressions. Specifically, considering AUs semantically describe fine-grained expression details, we propose a novel multi-class adversarial training method to disentangle input images into two types of fine-grained representations: AU-related feature and AU-free feature. Then, we can synthesize new images with preserved identities and swapped expressions by combining AU-free features with swapped AU-related features. Moreover, to obtain reliable expression transfer results of the unpaired input, we introduce a swap consistency loss to make the synthesized images and self-reconstructed images indistinguishable.
We propose an end-to-end expression-guided generative adversarial network (EGGAN), which synthesizes an image with expected expression given continuous expression label and structured latent code. In particular, an adversarial autoencoder is used to translate a source image into a structured latent space. The encoded latent code and the target expression label are input to a conditional GAN to synthesize an image with the target expression. Moreover, a perceptual loss and a multi-scale structural similarity loss are introduced to preserve facial identity and global shape during expression manipulation.
We propose a novel end-to-end deep learning framework for joint AU detection and face alignment, which has not been explored before. In particular, multi-scale shared feature is learned firstly, and high-level feature of face alignment is fed into AU detection. Moreover, to extract precise local features, we propose an adaptive attention learning module to refine the attention map of each AU adaptively. Finally, the assembled local features are integrated with face alignment feature and global feature for AU detection.
We propose a novel deep learning framework named Multi-Center Learning with multiple shape prediction layers for face alignment. In particular, each shape prediction layer emphasizes on the detection of a certain cluster of semantically relevant landmarks respectively. Challenging landmarks are focused firstly, and each cluster of landmarks is further optimized respectively. Moreover, to reduce the model complexity, we propose a model assembling method to integrate multiple shape prediction layers into one shape prediction layer.
We propose a novel version of Gated Recurrent Unit (GRU), called Single-Tunnelled GRU for abnormality detection. Particularly, the Single-Tunnelled GRU discards the heavy-weighted reset gate from GRU cells that overlooks the importance of past content by only favouring current input to obtain an optimized single-gated-cell model. Moreover, we substitute the hyperbolic tangent activation in standard GRUs with sigmoid activation, as the former suffers from performance loss in deeper networks.
We propose an end-to-end expression-guided generative adversarial network (EGGAN), which utilizes structured latent codes and continuous expression labels as input to generate images with expected expressions. Specifically, we adopt an adversarial autoencoder to map a source image into a structured latent space. Then, given the source latent code and the target expression label, we employ a conditional GAN to generate a new image with the target expression. Moreover, we introduce a perceptual loss and a multi-scale structural similarity loss to preserve identity and global shape during generation.
We propose a novel end-to-end deep learning framework for joint AU detection and face alignment, which has not been explored before. In particular, multi-scale shared features are learned firstly, and high-level features of face alignment are fed into AU detection. Moreover, to extract precise local features, we propose an adaptive attention learning module to refine the attention map of each AU adaptively. Finally, the assembled local features are integrated with face alignment features and global features for AU detection.
We propose a novel multi-center convolutional neural network for unconstrained face alignment. To utilize structural correlations among different facial landmarks, we determine several clusters based on their spatial position. We pre-train our network to learn generic feature representations. We further fine-tune the pre-trained model to emphasize on locating a certain cluster of landmarks respectively. Fine-tuning contributes to searching an optimal solution smoothly without deviating from the pre-trained model excessively. We obtain an excellent solution by combining multiple fine-tuned models.
We propose a novel data augmentation strategy. And we design an innovative training algorithm with adaptive learning rate for two iterative procedures, which helps the network to search an optimal solution. Our convolutional network can learn global high-level features and directly predict the coordinates of facial landmarks.
2022: A Method and Device of User Personality Characteristic Prediction Based on Multi-Modal Information Fusion, The Third Inventor, ZL202111079044.4
2021: A Method and Device of Facial Action Unit Recognition Based on Joint Learning and Optical Flow Estimation, The First Inventor, ZL202110360938.4
2018: Identity Verification System V1.0 Based on Face Recognition, Software Copyright, The Second Inventor, 2018SR160441
2022: Honorable Mention for Teaching Competition at the School of Computer Science and Technology, China University of Mining and Technology
2021: Excellent Headteacher of the China University of Mining and Technology
2020: Outstanding Prize for Scientific and Technological Progress of Shanghai Municipality, 11/18
2020: One of the Top 10 Scientific Advances in the Shanghai Jiao Tong University, 8/9
2019: Super AI Leader (SAIL) TOP 30 project at World Artificial Intelligence Conference, 6/13
2016-2019: KoGuan Endeavor Scholarship, Suzhou Yucai Scholarship
2015: Outstanding Graduate of the Northwestern Polytechnical University
2012-2015: Outstanding Student of the Northwestern Polytechnical University
2012-2015: National Endeavor Scholarship, Samsung China Scholarship, Wu Yajun Scholarship
2022: Image Processing and Computer Vision, Lecturer (Principal of Course)
2022: Practice for Python Programming, Lecturer
2021: Computational Thinking and Artificial Intelligence Foundation, Teaching Assistant
2020: Practice for Computational Thinking and Artificial Intelligence Foundation, Lecturer
2020: Technology of Cloud Computing and Big Data, Teaching Assistant
2020: Introduction to Information Science, Teaching Assistant
Publication Chair: Computer Graphics International (CGI) 2023
Session Chair: Computer Graphics International (CGI) 2022, Shanghai Cross-Media Intelligence and Computer Vision Forum 2019
Member in Chinese Association for Artificial Intelligence (CAAI): Professional Committee of Pattern Recognition, Professional Committee of Knowledge Engineering and Distributed Intelligence, Professional Committee of Machine Learning
Member in China Society of Image and Graphics (CSIG): Professional Committee of Machine Vision, Professional Committee of Animation and Digital Entertainment
Program Committee Member/Conference Reviewer: CVPR, IJCAI, AAAI, ACM MM, CGI, ICONIP, ICIG, NCIIP
Journal Reviewer: IEEE TPAMI, IJCV, IEEE TIP, IEEE TAFFC, IEEE TMM, Signal Processing, SPIC, IET IP, TVC, Computers & Graphics, IEEE Sensors, Journal of Electronic Imaging, Frontiers in Computer Science