Kaikki aineistot
Lisää
Abstract Recently, automatic pain assessment technology, in particular automatically detecting pain from facial expressions, has been developed to improve the quality of pain management, and has attracted increasing attention. In this paper, we propose self-supervised learning for automatic yet efficient pain assessment, in order to reduce the cost of collecting large amount of labeled data. To achieve this, we introduce a novel similarity function to learn generalized representations using a Siamese network in the pretext task. The learned representations are finetuned in the downstream task of pain intensity estimation. To make the method computationally efficient, we propose Statistical Spatiotemporal Distillation (SSD) to encode the spatiotemporal variations underlying the facial video into a single RGB image, enabling the use of less complex 2D deep models for video representation. Experiments on two publicly available pain datasets and cross-dataset evaluation demonstrate promising results, showing the good generalization ability of the learned representations.
Abstract Recently face recognition has made significantly progress due to the advancement of large scale Deep Convolutional Neural Network (DeepCNNs). Despite the great success, the known deficiencies of DeepCNNs have not been addressed, such as the need for too much labeled training data, energy hungry, lack of theoretical interpretability, lack of robustness to image transformations and degradations, and vulnerable to attacks, which limit DeepCNNs to be used in many real world applications. Therefore, these factors make previous predominating Local Binary Patterns (LBP) based face recognition methods still irreplaceable. In this paper we propose a novel approach called BIRD (learning Binary and Illumination Robust Descriptor) for face representation, which nicely balances the three criteria: distinctiveness, robustness, and computationally inexpensive cost. We propose to learn discriminative and compact binary codes directly from six types of Pixel Difference Vectors (PDVs). For each type of binary codes, we cluster and pool these compact binary codes to obtain a histogram representation of each face image. Six global histograms derived from six types of learned compact binary codes are fused for the final face recognition. Experimental results on the CAS_PERL_R1 and LFW databases indicate the performance of our BIRD surpasses all previous binary based face recognition methods on the two evaluated datasets. More impressively, the proposed BIRD is shown to be highly robust to illumination changes, and produces 89.5% on the CAS_PEAL_R1 illumination subset, which, we believe, is so far the best reported results on this dataset. Our code is made available https://github.com/zhuogege1943/bird-descriptor.
Abstract In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot recognition to multimedia question and answering, bridging image and text representations plays an important and in some cases an indispensable role. To narrow the modality gap between vision and language, prior approaches attempt to discover their correlated semantics in a common feature space. However, these approaches omit the intra-modal semantic consistency when learning the inter-modal correlations. To address this problem, we propose cycle-consistent embeddings in a deep neural network for matching visual and textual representations. Our approach named as CycleMatch can maintain both inter-modal correlations and intra-modal consistency by cascading dual mappings and reconstructed mappings in a cyclic fashion. Moreover, in order to achieve a robust inference, we propose to employ two late-fusion approaches: average fusion and adaptive fusion. Both of them can effectively integrate the matching scores of different embedding features, without increasing the network complexity and training time. In the experiments on cross-modal retrieval, we demonstrate comprehensive results to verify the effectiveness of the proposed approach. Our approach achieves state-of-the-art performance on two well-known multi-modal datasets, Flickr30K and MSCOCO.
Abstract Efficiency and robustness are increasingly needed for applications on 3D point clouds, with the ubiquitous use of edge devices in scenarios like autonomous driving and robotics, which often demand real-time and reliable responses. The paper tackles the challenge by designing a general framework to construct 3D learning architectures with SO(3) equivariance and network binarization. However, a naive combination of equivariant networks and binarization either causes sub-optimal computational efficiency or geometric ambiguity. We propose to locate both scalar and vector features in our networks to avoid both cases. Precisely, the presence of scalar features makes the major part of the network binarizable, while vector features serve to retain rich structural information and ensure SO(3) equivariance. The proposed approach can be applied to general backbones like PointNet and DGCNN. Meanwhile, experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation robustness, and accuracy. The codes are available at https://github.com/zhuoinoulu/svnet.
Abstract Multimodal learning has been an important and challenging problem for decades, which aims to bridge the modality gap between heterogeneous representations, such as vision and language. Unlike many current approaches which only focus on either multimodal matching or classification, we propose a unified network to jointly learn multimodal matching and classification (MMC-Net) between images and texts. The proposed MMC-Net model can seamlessly integrate the matching and classification components. It first learns visual and textual embedding features in the matching component, and then generates discriminative multimodal representations in the classification component. Combining the two components in a unified model can help in improving their performance. Moreover, we present a multi-stage training algorithm by minimizing both of the matching and classification loss functions. Experimental results on four well-known multimodal benchmarks demonstrate the effectiveness and efficiency of the proposed approach, which achieves competitive performance for multimodal matching and classification compared to state-of-the-art approaches.
Abstract Fashion style transfer has attracted significant attention because it both has interesting scientific challenges and it is also important to the fashion industry. This paper focuses on addressing a practical problem in fashion style transfer, person-to-person clothing swapping, which aims to visualize what the person would look like with the target clothes worn on another person instead of dressing them physically. This problem remains challenging due to varying pose deformations between different person images. In contrast to traditional nonparametric methods that blend or warp the target clothes for the reference person, in this paper we propose a multistage deep generative approach named SwapGAN that exploits three generators and one discriminator in a unified framework to fulfill the task end-to-end. The first and second generators are conditioned on a human pose map and a segmentation map, respectively, so that we can simultaneously transfer the pose style and the clothes style. In addition, the third generator is used to preserve the human body shape during the image synthesis process. The discriminator needs to distinguish two fake image pairs from the real image pair. The entire SwapGAN is trained by integrating the adversarial loss and the mask-consistency loss. The experimental results on the DeepFashion dataset demonstrate the improvements of SwapGAN over other existing approaches through both quantitative and qualitative evaluations. Moreover, we conduct ablation studies on SwapGAN and provide a detailed analysis about its effectiveness.