The discovery of correlations from large scale of data set is an interested issue nowadays. Artificial intelligence is now heading towards how to integrate data-driven learning and knowledge-guided inference to perform better reasoning and decision instead of correlation learning via metric matching. This talk will discuss the potential ways to fuse symbolic AI, data-driven learning and reinforcement learning to support causal reasoning.


Fei Wu , Zhejiang University, Hangzhou, China


Analyzing human behaviour in videos is one of the fundamental problems of computer vision and multimedia understanding. The task is very challenging as video is an information-intensive media with large variations and complexities in content. With the development of deep learning techniques, researchers have strived to push the limits of human behaviour understanding in a wide variety of applications from action recognition to event detection. This tutorial will present recent advances under the umbrella of human behavior understanding, which range from the fundamental problem of how to learn “good" video representations, to the challenges of categorizing video content into human action classes, finally to multimedia event detection and surveillance event detection in complex scenarios.


Ting Yao , JD AI Research, Beijing, China

Jingen Liu , JD AI Research, Mountain View, CA, USA


Intelligent image/video editing is a fundamental topic in image processing which has witnessed rapid progress in the last two decades. Due to various degradations in the image and video capturing, transmission and storage, image and video include many undesirable effects, such as low resolution, low light condition, rain streak and rain drop occlusions. The recovery of these degradations is ill-posed. With the wealth of statistic-based methods and learning-based methods, this problem can be unified into the cross-domain transfer, which cover more tasks, such as image stylization.

In our tutorial, we will discuss recent progresses of image stylization, rain streak/drop removal, image/video super-resolution, and low light image enhancement. This tutorial covers both traditional statistics based and deep-learning based methods, and contains both biological-driven model, i.e. Retinex model, and data-driven model. An image processing viewpoint that considers the popular deep networks as a traditional Maximum-a-Posteriori (MAP) Estimation is provided. The side priors, designed by researchers and learned by multi-task learnings, and automatically learned priors, captures by adversarial learning are two kinds of important priors in this framework. Three works under this framework, including single image super-resolution, low light image enhancement, and single image raindrop removal are presented.

Single image super-resolution is a classical problem in computer vision. It aims at recovering a high-resolution image from a single low-resolution image. This problem is an underdetermined inverse problem, of which solution is not unique. In this tutorial, we will discuss how we can solve the problem by deep convolutional networks in a data-driven manner. We will review different model variants and important techniques such as adversarial learning for image super-resolution. We will then discuss recent work on hallucinating faces of unconstrained poses and with very low resolution. Finally, the tutorial will discuss challenges of implementing image super-resolution in real-world scenarios.


Jiaying Liu , Peking University, Beijing, China

Wenhan Yang , National University of Singapore, Singapore

Chen Change Loy , Nanyang Technological University, Singapore


Personal photo and video data are being accumulated at an unprecedented speed. For example, 14 petabytes of personal photos and videos were uploaded to Google Photo1 by 200 million users in 2015, while a tremendous amount of personal photos and videos are also being uploaded to Flickr every day. How to efficiently search and organize such data presents a huge challenge to both academic research and industrial applications.

To attack this challenge, this tutorial will review the research efforts in related subjects and showcases of successful industrial systems. We will discuss traditional visual search methods and the improvement of visual presentations brought by deep neural networks. The instructors will also share their experience of building large-scale fashion search and Flickr similarity search systems and bring insights on the challenges of extending the academic research to industrial applications.

This tutorial will discuss the queries and logs of search engines, and analyze how to address the characteristics of personal media search. By leveraging searching techniques to visual question answering, this tutorial will introduce a new task named MemexQA: given a collection of photos or videos from the user, can we automatically answer questions that help users recover their memory about events captured in the collection? New datasets and algorithms of MemexQA will be reviewed. We hope MemexQA will shed light on the next generation computer interface of exploding amount of personal photos and videos.


Lu Jiang , Google Cloud AI, Sunnyvale, CA, USA

Yannis Kalantidis , Facebook Research, Oakland, CA, USA

Liangliang Cao , University of Massachusetts, Amherst, MA, USA


Computer vision in transportation has recently received increasing attention from both industry and academia due to the popularity of modern mobile transportation platforms and the rapid development of autonomous driving. In this tutorial, we systematically introduce the recent progresses of computer vision techniques and their applications in transportation. Specifically, we will provide a general overview of the key problems, common formulations, existing methodologies and future directions. This tutorial will inspire the audience and facilitate research in computer vision for transportation.

The tutorial mainly consists of three parts:

 Lecture 1: Challenges using object recognition, optical character recognition and face recognition in transportation.

  • The recent progresses of the object recognition, optical character recognition and face recognition technologies.
  • The difficulties and problems when applying these technologies in transportation.
  • The solutions and applications.

 Lecture 2: Towards Driving Scenario Understanding.

  • Object detection, tracking, and segmentation in driving scenarios.
  • Vision based 3D reconstruction of driving scenario.
  • Driving behavior modeling and safety risk analysis.

 Lecture 3: Applying transfer learning in CV.

  • The introduction of the recent transfer learning technologies.
  • The applications of the transfer learning technologies in CV.


Haifeng Shen , AI Labs, Didi Chuxing

Yuhong Guo , AI Labs, Didi Chuxing

Yan Liu , AI Labs, Didi Chuxing

Jieping Ye , AI Labs, Didi Chuxing


Object detection is a fundamental problem in the computer vision society with numerous applications. Recently, as the development of Mask R-CNN and RetinaNet, the pipeline of the object detection seems to be mature. However, the performance for the current state-of-art object detection is still far from the requirements from the visual applications. In this tutorial, we will delve into the details of the object detection and present the improvements from five aspects: backbone, head, scale, batchsize, and post-processing.

For the backbone, we will discuss a novel network called DetNet, which is specifically designed for the detection task. DetNet preserves the spatial information of the network structure compared with traditional ImageNet pretrained backbone. For the head design, we will introduce Light-Head R-CNN for the fast inference speed and moreover a novel Localization Sensitive Head (LSH) will be discussed which decouples the classification and regression tasks into two branches. For the scale issue, we present a novel algorithm called SFace which can address the large scale variation problem in the Face Detection problem. Also, large batch-size detector will be discussed to significantly reduce the number of model training time. Besides, a new dataset called CrowdHuman will be discussed to address the NMS issue during the post-processing stage.


Gang Yu , Face++, Beijing, China

Yichen Wei , Face++, Beijing, China

Xiangyu Zhang , Face++, Beijing, China


Owing to the popularity of Big Data, abundant multimedia data are accumulated in various domains. At the same time, many machine learning methods are proposed to exploit these data for prediction. These methods have been proved to be successful in prediction-oriented applications. However, the lack of interpretability of most predictive algorithms makes them less attractive in many settings, especially those requiring decision making. How to improve the interpretability of learning algorithms is of paramount importance for both academic research and real applications.

Causal inference, which refers to the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect, is a powerful statistical modeling tool for explanatory analysis. In this tutorial, we focus on causally regularized machine learning, aiming to explore causal knowledge from observational data to improve the explainability and stability of machine learning algorithms. First, we will give some examples on how machine learning algorithms today focus on correlation analysis and prediction, and why those methods are not insufficient for decision making. Then, we will give introduction to causal inference and introduce some recent data-driven approaches to explore causal knowledge from observational data, especially in high dimensional setting. Aiming to bridge the gap between causal inference and machine learning, we will introduce some recently causally regularized machine learning algorithms for improving the stability and interpretability of prediction on multimedia data. Finally, we will discussing future directions of the landscape of open research and challenges in machine learning with causal inference.


Peng Cui , Tsinghua University, Beijing, China

Kun Kuang , Tsinghua University, Beijing, China

Bo Li , Tsinghua University, Beijing, China


Recent years have witnessed great success in the deployment of deep learning for various tasks. Neural architecture innovation plays an important role in advancing this research direction. From AlexNet and VGG to ResNet and DenseNet, better architecture design has pushed the depth limit of deep models from 7 layers to over one thousand layers. The unprecedent depth endows neural networks with strong representation power.

This tutorial will review classical convolutional network architectures, discuss their underlying design principles, and analyze their strengths and weaknesses. Particularly, we will address the recent trend of developing highly efficient light-weighted deep models for practical applications with limited computational resources, e.g., mobile phones and wearable devices. Besides hand designed structures that incorporate human intuition, neural architectures obtained via automatic search have gained great popularity in the recent two years. This newly emerged research direction, usually referred as AutoML, will also be covered in this tutorial.


Gao Huang , Tsinghua University, Beijing, China

Jingdong Wang , MSRA, Beijing, China

Lingxi Xie , Johns Hopkins University, MD, USA


Deep learning has been successfully explored in addressing different multimedia topics in recent years, ranging from object detection, semantic classification, entity annotation, to multimedia captioning, multimedia question answering and storytelling. The academic researchers have now transferring their attention from identifying what problem deep learning can address to exploring what problem deep learning can NOT address. This tutorial starts with a summarization of six NOT problems deep learning fails to solve in the current stage, i.e., low stability, debugging difficulty, poor parameter transparency, poor incrementality, poor reasoning ability, and machine bias. These problems share a common origin from the lack of deep learning interpretation. This tutorial attempts to correspond the six 'NOT' problems to three levels of deep learning interpretation: (1) Locating, accurately and efficiently locating which feature contributes much to the output. (2) Understanding, bidirectional semantic accessing between human knowledge and deep learning algorithm. (3) Accumulation, well storing, accumulating and reusing the models learned from deep learning. Existing studies falling into these three levels will be reviewed in detail, before a discussion on the future interesting directions in the end.


Jitao Sang , Beijing Jiatong University, Beijing, China


Due to the rapid growth of multimedia big data and related novel applications, intelligent recommendation systems have become more and more important in our daily life. During last decades, various multimedia technologies have been developed by different research communities (e.g., multimedia systems, information retrieval, and machine learning). Meanwhile, recommendation techniques have been successfully leveraged by commercial systems (e.g., Amazon, Youtube and Spotify) to assist general users to deal with information overload and provide them high quality contents, interactions and services.

While several tutorials and courses were dedicated to multimedia recommendation in the last few years, to the best of our knowledge, this tutorial should be the advanced and comprehensive one focusing on intelligent content analytics and its core applications on recommending various types of media contents. We plan to summarize the research along this direction and provide a good balance between theoretical methodologies and real system development (including several industrial approaches). Core contributions to literature largely include:

  • Introducing why advanced recommendation system is important for Web scale multimedia retrieval, understanding and sharing.
  • Examining current commercial systems and research prototypes, focusing on comparing the advantages and the disadvantages of the various strategies and schemes for different types of media documents (e.g., image, video, audio and text) and their composition.
  • Reviewing key challenges and technical issues in building and evaluating modern recommendation systems under different contexts.
  • Discussing and reviewing various limitations of the current generation of systems.
  • Make predictions about the road that lies ahead for the scholarly exploration and industrial practice in multimedia and other related communities.

We also plan to have open discussion in this tutorial on several promising research directions with significant technical importance and explore potential solutions. Thus, we hope that this study provides an impetus for further research on this important direction.


Jialie Shen , Queen's University Belfast, Belfast, United Kingdom

Jian Zhang , University of Technology Sydney, Sydney, Australia

Tutorial Chairs

Jiebo Luo
University of Rochester, USA
Zheng-Jun Zha
University of Science and Tech of China, China