Program

Cordelia Schmid, DR INRIA, LEAR, Grenoble, France 'Actions recognition from videos: some recent results'

The amount of digital video content available is growing daily, on sites such as YouTube. Recent statistics on this website show that around 48h of video are uploaded every minute. This massive data production calls for automatic analysis. In this talk we present some recent results for Action Recognition in Videos (ARV). Bag-of-features have shown very good performance for ARV. We review the underlying principles and introduce trajectory-based video features, which have shown to outperform the state of the art. These features are obtained by dense point sampling and tracking based on displacement information from a dense optical flow field. Trajectory descriptors are obtained with motion boundary histograms, which are robust to camera motion.

We, then, show how to move towards more structured representations by explicitly modeling human-object interactions. We learn how to represent a human actions as an interactions between persons and objects. We localize in space and track over time both the object and the person, and represent an action as the trajectory of the object with respect to the person position, i.e., our human-object interaction features capture the relative trajectory of the object with respect to the human.

Finally, we present work learning object detectors from realworld web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects in a set of videos of the class and learns a detector for it. The approach extracts candidate spatio-temporal tubes based on motion segmentation and then selects one tube per video jointly over all videos. (Joint work with V. Ferrari, H. Grabner, A. Klaeser, A. Prest, H. Wang)

Barbara Caputo, Senior Researcher, IDIAP EPF Lausanne, Switzerland 'Learning to learn in computer vision & robotics: some success stories and challenges ahead'

The awareness that learning of categories and concepts from multi modal data should be a never ending, dynamic process, has led to a growing interest in algorith ms for leveraging over priors over the last years. This interest has been declined in different ways in dif ferent communities: while the visual recognition and robotics community have focused mostly on designing algorithms able to come with large scale concept learning from multi modal data, machine learning research has been developing theoretical frameworks able (to some extent) to explain the experimental success of several of these methods. In this lecture I will give an overview of the several settings where learning to learn has been applied (from domain adaptation to transfer learning), review the current state of the art in the se research threads, link these algorithms to machine learning theories and outline the open challenges ahead. I will also provide links to various online resour ces, from software to established benchmark databases.

Samy Bengio, Senior Researcher, Google, USA 'Large Scale Image/Music Understanding'

Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a so-called 'embedding space', into which both images and annotations can be automatically projected.

In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visio-semantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts. We propose a new learning-to-rank approach that can scale to such dataset and show some annotation results. Such an idea can be used for many other problems including music recommendation, which I'll describe briefly.

Jorge Sanchez, A. Prof. Cordoba univ., Argentina 'Fisher vectors for classification'

The Fisher Vector (FV) has been introduced in classification as an alternative to the popular Bag-of-Words (BOV) image representation. As in the BOV, images are characterized by summary statistics computed from a set of low-level patch descriptors extracted from the image. In the FV framework, the sample is characterized by its deviation with respect to a generative model of the data.

Such a representation is given by a gradient vector w.r.t the parameters of the model, which is chosen to be a Gaussian Mixture with diagonal covariances. The FV has many advantages compared to the BOV. First, it gives a more complete representation of the samples, as it considers information that goes beyond simple counts. Second , by encoding additional information, it requires smaller vocabularies to achieve a given accuracy. This makes the FV very efficient to compute. Third, its classification performance rank among the bests in a wide range of problems, despite relying on simple linear classifiers.

We'll will first present a formal overview of the FV framework, showing some recent results on several small- to large-scale problems. Next, I'll di scuss some extensions to the FV which show the generality and modeling power of the approach. Finally, I'll present some applications to other classification re lated problems.

Matthieu Cord, Prof. Paris 6 univ., LIP6, France 'Beyond Bag of Visual Word model for image representation'

I will focus on few extensions of classical Bag-of-(Visual)-Words (BoVW) model widely used approach to represent visual documents. BoVW relies on the quantization of local descriptors and their aggregation into a single feature vector. The underlying concepts, such as the visual codebook, coding, pooling, and the impact of the main parameters of the BoVW pipeline will be discussed with few propositions about pooling.

Recently, unsupervised learning methods have emerged to jointly learn visual codebooks and codes. I will present approaches based on restricted Boltzmann machines (RBM) to achieve this joint optimization. To enhance feature coding, RBMs may be regularized with a sparsity constraint term. I will show experimental results of this code learning strategy embedded in the BoVW pipeline for image classification. Some extensions concerning hierarchical and bio-inspired approaches for image representation will also be discussed. Additionally to classification, I will present some applications in content-based image retrieval, focusing on interactive-learning based approaches.

Sebastien Paris, A. Prof. Aix Marseille univ., France 'Efficient Bag of Scenes Analysis for Image Categorization'

We address the general problem of image/object categorization with a novel approach referred to as Bag-of-Scenes (BoS).Our approach is efficient for low semantic applications such as texture classification as well as for higher semantic tasks such as natural scenes recognition or fine-grained visual categorization . It is based on the widely used combination of (i) Sparse coding (Sc), (ii) Max-pooling and (iii) Spatial Pyramid Matching techniques applied to histograms of multi-scale Local Binary/Ternary Patterns (LBP/LTP) and its improved variants. This approach can be considered as a two-layer hierarchical architec- ture: the first layer encodes the local spatial patch structure via histograms of LBP/LTP while the second en- codes the relationships between pre-analyzed LBP/LTP-scenes/objects. Our method outperforms SIFT-based approaches using Sc techniques and can be trained efficiently with a simple linear SVM.

Marc Le Goc, Prof. Aix Marseille univ., France 'Learning with the Theory of Timed Observations'

We present the recent Theory of Timed Observations (TTO, Le Goc 2006), based on new mathematical object called Timed Observation. It merges notably the results of the Bayesian Networks and Markov Chains theories, those of the Poisson Processes theory, Shannon's theory of communication and the logical theory of diagnosis.

With the extension of the informational entropy concept to the temporal dimension of the data, the TTO provides (i) the basis of a reasoning process to induce temporal knowledge from timed data, (ii) the organizational laws of the discovered knowledge in a 4-tuple model of the dynamic process that produces the timed data and (iii) an abstraction principle allowing a multi-scale modeling. The Theory is then the 1st mathematical basis of a learning process that combines a Knowledge Engineering methodology called Tom4D (Timed Observations Modeling for Diagnosis) and a Knowledge Discovery in Database process called Tom4L (Timed Observation Modeling for Learning).

The advantage of TTO is to unify the representation formalisms of the Tom4D and the Tom4L: the human and the data knowledge sources are associated within a unique learning process combining the advantages of human learning with those of the machine. The main contribution of TTO is to model a data production process without any prior knowledge. We present the main concepts and properties of TTO with didactic examples, and real-world applications (continuous production processes, smart environments and financial industry). Through an experimental and conceptual benchmark, we show show that Tom4L process provides better results than the best comparable learning algorithms. We conclude on the new problems that TTO introduces, in particular for the validation of the induced knowledge models, and the future developments : multi-scale modeling of the brain.

Herve Glotin, Prof. Toulon univ. & Inst. Univ. de France 'Sparse and Scattering operators for bioacoustic classification'

After a brief introduction to the machine learning for automatic speech recognition, we demonstrate the main difficulties for its applications to bioacoustics. We then discuss on two efficient approaches for scaled classification of animal sounds : the sparse coding and the scattering operators. We illustrate their advantages with various cases of species, from bats to whales.

A more detailed illustration is given for Humpback whale songs that present several similarities to speech, including voiced and unvoiced type vocalizations, a great variety of methods have been used to analyze them. Most of the studies of these songs are based on the classification of sound units, however detailed analysis of the vocalizations showed that the features of an unit can change abruptly throughout its duration making it difficult to characterize and cluster them systematically. We then show how joint sparse coding and scattering operators can help to determine in a song the stable components versus the evolving ones. This results in a separation of the song components, and then highlights song copying between males accross years.

We will illustrate the scaled bioacoustic paradigm with an overview of the ICML workshop that we organize in 2013 :

Bioacoustic classification challenge at ICML 2013

This work is supported by IUF and Scaled Acoustic Biodiversity SABIOD MASTODONS CNRS project.

Online user: 1