On the Behavior of Convolutional Nets for Feature Extraction

Convolutional neural networks (CNN) are representation learning techniques that achieve state-of-the-art performance on almost every image-related, machine learning task. Applying the representation languages build by these models to tasks beyond the one they were originally trained for is a field of interest known as transfer learning for feature extraction. Through this approach, one can apply the image descriptors learnt by a CNN after processing millions of images to any dataset, without an expensive training phase. Contributions to this field have so far focused on extracting CNN features from layers close to the output (e.g., fully connected layers), particularly because they work better when used out-of-the-box to feed a classifier. Nevertheless, the rest of CNN features is known to encode a wide variety of visual information, which could be potentially exploited on knowledge representation and reasoning tasks. In this paper we analyze the behavior of each feature individually, exploring their intra/inter class activations for all classes of three different datasets. From this study we learn that low and middle level features behave very differently to high level features, the former being more descriptive and the latter being more discriminant. We show how low and middle level features can be used for knowledge representation purposes both by their presence or by their absence. We also study how much noise these features may encode, and propose a thresholding approach to discard most of it. Finally, we discuss the potential implications of these results in the context of knowledge representation using features extracted from a CNN.


Introduction
Image classification is among the most successful applications of deep learning. The performance of deep learning networks on challenges like the ImageNet Large Scale Visual Recognition Competition is based on the capabilities of these models at building an exceptionally rich representation language for a given dataset. A language that can be used for solving problems like classification or detection, achieving state-of-the-art performance (He, Zhang, Ren, & Sun, 2016). Coherently, deep learning models are frequently defined as representation learning techniques (LeCun, Bengio, & Hinton, 2015). Unfortunately, to build this representation language, deep networks require a lot of data and a lot of computational effort, which reduces the number of problems to which these models can be directly applied to.
Within deep learning, the field of research commonly known as transfer learning tries to reuse the representation language learnt for one problem to solve another. To formalize this we use the notation introduced by Pan and Yang (2010). This notation has two main components which can be summarized as follows; a domain D which is defined by a set of data instances with a given probability distribution (e.g., images of a certain resolution with a certain distribution of pixel values), and a task T which is defined by a set of labels and a target function (e.g., labels assigned to images, and a function classifying those images accordingly).
Transfer learning in deep learning is most frequently used to initialize a deep network using the features learnt for a source problem (T S , D S ), so that the same network can be optimized later through a fine-tuning process for a target problem (T T , D T ). Significantly, this approach has been shown to produce better results than training a network for (T T , D T ) from scratch (i.e., using random initialization) (Yosinski, Clune, Bengio, & Lipson, 2014). According to Pan and Yang (2010), this approach would be an example of inductive transfer learning, since labeled data is available for both T T and T S . An alternative use of transfer learning is to use a neural network trained for T S as a feature extractor for T T , in order to use other machine learning methods on top of the resulting representations. By doing so, one is representing the D T data in a language learnt for the T S task, enabling the use of pre-trained deep network representations (not deep network models by themselves) on datasets which lack the size required to train these methods (Azizpour, Razavian, Sullivan, Maki, & Carlsson, 2016;Sharif Razavian, Azizpour, Sullivan, & Carlsson, 2014). According to Pan and Yang (2010) this case would be an example of feature representation transfer (the analogous term transfer learning for feature extraction is also widely used). Notice that this approach to transfer learning can be used to tackle unsupervised learning problems such as clustering (Gui & Morency, 2015), and it would also enable the use of features obtained from an unsupervised learning task (e.g., Autoencoders).
In the context of convolutional neural networks (CNNs), most attempts at feature representation transfer have focused on reusing the activations obtained from layers close to the output of the CNN (typically a fully-connected layer). When compared to lower-level layers, these high-level layers provide better results when used out-of-the-box to feed classifiers or clustering algorithms (Azizpour et al., 2016;Sharif Razavian et al., 2014;Gui & Morency, 2015;Donahue et al., 2014). Regardless of these results, all convolutional layers within a CNN encode a large amount of visual knowledge of varying complexity (Yosinski et al., 2015;Garcia-Gasulla et al., 2017a), knowledge which has not been successfully exploited so far. Since the general purpose of feature representation transfer is to maximize representativeness, we may hypothesize that the optimal representation will include, to some degree, information from a larger variety of layers (i.e., beyond the last ones). It is therefore relevant, particularly for the knowledge representation and reasoning fields, to understand what differences are there between convolutional layers and fully connected layers, so that all that is learnt by a CNN can be properly exploited.
In this context, this paper analyzes the behaviour of all features within a deep CNN for the purpose of feature extraction. We use a very deep CNN (VGG16,  pre-trained on a large dataset ((T S , D S ) = ImageNet 2012 ) to build image representations for alternative datasets ((T T , D T ) ∈ {mit67, f lowers102, cub200, ...}), and study the behavior of individual features for the domain defined by each dataset. First, Section 2 introduces previous contributions to the transfer learning field, focusing on feature extraction. Section 3 introduces the datasets and CNN model used in our experiments, as well as the image embedding we build in our feature extraction process. The basis of our study are statistical distance methods, which we review in Section 4. Section 5, introduces the behavior of statistical distances for the current problem, and the distributions these compose. These distributions are analyzed in Section 6, while the impact of noise on the measures is discussed in Section 7. The consistency of our findings given a different source task is shown in Section 8. Finally, the conclusions drawn from this study are summarized in Section 9.

Related Work
CNN models are defined by a large number of parameters, requiring lots of data instances (typically images) for their optimization. Until the release of large visual datasets, hand-made features (Perronnin, Sánchez, & Mensink, 2010) produced the best results for vision tasks. Nowadays, CNNs can be trained using datasets such as ImageNet and VOC2012 (Everingham, Van Gool, Williams, Winn, & Zisserman, 2012), learning powerful visual descriptors, which allows them to outperform previously competitive solutions such as Improved Fisher Vectors on many visual tasks (Chatfield, Simonyan, Vedaldi, & Zisserman, 2014). Donahue et al. (2014) presents one of the first studies on the behavior of convolutional filters in a feature extraction process. In that work, authors study a CNN (AlexNet architecture composed by 5 convolutional layers and 3 fully connected layers) trained using ImageNet 2012 as (T S , D S ) for transfer learning. Qualitatively, authors observe how features from the first fully connected layer outperform features from lower layers at the task of separating concepts according to the WordNet hierarchy. Authors evaluate features extracted from the last convolutional layer and the first two fully connected layers on various datasets. Their results indicate that by using features from the first fully connected layers to train a support vector machine (SVM) one can achieve state-of-the-art results on various related tasks.
The contribution of Sharif Razavian et al. (2014) goes in a similar direction, using the OverFeat network architecture pre-trained using ImageNet 2012 as (T S , D S ). Authors focus mostly on the first fully connected layer, performing data augmentation to increase the quality of those features (cropping and rotating samples), and doing component-wise power transformation. After applying a l2-normalization to the resultant vectors, an SVM is trained and applied to a wide variety of tasks (T T ∈{image classification, fine grained recognition, attribute detection, visual instance retrieval,...}) and domains (D T ∈ {VOC2007, flowers102, cub200,...) achieving competitive results on all of them. For one of those tasks (image classification), features from various layers are evaluated separately using an SVM, with the first fully connected layer obtaining the best results.
A rather different approach is depicted by Yosinski et al. (2014), where the goal is to study the transferability of features for the purpose of fine tuning the deep neural network for the target task and dataset. In that regard, authors find that the distance between the source and target tasks is strongly related with the depth of the optimal layer to be used in the transfer learning process. Azizpour et al. (2016) empirically evaluated several parameters that can affect the transfer learning process for feature extraction. Among the parameters they considered some are related with the architecture and training of the initial CNN (network depth and width, distribution of training data, optimization parameters), and some are related with the transfer learning process (fine-tuning, network layer to be extracted, spatial pooling and dimensionality reduction). All these parameters are evaluated on 17 visual recognition tasks, identifying a set of good parameters depending on the distance between the source task and the target task. Regarding the representation layer (which layer is used to build the embedding) authors find that the first or second fully connected layer produces the best results on most cases, when feeding an SVM for classification.
Deep residual networks (ResNets) are an evolution of traditional CNN which include branching of paths. Unlike CNNs, which stack layers sequentially, ResNets implement shortcut connections which eases the convergence during the training process, allowing the training of networks with more layers (up to thousands). Mahmood, Bennamoun, An, and Sohel (2016) explore the use of ResNets for feature extraction, particularly to solve three image classification problems. Results indicate that ResNets are a competitive alternative to classic CNN architectures, also in the context of feature extraction. Long, Cao, Wang, and Jordan (2015) proposed a new Deep Adaptation Network (DANs) architecture to solve the problem of domain adaptation for convolutional neural networks. In this architecture the first convolutional layers parameters are reused without modification, while the weights of the last convolutional layers are fine-tuned for the new task. Fully connected weights are tailored to fit specific tasks via Multiple Kernel Maximum Mean Discrepancies (MK-MMD). However, this approach does not consider the problem of solving a different target task and its impact in the reusability of the pre-trained features.

Source and Target Problems
Feature representation transfer in the context of deep learning is particularly useful when the target problem (T T , D T ) does not include enough labeled data to train its own CNN representation language. In this context, the (T T , D T ) problem can be solved by using features crafted for a different (T S , D S ) problem, although the quality of the resultant embedding representation will strongly depend on how similar (D S ,T S ) and (D T ,T T ) are. If the language learnt for (T S , D S ) lacks the vocabulary to properly characterize the particularities of (T T , D T ) (e.g., because D S is defined by black and white images, and D T includes colorful patterns), the resultant embedding will be of poor quality and any learning applied to it will be deficient. With that in mind, (T S , D S ) is typically chosen to capture a range of visual patterns as broad as possible, so that its language will be likely to include features relevant for many different target tasks, and capable of characterizing a wide variety of image domains. In this regard, a CNN trained for the ImageNet 2012 dataset (Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, & Fei-Fei, 2015) is a good candidate for a source problem, since the 1,000 categories composing its T S requires a huge variety of visual patterns, while the large number of images available in D S guarantees that the learnt model will generalize to different domains.

CNN Architecture
To completely specify T S one needs the label space Y but also an objective predictive function f (·). In our case the function f (·) is defined by a trained CNN (i.e., its architecture and parameters). There are many popular CNN architectures, and various have been used for feature extraction (see Section2). Since our goal is to explore the behavior of convolutional layers in the feature extraction process, we will use an architecture which follows the most canonical scheme of layers (i.e., conv/pool/conv/pool/.../fc). At the same time, we wish to use a model capable of learning a rich representation language at various levels (i.e., a very deep network). This combination of requirements leads us to use the VGG16 architecture as source of features . VGG16 is composed by 13 convolutional layers (with 5 pooling layers) and 3 fully-connected layers (see Table 1 for details on the architecture). The only exception are Figures 2, 3, 4 and 7 which are obtained using the VGG19 architecture, and used here only for illustrative purposes. This architecture is from the same authors, and detailed on the same paper. It only differs from VGG16 by having 3 extra convolutional layers conv3 4, conv4 4 and conv5 4. Results obtained with the VGG19 architecture were consistent with the ones obtained with VGG16 for all experiments. Both models are publicly available at the authors web page 1 .

Target Datasets
Once we have defined the source task and domain (T S , D S ), let us introduce the publicly available datasets we will consider as target (T T , D T ) in our study on transfer learning: 1. The MIT Indoor Scene Recognition dataset (Quattoni & Torralba, 2009) (mit67 ) consists of different indoor scenes of 67 categories. Its main challenge resides in the class dependence on global spatial properties and on the relative presence of objects.  (Silvén, Niskanen, & Kauppinen, 2003) (wood ) contains knot images from spruce wood, classified according to Nordic Standards. This old dataset of industrial application is considered to be challenging even for human experts.  11. We also use the validation split of ImageNet 2012 (Russakovsky et al., 2015) (imagenet) as a target problem for comparison purposes. Notice that the images composing this dataset are different from the training set of ImageNet 2012, so they can have a different distribution which implies that domains can be different, but similar (D S D T ). For disambiguation we will refer to ImageNet 2012 when talking about the whole dataset and to imagenet when talking about this target task T T .
Dataset sizes, number of classes and number of images per class are specified in Table  2. In our experiments we do not train models using these datasets, which means we do not require the provided train and test splits. Instead, we will merge both splits to make use of all the data available.

Embedding
Given the source and target problems (T S , D S ), (T T , D T ), there are still several parameters than can modify the construction of the embedding space. Most of those parameters were explored by Azizpour et al. (2016). In our case, we will use two main parameters which we consider to be coherent with our study. First, each image representation will be built as a result of processing 10 crops of the image (4 corners and middle crop, mirrored) through the CNN and averaging the resulting activations. This is a frequently used methodology for feature extraction (Sharif Razavian et al., 2014;Azizpour et al., 2016), which provides robustness to the resultant representations. Second, we perform a spatial average pooling of each convolutional layer to obtain a single value per filter. This transformation reduces the number of features in the embedding, as well as the relative spatial information (i.e., each resulting feature will determine if a visual pattern is found or not in the image on average, regardless of its exact location), while maintaining most of its descriptive power (i.e., each feature is still separately accounted for in the embedding). This spatial pooling methodology is also a recurrent solution in the field (Sharif Razavian et al., 2014;Azizpour et al., 2016).
Since we wish to explore the behavior of convolutional layers, our embedding will contain all the 16 convolutional layers available in VGG16 (from conv1 1 to conv5 3). For comparison purposes we will also extract the fully connected layers (fc6,fc7), so that we can contrast the behavior of the convolutional and fully connected features. Notice the spatial pooling performed on the convolutional layers cannot be applied to the fc layers. The components of the resultant embedding, composed by 12,416 values, is shown in Table 1. For the remaining of the document, all mentions to the embedding will refer to this representation.

Statistic Distance Methods
Previous studies on the usefulness of convolutional layers for feature extraction transfer learning have been purely empirical, based on the performance of specific classifiers (most frequently, an SVM) using the features extracted from a single layer of a CNN (see Section 2). This approach has been shown to provide consistent results, but it is limited to classification, and strongly influenced by the choice of classifier (e.g., some classifiers may perform better with a certain number of variables, or may be affected differently by noise).
In this paper we propose a different approach to evaluate the behavior of CNN features. Instead of evaluating the performance of a specific machine learning algorithm on the embedding, we measure the descriptive power of CNN features statistically, studying their behavior for the different classes composing each dataset. The goal is to learn about the descriptive nature of CNN features, so that other knowledge representation and reasoning methodologies can be adapted accordingly.
In detail, our approach consists on evaluating how characteristic each feature in the embedding is, for each of the target classes of the considered datasets. In other words, we do not want to evaluate the descriptive power of a group of features (which would be a feature selection problem) but to analyze the discriminative power of each single feature. CNN neurons do not have a crisp behavior w.r.t. classes (i.e., neurons do not activate binarily depending on the class), not even for the original training task. Instead, each CNN neuron provides a fuzzy piece of information for each class. To contextualize the information provided by individual features, we consider their activations on a given class of the target task T T (inner-class behaviour ), and compare it with the activations happening for the rest of the classes of the same dataset T T (outer-class behaviour ). This will also give us an insight on how these features would perform on their own for representing each single class within a dataset.
The inner/outer class behaviour can be visualized through two histograms of feature activations (see left plot of Figure 1). Statistically speaking, rescaled histograms are density estimations approximating a true probability density function (PDF). Although more sophisticated methods are available (Scott, 2015) we use the histogram for the sake of computational simplicity. To study the behavior of a given CNN feature for a given class (a feature class pair), we compare the corresponding inner/outer density estimations. The first statistical distance we consider using for that purpose is the well-known Kullback-Leibler (KL) divergence. The Kullback-Leibler divergence measures how much two PDF, P and Q, differ following Equation 1, where i are the points in the domain.
Although histograms are only approximations of PDFs, it is possible to fit a PDF (e.g., normal distribution, uniform distribution, etc) to a histogram. This is, however, inconvenient since the histogram of different features may be fit by different PDFs, and there may be some features which are not properly fitted by a PDF. Mutual information measures the information that two random variables, X and Y , share. It can be understood as the expectation of the KullbackLeibler divergence of the univariate distribution p(x) from the conditional distribution p(x|y). The information gain is greater as the difference between the distributions p(x|y) and p(x) grows. In our experiments, Y is analogous to belonging to class c. Thus, p(x|y) represents the inner-class distribution, and p(x|¬y) represents the outer-class distribution. This p(x|¬y) can be an approximation of p(x) if the number of samples of other classes is much bigger than for class c, formally |I ¬c | |I c | ⇒ p(x|¬y) p(x). In this case, which is the usual for tasks with high number of classes and evenly distributed samples, the mutual information can be approximated by the Kullback-Leibler divergence.
The Bhattacharyya distance is an alternative to KL which can measure the distance between two discrete probability distributions. Analogously, it can be measured from two density estimations P and Q following Equation 2, where i are the discrete points of the domain X.
By comparing two density estimations, the Bhattacharyya distance can be used directly on the data, without having to choose a fitting PDF. However, its mathematical range is only positive ([0, ∞)), making the Bhattacharyya distance unable to identify which density estimation is above and which is below. In our analysis it will be of interest to know if an inner-class behaviour is higher than the outer-class behavior or vice versa, since both situations may provide different insights. The Kolmogorov-Smirnov statistic (D KS ) measures the distance between two empirical distribution functions (EDF) P and Q. For each point i in the domain X, D KS evaluates the distance between P (i) and Q(i), and obtains the maximum. It is formally defined in Equation 3 and graphically displayed in the right plot of Figure 1. To reduce the computational cost of evaluating D KS , we discretize the domain of each EDF into 100 bins, thus decreasing the domain resolution by 1%. Notice that, since EDF is a cumulative distribution, D KS is directly computed on a set of values (one pair of inner/outer class behaviours).
In contrast with Bhattacharyya distance, D KS 's mathematical range is [0, 1]. We use a signed variant of D KS where the sign indicates which EDF is above and which is below at the point where P and Q differ most. This variant extends the range of D KS to [−1, 1] and allows us to differentiate when inner class behavior is above outer class behavior (D KS > 0) and vice versa (D KS < 0). D KS = 0 means that both distributions are identical, while D KS = 1 and D KS = −1 means that both distributions do not intersect at any point. The Kolmogorov-Smirnov statistic does not require the fitting of a PDF (unlike Kullback-Leibler divergence) which is desirable. For these reasons, in all our following experiments we will use the signed version of Kolmogorov-Smirnov statistic (D KS ).

Statistic Distance Behaviors
Our statistic distance analysis is based on the inner/outer class D KS . In this section we introduce the behavior of the D KS values, and the distribution of these values layer-wise when computing a given datasets. The following section 5.1 contains a detailed study of the distributions from various perspectives.
Simply put, a distance D KS (f, c) 0 means that the distribution of activations of feature f for all the images belonging to class c (i.e., I c ) is almost identical to the distribution of values of feature f for all the images that do not belong to that class (i.e., I ¬c ). If D KS (f, c) > 0 then feature values for images I c tend to be higher than for the rest of images I ¬c , which implies that the visual elements represented by feature f are more commonly found in I c images than in I ¬c images. Similarly, if D KS (f, c) < 0, feature values are in general lower for I c than for I ¬c , which implies that elements represented by feature f are rare within I c images when compared to the rest of the dataset.
To illustrate this behavior we explore which are the features with the highest D KS values for the mit67 dataset (i.e., closer to 1). Figure 2 shows some of the top D KS (f, c) values for different layers, indicating the class c in which the large D KS value occurs. To show what a particular feature is encoding, we plot the 9 image crops from the ImageNet 2012 validation set (i.e., imagenet) producing the highest activation value for that feature. Images from this dataset will provide better feature characterizations, since CNN features were originally trained for its classes. In the example of The class producing that high D KS is shown below each feature. Each feature corresponds to a specific neuron (for fully connected layers) or filter (for convolutional layers) from the original CNN model. To illustrate captured visual patterns of each feature, we show 9 cropped images from ImageNet 2012 validation set producing the highest activation values for this neuron or filter. Images are cropped to match the neuron receptive field.
conv3 3 n145, which presents a high D KS value for the Florist class, the crops producing high activations correspond to colorful patterns in contrast with its surroundings.
Analogous to the study of positive D KS values of Figure 2, we consider the lowest D KS values (i.e., closer to -1). Initially, one could expect that the features having the lowest D KS for a given class c would be those identifying elements which never appear in the images of c. For example, a hypothetical class whale could be expected to have a extremely negative D KS for a feature identifying a wheel. However, since the D KS values are computed in the context of a dataset (i.e., it indicates inner/outer class disparity) such an assumption is incomplete. As a matter of fact, features having the lowest D KS for a given class c are those identifying elements which appear in the images of c very rarely when compared with their frequency for the rest of images. For example, in a dataset composed only by the classes whale and clownfish, the features with the lowest D KS values for the class whale would correspond to those having the highest D KS values for the class clownfish, most likely features identifying orange related patterns. On the other hand, features identifying uncommon patterns on both classes (i.e., a wheel) would have a D KS value close to zero for both classes, as its inner/outer class distributions would be very similar.
To illustrate the behavior of extremely negative D KS values, Figure 3 shows a feature which has extremely negative D KS values (among the top 10 lowest) for four different classes of the cub200 dataset. This particular feature (the n1946 of the fc7 layer) is apparently specialized to recognize flying animals, as shown by the set of images from the ImageNet validation set which produce a maximum feature activation (see first row of Figure 3). A deeper analysis of the feature, based on the methods used by Yosinski et al., 2015, indicates that both the central colorful figure and the cluttered background are influential for the feature activation. Nevertheless, according to our D KS study, this feature produces top negative values for several classes of birds. The explanation behind this lies in the particularities of the classes for which the feature produces extremely negative D KS values: the four classes correspond to birds which live in a water or coastal environment, and which have dull colors (see second row of Figure 3). The feature, on the other hand, is apparently specialized on fc7 n1946  identifying colorful flying animals lying on branches. In this case, the extremely negative D KS values for this neuron would be analogous to identifying flying animals of dull colors through the absence of visual features. Another example of this behavior for the flowers102 dataset is shown in Figure 4. Again, two features which produce top 10 negative D KS values seem to be very representative of the whole dataset (classes of flowers), but not so for a few specific classes. One feature encodes the visual patterns corresponding to radial orange and red patterns (feature n1449 of fc7), while the other focuses on wide petals (feature n3529 of fc7). Clearly, the classes of flowers with highly negative D KS values do not have these properties. Hence, through the abnormal absence of both of these features (e.g., the spear thristle class shown in Figure 4), we are roughly characterizing flowers without radial pistils and wide petals.
These two examples illustrate how the lack of feature activations can convey relevant information. Notice how the behavior of negative D KS values depend on the context provided by the dataset, as extremely negative values on some classes will only happen for features which have a consistently high value on the rest of the dataset. Statistically, the extremely negative values of a feature can only happen for a small set of classes, since, if the set of classes grew, the inner/outer class disparity would decrease, making D KS closer to zero. This capability of extracting knowledge from the lack of data is novel and particularly relevant for feature representation transfer, where features are not originally designed for the target task. In this setting, both the presence and absence of visual patterns can provide relevant information for the characterization of images.
Let us also discuss the relevance of this behavior for fine-grained datasets, those containing classes belonging to a small, rather similar family of entities. Since extremely negative D KS values identify infrequently low feature activations, it is needed for that feature to be frequent on most of the dataset (e.g., flying animals of bright colors that live on trees is a frequent feature of birds). This may often happen in fine-grained datasets, where there are many common features in the data. However, in broad datasets which include a wider visual variety of classes (e.g., ImageNet, mit67 ), there are much fewer features which are frequent on most classes and infrequent in a few. Hence, it will be much harder to obtain extremely negative D KS values.

Statistic Distance Distributions
After introducing the essential behavior of positive and negative D KS , we now consider the overall distribution of D KS values per dataset. By plotting all D KS (f, c) values produced for a dataset, we obtain a clear bimodal distribution, separating positive (D + KS ) and negative (D − KS ) values. See Figure 6 for an example. Each modality on its own resembles a log-norm distribution. To represent the distribution of D KS values for all layers and datasets in a single plot, Figure 5 flattens each distribution and displays the two corresponding modes and error bars. Before discussing the resultant distributions, let us define a few terms which we will use in the following sections. A data representation which is good at modeling the target domain D T can be considered to be highly descriptive, as it is capable of characterizing the associated data. On the other hand, a data representation which is good at modeling the target task T T can be considered to be highly discriminative, as it is capable of separating the associated labels. This same categorization can be made for features, being highly descriptive the ones that help to build a rich representation of the domain, and highly discriminative the ones that help to solve the classification task. In the context of our study, the discriminativeness of a feature f w.r.t. a class c is shown by how close to either -1 or 1 D KS (f, c) is (as discriminative features are expected to produce very different values for I c and I ¬c ). Unfortunately, the descriptiveness of a feature cannot be illustrated in terms of D KS values, as descriptiveness originates from the domain and not from the task labels (which is what D KS measures). All further references to the discriminativeness of a feature will refer to this definition.
Let us now discuss the distributions of D KS values shown in Figure 5. Regardless of the layer depth, most features are discriminative for most tasks (either positively or negatively) as there are very few D KS values close to zero. The separation between D + KS and D − KS decreases on deeper layers (the fully-connected ones), particularly for those tasks which differ the most from the source task. Since deeper features are more specialized for the source task, more of these features may turn out to be irrelevant for the new task, producing similar activation values for I c and I ¬c , which in turn results in D KS values closer to zero. Indeed, the distance between D + KS and D − KS does not decrease for those datasets which are essentially a subset of the source task, such as imagenet, caltech101 and caltech256. The behavior of D KS distributions on fully-connected layers is further discussed in Section6.2.
To further investigate the variable behavior of D KS distributions based on layer depth, Figure 6 shows the distributions separated in two plots: one for features from convolutional layers and another one for features from fully connected layers. According to the top plot of Figure 6, almost all features from convolutional layers are equally discriminative for all datasets, even for the tasks which are a direct subset of the source task (e.g., imagenet). Furthermore, the number and degree of positively discriminative features is almost symmetrical to the number and degree of negatively discriminative features. This indicates that convolutional features contain a similar amount of information to be exploited from both modalities. The behavior of D KS distributions on convolutional layers is further discussed in Section6.1.
Most of these insights are coherent with the findings in the state-of-the-art, indicating that features from high-level layer are more specific and discriminant, particularly for target tasks which are close to the source task (Azizpour et al., 2016). However, our results indicate that features from low-level layers are more general and discriminant than originally considered. This opens the door to use them for knowledge representations purposes and related problems such as unsupervised learning.

Desirable Distributions of D KS Values
Before getting into the detailed analysis, let us consider what characterizes a useful feature from the perspective of D KS distributions. This will help motivate some of the conclusions we draw from the consequent analysis.
As mentioned before, features with high absolute D KS value for a given task are discriminative of the classes of that task. Considering all the features of a layer together, as in Figure  6, the desirable distribution becomes one with density concentrated as much as possible on the extremes of the x axis. The two plots of Figure 6 show that convolutional features are on average more separated from the irrelevancy of D KS = 0. However, in the bibliography there are plenty of experiments where a fully-connected layer is shown to outperform a convolutional layer for classification. The explanation for this phenomenon is that the correlation between D KS values and discriminativeness is not linear, as discriminativeness grows rapidly as it approaches D KS = 1 or D KS = −1. As a result, having two features with D KS = 0.3 is not as good as having a single feature with D KS = 0.5. The higher discrimative power of fully-connected features reported in the bibliography for certain datasets is thus supported by the distribution of D KS values (bottom plot of Figure 6) which, for some datasets, has a slightly higher and longer tail on the plus side than the distribution for convolutional features (top plot of Figure 6). For other datasets (e.g., wood, flowers102 ) convolutional features have more discrimative power than fully-connected features.

Analysis of Statistic Behaviors
In this section we discuss some of the observations we make on the plots introduced in the previous section. In the two following sections we separately analyze the behavior of convolutional layers and fully-connected layers.

Analysis of Convolutional Layers
In all Figures, datasets are ordered by average number of images per class. As shown in Figure 5, there is a clear correlation between that number and the D KS values obtained on the low convolutional layers (less images per class may cause more extreme D KS values). This correlation decreases with layer depth and is non-existent on the fully-connected layers. Middle and higher layers are more affected by other properties like dataset similarity, as we will see later.
We start our analysis of the distributions for convolutional layers shown Figure 5 by focusing on the unusually long bars for the wood and, to a certain degree, also on the flowers102 datasets. This behaviour is more obvious in the first convolutional layers of the D + KS modality, but becomes attenuated in later layers and is not symmetric for the D − KS modality. Beyond having a relatively few images per class (a particular class of the wood dataset has only 14 samples), the flowers102 dataset, and specially the wood dataset, are composed by classes which differ only in small and texture-like characteristics. Low level convolutional layers are known to learn filters similar to Gabor filters and color blobs (Yosinski et al., 2014), which are appropriate to solve this sort of problems. These two factors explain why these features are so disproportionally discriminative for these datasets. As to why the textures dataset does not display this behavior, when its a dataset specific of textural patterns, the answer lies in the composition of the dataset. In addition to having more images per class, images from the textures dataset display textures at an image level, and by looking at a few pixels in the image (as low convolutional layers do) it is impossible to identify the texture (e.g., there are large portions of images labeled as wrinkled which do not show a single wrinkle). Coherently, the most discriminative features for this dataset are the ones found within middle and upper convolutional layers. As an example on the behavior of low level convolutional layers, Figure  7 shows some of the features from layers conv1 1 and conv1 2 that produce very high D + KS values for a specific class of the flowers102 dataset. These particular features correspond to horizontal gradients, vertical gradients and edge detectors, features which appear infrequently often in images of this class when compared to the rest of the flowers in the dataset.
Beyond the behavior of the wood and flowers102 datasets for the first convolutional layers, the distribution of D KS values is rather stable in general. The top plot of Figure 6 shows that low-level convolutional features behave similarly for all datasets (including imagenet). Even though these features were optimized for the classification of ImageNet 2012 classes, it seems that they are roughly as discriminative for imagenet as they are for the rest of datasets. This provides further evidence on why transfer learning for fine tuning produces such good results (Yosinski et al., 2014), but also indicates that features from these layers could be used almost ubiquitously for knowledge representation. The only dataset behaving clearly differently at the extremes values of low-level layers features (by having a higher tail) is the wood dataset, for the reasons previously discussed: very detailed classes and small sample size. flowers102 and caltech101 also have tails with a height above average, as these datasets also include these properties (both in the case of flowers102, and only limited samples sizes in the case of caltech101 ).

Analysis of Fully Connected Layers
Let us now consider the distributions of D KS values for the fully connected layers through the bottom plot of Figure 6. The dataset food101 has the most distinct distribution, with a large spike of D KS values close to 0 (close to 8% of feature class pairs fall within the same bin) and very few D KS values close to both -1 and 1. This behaviour is likely to be directly related with the variability of the domain, as well as with the number of images per class (food101 has the most, with 250). For something as inconsistent as food, a large number of samples may lead to very different activations within a class (e.g., a caesar salad can include many different ingredients presented in many different ways). These variations lead to indistinguishable inner-class and outer-class behaviors, which in turn results in D KS values close to zero. The particular behavior of the food101 dataset does not extrapolate to the other datasets which also have a larger number of images per class, such as catsdogs stanforddogs and caltech256. Since these datasets are very similar to the source task T S (ImageNet2012 ), these results indicate that similarity between tasks is the most relevant property for the behavior of fully-connected features. This is further supported by the distributions corresponding to the caltech101 and caltech256 datasets. While there are differences in their average number of instances per class |I c | (91 -120), their total size (9,146 -30,607) and the number of classes (101 -256) both D KS distributions are almost identical (see plot (a) of Figure 8). To explore this consideration, next we categorize the 11 datasets in 3 groups, based on their degree of overlap with ImageNet2012 : (a) Datasets where the classes are a direct subset of the ones in ImageNet2012. This group includes imagenet, stanforddogs, catsdogs, caltech101 and caltech256.
(b) Datasets where the classes partially intersect with the ones in ImageNet2012. This group includes cub200, flowers102 and food101.
(c) Datasets where the classes are completely disjoint with the ones in ImageNet2012. This group includes wood, mit67 and textures. Figure 8 shows the distribution of D KS values for the fully-connected features, plotted separately for each of these three groups. Group (a) is the only group where the distribution of D KS values gets very close to zero in the y axis for D KS = 0 for three of the five datasets in the group. This implies that, in these three datasets, there is not a single feature-class pair which has an identical inner and outer-class distribution. In other words, for these datasets all fully-connected features are at least somewhat discriminative for all classes. The three datasets showing this behavior are imagenet, caltech101 and caltech256. All wide spectrum datasets directly contained within the source task of ImageNet2012. Significantly, this happens regardless of the number of classes (1,000, 101 and 256 respectively). On the other hand, the two datasets of group (a) for which this does not happen (stanforddogs and catsdogs) are limited to a certain domain (dogs, and cats and dogs respectively). Even though these datasets are subsets of the source task, there are still some fully-connected features which are not discriminant for any class. These irrelevant feature class pairs most likely correspond to those features used to characterize the type of elements which are found in ImageNet2012 but not in these restricted domains datasets (e.g., those used to characterize non-living things). These indiscriminant features prevent the D KS distribution to reach zero on the D KS = 0 point. The impact of including a wide spectrum of classes w.r.t. not having indiscriminant features is further supported by the dataset with the fourth lowest percentage of feature/class pairs close to D KS = 0. That is the textures dataset (see panel (c) of Figure  8), which, although apparently has little in common with the source ImageNet2012 task, includes a wide variety of textures coming from plants, animals, man-made objects, etc.
In general, the bimodal distributions for both groups (a) and (b) are more imbalanced than for group (c), as the D − KS part of the distribution accounts for a significantly larger proportion of the total area. On the other hand, the distribution of values for the group (c) is closer to the distribution of values for the convolutional features (see top plot of Figure  6), where both modalities are symmetrical. The imbalanced behavior on groups (a) and (b) is explained by the same nature of fully-connected features, which were optimized during its original training to strongly activate for a small subset of classes and to be inhibited for the vast majority. This also results in a higher tail on the D + KS side. On the other hand, the more balanced behavior of group (c) indicates that in this cases, instead of activating very strongly for a few set of classes, features activate moderately for a larger amount of classes.  This is particularly interesting, as it indicates that fully-connected features could be treated as convolutional features when the target task is completely different than the source task.

Level of Noise and Thresholding
In Section 6, we discussed the distribution of D KS values at a dataset level, assuming that the D KS values were evenly distributed among the classes that compose a dataset. However, this may not be the case, as a subset of the classes composing a dataset may have a large set of relevant features, while another subset of classes is under-represented with no or very few features characterizing them. To answer this question, in Figure 9 we plot an accumulated distribution of D + KS values per class. Each of the black lines represents a single class in the dataset. This graph is accumulative, showing how many features have a D + KS value greater than the x axis value for each class. Thus, at D KS = 0.2 (on the x axis) we are plotting the number of features that meet D KS > 0.2 (on the y axis) for each class. Figure 9 shows a certain variance among classes of the same dataset for any given D KS threshold. This implies that some classes are more richly characterized by the embedding than others, as suspected. Although no class reaches 0 on the y axis until around D KS = 0.4, some reach 0 at a remarkable D KS = 0.9. To check if having 0 features at D KS = 0.4 implies that the class is characterized by the embedding, in the same figure we show the behavior of the same dataset with randomized labels (in red). This is obtained assigning to each image a random label, keeping the total number of instances per class unmodified (i.e., shuffling the real labels). Notice that this process keeps unchanged properties like the number of classes or the imbalance in the number of instances per class. By randomizing the labels we can observe the characterization that the embedding produces of purely noisy classes with the same characteristics of the target task.
As shown for all 11 datasets of Figure 9, most randomized classes drop to 0 features between D KS = 0.1 and D KS = 0.3. The gap between the black and red lines allows us to assert that all classes are represented meaningfully (i.e., beyond randomness) at a certain point. It also triggers the question of which portion of this curve could or should be pruned to maximize discriminativeness while minimizing noise. This is equivalent to ask which is the minimum D KS value we consider to be relevant when choosing the features to characterize a class.

Threshold Measure
One of the main goals of this paper is to study the viability of using convolutional features for feature representation transfer. However, and due to their unspecificity, many convolutional features may generate noise in the sense that they do not provide any information related to the target labels. The plots of Figure 9 showing the inner and outer class distributions for randomized classes provides a first insight on the actual magnitude of that noise.
We now consider the definition of thresholds t + and t − on D KS , such that every D + KS < t + or D − KS > t − could be safely discarded in a feature representation transfer process. These thresholds should allow us to determine which features are likely to be relevant for each class, canceling out a significant amount of noise. Defining such a threshold implies a trade-off, as a t + and t − close to zero would result in representations with a larger descriptive power, while a t + and t − close to 1 or -1 respectively would result in representations with a minimum amount of noise.
As a reliable threshold (this is analogous for both t + and t − ), we propose one which maximizes the distance between a datasets and its corresponding version with randomized caltech256 stanforddogs catsdogs food101 labels (as shown in each subfigure of Figure 9). We define such distance using the average number of features having a D KS > x for all x in the range [0, 1]. This is analogous to compute, for every point along the x axis of one of the subfigures of Figure 9, the average y axis values for all black/red lines. The average of black lines will give us the behavior on the regular dataset, while the average of red lines will give us the behavior on its randomized version. By obtaining the difference between both values we obtain the average distance (d avg ) between a dataset and its randomized version. Formally, the average distance for a value D KS = x is: where C are the known classes (labels) associated with the data, and C rand are the randomly associated classes (random-labels). The vertical bars | · | denote set cardinality.
Given the distance measure d avg (x), we define the thresholds t + and t − as the values of D KS = x that maximize d avg (x) in each respective sub-domain D + KS and D − KS . Table 3 shows the thresholds found for the 11 datasets.
There is a clear correlation between t + and t − and the number of samples per class |I c |. Indeed, a logarithmic curve can be fitted to both t + and t − with respect to |I c | obtaining coefficients of determination R 2 of 0.82 and 0.84 respectively. This indicates that the number of images per class is a very good indicator of the level of noise to be expected. This factor overshadows other relevant aspects, such as the level of similarity between tasks, which is only important when the tasks are exactly the same (i.e., imagenet).
It is also interesting to see how many features in the embedding remain relevant after pruning the noisy ones through the application of the threshold. Of the 12,416 features in the embedding, approximately 3,300 features for t − and 4,000 for t + remain on average. The threshold corresponding to each dataset is plotted as a vertical dashed line in Figure 9, showing how all classes (all black lines) would be minimally represented after applying it.
To study the degree of noise layer-wise, in Figure 10 we plot the percentage of D KS (f, c) values that are kept by the t + and t − thresholds on various sets of layers. For datasets in group (a) (i.e., imagenet, stanforddogs, catsdogs, caltech101 and caltech256 ), the pruned feature-class pairs are evenly distributed among convolutional and fully-connected layers. This indicates that noise is found throughout the embedding for these datasets. For datasets in groups (b) and (c), the pruned feature-class pairs from fully-connected layers are significantly larger than the pruned pairs from convolutional layers. This is caused by the higher specificity of high level features, which are more frequently irrelevant for characterizing classes which differ from the source task. This results could be useful for, given a target task, determining which features and from which layers should be used when building an embedding.

Consistency Between Source Tasks
In this section we validate that our results are consistent beyond the source problem used for CNN training. For that purpose we use of the same VGG16 network architecture trained for the Places2 dataset (Zhou, Khosla, Lapedriza, Torralba, & Oliva, 2016). Places2 is a task T S unrelated to ImageNet 2012 containing a large set of samples (1.8 million). However, the domain D S of Places2 dataset is not as wide as ImageNet 2012, as it is focused in scene categories instead of objects. Table 3: t + and t − thresholds as defined by the maximum average distance for each of the eleven datasets explored. D + KS and D − KS regions are computed separately. The third and fifth column shows the maximum d avg distance between the dataset and its randomized version. Analogous to Figure 5, Figure 11 shows the distribution of D KS values per layer using the embedding created by Places2 dataset. Overall, the distribution is quite similar to the one obtained with the ImageNet2012 embedding. Focusing on convolutional layers we confirm the observed correlation between the average number of instances per class |I c | and D KS values. The particular behaviours of wood and flowers102 is also present. Moreover, convolutional layers conv1 1 to conv3 3 look practically the same for all datasets. This similarity, reinforces the hypothesis of the generalist nature of convolutional features, regardless of the source and target tasks.
In the case of fully-connected features, the behavior of mit67 is analogous to those of group (a) for the ImageNet2012 embedding, as mit67 is now the closest task to the source task (i.e., Places2 ). Target tasks with no intersection with this source task, such as stanforddogs and catsdogs, now display a completely opposite activity.
There is however a remarkable difference between both embeddings in the second-to-last fully-connected layer, as the D KS divergences contract significantly for all datasets (including caltech101 caltech256 stanforddogs catsdogs imagenet cub200 wood flowers102 mit67 textures food101 Figure 10: For each dataset and layer, percentage of feature/class pairs remaining after pruning, for the original labels (blue) and for the randomized labels (green) (i.e., features whose D KS for a certain class is higher than t + or lower than t − ).
mit67 ). Although we have no clear explanation for this phenomenon, we hypothesize that the different objects and characteristics needed to classify the holistic classes of Places2 (i.e., scenes) cause the features from the fc6 layer (where all this information is aggregated) to be extremely specific of the source task.

Conclusions
CNN feature representation transfer has been studied in the past through the performance of a classifier (most commonly, a SVM). Most contributions measure how each layer performs on its own at discriminating the classes of a task which is not the one the CNN was originally trained for. Through these contributions we know that, when considered together, the features composing a fully connected layer define the most discriminant of embedding spaces. In contrast with these contributions, the purpose of this paper was to analyze the behavior of all features from all layers individually, to measure their relevance for knowledge representation. We do so by exploring the inner/outer class activations of each feature, for all classes of several datasets. Some of the conclusions we draw from this study are coherent with the current state-of-the-art, and some are novel. Next we outline them all: • Typically, features are characteristic for a given class by presence, but we have shown features can also be used to describe classes by their absence, thus providing a different type of information modality. This is particularly relevant for fine-grained datasets, where there may be many common features being characteristic for certain classes by their absence (e.g., birds of dull colors that live in water). This novel contribution could be useful for knowledge representation and reasoning purposes (Section 5). Figure 11: Inner/outer class D KS distribution per layer for 10 different datasets on the embedding created from places2 source task. Details are as in Figure 5 • Features from the last convolutional layer and fully connected layers are highly specific, being either characteristic of a class or irrelevant for it. Features from the rest of convolutional layers convey more variate information, and can be characteristic of a class both by their presence or by their absence. This motivates the use of two distinct knowledge extraction approaches, depending on layer depth (Section 5.1).
• For certain tasks, convolutional features outperform fully-connected features at discriminating the corresponding labels. Overall, our results indicate that features from low-level layers are more general and discriminant than originally considered, and opens the door to use them for knowledge representations purposes and related problems such as unsupervised learning (Section 5.2).
• Low and middle level features have a very similar behavior for the dataset they were trained for (imagenet) as for the rest of target datasets. This indicates that CNN features from these layers could be used for knowledge representation on a wide variety of datasets without fine-tuning. This is something previously proposed in the bibliography (Section 6.1).
• As previously claimed in the bibliography, the relevance of fully-connected features is strongly related with the similarity of the task with the original problem the network was trained for. However, we find that wide spectrum tasks (i.e., those containing all sorts of classes) is also a key factor. Only wide spectrum domains which are similar to the source task guarantee that there will be no indiscriminant fully-connected features (Section 6.2).
• The behavior of fully-connected features for target tasks which have no intersection with the source task is similar to the behavior of convolutional features. In this context, both sets of features could be treated analogously for knowledge representation purposes (Section 6.2).
• Discriminant features were found on all layers of the embedding for all classes of the eleven datasets evaluated. This means that, in a knowledge representation setting, no class would become indescribable, showcasing the richness of the representation language built by the CNN at every layer (Section 7).
• Through the behavior of randomized datasets we obtain an estimation of the inner/outer class distances that can be accounted for by noise. We find that a conservative threshold with little variance can be defined across datasets, and after applying that threshold more than half of the features of the embedding remain relevant (Section 7.1).
• Context is key, both in the feature extraction and knowledge representation processes. The significance of some feature activations (or its lack of) depends on the dataset being used as reference. The representation of data using neural network embeddings should consider context to be able to exploit all possible modalities of information.
Beyond these conclusions, this work provides a methodology for identifying relevant features (either by presence and absence) throughout a deep CNN. By applying this approach presented here, one can define a full-network embedding (an embedding using all layers of a network) which outperforms traditional single-layer embeddings in classification tasks (Garcia-Gasulla, Vilalta, Parés, Moreno, Ayguadé, Labarta, Cortés, & Suzumura, 2017b), and which improves the performance of multimodal pipelines for image caption and image retrieval tasks (Vilalta, Garcia-Gasulla, Parés, Ayguadé, Labarta, Cortés, & Suzumura, 2017).