{"title": "Learning multiple visual domains with residual adapters", "book": "Advances in Neural Information Processing Systems", "page_first": 506, "page_last": 516, "abstract": "There is a growing interest in learning data representations that work well for many different types of problems and data. In this paper, we look in particular at the task of learning a single visual representation that can be successfully utilized in the analysis of very different types of images, from dog breeds to stop signs and digits. Inspired by recent work on learning networks that predict the parameters of another, we develop a tunable deep network architecture that, by means of adapter residual modules, can be steered on the fly to diverse visual domains. Our method achieves a high degree of parameter sharing while maintaining or even improving the accuracy of domain-specific representations. We also introduce the Visual Decathlon Challenge, a benchmark that evaluates the ability of representations to capture simultaneously ten very different visual domains and measures their ability to recognize well uniformly.", "full_text": "Learning multiple visual domains with residual\n\nadapters\n\nSylvestre-Alvise Rebuf\ufb011\n\nHakan Bilen1,2\n\nAndrea Vedaldi1\n\n1 Visual Geometry Group\n\nUniversity of Oxford\n\n{srebuffi,hbilen,vedaldi}@robots.ox.ac.uk\n\n2 School of Informatics\nUniversity of Edinburgh\n\nAbstract\n\nThere is a growing interest in learning data representations that work well for many\ndifferent types of problems and data. In this paper, we look in particular at the\ntask of learning a single visual representation that can be successfully utilized in\nthe analysis of very different types of images, from dog breeds to stop signs and\ndigits. Inspired by recent work on learning networks that predict the parameters of\nanother, we develop a tunable deep network architecture that, by means of adapter\nresidual modules, can be steered on the \ufb02y to diverse visual domains. Our method\nachieves a high degree of parameter sharing while maintaining or even improving\nthe accuracy of domain-speci\ufb01c representations. We also introduce the Visual\nDecathlon Challenge, a benchmark that evaluates the ability of representations to\ncapture simultaneously ten very different visual domains and measures their ability\nto perform well uniformly.\n\n1\n\nIntroduction\n\nWhile research in machine learning is often directed at improving the performance of algorithms on\nspeci\ufb01c tasks, there is a growing interest in developing methods that can tackle a large variety of\ndifferent problems within a single model. In the case of perception, there are two distinct aspects of\nthis challenge. The \ufb01rst one is to extract from a given image diverse information, such as image-level\nlabels, semantic segments, object bounding boxes, object contours, occluding boundaries, vanishing\npoints, etc. The second aspect is to model simultaneously many different visual domains, such as\nInternet images, characters, glyph, animal breeds, sketches, galaxies, planktons, etc (\ufb01g. 1).\nIn this work we explore the second challenge and look at how deep learning techniques can be used\nto learn universal representations [5], i.e. feature extractors that can work well in several different\nimage domains. We refer to this problem as multiple-domain learning to distinguish it from the more\ngeneric multiple-task learning.\nMultiple-domain learning contains in turn two sub-challenges. The \ufb01rst one is to develop algorithms\nthat can learn well from many domains.\nIf domains are learned sequentially, but this is not a\nrequirement, this is reminiscent of domain adaptation. However, there are two important differences.\nFirst, in standard domain adaptation (e.g. [9]) the content of the images (e.g. \u201ctelephone\u201d) remains\nthe same, and it is only the style of the images that changes (e.g. real life vs gallery image). Instead\nin our case a domain shift changes both style and content. Secondly, the dif\ufb01culty is not just to adapt\nthe model from one domain to another, but to do so while making sure that it still performs well on\nthe original domain, i.e. to learn without forgetting [21].\nThe second challenge of multiple-domain learning, and our main concern in this paper, is to construct\nmodels that can represent compactly all the domains. Intuitively, even though images in different\ndomains may look quite different (e.g. glyph vs. cats), low and mid-level visual primitives may still\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a)\n\n(f)\n\n(b)\n\n(g)\n\n(c)\n\n(h)\n\n(d)\n\n(i)\n\n(e)\n\n(j)\n\nFigure 1: Visual Decathlon. We explore deep architectures that can learn simultaneously different\ntasks from very different visual domains. We experiment with ten representative ones: (a) Aircraft, (b)\nCIFAR-100, (c) Daimler Pedestrians, (d) Describable Textures, (e) German Traf\ufb01c Signs, (f) ILSVRC\n(ImageNet) 2012, (g) VGG-Flowers, (h) OmniGlot, (i) SVHN, (j) UCF101 Dynamic Images.\n\nbe largely shareable. Sharing knowledge between domains should allow to learn compact multivalent\nrepresentations. Provided that suf\ufb01cient synergies between domains exist, multivalent representations\nmay even work better than models trained individually on each domain (for a given amount of training\ndata).\nThe primary contribution of this paper (section 3) is to introduce a design for multivalent neural\nnetwork architectures for multiple-domain learning (section 3 \ufb01g. 2). The key idea is recon\ufb01gure\na deep neural network on the \ufb02y to work on different domains as needed. Our construction is based on\nrecent learning-to-learn methods that showed how the parameters of a deep network can be predicted\nfrom another [2, 16]. We show that these formulations are equivalent to packing the adaptation\nparameters in convolutional layers added to the network (section 3). The layers in the resulting\nparametric network are either domain-agnostic, hence shared between domains, or domain-speci\ufb01c,\nhence parametric. The domain-speci\ufb01c layers are changed based on the ground-truth domain of the\ninput image, or based on an estimate of the latter obtained from an auxiliary network. In the latter\ncon\ufb01guration, our architecture is analogous to the learnet of [2].\nBased on such general observations, we introduce in particular a residual adapter module and\nuse it to parameterize the standard residual network architecture of [13]. The adapters contain a\nsmall fraction of the model parameters (less than 10%) enabling a high-degree of parameter sharing\nbetween domains. A similar architecture was concurrently proposed in [31], which also results in\nthe possibility of learning new domains sequentially without forgetting. However, we also show a\nspeci\ufb01c advantage of the residual adapter modules: the ability to modulate adaptation based on the\nsize of the target dataset.\nOur proposed architectures are thoroughly evaluated empirically (section 5). To this end, our second\ncontribution is to introduce the visual decathlon challenge (\ufb01g. 1 and section 4), a new benchmark\nfor multiple-domain learning in image recognition. The challenge consists in performing well\nsimultaneously on ten very different visual classi\ufb01cation problems, from ImageNet and SVHN to\naction classi\ufb01cation and describable texture recognition. The evaluation metric, also inspired by the\ndecathlon discipline, rewards models that perform better than strong baselines on all the domains\nsimultaneously. A summary of our \ufb01nding is contained in section 6.\n\n2 Related Work\n\nOur work touches on multi-task learning, learning without forgetting, domain adaptation, and other\nareas. However, our multiple-domain setup differs in ways that make most of the existing approaches\nnot directly applicable to our problem.\nMulti-task learning (MTL) looks at developing models that can address different tasks, such as\ndetecting objects and segmenting images, while sharing information and computation among them.\nEarlier examples of this paradigm have focused on kernel methods [10, 1] and deep neural network\n(DNN) models [6]. In DNNs, a standard approach [6] is to share earlier layers of the network, training\nthe tasks jointly by means of back-propagation. Caruana [6] shows that sharing network parameters\nbetween tasks is bene\ufb01cial also as a form of regularization, putting additional constraints on the\nlearned representation and thus improving it.\nMTL in DNNs has been applied to various problems ranging from natural language processing [8, 22],\nspeech recognition [14] to computer vision [41, 42, 4]. Collobert et al. [8] show that semi-supervised\nlearning and multi-task learning can be combined in a DNN model to solve several language\nprocessing prediction tasks such as part-of-speech tags, chunks, named entity tags and semantic\n\n2\n\n\froles. Huang et al. [14] propose a shared multilingual DNN which shares hidden layers across\nmany languages. Liu et al. [22] combine multiple-domain classi\ufb01cation and information retrieval for\nranking web search with a DNN. Multi-task DNN models are also reported to achieve performance\ngains in computer vision problems such as object tracking [41], facial-landmark detection [42], object\nand part detection [4], a collection of low-level and high-level vision tasks [18]. The main focus of\nthese works is learning a diverse set of tasks in the same visual domain. In contrast, our paper focuses\non learning a representation from a diverse set of domains.\nOur investigation is related to the recent paper of [5], which studied the \u201csize\u201d of the union of different\nvisual domains measured in terms of the capacity of the model required to learn it. The authors\npropose to absorb different domain in a single neural network by tuning certain parameters in batch\nand instance normalization layers throughout the architecture; we show that our residual adapter\nmodules, which include the latter as a special case, lead to far superior results.\nLife-long learning. A particularly important aspect of MTL is the ability of learning multiple tasks\nsequentially, as in Never Ending Learning [25] and Life-long Learning [38]. Sequential learning\ntypically suffers in fact from forgetting the older tasks, a phenomenon aptly referred to as \u201ccatastrophic\nforgetting\u201d in [11]. Recent work in life-long learning try to address forgetting in two ways. The \ufb01rst\none [37, 33] is to freeze the network parameters for the old tasks and learn a new task by adding\nextra parameters. The second one aims at preserving knowledge of the old tasks by retaining the\nresponse of the original network on the new task [21, 30], or by keeping the network parameters of\nthe new task close to the original ones [17]. Our method can be considered as a hybrid of these two\napproaches, as it can be used to retain the knowledge of previous tasks exactly, while adding a small\nnumber of extra parameters for the new tasks.\nTransfer learning. Sometimes one is interested in maximizing the performance of a model on a\ntarget domain. In this case, sequential learning can be used as a form of initialization[29]. This is\nvery common in visual recognition, where most DNN are initialize on the ImageNet dataset and then\n\ufb01ne-tuned on a target domain and task. Note, however, that this typically results in forgetting the\noriginal domain, a fact that we con\ufb01rm in the experiments.\nDomain adaptation. When domains are learned sequentially, our work can be related to domain\nadaptation. There is a vast literature in domain adaptation, including recent contributions in deep\nlearning such as [12, 39] based on the idea of minimizing domain discrepancy. Long et al. [23]\npropose a deep network architecture for domain adaptation that can jointly learn adaptive classi\ufb01ers\nand transferable features from labeled data in the source domain and unlabeled data in the target\ndomain. There are two important differences with our work: First, in these cases different domains\ncontain the same objects and is only the visual style that changes (e.g. webcam vs. DSLR), whereas\nin our case the object themselves change. Secondly, domain adaptation is a form of transfer learning,\nand, as the latter, is concerned with maximizing the performance on the target domain reagardless of\npotential forgetting.\n\n3 Method\n\nOur primary goal is to develop neural network architectures that can work well in a multiple-domain\nsetting. Modern neural networks such as residual networks (ResNet [13]) are known to have very high\ncapacity, and are therefore good candidates to learn from diverse data sources. Furthermore, even\nwhen domains look fairly different, they may still share a signi\ufb01cant amount of low and mid-level\nvisual patterns. Nevertheless, we show in the experiments (section 5) that learning a ResNet (or a\nsimilar model) directly from multiple domains may still not perform well.\nIn order to address this problem, we consider a compact parametric family of neural networks\n\u03c6\u03b1 : X \u2192 V indexed by parameters \u03b1. Concretely, X \u2282 RH\u00d7W\u00d73 can be a space of RGB images\nand V = RHv\u00d7Wv\u00d7Cv a space of feature tensors. \u03c6\u03b1 can then be obtained by taking all but the last\nclassi\ufb01cation layer of a standard ResNet model. The parametric feature extractors \u03c6\u03b1 is then used to\nconstruct predictors for each domain d as \u03a6d = \u03c8d \u25e6 \u03c6\u03b1d, where \u03b1d are domain-speci\ufb01c parameters\nand \u03c8d(v) = softmax(Wdv) is a domain-speci\ufb01c linear classi\ufb01er V \u2192 Yd mapping features to\nimage labels.\nIf \u03b1 comprises all the parameters of the feature extractor \u03c6\u03b1, this approach reduces to learning\nindependent models for each domain. On the contrary, our goal is to maximize parameter sharing,\nwhich we do below by introducing certain network parametrizations.\n\n3\n\n\fw1\n\n(\u03b1s\n\n1, \u03b1b\n1)\n\nBN\n\n\u2217\n\n\u03b1w\n1\n\u2217\n\n(\u03b1s(cid:48)\n\n1 , \u03b1b(cid:48)\n1 )\n\nw2\n\n(\u03b1s\n\n2, \u03b1n\n2 )\n\n+ BN [\u00b7]+\n\n\u2217\n\nBN\n\n\u03b1w\n2\n\u2217\n\n(\u03b1s(cid:48)\n\n2 , \u03b1b(cid:48)\n2 )\n\n+ BN + [\u00b7]+\n\nFigure 2: Residual adapter modules. The \ufb01gure shows a standard residual module with the inclusion\nof adapter modules (in blue). The \ufb01lter coef\ufb01cients (w1, w2) are domain-agnostic and contains the\nvast majority of the model parameters; (\u03b11, \u03b12) contain instead a small number of domain-speci\ufb01c\nparameters.\n\n3.1 Learning to learn and \ufb01lter prediction\n\nThe problem of adapting a neural network dynamically to variations of the input data is similar to the\none found in recent approaches to learning to learn. A few authors [34, 16, 2], in particular, have\nproposed to learn neural networks that predict, in a data-dependent manner, the parameters of another.\nFormally, we can write \u03b1d = Aedx where edx is the indicator vector of the domain dx of image x\nand A is a matrix whose columns are the parameter vectors \u03b1d. As shown later, it is often easy to\nconstruct an auxiliary network that can predict d from x, so that the parameter \u03b1 = \u03c8(x) can also be\nexpressed as the output of a neural network. If d is known, then \u03c8(x, d) = \u03b1d as before, and if not \u03c8\ncan be constructed as suggested above or from scratch as done in [2].\nThe result of this construction is a network \u03c6\u03c8(x)(x) whose parameters are predicted by a second\nnetwork \u03c8(x). As noted in [2], while this construction is conceptually simple, its implementation is\nmore subtle. Recall that the parameters w of a deep convolutional neural network consist primarily of\nthe coef\ufb01cients of the linear \ufb01lters in the convolutional layers. If w = \u03b1, then \u03b1 = \u03c8(x) would need\nto predict millions of parameters (or to learn independent models when d is observed). The solution\nof [2] is to use a low-rank decomposition of the \ufb01lters, where w = \u03c0(w0, \u03b1) is a function of a \ufb01lter\nbasis w0 and \u03b1 is a small set of tunable parameters.\nHere we build on the same idea, with some important extensions. First, we note that linearly\nparametrizing a \ufb01lter bank is the same as introducing a new, intermediate convolutional layer in the\nnetwork. Speci\ufb01cally, let Fk \u2208 RHf\u00d7Wf\u00d7Cf be a basis of K \ufb01lters of size Hf \u00d7 Wf operating on\nCf input feature channels. Given parameters [\u03b1tk] \u2208 RT\u00d7K, we can express a bank of T \ufb01lters as\nk=1 \u03b1tkFk. Applying the bank to a tensor x and using associativity and\nk=1 \u03b1:k(Fk \u2217 x) = \u03b1 \u2217 F \u2217 x where we interpreted \u03b1 as\na 1 \u00d7 1 \u00d7 T \u00d7 K \ufb01lter bank. While [2] used a slightly different low-rank \ufb01lter decomposition, their\nparametrization can also be seen as introducing additional \ufb01ltering layers in the network.\nAn advantage of this parametrization is that it results in a useful decomposition, where part of the\nconvolutional layers contain the domain-agnostic parameters F and the others contain the domain-\nspeci\ufb01c ones \u03b1d. As discussed in section 5, this is particularly useful to address the forgetting\nproblem. In the next section we re\ufb01ne these ideas to obtain an effective parametrization of residual\nnetworks.\n\nlinear combinations Gt =(cid:80)K\nlinearity of convolution results in G \u2217 x =(cid:80)K\n\n3.2 Residual adapter modules\n\nAs an example of parametric network, we propose to modify a standard residual network. Recall that\na ResNet is a chain gm \u25e6 \u00b7\u00b7\u00b7 \u25e6 g1 of residual modules gt. In the simplest variant of the model, each\nresidual module g takes as input a tensor RH\u00d7W\u00d7C and produces as output a tensor of the same size\nusing g(x; w) = x + ((w2 \u2217 \u00b7) \u25e6 [\u00b7]+ \u25e6 (w1 \u2217 \u00b7))(x). Here w1 and w2 are the coef\ufb01cients of banks of\nsmall linear \ufb01lters, [z]+ = max{0, z} is the ReLU operator, w \u2217 z is the convolution of z by the \ufb01lter\nbank w, and \u25e6 denotes function composition. Note that, for the addition to make sense, \ufb01lters must\nbe con\ufb01gured such that the dimensions of the output of the last bank are the same as x.\nOur goal is to parametrize the ResNet module. As suggested in the previous section, rather than\nchanging the \ufb01lter coef\ufb01cients directly, we introduce additional parametric convolutional layers. In\nfact, we go one step beyond and make them small residual modules in their own right and call them\n\n4\n\n\fresidual adapter modules (blue blocks in \ufb01g. 2). These modules have the form:\n\ng(x; \u03b1) = x + \u03b1 \u2217 x.\n\nIn order to limit the number of domain-speci\ufb01c parameters, \u03b1 is selected to be a bank of 1 \u00d7 1 \ufb01lters.\nA major advantage of adopting a residual architecture for the adapter modules is that the adapters\nreduce to the identity function when their coef\ufb01cients are zero. When learning the adapters on small\ndomains, this provides a simple way of controlling over-\ufb01tting, resulting in substantially improved\nperformance in some cases.\nBatch normalization and scaling. Batch Normalization (BN) [15] is an important part of very deep\nneural networks. This module is usually inserted after convolutional layers in order to normalize\ntheir outputs and facilitate learning (\ufb01g. 2). The normalization operation is followed by rescaling and\nshift operations s (cid:12) x + b, where (s, b) are learnable parameters. In our architecture, we incorporate\nthe BN layers into the adapter modules (\ufb01g. 2). Furthermore, we add a BN module right before the\nadapter convolution layer.1 Note that the BN scale and bias parameters are also dataset-dependent \u2013\nas noted in the experiments, this alone provides a certain degree of model adaptation.\nDomain-agnostic vs domain-speci\ufb01c parameters. If the residual module of \ufb01g. 2 is con\ufb01gured to\nprocess an input tensor with C feature channels, and if the domain-agnostic \ufb01lters w1, w2 are of size\nh \u00d7 h \u00d7 C, then the model has 2(h2C 2 + hC) domain-agnostic parameters (including biases in the\nconvolutional layers) and 2(C 2 + 5C) domain-speci\ufb01c parameters.2 Hence, there are approximately\nh2 more domain-agnostic parameters than domain speci\ufb01c ones (usually h2 = 9).\n\n3.3 Sequential learning and avoiding forgetting\n\nWhile in this paper we are not concerned with sequential learning, we have found it to be a good\nstrategy to bootstrap a model when a large number of domains have to be learned. However, the most\npopular approach to sequential learning, \ufb01ne-tuning (section 2), is often a poor choice for learning\nshared representations as it tends to quickly forget the original tasks.\nThe challenge in learning without forgetting is to maintain information about older tasks as new\nones are learned (section 2). With respect to forgetting, our adapter modules are similar to the\ntower model [33] as they preserve the original model exactly: one can pre-train the domain-agnostic\nparameters w on a large domain such as ImageNet, and then \ufb01ne-tune only the domain-speci\ufb01c\nparameters \u03b1d for each new domain. Like the tower method, this preserves the original task exactly,\nbut it is far less expensive as it does not require to introduce new feature channels for each new\ndomain (a quadratic cost). Furthermore, the residual modules naturally reduce to the identity function\nwhen suf\ufb01cient shrinking regularization is applied to the adapter weights \u03b1w. This allows the adapter\nto be tuned depending on the availability of data for a target domain, sometimes signi\ufb01cantly reducing\nover\ufb01tting.\n\n4 Visual decathlon\n\nIn this section we introduce a new benchmark, called visual decathlon, to evaluate the performance\nof algorithms in multiple-domain learning. The goal of the benchmark is to assess whether a method\ncan successfully learn to perform well in several different domains at the same time. We do so by\nchoosing ten representative visual domains, from Internet images to characters, as well as by selecting\nan evaluation metric that rewards performing well on all tasks.\nDatasets. The decathlon challenge combines ten well-known datasets from multiple visual domains:\nFGVC-Aircraft Benchmark [24] contains 10,000 images of aircraft, with 100 images for each of\n100 different aircraft model variants such as Boeing 737-400, Airbus A310. CIFAR100 [19] contains\n60,000 32 \u00d7 32 colour images for 100 object categories. Daimler Mono Pedestrian Classi\ufb01cation\nBenchmark (DPed) [26] consists of 50,000 grayscale pedestrian and non-pedestrian images, cropped\nand resized to 18 \u00d7 36 pixels. Describable Texture Dataset (DTD) [7] is a texture database, con-\nsisting of 5640 images, organized according to a list of 47 terms (categories) such as bubbly, cracked,\n\n1While the bias and scale parameters of the latter can be incorporated in the following \ufb01lter bank, we found\n\nit easier to leave them separated from the latter\n\n2Including all bias and scaling vectors; 2(C 2 + 3C) if these are absorbed in the \ufb01lter banks when possible.\n\n5\n\n\fmarbled. The German Traf\ufb01c Sign Recognition (GTSR) Benchmark [36] contains cropped im-\nages for 43 common traf\ufb01c sign categories in different image resolutions. Flowers102 [28] is a\n\ufb01ne-grained classi\ufb01cation task which contains 102 \ufb02ower categories from the UK, each consisting of\nbetween 40 and 258 images. ILSVRC12 (ImNet) [32] is the largest dataset in our benchmark con-\ntains 1000 categories and 1.2 million images. Omniglot [20] consists of 1623 different handwritten\ncharacters from 50 different alphabets. Although the dataset is designed for one-shot learning, we use\nthe dataset for standard multi-class classi\ufb01cation task and include all the character categories in train\nand test splits. The Street View House Numbers (SVHN) [27] is a real-world digit recognition\ndataset with around 70,000 32 \u00d7 32 images. UCF101 [35] is an action recognition dataset of realistic\nhuman action videos, collected from YouTube. It contains 13,320 videos for 101 action categories.\nIn order to make this dataset compatible with our benchmark, we convert the videos into images by\nusing the Dynamic Image encoding of [3] which summarizes each video into an image based on a\nranking principle.\nChallenge and evaluation. Each dataset Dd, d = 1, . . . , 10 is formed of pairs (x, y) \u2208 Dd where x\nis an image and y \u2208 {1, . . . , Cd} = Yd is a label. For each dataset, we specify a training, validation\nand test subsets. The goal is to train the best possible model to address all ten classi\ufb01cation tasks\nusing only the provided training and validation data (no external data is allowed). A model \u03a6 is\nevaluated on the test data, where, given an image x and its ground-truth domain dx label, it has to\npredict the corresponding label y = \u03a6(x, dx) \u2208 Yd.\nPerformance is measured in terms of a single scalar score S determined as in the decathlon discipline.\nPerforming well at this metric requires algorithms to perform well in all tasks, compared to a\nminimum level of baseline performance for each. In detail, S is computed as follows:\n\n10(cid:88)\n\nd=1\n\n(cid:88)\n\n(x,y)\u2208Dtest\n\nd\n\nS =\n\n\u03b1d max{0, Emax\n\nd \u2212 Ed}\u03b3d ,\n\nEd =\n\n1\n|Dtest\nd |\n\n1{y(cid:54)=\u03a6(x,d)}.\n\n(1)\n\nwhere Ed is the average test error for each domain. Emax\nthe baseline error (section 5), above which\nno points are scored. The exponent \u03b3d \u2265 1 rewards more reductions of the classi\ufb01cation error\nas this becomes close to zero and is set to \u03b3d = 2 for all domains. The coef\ufb01cient \u03b1d is set to\n1, 000 (Emax\n\n)\u2212\u03b3d so that a perfect result receives a score of 1,000 (10,000 in total).\n\nd\n\nd\n\nData preprocessing. Different domains contain a different set of image classes as well as a different\nnumber of images. In order to reduce the computational burden, all images have been resized\nisotropically to have a shorter side of 72 pixels. For some datasets such as ImageNet, this is a\nsubstantial reduction in resolution which makes training models much faster (but still suf\ufb01cient to\nobtain excellent classi\ufb01cation results with baseline models). For the datasets for which there exists\ntraining, validation, and test subsets, we keep the original splits. For the rest, we use 60%, 20% and\n20% of the data for training, validation, and test respectively. For the ILSVRC12, since the test labels\nare not available, we use the original validation subset as the test subset and randomly sample a new\nvalidation set from their training split. We are planning to make the data and an evaluation server\npublic soon.\n\n5 Experiments\n\nIn this section we evaluate our method quantitatively against several baselines (section 5.1), investigate\nthe ability of the proposed techniques to learn models for ten very diverse visual domains.\n\nImplementation details. In all experiments we choose to use the powerful ResNets [13] as base\narchitectures due to their remarkable performance. In particular, as a compromise of accuracy and\nspeed, we chose the ResNet28 model [40] which consists of three blocks of four residual units. Each\nresidual unit contains 3 \u00d7 3 convolutional, BN and ReLU modules (\ufb01g. 2). The network accepts\n64 \u00d7 64 images as input, downscales the spatial dimensions by two at each block and ends with a\nglobal average pooling and a classi\ufb01er layer followed by a softmax. We set the number of \ufb01lters to\n64, 128, 256 for these blocks respectively. Each network is optimized to minimize its cross-entropy\nloss with stochastic gradient descent. The network is run for 80 epochs and the initial learning rate of\n0.1 is lowered to 0.01 and then 0.001 gradually.\n\n6\n\n\f4k\n\n40k\n\n2k\n\nS\n\n9k\n\n26k\n\n30k\n\n70k\n\n50k\n\n1.3m 7k\n\n#par. ImNet Airc. C100 DPed DTD GTSR Flwr OGlt SVHN UCF mean\n\nModel\n# images\n10\u00d7 59.87 57.10 75.73 91.20 37.77 96.55 56.30 88.74 96.63 43.27 70.32 1625\nScratch\n11\u00d7 59.67 59.59 76.08 92.45 39.63 96.90 56.66 88.74 96.78 44.17 71.07 1826\nScratch+\n1\u00d7 59.67 23.31 63.11 80.33 45.37 68.16 73.69 58.79 43.54 26.80 54.28\n544\nFeature extractor\n10\u00d7 59.87 60.34 82.12 92.82 55.53 97.53 81.41 87.69 96.55 51.20 76.51 2500\nFinetune\n10\u00d7 59.87 61.15 82.23 92.34 58.83 97.57 83.05 88.08 96.10 50.04 76.93 2515\nLwF [21]\n\u223c 1\u00d7 59.87 43.05 78.62 92.07 51.60 95.82 74.14 84.83 94.10 43.51 71.76 1363\nBN adapt. [5]\n2\u00d7 59.67 56.68 81.20 93.88 50.85 97.05 66.24 89.62 96.13 47.45 73.88 2118\nRes. adapt.\n2\u00d7 59.67 61.87 81.20 93.88 57.13 97.57 81.67 89.62 96.13 50.12 76.89 2621\nRes. adapt. decay\n2\u00d7 59.23 63.73 81.31 93.30 57.02 97.47 83.43 89.82 96.17 50.28 77.17 2643\nRes. adapt. \ufb01netune all\n2.5\u00d7 59.18 63.52 81.12 93.29 54.93 97.20 82.29 89.82 95.99 50.10 76.74 2503\nRes. adapt. dom-pred\n\u223c 12\u00d7 67.00 67.69 84.69 94.28 59.41 97.43 84.86 89.92 96.59 52.39 79.43 3131\nRes. adapt. (large)\nTable 1: Multiple-domain networks. The \ufb01gure reports the (top-1) classi\ufb01cation accuracy (%) of\ndifferent models on the decathlon tasks and \ufb01nal decathlon score (S). ImageNet is used to prime the\nnetwork in every case, except for the networks trained from scratch. The model size is the number of\nparameters w.r.t. the baseline ResNet. The fully-\ufb01netuned model, written blue, is used as a baseline\nto compute the decathlon score.\n\nFlwr\n\nOGlt\n\nDTD\n0.7 45.3\n\nDPed\n0.6 80.3\n\nC100\n3.6 63.1\n\nGTSR\n1.4 68.1 27.2 73.6 13.4 87.7\n3.0 88.1\n\nUCF\nSVHN\nAirc.\nModel\n0.2 96.6\n5.4 51.2\n1.1 60.3\nFinetune\n4.1 61.1 21.0 82.2 23.8 92.3 36.7 58.8 11.5 97.6 34.2 83.1\n0.2 96.1 18.6 50.0\nLwF [21] high lr\n38.0 50.6 33.0 80.7 53.3 92.2 47.0 57.2 23.7 96.6 45.7 75.7 21.0 86.0 13.3 94.8 29.0 44.6\nLwF [21] low lr\nRes. adapt. \ufb01netune all 59.2 63.7 59.2 81.3 59.2 93.3 59.2 57.0 59.2 97.5 59.2 83.4 59.2 89.8 59.2 96.1 59.2 50.3\nTable 2: Pairwise forgetting. Each pair of numbers report the top-1 accuracy (%) on the old task\n(ImageNet) and a new target task after the network is fully \ufb01netuned on the latter. We also show\nthe performance of LwF when it is \ufb01netuned on the new task with a high and low learning rate,\ntrading-off forgetting ImageNet and improving the results on the target domain. By comparison, we\nshow the performance of tuning only the residual adapters, which by construction does not result in\nany performance loss in ImageNet while still achieving very good performance on each target task.\n\n5.1 Results\n\nThere are two possible extremes. The \ufb01rst one is to learn ten independent models, one for each\ndataset, and the second one is to learn a single model where all feature extractor parameters are\nshared between the ten domains. We evaluate next different approaches to learn such models.\nPairwise learning. In the \ufb01rst experiment (table 1), we start by learning a ResNet model on ImageNet,\nand then use different techniques to extend it to the remaining nine tasks, one at a time. Depending\non the method, this may produce an overall model comprising ten ResNet architectures, or just one\nResNet with a few domain-speci\ufb01c parameters; thus we also report the total number of parameters\nused, where 1\u00d7 is the size of a single ResNet (excluding the last classi\ufb01cation layer, which can never\nbe shared).\nAs baselines, we evaluate four cases: i) learning an individual ResNet model from scratch for each\ntask, ii) freezing all the parameters of the pre-trained network, using the network as feature extractor\nand only learn a linear classi\ufb01er, iii) standard \ufb01netuning and iv) applying a reimplementation of the\nLwF technique of [21] that encourages the \ufb01ne-tuned network to retain the responses of the original\nImageNet model while learning the new task.\nIn terms of accuracy, learning from scratch performs poorly on small target datasets and, by learning\n10 independent models, requires 10\u00d7 parameters in total. Freezing the ImageNet feature extraction is\nvery ef\ufb01cient in terms of parameter sharing (1\u00d7 parameters in total), preserves the original domain\nexactly, but generally performs very poorly on the target domain. Full \ufb01ne-tuning leads to accurate\nresults both for large and small datasets; however, it also forgets the ImageNet domain substantially\n(table 2), so it still requires learning 10 complete ResNet models for good overall performance.\nWhen LwF is run as intended by the original authors [21], is still leads to a noticeable performance\ndrop on the original task, even when learning just two domains (table 2), particularly if the target\ndomain is very different from ImageNet (e.g. Omniglot and SVHN). Still, if one chooses a different\ntrade-off point and allows the method to forget ImageNet more, it can function as a good regularizer\nthat slightly outperforms vanilla \ufb01ne-tuning overall (but still resulting in a 10\u00d7 model).\n\n7\n\n\fNext, we evaluate the effect of sharing the majority of parameters between tasks, whereas still\nallowing a small number of domain-speci\ufb01c parameters to change. First, we consider specializing\nonly the BN layer scaling and bias parameters, which is equivalent to the approach of [5]. In this\ncase, less than the 0.1% of the model parameters are domain-speci\ufb01c (for the ten domains, this results\nin a model with 1.01\u00d7 parameters overall). Hence the model is very similar to the one with the\nfrozen feature extractor; nevertheless, the performances increase very substantially in most cases (e.g.\n23.31% \u2192 43.05% accuracy on Aircraft).\nAs the next step, we introduce the residual adapter modules, which increase by 11% the number\nof parameters per domain, resulting in a 2\u00d7 model. In the pre-training phase, we \ufb01rst pretrain on\nImageNet the network with the added modules. Then, we freeze the task agnostic parameters and\ntrain the task speci\ufb01c parameters on the different datasets. Differently from vanilla \ufb01ne-tuning, there\nis no forgetting in this setting. While most of the parameters are shared, our method is either close or\nbetter than full \ufb01ne-tuning. As a further control, we also train 10 models from scratch with the added\nparameters (denoted as Scratch+), but do not observe any noticeable performance gain in average,\ndemonstrating that parameters sharing is highly bene\ufb01cial. We also contrast learning the adapter\nmodules with two values of weight decay (0.002 and 0.005) higher than the default 0.0005. These\nparameters are obtained after a coarse grid search using cross-validation for each dataset. Using\nhigher decay signi\ufb01cantly improves the performance on smaller datasets such as Flowers, whereas\nthe smaller decay is best for larger datasets. This shows both the importance and utility of controlling\nover\ufb01tting in the adaptation process. In practice, there is an almost direct correspondence between\nthe size of the data and which one of these values to use. The optimal decay can be selected via\nvalidation, but a rough choice can be performed by simply looking at the dataset size.\nWe also compare to another baseline where we only \ufb01netune the last two convolutional layers and\nfreeze the others, which may be thought to be generic. This amounts to having a network with twice\nthe number of total parameters in a vanilla ResNet which is equal to our proposed architecture. This\nmodel obtains 64.7% mean accuracy over ten datasets, which is signi\ufb01cantly lower than our 73.9%,\nlikely due to over\ufb01tting (controlling over\ufb01tting is one of the advantages of our technique).\nFurthermore, we also assess the quality of our adapter without residual connections, which corre-\nsponds to the low rank \ufb01lter parametrization of section 3.1; this approach achieves an accuracy of\n70.3%, which is worse than our 73.9%. We also observe that this con\ufb01guration requires notably more\niterations to converge. Hence, the residual architecture for the adapters results in better performances,\nbetter control of over\ufb01tting, and a faster convergence.\nEnd-to-end learning. So far, we have shown that our method, by learning only the adapter modules\nfor each new domain, does not suffer from forgetting. However, for us sequential learning is just a\nscalable learning strategy. Here, we also show (table 1) that we can further improve the results by\n\ufb01ne-tuning all the parameters of the network end-to-end on the ten tasks. We do so by sampling a\nbatch from each dataset in a round robin fashion, allowing each domain to contribute to the shared\nparameters. A \ufb01nal pass is done on the adapter modules to take into account the change in the shared\nparameters.\nDomain prediction. Up to now we assume that the domain of each image is given during test\ntime for all the methods. If this is unavailable, it can be predicted on the \ufb02y by means of a small\nneural-network predictor. We train a light ResNet, which is composed three stacks of two residual\nnetworks, half deep as the original net, obtaining 99.8% accuracy in domain prediction, resulting in a\nbarely noticeable drop in the overall multiple-domain challenge (see Res. adapt dom-pred in table 1).\nNote that similar performance drop would be observed for the other baselines.\nDecathlon evaluation: overall performance. While so far we have looked at results on individual\ndomain, the Decathlon score eq. (1) can be used to compare performance overall. As baseline error\nrates in eq. (1), we double the error rates of the fully \ufb01netuned networks on each task. In this manner,\nthis 10\u00d7 model achieves a score of 2,500 points (over 10,000 possible ones, see eq. (1)). The last\ncolumn of table 1 reports the scores achieved by the other architectures. As intended, the decathlon\nscore favors the methods that perform well overall, emphasizes their consistency rather than just their\naverage accuracy. For instance, although the Res. adapt. model (trained with single decay coef\ufb01cient\nfor all domains) performs well in terms of average accuracy (73.88%), its decathlon score (2118) is\nrelatively low because the model performs poorly in DTD and Flowers. This also shows that, once\nthe weight decays are con\ufb01gured properly, our model achieves superior performance (2643 points) to\nall the baselines using only 2\u00d7 the capacity of a single ResNet.\n\n8\n\n\fFinally we show that using a higher capacity ResNet28 (12\u00d7, ResNet adapt. (large) in table 1), which\nis comparable to 10 independent networks, signi\ufb01cantly improves our results and outperforms the\n\ufb01netuning baseline by 600 point in decathlon score. As a matter of fact, this model outperforms the\nstate-of-the-art [40] (81.2%) by 3.5 points in CIFAR100. In other cases, our performances are in\ngeneral in line to current state-of-the-art methods. When this is not the case, this is due to reduced\nimage resolution (ImageNet, Flower) or due to the choice of a speci\ufb01c video representation in UCF\n(dynamic image).\n\n6 Conclusions\n\nAs machine learning applications become more advanced and pervasive, building data representations\nthat work well for multiple problems will become increasingly important. In this paper, we have\nintroduced a simple architectural element, the residual adapter module, that allows compressing many\nvisual domains in relatively small residual networks, with substantial parameter sharing between\nthem. We have also shown that they allow addressing the forgetting problem, as well as adapting to\ntarget domain for which different amounts of training data are available. Finally, we have introduced\na new multi-domain learning challenge, the Visual Decathlon, to allow a systematic comparison of\nalgorithms for multiple-domain learning.\nAcknowledgments: This work acknowledges the support of Mathworks/DTA DFR02620 and ERC 677195-\nIDIU.\n\nReferences\n\n[1] A. Argyriou, T. Evgeniou, and M Pontil. Multi-task feature learning. In Proc. NIPS, volume 19,\n\npage 41. MIT; 1998, 2007.\n\n[2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward\n\none-shot learners. In Proc. NIPS, pages 523\u2013531, 2016.\n\n[3] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for\n\naction recognition. In Proc. CVPR, 2016.\n\n[4] H. Bilen and A. Vedaldi. Integrated perception with recurrent multi-task neural networks. In\n\nProc. NIPS, 2016.\n\n[5] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text,\n\nplanktons, and cat breeds. arXiv preprint arXiv:1701.07275, 2017.\n\n[6] R. Caruana. Multitask learning. Machine Learning, 28, 1997.\n[7] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild.\n\nIn Proc. CVPR, 2014.\n\n[8] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural\n\nnetworks with multitask learning. In icml, pages 160\u2013167. ACM, 2008.\n\n[9] H. Daum\u00e9 III. Frustratingly easy domain adaptation. ACL 2007, page 256, 2007.\n[10] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In SIGKDD, pages 109\u2013117.\n\nACM, 2004.\n\n[11] R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences,\n\n3(4):128\u2013135, 1999.\n\n[12] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. Proc. ICML,\n\n2015.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In Proc.\n\nECCV, pages 630\u2013645. Springer, 2016.\n\n[14] J. T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong. Cross-language knowledge transfer using\nmultilingual deep neural network with shared hidden layers. In ICASSP, pages 7304\u20137308,\n2013.\n\n9\n\n\f[15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. CoRR, 2015.\n\n[16] X. Jia, B. De Brabandere, T. Tuytelaars, and L. Gool. Dynamic \ufb01lter networks. In Proc. NIPS,\n\npages 667\u2013675, 2016.\n\n[17] J. Kirkpatrick, E. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,\nJ. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in\nneural networks. National Academy of Sciences, 2017.\n\n[18] I. Kokkinos. Ubernet: Training auniversal\u2019convolutional neural network for low-, mid-, and\n\nhigh-level vision using diverse datasets and limited memory. Proc. CVPR, 2017.\n\n[19] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n[20] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[21] Z. Li and D. Hoiem. Learning without forgetting. In Proc. ECCV, pages 614\u2013629, 2016.\n[22] X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y. Wang. Representation learning using multi-task\ndeep neural networks for semantic classi\ufb01cation and information retrieval. In HLT-NAACL,\npages 912\u2013921, 2015.\n\n[23] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised Domain Adaptation with Residual\n\nTransfer Networks. In Proc. NIPS, pages 136\u2013144, 2016.\n\n[24] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classi\ufb01cation\n\nof aircraft. Technical report, 2013.\n\n[25] T. Mitchell. Never-ending learning. Technical report, DTIC Document, 2010.\n[26] S. Munder and D. M. Gavrila. An experimental study on pedestrian classi\ufb01cation. PAMI,\n\n28(11):1863\u20131868, 2006.\n\n[27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in nat-\nural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and\nUnsupervised Feature Learning, 2011.\n\n[28] M-E. Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of\n\nclasses. In ICCVGIP, Dec 2008.\n\n[29] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features\n\noff-the-shelf: an astounding baseline for recognition. In CVPR DeepVision Workshop, 2014.\n\n[30] S. A. Rebuf\ufb01, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classi\ufb01er and\n\nrepresentation learning. In Proc. CVPR, 2017.\n\n[31] Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. arXiv\n\npreprint arXiv:1705.04228, 2017.\n\n[32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and K. Fei-Fei. Imagenet large scale visual recognition\nchallenge, 2014.\n\n[33] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu,\nR. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671,\n2016.\n\n[34] J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks. Neural Computation, 4(1):131\u2013139, 1992.\n\n[35] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from\n\nvideos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[36] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine\n\nlearning algorithms for traf\ufb01c sign recognition. Neural Networks, 32(0):323\u2013332, 2012.\n\n[37] A. V. Terekhov, G. Montone, and J. K. O\u2019Regan. Knowledge transfer in deep block-modular\n\nneural networks. In Biomimetic and Biohybrid Systems, pages 268\u2013279, 2015.\n\n10\n\n\f[38] S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181\u2013209. Springer, 1998.\n[39] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains\n\nand tasks. In Proc. CVPR, pages 4068\u20134076, 2015.\n\n[40] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[41] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Robust visual tracking via structured multi-task\n\nsparse learning. IJCV, 101(2):367\u2013383, 2013.\n\n[42] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning.\n\nIn Proc. ECCV, 2014.\n\n11\n\n\f", "award": [], "sourceid": 360, "authors": [{"given_name": "Sylvestre-Alvise", "family_name": "Rebuffi", "institution": "University of Oxford"}, {"given_name": "Hakan", "family_name": "Bilen", "institution": "University of Edinburgh"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "University of Oxford"}]}