{"title": "Efficient Match Kernel between Sets of Features for Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 135, "page_last": 143, "abstract": "In visual recognition, the images are frequently modeled as sets of local features (bags). We show that bag of words, a common method to handle such cases, can be viewed as a special match kernel, which counts 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse. It is, therefore, appealing to design match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels on large datasets due to their significant computational cost. To address this problem, we propose an efficient match kernel (EMK), which maps local features to a low dimensional feature space, average the resulting feature vectors to form a set-level feature, then apply a linear classifier. The local feature maps are learned so that their inner products preserve, to the best possible, the values of the specified kernel function. EMK is linear both in the number of images and in the number of local features. We demonstrate that EMK is extremely efficient and achieves the current state of the art performance on three difficult real world datasets: Scene-15, Caltech-101 and Caltech-256.", "full_text": "Ef\ufb01cient Match Kernels between Sets of Features\n\nfor Visual Recognition\n\nLiefeng Bo\n\nToyota Technological Institute at Chicago\n\nCristian Sminchisescu\n\nUniversity of Bonn\n\nblf0218@tti-c.org\n\nsminchisescu.ins.uni-bonn.de\n\nAbstract\n\nIn visual recognition, the images are frequently modeled as unordered collections\nof local features (bags). We show that bag-of-words representations commonly\nused in conjunction with linear classi\ufb01ers can be viewed as special match kernels,\nwhich count 1 if two local features fall into the same regions partitioned by vi-\nsual words and 0 otherwise. Despite its simplicity, this quantization is too coarse,\nmotivating research into the design of match kernels that more accurately mea-\nsure the similarity between local features. However, it is impractical to use such\nkernels for large datasets due to their signi\ufb01cant computational cost. To address\nthis problem, we propose ef\ufb01cient match kernels (EMK) that map local features\nto a low dimensional feature space and average the resulting vectors to form a set-\nlevel feature. The local feature maps are learned so their inner products preserve,\nto the best possible, the values of the speci\ufb01ed kernel function. Classi\ufb01ers based\non EMK are linear both in the number of images and in the number of local fea-\ntures. We demonstrate that EMK are extremely ef\ufb01cient and achieve the current\nstate of the art in three dif\ufb01cult computer vision datasets: Scene-15, Caltech-101\nand Caltech-256.\n\n1 Introduction\n\nModels based on local features have achieved state-of-the art results in many visual object recogni-\ntion tasks. For example, an image can be described by a set of local features extracted from patches\naround salient interest points or regular grids, or a shape can be described by a set of local features\nde\ufb01ned at edge points. This raises the question on how should one measure the similarity between\ntwo images represented as sets of local features. The problem is non-trivial because the cardinality\nof the set varies with each image and the elements are unordered.\nBag of words (BOW) [27] is probably one of the most popular image representations, due to both\nits conceptual simplicity and its computational ef\ufb01ciency. BOW represents each local feature with\nthe closest visual word and counts the occurrence frequencies in the image. The resulting histogram\nis used as an image descriptor for object recognition, often in conjunction with linear classi\ufb01ers.\nThe length of the histogram is given by the number of visual words, being the same for all images.\nVarious methods for creating vocabularies exist [10], the most common being k-means clustering of\nall (or a subsample of) the local features to obtain visual words.\nAn even better approach to recognition is to de\ufb01ne kernels over sets of local features. One way is to\nexploit closure rules. The sum match kernel of Haussler [7] is obtained by adding local kernels over\nall combinations of local features from two different sets. In [17], the authors modify the sum kernel\nby introducing an integer exponent on local kernels. Neighborhood kernels [20] integrate the spatial\nlocation of local features into a sum match kernel. Pyramid match kernels [5, 14, 13] map local\nfeatures to multi-resolution histograms and compute a weighted histogram intersection. Algebraic\nset kernels [26] exploit tensor products to aggregate local kernels, whereas principal angle kernels\n\n1\n\n\f[29] measure similarities based on angles between linear subspaces spanned by local features in the\ntwo sets. Other approaches estimate a probability distribution on sets of local features, then derive\ntheir similarity using distribution-based comparison measures [12, 18, 2]. All of the above methods\nneed to explicitly evaluate the full kernel matrix, hence they require space and time complexity that\nis quadratic in the number of images. This is impractical for large datasets (see \u00a74).\nIn this paper we present ef\ufb01cient match kernels (EMK) that combine the strengths of both bag of\nwords and set kernels. We map local features to a low dimensional feature space and construct\nset-level features by averaging the resulting feature vectors. This feature extraction procedure is not\nsigni\ufb01cantly different than BOW. Hence EMK can be used in conjunction with linear classi\ufb01ers and\ndo not require the explicit computation of a full kernel matrix\u2014this leads to both space and time\ncomplexity that is linear in the number of images. Experiments on three image categorization tasks\nshow that EMK are effective computational tools.\n\n2 Bag of Words and Match Kernels\n\nIn supervised image classi\ufb01cation, we are given a training set of images and their corresponding\nlabels. The goal is to learn a classi\ufb01er to label unseen images. We adopt a bag of features method,\nwhich represents an image as a set of local features. Let X = {x1, . . . , xp} be a set of local features\nin an image and V = {v1, . . . , vD} the dictionary, a set of visual words.\nIn BOW, each local\nfeature is quantized into a D dimensional binary indicator vector \u00b5(x) = [\u00b51(x), . . . , \u00b5D(x)](cid:62).\n\u00b5i(x) is 1 if x \u2208 R(vi) and 0 otherwise, where R(vi) = {x : (cid:107)x \u2212 vi(cid:107) \u2264 (cid:107)x \u2212 v(cid:107),\u2200v \u2208 V}.\nThe feature vectors for one image form a normalized histogram \u00b5(X) = 1|X|\nx\u2208X \u00b5(x), where\n| \u00b7 | is the cardinality of a set. BOW features can be used in conjunction with either a linear or a\nkernel classi\ufb01er, albeit the latter often leads to expensive training and testing (see \u00a74). When a linear\nclassi\ufb01er is used, the resulting kernel function is:\n\n(cid:80)\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Y\n\n\u03b4(x, y)\n\n(1)\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Y\n\nKB(X, Y) = \u00b5(X)(cid:62)\u00b5(Y) =\n\nwith\n\n\u03b4(x, y) =\n\n(cid:189)\n\n1\n\n|X||Y|\n\n\u00b5(x)(cid:62)\u00b5(y) =\n\n1\n\n|X||Y|\n\n1, x, y \u2282 R(vi),\u2203i \u2208 {1, . . . , D}\n0, otherwise\n\n\u03b4(x, y) is obviously a positive de\ufb01nite kernel, measuring the similarity between two local features\nx and y: \u03b4(x, y) = 1 if x and y belong the same region R(vi), and 0 otherwise. However, this type\nof quantization can be too coarse when measuring the similarity of two local features (see also \ufb01g.\n1 in [21]), risking a signi\ufb01cant decrease in classi\ufb01cation performance. Better would be to replace\n\u03b4(x, y) with a continuous kernel function that more accurately measures the similarity between x\nand y:\n\nKS(X, Y) =\n\n1\n\n|X||Y|\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Y\n\nk(x, y)\n\n(3)\n\nIn fact, this is related to the normalized sum match kernel [7, 17]. Based on closure properties,\nKs(X, Y) is a positive de\ufb01nite kernel, as long as the components k(x, y) are positive de\ufb01nite. For\nconvenience, we refer to k(x, y) as the local kernel. A negative impact of kernelization is the high\ncomputational cost required to compute the summation match function, which takes O(|X||Y|) for\na single kernel value rather than O(1), the cost of evaluating a single kernel function de\ufb01ned on\nvectors. When used in conjunction with kernel machines, it takes O(n2) and O(n2m2d) to store\nand compute the entire kernel matrix, respectively, where n is the number of images in the training\nset, and m is the average cardinality of all sets. For image classi\ufb01cation, m can be in the thousands\nof units, so the computational cost rapidly becomes quartic as n approaches (or increases beyond)\nm. In addition to expensive training, the match kernel function has also a fairly high testing cost:\ni=1 \u03b1iKs(Xi, X) takes O(nm2d). This\nfor a test input, evaluating the discriminant f(X) =\nis, again, unacceptably slow for large n. For sparse kernel machines, such as SVMs, the cost can\ndecrease to some extent, as some of the \u03b1i are zero. However, this does not change the order of\ncomplexity, as the level of sparsity usually grows linearly in n.\nDe\ufb01nition 1. The kernel function k(x, y) = \u03c6(x)(cid:62)\u03c6(y) is called \ufb01nite dimensional if the feature\nmap \u03c6(\u00b7) is \ufb01nite dimensional.\n\n(cid:80)n\n\n2\n\n\fSum [7]\n\nBhattacharyya [12]\n\nPMK [6]\n\nEMK-CKSVD\n\nTrain O(n2m2d)\nTest\nO(nm2d)\n\nO(n2m3d)\nO(nm3d)\n\nO(n2m log(T )d) O(nmDd + nD2)\nO(nm log(T )d)\n\nO(mDd + D2)\n\nEMK-Fourier\nO(nmDd)\nO(mDd)\n\nTable 1: Computational complexity for \ufb01ve types of \u2018set kernels\u2019. \u2019Test\u2019 means the computational\ncost per image.\u2019Sum\u2019 is the sum match kernel used in [7]. \u2019Bhattacharyya\u2019 is the Bhattacharyya\nkernel in [12]. PMK is the pyramid match kernel of [6], with T in PMK giving the value of the\nmaximal feature range. d is the dimensionality of local features. D in EMK is the dimensionality\nof feature maps and does not change with the training set size. Our experiments suggest that a\nvalue of D in the order of thousands of units is suf\ufb01cient for good accuracy. Thus, O(nmDd) will\ndominate the computational cost for training, and O(mDd) the one for testing, since m is usually\nin the thousands, and d in the hundreds of units. EMK uses linear classi\ufb01ers and does not require\nthe evaluation of the kernel matrix. The other four methods are used in conjunction with kernel\nclassi\ufb01ers, hence they all need to evaluate the entire kernel matrix. In the case of nearest neighbor\nclassi\ufb01ers, there is no training cost, but testing costs remain unchanged.\n\n\u03b4(x, y) is a special type of \ufb01nite dimensional kernel. With the \ufb01nite dimensional kernel, the match\nkernel can be simpli\ufb01ed as:\n\nKS(X, Y) = \u03c6(X)(cid:62)\u03c6(Y)\n\n(4)\n\n(cid:80)\n\nx\u2208X \u03c6(x) is the feature map on the set of vectors. Since \u03c6(X) is \ufb01nite\nwhere \u03c6(X) = 1|X|\nand can be computed explicitly, we can extract feature vectors on the set X, then apply a linear\nclassi\ufb01er on the resulting represenation. We call (4) an ef\ufb01cient match kernel (EMK). The feature\nextraction in EMK is not signi\ufb01cantly different from the bag of words method. The training and\ntesting costs are O(nmDd) and O(mDd) respectively, where D is the dimensionality of the feature\nmap \u03c6(x). If the feature map \u03c6(x) is low dimensional, the computational cost of EMK can be\nmuch lower than the one required to evaluate the match kernel by computing the kernel functions\nk(x, y). For example, the cost is 1\nn lower when D has the same order as m (this is the case in\nour experiments). Notice that we only need the feature vectors \u03c6(X) in EMK, hence it is not\nnecessary to compute the entire kernel matrix. Since recent developments have shown that linear\nSVMs can be trained in linear complexity [25], there is no substantial cost added in the training\nphase. The complexity of EMK and of several other well-known set kernels is reviewed in table 1.\nIf necessary, location information can be incorporated into EMK, using a spatial pyramid [14, 13]:\nt=1 2\u2212lKS(X(l,t), Y(l,t)) = \u03c6S(X)(cid:62)\u03c6S(Y), where L is the number of\nKP (X, Y) =\npyramid levels, 2l is the number of spatial cells in the l-th pyramid level, X(l,s) are local features\nfalling within the spatial cell (l, s), and \u03c6P (X) = [\u03c6(X(1,1))(cid:62), . . . , \u03c6(X(l,s))(cid:62)](cid:62).\nWhile there can be many choices for the local feature maps \u03c6(x)\u2014and the positive de\ufb01niteness of\nk(x, y) = \u03c6(x)(cid:62)\u03c6(x) can be always guaranteed\u2014, most do not necessarily lead to a meaningful\nsimilarity measure. In the paper, we give two principled methods to create meaningful local feature\nmaps \u03c6(x), by arranging for their inner products to approximate a given kernel function.\n\n(cid:80)L\u22121\n\n(cid:80)2l\n\nl=0\n\n3 Ef\ufb01cient Match Kernels\nIn this section we present two kernel approximations, based on low-dimensional projections (\u00a73.1),\nand based on random Fourier set features (\u00a73.2).\n\n3.1 Learning Low Dimensional Set Features\n\nOur approach is to project the high dimensional feature vectors \u03c8(x) induced by the kernel\nk(x, y) = \u03c8(x)(cid:62)\u03c8(y) to a low dimensional space spanned by D basis vectors, then construct a\nlocal kernel from inner products, based on low-dimensional representations. Given {\u03c8(zi)}D\ni=1, a\nset of basis vectors zi, we can approximate the feature vector \u03c8(x):\n\nvx = argmin\n\nvx\n\n(cid:107)\u03c8(x) \u2212 Hvx(cid:107)2\n\n3\n\n(5)\n\n\fFigure 1: Low-dimensional approximations for a Gaussian kernel. Left: approximated Gaussian\nkernel with 20 learned feature maps. Center: the training objective (12) as a function of stochastic\ngradient descent iterations. Right: approximated Gaussian kernel based on 200 random Fourier\nfeatures. The feature maps are learned from 200 samples, uniformly drawn from [-10,10].\n\nwhere H = [\u03c8(z1), . . . , \u03c8(zD)] and vx are low-dimensional (projection) coef\ufb01cients. This is a\nconvex quadratic program with analytic solution:\n\nvx = (H(cid:62)H)\u22121(H(cid:62)\u03c8(x))\n\n(6)\n\nThe local kernel derived from the projected vectors is:\n\nkl(x, y) = [Hvx](cid:62) [Hvy] = kZ(x)(cid:62)K\u22121\n\n(7)\nwhere kZ is a D \u00d7 1 vector with {kZ}i = k(x, zi) and KZZ is a D \u00d7 D matrix with {KZZ}ij =\nk(zi, zj). For G(cid:62)G = K\u22121\n\nZZ is positive de\ufb01nite), the local feature maps are:\n\nZZ (notice that K\u22121\n\nZZkZ(y)\n\n\u03c6(x) = GkZ(x)\n\n(cid:163)(cid:80)\n\n(cid:164)\n\nx\u2208X kZ(x)\n\nThe resulting full feature map is: \u03c6(X) = 1|X| G\n, with computational complexity\nO(mDd + D2) for a set of local features. A related method is the kernel codebook [28], where\na set-level feature is also extracted based on a local kernel, but with different feature map \u03c6(\u00b7).\nAn essential difference is that inner products of our set-level features \u03c6(X) formally approximate\nthe sum-match kernel, whereas the ones induced by the kernel codebook do not. Therefore EMK\nonly requires a linear classi\ufb01er wherea a kernel codebook would require a non-linear classi\ufb01er for\ncomparable performance. As explained, this can be prohibitively expensive to both train and test, in\nlarge datasets. Our experiments, shown in table 3, further suggest that EMK outperforms the kernel\ncodebook, even in the non-linear case.\nHow can we learn the basis vectors? One way is kernel principal component analysis (KPCA) [24]\non a randomly selected pool of F local features, with the basis set to the topmost D eigenvectors.\nThis faces two dif\ufb01culties, however: (i) KPCA scales cubically in the number of selected local fea-\ntures, F ; (ii) O(F md) work is required to extract the set-level feature vector for one image, because\ni=1 \u03b1i\u03c8(xi). For\nthe eigenvectors are linear combinations of the selected local feature vectors,\nlarge F , as typically required for good accuracy, this approach is too expensive. Although the \ufb01rst\nthe pre-image problem (z, \u03b2) = argminz,\u03b2 (cid:107)(cid:80)F\ndif\ufb01culty can be palliated by iterative KPCA [11], the second computational challenge remains. An-\nother option would be to approximate each eigenvector with a single feature vector \u03c8(z) by solving\ni=1 \u03b1i\u03c8(xi) \u2212 \u03b2\u03c8(z)(cid:107)2, after KPCA. However,\nthe two step approach is sub-optimal. Intuitively, it should be better to \ufb01nd the single vector ap-\nproximations within an uni\ufb01ed objective function. This motivates our constrained singular value\nF(cid:88)\ndecomposition in kernel feature space (CKSVD):\n\n(cid:80)F\n\nargmin\n\nR(V, Z) =\n\nV,Z\n\n1\nF\n\ni=1\n\n(cid:107)\u03c8(xi) \u2212 Hvi(cid:107)2\n\nwhere F is the number of the randomly selected local features, Z = [z1, . . . , zD] and V =\n[v1, . . . , vF ]. If the pre-image constraints H = [\u03c8(z1), . . . , \u03c8(zD)] are dropped, it is easy to show\nthat KPCA can be recovered. The partial derivatives of R with respect to vi are:\n\n\u2202R(V, Z)\n\n\u2202vi\n\n= 2H(cid:62)Hvi \u2212 2H(cid:62)\u03c8(xi)\n\n(10)\n\n4\n\n(8)\n\n(9)\n\n\u221210\u221250510\u22120.200.20.40.60.811.2 ApproximationExact0100200300400500\u2212190\u2212185\u2212180\u2212175\u2212170\u2212165\u221210\u221250510\u22120.200.20.40.60.811.2 ApproximationExact\fExpanding equalities like \u2202R(V,Z)\n= 0 produces a linear system with respect to vi for a \ufb01xed Z. In\nthis case, we can obtain the optimal, analytical solution: vi = (H(cid:62)H)\u22121(H(cid:62)\u03c8(xi)). Substituting\nthe solution in eq. (9), we can eliminate the variable V. To learn the basis vectors, instead of directly\noptimizing R(V, Z), we can solve the equivalent optimization problem:\n\n\u2202vi\n\nargmin\n\nZ\n\nR\u2217(Z) = \u2212 1\nF\n\nkZ(xi)(cid:62)K\u22121\n\nZZkZ(xi)\n\n(11)\n\nOptimizing R\u2217(Z) is tractable because its parameter space is much smaller than R(V, Z). The prob-\nlem (11) can be solved using any gradient descent algorithm. For ef\ufb01ciency, we use the stochastic\n(on-line) gradient descent (SGD) method. SGD applies to problems where the full gradient decom-\nposes as a sum of individual gradients of the training samples. The standard (batch) gradient descent\nmethod updates the parameter vector using the full gradient whereas SGD approximates it using the\ngradient at a single training sample. For large datasets, SGD is usually much faster than batch gra-\ndient descent. At the t-th iteration, in SCG, we randomly pick a sample xt from the training set and\nupdate the parameter vector based on:\n\nF(cid:88)\n\ni=1\n\n(cid:163)\u2212kZ(xt)(cid:62)K\u22121\n\n\u2202Z\n\nZZkZ(xt)\n\n(cid:164)\n\nZ(t + 1) = Z(t) \u2212 \u03b7\nt\n\n\u2202\n\n(12)\n\nwhere \u03b7 is the learning rate. In our implementation, we use D samples (rather than just one) to\ncompute the gradient. This produces more accurate results and matches the cost of inverting KZZ,\nwhich is O(D3) per iteration.\n\n3.2 Random Fourier Set Features\n\n(cid:82)\n\n2\n\n(cid:113)\n\n(cid:80)\n\nD [cos(\u03c9(cid:62)\n\n1 x + b1),. . . , cos(\u03c9(cid:62)\n\nAnother tractable approach to large-scale learning is to approximate the kernel using random fea-\nture maps [22, 23]. For a given function \u00b5(x; \u03b8) and the probability distribution p(\u03b8), one can\nde\ufb01ne the local kernel as: kf (x, y) =\np(\u03b8)\u00b5(x; \u03b8)\u00b5(y, \u03b8)d\u03b8. We consider feature maps of\nthe form \u00b5(x; \u03b8) = cos(\u03c9(cid:62)x + b) with \u03b8 = (\u03c9, b), which project local features to a ran-\ndomly chosen line, then pass the resulting scalar through a sinusoid. For example, to approxi-\nmate the Gaussian kernel kf (x, y) = exp(\u2212\u03b3(cid:107)x \u2212 y(cid:107)2), the random feature maps are: \u03c6(x) =\nDx + bD)](cid:62), where bi are drawn from the uniform distribution\n[\u2212\u03c0, \u03c0] and \u03c9 are drawn from a Gaussian with 0 mean and covariance 2\u03b3I. Our proposed set-\nlevel feature map is (c.f . \u00a72): \u03c6(X) = 1|X|\nx\u2208X \u03c6(x). Although any shift invariant kernel can\nbe represented using random Fourier features, currently these are limited to Gaussian kernels or\nto kernels with analytical inverse Fourier transforms. In particular, \u03c9 needs to be sampled from\nthe inverse Fourier transform of the corresponding shift invariant kernel. The constraint of a shift-\ninvariant kernel excludes a number of practically interesting similarities. For example, the \u03c72 kernel\n[8] and the histogram intersection kernel [5] are designed to compare histograms, hence they can\nbe used as local kernels, if the features are histograms. However, no random Fourier features can\napproximate them. Such problems do not occur for the learned low dimensional features\u2014a method-\nology applicable to any Mercer kernel. Moreover, in experiments, we show that kernels based on\nlow-dimensional approximations (\u00a73.1) can produce superior results when the dimensionality of the\nfeature maps is small. As seen in \ufb01g. 2, for applicable kernels, the random Fourier set features also\nproduce very competitive results in the higher-dimensional regime.\n\n4 Experiments\n\nWe illustrate our methodology in three publicly available computer vision datasets: Scene-15,\nCaltech-101 and Caltech-256. For comparisons, we consider four algorithms: BOW-Linear, BOW-\nGaussian, EMK-CKSVD and EMK-Fourier. BOW-Linear and BOW-Gaussian use a linear classi\ufb01er\nand a Gaussian kernel classi\ufb01er on BOW features, respectively. EMK-CKSVD and EMK-Fourier\nuse linear classi\ufb01ers. For the former, we learn low dimensional feature maps (\u00a73.1), whereas for the\nlatter we obtain them using random sampling (\u00a73.2).\nAll images are transformed into grayscale form. The local features are SIFT descriptors [16] ex-\ntracted from 16\u00d716 image patches. Instead of detecting the interest points, we compute SIFT de-\nscriptors over dense regular grids with spacing of 8 pixels. For EMK, our local kernel is a Gaussian\n\n5\n\n\fexp(\u2212\u03b3(cid:107)x \u2212 y(cid:107)2). We use the same \ufb01xed \u03b3 = 1 for our SIFT descriptor in all datasets: Scene-15,\nCaltech-101 and Caltech-256, although a more careful selection is likely to further improve perfor-\nmance. We run k-means clustering to identify the visual words and stochastic gradient descent to\nlearn the local feature maps, using a 100,000 random set of SIFT descriptors.\nOur classi\ufb01er is a support vector machine (SVM), which is extended to multi-class decisions by\ncombining one-versus-all votes. We work with LIBLINEAR [3] for BOW-Linear, EMK-Fourier\nand EMK-CKSVD, and LIBSVM for BOW-Gaussian (the former need a linear classi\ufb01er whereas\nthe latter uses a nonlinear classi\ufb01er). The regularization and the kernel parameters (if available) in\nSVM are tuned by ten-fold cross validation on the training set. The dimensionality of the feature\nmaps and the vocabulary size are both set to 1000 for fair comparisons, unless otherwise speci\ufb01ed.\nWe have also experimented with larger vocabulary sizes in BOW, but no substantial improvement\nwas found (\ufb01g. 2). We measure performance based on classi\ufb01cation accuracy, averaged over \ufb01ve\nrandom training/testing splits. All experiments are run on a cluster built of compute nodes with 1.0\nGHz processors and 8GB memory.\nScene-15: Scene-15 consists of 4485 images labeled into 15 categories. Each category contains 200\nto 400 images whose average size is 300\u00d7250 pixels. In our \ufb01rst experiment, we train models on a\nrandomly selected set of 1500 images (100 images per category) and test on the remaining images.\nWe vary the dimensionality of the feature maps (EMK) and the vocabulary size (BOW) from 250\nto 2000 with step length 250. For this dataset, we only consider the \ufb02at BOW and EMK (only\npyramid level 0) in all experiments. The classi\ufb01cation accuracy of BOW-Linear, BOW-Gaussian,\nEMK-Fourier and EMK-CKSVD is plotted in \ufb01g. 2 (left). Our second experiment is similar with\nthe \ufb01rst one, but the dimensionality of the feature maps and the vocabulary size vary from 50 to 200\nwith step length 25. In our third experiment, we \ufb01x the dimensionality of the feature maps to 1000,\nand vary the training set size from 300 to 2400 with step length 300. We show the classi\ufb01cation\naccuracy of the four models as a function of the training set size in \ufb01g. 2 (right).\nWe notice that EMK is consistently 5-8% better than BOW in all cases. BOW-Gaussian is about 2\n% better than BOW-Linear on average, whereas EMK-CKSVD give are very similar performance\nto EMK-Fourier in most cases. We observe that EMK-CKSVD signi\ufb01cantly outperforms EMK-\nFourier for low-dimensional feature maps, indicating that learned features preserve the values of the\nGaussian kernel better than the random Fourier maps in this regime, see also \ufb01g. 1 (center).\nFor comparisons, we attempted to run the sum match kernel, on the full Scene-15 dataset. However,\nwe weren\u2019t able to \ufb01nish in one week. Therefore, we considered a smaller dataset, by training and\ntesting with only 40 images from each category. The sum match kernel obtains 71.8% accuracy\nand slightly better than EMK-Fourier 71.0% and EMK-CKSVD 71.4% on the same dataset. The\nsum match kernel takes about 10 hours for training and 10 hours for testing, respectively whereas\nEMK-Fourier and EMK-CKSVD need less than 1 hour, most spent computing SIFT descriptors.\nIn addition, we use 10,000 randomly selected SIFT descriptors to learn KPCA-based local feature\nmaps, which takes about 12 hours for the training and testing sets on the full Scene-15 dataset,\nrespectively. We obtain slightly lower accuracy than EMK-Fourier and EMK-CKSVD. One reason\ncan be the small sample size, but it is currently prohibitive, computationally, to use larger ones.\nCaltech-101: Caltech-101 [15] contains 9144 images from 101 object categories and a background\ncategory. Each category has 31 to 800 images with signi\ufb01cant color, pose and lighting variations.\nCaltech-101 is one of the most frequently used benchmarks for image classi\ufb01cation, and results ob-\ntained by different algorithms are available from the published papers, allowing direct comparisons.\nFollowing the common experimental setting, we train models on 15/30 image per category and test\non the remaining images. We consider three pyramid levels: L = 0, L = 1, amd L = 2 (for the\nlatter two, spatial information is used). We have also tried increasing the number of levels in the\npyramid, but did not obtain a signi\ufb01cant improvement.\nWe report the accuracy of BOW-Linear, BOW-Gaussian, EMK-Fourier and EMK-CKSVD in ta-\nble 2. EMK-Fourier and EMK-CKSVD perform substantially better than BOW-Linear and BOW-\nGaussian for all pyramid levels. The performance gap increases as more pyramid levels are added.\nEMK-CKSVD is very close to EMK-Fourier and BOW-Gaussian does not improve over BOW-\nLinear much. In table 3, we compare EMK to other algorithms. As we have seen, EMK is compara-\nble to the best-scoring classi\ufb01ers to date. The best result on Caltech101 was obtained by combining\nmultiple descriptor types [1]. Our main goal in this paper is to analyze the strengths of EMK relative\n\n6\n\n\fFigure 2: Classi\ufb01cation accuracy on Scene-15. Left: Accuracy in the high-dimensional regime, and\n(center) in the low-dimensional regime. Right: Accuracy as a function of the training set size. The\ntraining set size is 1500 in the left plot; the dimensionality of feature maps and the vocabulary size\nare both set to 1000 in the right plot (for fair comparisons).\n\nAlgorithms\nBOW-Linear\nBOW-Gaussian\nEMK-Fourier\nEMK-CKSVD\n\nPyramid levels(15 training)\nL=0\nL=2\n\nL=1\n\n37.3\u00b1 0.9\n38.7\u00b1 0.8\n46.3\u00b1 0.7\n46.6\u00b10.9\n\n41.6\u00b1 0.7\n43.7\u00b1 0.7\n53.0\u00b1 0.6\n53.4\u00b10.8\n\n45.0\u00b10.5\n46.5\u00b10.6\n60.2\u00b10.8\n60.5\u00b10.9\n\n46.2\u00b10.8\n47.5\u00b10.7\n54.0\u00b1 0.7\n54.5\u00b10.8\n\nPyramid levels (30 training)\nL=0\nL=2\n\nL=1\n\n53.0\u00b1 0.9\n54.7\u00b1 0.8\n64.1\u00b1 0.8\n63.7\u00b10.9\n\n56.2\u00b10.7\n58.1\u00b10.6\n70.1\u00b10.8\n70.3\u00b10.8\n\nTable 2: Classi\ufb01cation accuracy comparisons for three pyramid levels. The results are averaged\nover \ufb01ve random training/testing splits. The dimensionality of the feature maps and the vocabulary\nsize are both set to 1000. We have also experimented with large vocabularies, but did not observe\nnoticeable improvement\u2014the performance tends to saturate beyond 1000 dimensions.\n\nto BOW. Only SIFT descriptors are used in BOW and EMK for all compared algorithms, listed in\ntable 3. To improve performance, EMK can be conveniently extended to multiple feature types.\nCaltech-256: Caltech-256 consists of 30,607 images from 256 object categories and background,\nwhere each category contains at least 80 images. Caltech-256 is challenging due to the large number\nof classes and the diverse lighting conditions, poses, backgrounds, images size, etc. We follow the\nstandard setup and increase the training set from 15 to 60 images per category with step length 15.\nIn table 4, we show the classi\ufb01cation accuracy obtained from BOW-Linear, BOW-Gaussian, EMK-\nFourier and EMK-CKSVD. As in the other datasets, we notice that EMK-Fourier and EMK-CKSVD\nconsistently outperform the BOW-Linear and the BOW-Gaussian.\nTo compare the four algorithms computationally, we select images from each category proportion-\nally to the total number of images of that category, as the training set. We consider six different\ntraining set sizes: (cid:98)0.3 \u00d7 30607(cid:99), . . . ,(cid:98)0.8 \u00d7 30607(cid:99). The results are shown in \ufb01g. 3. To acceler-\nate BOW-Gaussian, we precompute the entire kernel matrix. As expected, BOW-Gaussian is much\nslower than the other three algorithms as the training set size increases, for both training and testing.\n\nAlgorithms\nPMK [5, 6]\nHMAX [19]\nML+PMK [9]\n\nKC [28]\nSPM [14]\n\nSVM-KNN [31]\n\n15 training\n50.0\u00b10.9\n\n51.0\n52.2\nN/A\n56.4\n\n59.1\u00b10.5\n\n30 training\n\n58.2\n56.0\n62.1\n64.0\n\n64.4\u00b10.5\n66.2\u00b10.8\n\nAlgorithms\nkCNN [30]\n\nLDF [4]\n\nML+CORR [9]\n\nNBNN [1]\n\nEMK-Fourier\nEMK-CKSVD\n\n15 training\n\n30 training\n\n59.2\n60.3\n61.0\n\n65.0\u00b11.1\n60.2\u00b10.8\n60.5\u00b10.9\n\n67.4\nN/A\n69.6\n73.0\n\n70.1\u00b10.8\n70.3\u00b10.8\n\nTable 3: Accuracy comparisons on Caltech-101. EMK is compared with ten recently published\nmethods. N/A indicates that results are not available. Notice that EMK is used in conjunction with a\nlinear classi\ufb01er (linear SVM here) whereas all other methods (except HMAX [19]) require nonlinear\nclassi\ufb01ers.\n\n7\n\n5001000150020000.60.650.70.750.8DimentionalityAccuracyScene\u221215 BOW\u2212LinearBOW\u2212GaussianEMK\u2212FourierEMK\u2212CKSVD501001502000.60.650.70.75DimentionalityAccuracyScene\u221215 BOW\u2212LinearBOW\u2212GaussianEMK\u2212FourierEMK\u2212CKSVD501001500.50.550.60.650.70.750.8Training Set SizeAccuracyScene\u221215 BOW\u2212LinearBOW\u2212GaussianEMK\u2212FourierEMK\u2212CKSVD\fAlgorithms BOW-Linear BOW-Gaussian EMK-Fourier EMK-CKSVD\n15 training\n30 training\n45 training\n60 training\n\n19.1\u00b10.8\n24.4\u00b10.6\n28.3\u00b10.5\n30.9\u00b10.4\n\n17.4\u00b10.7\n22.7\u00b10.4\n26.9\u00b10.3\n29.3\u00b10.6\n\n22.6\u00b10.7\n30.1\u00b10.5\n34.1\u00b10.5\n37.4\u00b10.6\n\n23.2\u00b10.6\n30.5\u00b10.4\n34.4\u00b10.4\n37.6\u00b10.5\n\nTable 4: Accuracy on Caltech-256. The results are averaged over \ufb01ve random training/testing splits.\nThe dimensionality of the feature maps and the vocabulary size are both set to 1000 (for fair com-\nparisons). We use 2 pyramid levels.\n\nFigure 3: Computational costs on Caltech-256. Left: training time as a function of the training set\nsize. Right: testing time as a function of the training set size. Testing time is in seconds per 100\nsamples. Flat BOW and EMK are used (no pyramid, L = 0). Notice that PMK has a similar training\nand testing cost with BOW-Gaussian.\n\nNonlinear SVMs takes O(n2 \u223c n3) even when a highly optimized software package like LIBSVM\nis used. For large n, the SVM training dominates the training cost. The testing time of BOW-\nGaussian is linear in the training set size, but constant for the other three algorithms. Although we\nonly experiment with a Gaussian kernel, a similar complexity would be typical for other nonlinear\nkernels, as used in [6, 9, 14, 31, 4].\n\n5 Conclusion\n\nWe have presented ef\ufb01cient match kernels for visual recognition, based on a novel insight that\npopular bag-of-words representations used in conjunction with linear models can be viewed as a\nspecial type of match kernel which counts 1 if two local features fall into the same regions parti-\ntioned by visual words and 0 otherwise. We illustrate the quantization limitations of such models\nand propose more sophisticated kernel approximations that preserve the computational ef\ufb01ciency of\nbag-of-words while being just as (or more) accurate than the existing, computationally demanding,\nnon-linear kernels. The models we propose are built around Ef\ufb01cient Match Kernels (EMK), which\nmap local features to a low dimensional feature space, average the resulting feature vectors to form\na set-level feature, then apply a linear classi\ufb01er. In experiments, we show that EMK are ef\ufb01cient and\nachieve state of the art classi\ufb01cation results in three dif\ufb01cult computer vision datasets: Scene-15,\nCaltech-101 and Caltech-256.\n\nAcknowledgements: This research was supported, in part, by awards from NSF (IIS-0535140) and\nthe European Commission (MCEXT-025481). Liefeng Bo thanks Jian Peng for helpful discussions.\n\nReferences\n[1] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classi\ufb01-\n\ncation. In CVPR, 2008.\n\n[2] M. Cuturi and J. Vert. Semigroup kernels on \ufb01nite sets. In NIPS, 2004.\n[3] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classi-\n\n\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\n8\n\n11.522.5x 104051015x 104Trainning set sizeTraining time (seconds)Caltech\u2212256 BOW\u2212LinearBOW\u2212GaussianEMK\u2212FourierEMK\u2212CKSVD11.522.5x 10420406080100120Trainning set sizeTesting time (seconds)Caltech\u2212256 BOW\u2212LinearBOW\u2212GaussianEMK\u2212FourierEMK\u2212CKSVD\f[4] A. Frome, Y. Singer, and J. Malik.\n\nfunctions. In NIPS, 2006.\n\nImage retrieval and classi\ufb01cation using local distance\n\n[5] K. Grauman and T. Darrell. The pyramid match kernel: discriminative classi\ufb01cation with sets\n\nof image features. In ICCV, 2005.\n\n[6] K. Grauman and T. Darrell. The pyramid match kernel: Ef\ufb01cient learning with sets of features.\n\nJMLR, 8:725\u2013760, 2007.\n\n[7] D. Haussler. Convolution kernels on discrete structures. Technical report, 1999.\n[8] Zhang J., Marszalek M., Lazebnik S., and Schmid C. Local features and kernels for classi\ufb01ca-\n\ntion of texture and object categories: A comprehensive study. IJCV, 73(2):213\u2013238, 2007.\n[9] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In CVPR, 2008.\n[10] F. Jurie and B. Triggs. Creating ef\ufb01cient codebooks for visual recognition. In ICCV, 2005.\n[11] K. Kim, M. Franz, and B. Sch\u00a8olkopf. Iterative kernel principal component analysis for image\n\nmodeling. PAMI, 27(9):1351\u20131366, 2005.\n\n[12] R. Kondor and T. Jebara. A kernel between sets of vectors. In ICML, 2003.\n[13] A. Kumar and C. Sminchisescu. Support kernel machines for object recognition. In ICCV,\n\n2007.\n\n[14] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for\n\nrecognizing natural scene categories. In CVPR, 2006.\n\n[15] F. Li, R. Fergus, and P. Perona. One-shot learning of object categories. PAMI, 28(4):594\u2013611,\n\n2006.\n\n[16] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91\u2013110, 2004.\n[17] S. Lyu. Mercer kernels for object recognition with local features. In CVPR, 2005.\n[18] P. Moreno, P. Ho, and N. Vasconcelos. A kullback-leibler divergence based kernel for svm\n\nclassi\ufb01cation in multimedia applications. In NIPS, 2003.\n\n[19] J. Mutch and D. Lowe. Multiclass object recognition with sparse, localized features. In CVPR,\n\n2006.\n\n[20] M. Parsana, S. Bhattacharya, C. Bhattacharyya, and K. Ramakrishnan. Kernels on attributed\n\npointsets with applications. In NIPS, 2007.\n\n[21] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving\n\nparticular object retrieval in large scale image databases. In CVPR, 2008.\n\n[22] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[23] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization\n\nwith randomization in learning. In NIPS, 2008.\n\n[24] B. Sch\u00a8olkopf, A. Smola, and K. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10:1299\u20131319, 1998.\n\n[25] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In ICML, pages 807\u2013814. ACM, 2007.\n\n[26] A. Shashua and T. Hazan. Algebraic set kernels with application to inference over local image\n\nrepresentations. In NIPS, 2004.\n\n[27] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in\n\nvideos. In ICCV, 2003.\n\n[28] J. van Gemert, J. Geusebroek, C. Veenman, and A. Smeulders. Kernel codebooks for scene\n\ncategorization. In ECCV, 2008.\n\n[29] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. JMLR, 4:913\u2013931,\n\n2003.\n\n[30] K. Yu, W. Xu, and Y. Gong. Deep learning with kernel regularization for visual recognition.\n\nIn NIPS, 2008.\n\n[31] H. Zhang, A. Berg, M. Maire, and J. Malik. Svm-knn: Discriminative nearest neighbor classi-\n\n\ufb01cation for visual category recognition. In CVPR, 2006.\n\n9\n\n\f", "award": [], "sourceid": 934, "authors": [{"given_name": "Liefeng", "family_name": "Bo", "institution": null}, {"given_name": "Cristian", "family_name": "Sminchisescu", "institution": null}]}