Visual Recognition

March 12, 2018 • Computer Vision

A very brief summary on classical methods of visual recognition.


Task: Visual Recognition / Image Classification, to judge if a certain object is present.


  • Traditonal ML method: BoW Model
  • DL


  • Semantic gap
  • Visual variations
  • Multiple labels

Bag-of-Words Model

General View :

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). Also known as the vector space model. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity). The bag-of-words model has also been used for computer vision.

In computer vision, the bag-of-words model (BoW model) can be applied to image classification, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

BoW model was borrowed from NLP.

The philosophy of BoW model is to divide the object into small patches (which is natural in NLP), neglect the positional imformation so as to simplify the problem. The model focus on the distribution of small patches (often reprensented by histograms). The lost positional imformation can be compensated in multiple ways as optimiaztions of the original model, such as n-gram in NLP and _ in CV.

The main challenge of applying BoW to CV is perhaps the choose of "words".

Pipeline :

  1. Feature detection: Find "interest" points
  2. Feature description: Decribe features in an algebraic way
  3. Dictionary generation: Cluster the representation of features and learn a visual dictionary
  4. Histogram calculation: Calculate a histogram of all visual words of a image.

Feature detection


  1. Similarity is local.
  2. Corners are nice feature points (locally "unique").

A corner can be defined as the intersection of two edges. A corner can also be defined as a point for which there are two dominant and different edge directions in a local neighbourhood of the point.

Example: Harris Corner Detector

Without loss of generality, we will assume a grayscale 2-dimensional image is used. Let this image be given by $I$. Consider taking an image patch over the area $(u,v)$ and shifting it by $(x,y)$. The weighted sum of squared differences (SSD) between these two patches, denoted $S$, is given by:
S(x,y) = \sum_u \sum_v w(u,v) \, \left( I(u+x,v+y) - I(u,v)\right)^2
$I(u+x,v+y)$ can be approximated by a Taylor expansion. Let $I_{x}$ and $ I_y$ be the partial derivatives of $I$, such that
I(u+x,v+y)\approx I(u,v)+I_{x}(u,v)x+I_{y}(u,v)y
This produces the approximation
S(x,y) \approx \sum_u \sum_v w(u,v) \, \left( I_x(u,v)x + I_y(u,v)y \right)^2​
which can be written in matrix form:
S(x,y) \approx \begin{pmatrix} x & y \end{pmatrix} A \begin{pmatrix} x \\ y \end{pmatrix}
where $A$ is the structure tensor,

{\displaystyle M=\sum _{u}\sum _{v}w(u,v){\begin{bmatrix}I_{x}(u,v)^{2}&I_{x}(u,v)I_{y}(u,v)\\I_{x}(u,v)I_{y}(u,v)&I_{y}(u,v)^{2}\end{bmatrix))={\begin{bmatrix}\langle I_{x}^{2}\rangle &\langle I_{x}I_{y}\rangle \\\langle I_{x}I_{y}\rangle &\langle I_{y}^{2}\rangle end{bmatrix))}

This matrix is a Harris matrix, and angle brackets denote averaging (i.e. summation over $(u,v)$. $w(u, v)$ denotes the type of window that slides over the image. If a Box filter is used the response will be anisotropic, but if a Gaussian is used, then the response will be isotropic.

A corner (or in general an interest point) is characterized by a large variation of S in all directions . By analyzing the eigenvalues of $M$, this characterization can be expressed in the following way: $A$ should have two "large" eigenvalues for an interest point. Based on the magnitudes of the eigenvalues, the following inferences can be made based on this argument:

If $\lambda_1 \approx 0$ and $\lambda_2 \approx 0$ then this pixel $(x,y)$ has no features of interest.
If $\lambda_1 \approx 0$ and $\lambda _{2}$ has some large positive value, then an edge is found.
If $ \lambda_1$ and $ \lambda_2$ have large positive values, then a corner is found.

Harris and Stephens note that exact computation of the eigenvalues is computationally expensive, since it requires the computation of a square root, and instead suggest the following function $M_c$, where $\kappa$ is a tunable sensitivity parameter:
M_c = \lambda_1 \lambda_2 - \kappa \, (\lambda_1 + \lambda_2)^2
= \operatorname{det}(A) - \kappa \, \operatorname{trace}^2(A)
Therefore, the algorithm does not have to actually compute the eigenvalue decomposition of the matrix $A$ and instead it is sufficient to evaluate the determinant and trace of $A$ to find corners, or rather interest points in general.

Feature Description

After we've detected points of interest, some methods should be implemented to represent the feature of these points. There are multiple algorithms: SIFT, SURF, Hog, GLOH...

Take SIFT(scale-invariant feature transform) as an example:

SIFT feature descriptor is invariant to uniform scaling, orientation, illumination changes, and partially invariant to affine distortion.

David G.Lowe's SIFT Paper

Dictionary Generation

Visual feature clustering. A typical way is to use K-means (iterate between "assignment" & "center-updating" steps)

Histogram Calculation

Each SIFT descriptor is quantized into a visual word using its neareat cluster center. An image is therefore represented by a histogram, each bin representing an entry of the dictionary.



  1. Utilize the positional information

    • Applied to "histogram calculation" step
    • Spatial Pyramid Matching
    • Pyramid Match Kernel
  2. Improve the aggregation method

    • VLAD
    • Fisher Vector
    • Soft Assignment
  3. Optimize the dictionary generation

    • Vocabulary Tree
    • Sparse Coding

Spatial Pyramid Matching

Lazebnik Slides

SPM Paper (CVPR 2006)

Philosophy: Add spatial information by adding locally representation at different scales.

Pyramid Match Kernel

Pyramid match kernel aims to measure the similarity of a partial matching between two images within $O(dm log(D))$, where $d$ is vector dim, $m$ is maximum set cardinality, and $D$ is diameter of vector space.

The algorithm compares similarity of histogram of diffrent scaling levels and

compute a score based on "additional matches of this level".

PMK Report

PMK Feature

Vocabulary Trees

Used in an offline unsupervised (K-means) training stage. Build a hierarchical codebook.


Take information within the clusters into consideration by concatenating the residuals. These high-order representations are usually helpful in tasks of fine-graind recognition.

VLAD Paper (CVPR 2010)

Sparse Coding

Sparse Coding Paper

Much thanks to my course instructor Yadong Mu, Wikipedia, and Google Scholar.

Archives QR Code
QR Code for this page
Tipping QR Code