What is the difference between “equivariant to translation” and “invariant to translation”

https://datascience.stackexchange.com/questions/16060

16-10-2019
|

Question

I'm having trouble understanding the difference between equivariant to translation and invariant to translation.

In the book Deep Learning. MIT Press, 2016 (I. Goodfellow, A. Courville, and Y. Bengio), one can find on the convolutional networks:

[...] the particular form of parameter sharing causes the layer to have a property called equivariance to translation
[...] pooling helps to make the representation become approximately invariant to small translations of the input

Is there any difference between them or are the terms interchangeably used?

Solution

Equivariance and invariance are sometimes used interchangeably. As pointed out by @Xi'an, you can find uses in the statistical literature, for instance on the notions of the invariant estimator and especially the Pitman estimator.

However, I would like to mention that it would be better if both terms keep separated, as the prefix "in-" in invariant is privative (meaning "no variance" at all), while "equi-" in equivariant refers to "varying in a similar or equivalent proportion". In other words, one does not move, the other does.

Let us start from simple image features, and suppose that image $I$ has a unique maximum $m$ at spatial pixel location $(x_m,y_m)$, which is here the main classification feature. In other words: an image and all its translations are "the same". An interesting property of classifiers is their ability to classify in the same manner some distorted versions $I'$ of $I$, for instance translations by all vectors $(u,v)$.

The maximum value $m'$ of $I'$ is invariant: $m'=m$: the value is the same. While its location will be at $(x'_m,y'_m)=(x_m-u,y_m-v)$, and is equivariant, meaning that is varies "equally" with the distortion.

The precise formulations given in mathematics for equivariance depend on the objects and transformations one considers, so I prefer here the notion that is most often used in practice (and I may get the blame from a theoretical stand-point).

Here, translations (or some more generic action) can be equipped with the structure of a group $G$, $g$ being one specific translation operator. A function or feature $f$ is invariant under $G$ if for all images in a class, and for any $g$, $$f(g(I)) = f(I)\,.$$

It becomes equivariant if there exists another mathematical structure or action (often a group) $G'$ that reflects the transformations in $G$ in a meaningful way. In other words, such that for each $g$, you have one a unique $g' \in G'$ such that

$$f(g(I)) = g'(f(I))\,.$$

In the above example on the group of translations, $g$ and $g'$ are the same (and hence $G'=G$): an integer translation of the image reflects as the exact same translation of the maximum location.

Another common definition is:

$$f(g(I)) = g(f(I))\,.$$

I however used potentially different $G$ and $G'$ because sometimes $f(I)$ and $g(I)$ are not in the same domain. This happens for instance in multivariate statistics (see e.g. Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardisation). But here, the uniqueness of the mapping between $g$ and $g'$ allows to get back to the original transformation $g$.

Often, people use the term invariance because the equivariance concept is unknown, or everybody else uses invariance, and equivariance would seem more pedantic.

For the record, other related notions (esp. in maths and physics) are termed covariance, contravariance, differential invariance.

In addition, translation-invariance, as least approximate, or in envelope, has been a quest for several signal and image processing tools. Notably, multi-rate (filter-banks) and multi-scale (wavelets or pyramids) transformations have been design in the past 25 years, for instance under the hood of shift-invariant, cycle-spinning, stationary, complex, dual-tree wavelet transforms (for a review on 2D wavelets, A panorama on multiscale geometric representations). The wavelets can absorb a few discrete scale variations. All theses (approximate) invariances often come with the price of redundancy in the number of transformed coefficients. But they are more likely to yield shift-invariant, or shift-equivariant features.

OTHER TIPS

The terms are different:

Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0
Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0

For feature maps in convolutional networks to be useful, they typically need both properties in some balance. The equivariance allows the network to generalise edge, texture, shape detection in different locations. The invariance allows precise location of the detected features to matter less. These are two complementary types of generalisation for many image processing tasks.

Just adding my 2 cents

Regarding an image classification task solved with a typical CNN Architecture consisting of a Backend (Convolutions + NL + possibly Spatial Pooling) which performs Representation Learning and of a Frontend (e.g. Fully Connected Layers, MLP) which solves the specific task, in this case image classification, the idea is to build a function $ f : I \rightarrow L $ able to map from the Spatial Domain $ I $ (Input Image) to the Semantic Domain $ L $ (Label Set) in a 2 step process which is

Backend (Representation Learning) : $ f : I \rightarrow \mathcal{L} $ maps the Input to the Latent Semantic Space
Frontend (Task Specific Solver) : $ f : \mathcal{L} \rightarrow L $ maps from the Latent Semantic Space to the Final Label Space

and it is performed using the following properties

spatial equivariance, regarding ConvLayer (Spatial 2D Convolution+NonLin e.g. ReLU) as a shift in the Layer Input produces a shift in the Layer Output (Note: it is about the Layer, not the single Convolution Operator)
spatial invariance, regarding the Pooling Operator (e.g. Max Pooling passes over the max value in its receptive field regardless of its spatial position)

The closer to the input layer, the closer to the purely spatial domain $ I $ and the more important the spatial equivariance property which allows to build spatially equivariant hierarchical (increasingly) semantic representation

The closer to the frontend, the closer to the latent purely semantic domain $ \mathcal{L} $ and the more important the spatial invariance as the specific meaning of the image is desired to be independent from the spatial positions of the features

Using fully connected layers in the frontend makes the classifier sensitive to feature position at some extent, depending on the backend structure : the deeper it is and the more the translation invariant operator (Pooling) used

It has been shown in Quantifying Translation-Invariance in Convolutional Neural Networks that to improve the CNN Classifier Translation Invariance, instead of acting on the inductive bias (architecture hence depth, pooling, …) it's more effective to act on the dataset bias (data augmentation)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange