The internals of "gist" features

A. Oliva, A. Torralba (IJCV 2001) - Modeling the shape of the scene: a holistic representation of the spatial envelope. {website}
Introduction
Proposed by Aude Oliva and Antonio Torralba, the "gist" features (aka "holistic representation of spatial envelope") are really simple to understand and to compute yet they yield state-of-the-art performance in scene recognition using very few dimensions (here 512).
Let's play with them and explore the internals of the implementation provided by the authors!
Dataset and source code
The dataset and the original source code are available on the paper's website.
Modified gist source code (MATLAB): gist_NP.tar.gz and svm_v251.tar.gz.
V1-like source code (Python): v1s-0.0.4_scene.tar.gz
Four "gist" variants
One may be interested to see if the available implementation is somehow "stable" to a few minor implementation changes that (hopefully) still make sense in the context of scene recognition. Here I'll consider four variants of the gist features by including two new parameters and consider them as neural representations:
- The first one will control the non-linear activation of the local transfer functions (you can also see them as neurons processing their weighted synaptic integration). This parameter can take two values: the original "abs" for absolute values and a new one named "clip" for thresholded values (clipping negative values to zero). The "abs" non-linear activation function represents pooling of two "contrast-reversed" functions whereas the "clip" activation only pools one and consider negative values as pure inhibition thus elicitating no response (i.e. zero). In both cases, these neurons (i.e. functions) can't "fire" (i.e. respond) negatively.
- The second one will implement some kind of local competition between the outputs by including a post-filtering normalization. This parameter can either be true or false.
"vanilla" gist
This variant uses "abs" as a non-linear transfer function and ''no post-filtering normalization'.
Command:
"vanilla" gist + post filtering (or "neural competition")
This variant uses "abs" as a non-linear transfer function and a ''post-filtering normalization'.
Command:
thresholded gist
This variant uses "clip" as a non-linear transfer function and ''no post-filtering normalization'.
Command:
thresholded gist + post filtering (or "neural competition")
This variant uses "clip" as a non-linear transfer function and a ''post-filtering normalization'.
Command:
Scene Recognition Performance
The performance of the four variants of gist on the 8 outdoor scene categories dataset is evaluated (i.e. mean and standard error of the mean (SEM) using a 10-trial random subsampling cross-validation scheme). In addition, I included the performance of V1-like [Pinto et al. PLoS 2008, Pinto et al. ECCV/LFW 2008], a primary visual cortex model (i.e. simple cells) that shares a lot of properties with gist: input normalization (i.e. pre-prefiltering), filtering with a bank of gabor filters, non-linear activation (i.e. threshold and saturation), output normalization (i.e. post-filtering), pooling of neighboring cells and SVM classification. This V1-like model differs in the details of the implementation of these operations, use a linear SVM (gist uses a RBF kernel) and use many more filters. As a consequence V1-like has a huge output dimensionality (more that 80 000 features) compared to gist (512). We'll see if it helps or not in the context of scene-recognition.
To reproduce these results using gist under MATLAB (R):
Use the following procedure to reproduce the results using V1-like in Python.
First you need to convert the dataset:
#!/bin/bash
# execute this script in the directory that includes the dataset
# spatial_envelope_256x256_static_8outdoorcategories to convert it
# to the v1s format
export outdir=spatial_envelope_256x256_static_8outdoorcategories_v1sformat
mkdir -p spatial_envelope_256x256_static_8outdoorcategories_v1sformat
for f in spatial_envelope_256x256_static_8outdoorcategories/*.jpg;
do
export category=$(python -c "import sys; print sys.argv[1].split('_')[0]" $(basename $f));
mkdir -p $outdir/$category;
cp -va $f $outdir/$category/;
done;
Then you can execute the following commands:
for V1-like:
for V1-like+:
Performance on the 8 outdoor scene categories dataset.
| Model | gist / abs, no postfilt | gist / abs, postfilt | gist / clip, no postfilt | gist / clip, postfilt | V1-like | V1-like+ |
| Mean | 82.91 | 84.44 | 82.21 | 84.44 | 82.25 | 84.79 |
| SEM | 0.24 | 0.16 | 0.20 | 0.16 | 0.16 | 0.21 |
As we can see, the performance of the four gist variants don't change much even if it seems that adding the post-filtering helps (i.e. output normalization or neural competition). The V1-like models, despite their huge dimensionality don't perform significantly better than the gist features on this scene category recognition task. It is not clear whether having so many dimensions helps or not as it may promote overfitting and non-generalization. Overall we see that the same class of models (input normalization, non-linear gabor filtering, output normalization, downsampling or basically normalized non-linear wavelets) performs at comparable levels and that the details of the implementations don't really matter as long as they represent the spatial envelope of scenes properly.
Internals
In this section, I present average images of some of the "vanilla" gist internals. We can then see if they reveal some interesting structures depending on the scene category (very coarsely).
To save the internal representations use the following command in MATLAB (the last parameter has to be true):
input
coast

forest

highway

insidecity

mountain

opencountry

street

tallbuilding

pre-filtering
coast

forest

highway

insidecity

mountain

opencountry

street

tallbuilding

filtering + non-linear activation
coast

forest

highway

insidecity

mountain

opencountry

street

tallbuilding

References
- Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision {website}
- Pinto N*, Cox DD*, DiCarlo JJ (2008) Why is Real-World Visual Object Recognition Hard? PLoS Computational Biology 4(1): e27 doi:10.1371/journal.pcbi.0040027 {html} {pdf} {code} {press release} {gscholar}
- Pinto N, Dicarlo JJ, Cox DD (2008) Establishing Good Benchmarks and Baselines for Face Recognition. ECCV 2008 Faces in 'Real-Life' Images Workshop. {pdf} {code} {gscholar}