Preliminaties

Notation

Spaces

  • $ \mathcal{I} $ : Raw Image Space
    • it can be represented as a 3D Tensor $ I \in \mathbb{R}^{w \times h \times c} $ with Channels Dimenions $ c = 3 $ assuming an RGB Input
  • $ \mathcal{B} $ : Bounding Box Space
    • each element is $ b \in \mathcal{B} $ so that $ b = (u,v,w,h) $ which identies a group of pixels in $ I $ image
    • $ B = \{ b_{i} \}_{i=1,...,N} $ Group of Bounding Boxes
  • $ I \in \mathcal{I} $ represents a generic image
  • $ I_{B} \subset I $ represents a Bounding Box applied to $ I $ according to $ B $
  • $ \mathcal{S} $ : Latent Image Space
    • it results from the CNN Processing and it typically identifies the 2D Spatial Tensor, with a certain Channel Depth, after all the Convolutive Processing (Convolutions + NonLin e.g. ReLU + Spatial Reduction Operators e.g. MaxPooling) just before it gets transformed into the $ d^{(bottleneck)} $ Bottleneck Feature Descriptor
  • $ \mathcal{D} $ : Bottleneck Feature Space
    • it is typically a $ d $ Dimensional Space $ \mathbb{R}^{d} $
  • $ \mathcal{L} $ : Label Space
    • it is typically a finite set of semantic labels

Functions

  • $ f^{(ROI)} : \mathbb{R}^{w \times h \times c} \times \mathcal{B} \rightarrow \mathbb{R}^{w' \times h' \times c} \qquad w' < w \quad h' < h $ : Gets a Spatial ROI from a Tensor Space (the Channel Dimension is kept the same)
    • with $ f^{(ROI)}(I; b) $ it applies to the $ I \subset \mathbb{R}^{w \times h \times c} $ Input Tensor the ROI identified by $ b \in \mathcal{B} $ so that $ b = ( u_{0}, v_{0}, w, h ) $
  • $ f^{CNN} : \mathcal{I} \rightarrow \mathcal{S} $ : Performs the CNN Processing to compute the Latent Representation
  • $ f^{Cl} : \mathcal{D} \rightarrow \mathcal{L} $ : Performs the Classification starting from some Latent Representation, typically consisting of 1D Tensor of some fixed lenght (Bottleneck Feature)
  • $ f^{MaxPooling} : \mathbb{R}^{w \times h} \rightarrow \mathbb{R} $ : Represents the Spatial Max Pooling Operator
    • it is defined as

$$ f^{MaxPooling}(R) = \max_{i=1,...,w \quad j=1,...,h} R(i,j) $$

RCNN

Main Idea

The RCNN consists of 3 main blocks running sequentially

  1. the Region Proposal $ f^{(RP)} : \mathcal{I} \rightarrow \mathcal{B} $ which in RCNN original formulation relies on Selective Search Algorithm
  2. the Feature Computation $ f^{(CNN)} : \mathcal{I_{B}} \rightarrow \mathcal{S} $ relying on some CNN Backend (e.g. VGG)
  3. the Classificator $ f^{(Cl)} : \mathcal{D} \rightarrow \mathcal{L} $ in its original implementation it relis on a Shallow Classifier like SVM

Implementation

  1. Compute $ B = \{ b_{i} \} $ Region Proposal Set
  2. Compute $ S_{i} = f^{(CNN)}(B_{i}) $ Latent Representation for each selected BBox
  3. Assign Semantic Label $ L_{i} = f^{(Cl)}(I_{B_{i}}) $
  4. The final result is $ \{ (B, L)_{i} \}_{i=1,...,N} $

Fast RCNN

Overview

  • Focused on improving RCNN on the speed performance side

  • Introduces ROI Pooling Network (Ross Girschick, Apr 2015)

Main Ideas

  • Achieve speed up by sharing the computationally expensive CNN Processing

  • Introduce $ f^{(RP)} $ ROI Pooling Network which is responsible for

    • mapping from Image Space Bounding Box $ I_{b} $ to Latent Space Bounding Box $ S_{b^{(s)}} $
      • it means computing $ b^{(s)} = (u,v,w,h)^{(s)} $ in Latent Space from $ b = (u,v,w,h) $ in Input Space
        • it can be performed in a deterministic way, considering all the spatial reductions performed by Convolutive Processing (Convolutions + Spatial Pooling)
        • it allows to compute a $ S_{b^{(s)}} $ ConvMap corresponding to a $ I_{b} $ Input Region Proposal
        • however it is not possibly to apply $ f^{Cl} $ directly to $ S_{b^{s}} $ because the latter can have a generic size while the former requires a fixed size input (as the classification is internally performed with fully connected layers), this is managed by the following second function performed by ROI Pooling Network
    • mapping the variable size $ S_{b^{(s)}} $ into a fixed size $ S^{(p)} $ ConvMap by means of further spatial pooling
  • By making the "Feature Computation Path" Mol start from the "Full Image PP" Mol instead of from a "Fixed Size ROI PP" Mol

Implementation

  1. Region Proposal

    • The $ f^{(RP)} = f^{(SS)} $ : Region Proposal still implemented with Selective Search (non trainable driven approach)
  2. CNN Processing

    • Change $ S_{B} = f^{(CNN)}(I_{B}) $ with $ S=f^{(CNN)}(I) $ and use

Details

ROI Pooling

The ROI Pooling Network computes $ S^{(p)} $ Fixed Size Region Proposal Latent Representation which can be easily transformed into the fixed size vector which can be passed to $ f^{(Cl)} $ for Classification, starting from $ S_{ b^{(s)} } $ according to the following Algo

  • Assumptions
    • Sizes: $ S_{b^{(s)}} $ ConvMap has size $ w^{(s)} \times h^{(s)} \times c^{(s)} $ while the $ S^{(p)} $ Pooled ConvMap has size $ w^{(p)} \times h^{(p)} \times c^{(p)} $ with $ w^{(p)} < w^{(s)} $ and $ h^{(p)} < h^{(s)} $
    • Typically $ S^{(p)} $ is square
  • The $ \{w', h'\} $ are computed as the result of an integer division between $ \{w,h\}^{(s)} $ and $ \{w,h\}^{(p)} $ respectively so the Input ConvMap gets divided into a set of $ \{ R_{i,j} \}_{i=1,...,w^{(p)}, j=1,...,h^{(p)}} $ elements of mostly equally sized subregions (up to the integer division approximation) so that $ S_{b^{(s)}} = \bigcup_{i,j}^{i=1,...,w^{(p)}, j=1,...,h^{(p)}} R_{i,j} $ and there is a one-to-one relationship between $ R_{i,j} $ and the $ i,j $ element in $ S^{(p)} $ ConvMap
  • Finally Max Pooling is performed setting the corresponding element in the Pooled ConvMap

$$ S^{(p)}(i,j) = f^{(MaxPooling)} R_{i,j} \quad \forall i = 1,...,w^{(p)}, j=1,...,h^{(p)} $$

ROI Pooling1

  • The $ S $ Full Latent ConvMap

ROI Pooling3

  • The $ S_{b^{(s)}} $ Region Proposal ConvMap with size $ w^{(s)}=7, h^{(s)}=5 $

ROI Pooling5

  • Considering $ S^{(p)} $ has $ w^{(p)}=2, h^{(p)}=2 $ then 4 $ R_{i,j} $ Regions are needed and considering $ w^{(s)} / w_{(p)} = 3 $ and $ h^{(s)} / h^{(p)} = 2 $ the association is
    • $ R_{0,0} = f^{(ROI)}( S_{b^{(s)}}; 0,0,3,2 ) $
    • $ R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 3,0,4,2 ) $
    • $ R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 0,2,3,3 ) $
    • $ R_{1,1} = f^{(ROI)}( S_{b^{(s)}}; 3,2,4,3 ) $