# Preliminaties¶

## Notation¶

### Spaces¶

• $\mathcal{I}$ : Raw Image Space
• it can be represented as a 3D Tensor $I \in \mathbb{R}^{w \times h \times c}$ with Channels Dimenions $c = 3$ assuming an RGB Input
• $\mathcal{B}$ : Bounding Box Space
• each element is $b \in \mathcal{B}$ so that $b = (u,v,w,h)$ which identies a group of pixels in $I$ image
• $B = \{ b_{i} \}_{i=1,...,N}$ Group of Bounding Boxes
• $I \in \mathcal{I}$ represents a generic image
• $I_{B} \subset I$ represents a Bounding Box applied to $I$ according to $B$
• $\mathcal{S}$ : Latent Image Space
• it results from the CNN Processing and it typically identifies the 2D Spatial Tensor, with a certain Channel Depth, after all the Convolutive Processing (Convolutions + NonLin e.g. ReLU + Spatial Reduction Operators e.g. MaxPooling) just before it gets transformed into the $d^{(bottleneck)}$ Bottleneck Feature Descriptor
• $\mathcal{D}$ : Bottleneck Feature Space
• it is typically a $d$ Dimensional Space $\mathbb{R}^{d}$
• $\mathcal{L}$ : Label Space
• it is typically a finite set of semantic labels

### Functions¶

• $f^{(ROI)} : \mathbb{R}^{w \times h \times c} \times \mathcal{B} \rightarrow \mathbb{R}^{w' \times h' \times c} \qquad w' < w \quad h' < h$ : Gets a Spatial ROI from a Tensor Space (the Channel Dimension is kept the same)
• with $f^{(ROI)}(I; b)$ it applies to the $I \subset \mathbb{R}^{w \times h \times c}$ Input Tensor the ROI identified by $b \in \mathcal{B}$ so that $b = ( u_{0}, v_{0}, w, h )$
• $f^{CNN} : \mathcal{I} \rightarrow \mathcal{S}$ : Performs the CNN Processing to compute the Latent Representation
• $f^{Cl} : \mathcal{D} \rightarrow \mathcal{L}$ : Performs the Classification starting from some Latent Representation, typically consisting of 1D Tensor of some fixed lenght (Bottleneck Feature)
• $f^{MaxPooling} : \mathbb{R}^{w \times h} \rightarrow \mathbb{R}$ : Represents the Spatial Max Pooling Operator
• it is defined as

$$f^{MaxPooling}(R) = \max_{i=1,...,w \quad j=1,...,h} R(i,j)$$

# RCNN¶

## Main Idea¶

The RCNN consists of 3 main blocks running sequentially

1. the Region Proposal $f^{(RP)} : \mathcal{I} \rightarrow \mathcal{B}$ which in RCNN original formulation relies on Selective Search Algorithm
2. the Feature Computation $f^{(CNN)} : \mathcal{I_{B}} \rightarrow \mathcal{S}$ relying on some CNN Backend (e.g. VGG)
3. the Classificator $f^{(Cl)} : \mathcal{D} \rightarrow \mathcal{L}$ in its original implementation it relis on a Shallow Classifier like SVM

## Implementation¶

1. Compute $B = \{ b_{i} \}$ Region Proposal Set
2. Compute $S_{i} = f^{(CNN)}(B_{i})$ Latent Representation for each selected BBox
3. Assign Semantic Label $L_{i} = f^{(Cl)}(I_{B_{i}})$
4. The final result is $\{ (B, L)_{i} \}_{i=1,...,N}$

# Fast RCNN¶

## Overview¶

• Focused on improving RCNN on the speed performance side

• Introduces ROI Pooling Network (Ross Girschick, Apr 2015)

## Main Ideas¶

• Achieve speed up by sharing the computationally expensive CNN Processing

• Introduce $f^{(RP)}$ ROI Pooling Network which is responsible for

• mapping from Image Space Bounding Box $I_{b}$ to Latent Space Bounding Box $S_{b^{(s)}}$
• it means computing $b^{(s)} = (u,v,w,h)^{(s)}$ in Latent Space from $b = (u,v,w,h)$ in Input Space
• it can be performed in a deterministic way, considering all the spatial reductions performed by Convolutive Processing (Convolutions + Spatial Pooling)
• it allows to compute a $S_{b^{(s)}}$ ConvMap corresponding to a $I_{b}$ Input Region Proposal
• however it is not possibly to apply $f^{Cl}$ directly to $S_{b^{s}}$ because the latter can have a generic size while the former requires a fixed size input (as the classification is internally performed with fully connected layers), this is managed by the following second function performed by ROI Pooling Network
• mapping the variable size $S_{b^{(s)}}$ into a fixed size $S^{(p)}$ ConvMap by means of further spatial pooling
• By making the "Feature Computation Path" Mol start from the "Full Image PP" Mol instead of from a "Fixed Size ROI PP" Mol

## Implementation¶

1. Region Proposal

• The $f^{(RP)} = f^{(SS)}$ : Region Proposal still implemented with Selective Search (non trainable driven approach)
2. CNN Processing

• Change $S_{B} = f^{(CNN)}(I_{B})$ with $S=f^{(CNN)}(I)$ and use

## Details¶

### ROI Pooling¶

The ROI Pooling Network computes $S^{(p)}$ Fixed Size Region Proposal Latent Representation which can be easily transformed into the fixed size vector which can be passed to $f^{(Cl)}$ for Classification, starting from $S_{ b^{(s)} }$ according to the following Algo

• Assumptions
• Sizes: $S_{b^{(s)}}$ ConvMap has size $w^{(s)} \times h^{(s)} \times c^{(s)}$ while the $S^{(p)}$ Pooled ConvMap has size $w^{(p)} \times h^{(p)} \times c^{(p)}$ with $w^{(p)} < w^{(s)}$ and $h^{(p)} < h^{(s)}$
• Typically $S^{(p)}$ is square
• The $\{w', h'\}$ are computed as the result of an integer division between $\{w,h\}^{(s)}$ and $\{w,h\}^{(p)}$ respectively so the Input ConvMap gets divided into a set of $\{ R_{i,j} \}_{i=1,...,w^{(p)}, j=1,...,h^{(p)}}$ elements of mostly equally sized subregions (up to the integer division approximation) so that $S_{b^{(s)}} = \bigcup_{i,j}^{i=1,...,w^{(p)}, j=1,...,h^{(p)}} R_{i,j}$ and there is a one-to-one relationship between $R_{i,j}$ and the $i,j$ element in $S^{(p)}$ ConvMap
• Finally Max Pooling is performed setting the corresponding element in the Pooled ConvMap

$$S^{(p)}(i,j) = f^{(MaxPooling)} R_{i,j} \quad \forall i = 1,...,w^{(p)}, j=1,...,h^{(p)}$$

• The $S$ Full Latent ConvMap

• The $S_{b^{(s)}}$ Region Proposal ConvMap with size $w^{(s)}=7, h^{(s)}=5$

• Considering $S^{(p)}$ has $w^{(p)}=2, h^{(p)}=2$ then 4 $R_{i,j}$ Regions are needed and considering $w^{(s)} / w_{(p)} = 3$ and $h^{(s)} / h^{(p)} = 2$ the association is
• $R_{0,0} = f^{(ROI)}( S_{b^{(s)}}; 0,0,3,2 )$
• $R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 3,0,4,2 )$
• $R_{1,0} = f^{(ROI)}( S_{b^{(s)}}; 0,2,3,3 )$
• $R_{1,1} = f^{(ROI)}( S_{b^{(s)}}; 3,2,4,3 )$