【目标检测】SSD

- 0 预备
- 1 什么是SSD
- 2 SSD的框架
- - 2.1 理解的关键
  - - 2.1.1 Default Boxes
    - 2.1.2 预测框内物体类别和框位置
    - 2.1.3 为什么叫做多框
  - 2.2 基架
  - 2.3 添加
  - 2.4 整体结构
- 3 训练时的部分措施
- - 3.1 难例挖掘
  - 3.2 数据增强
- 4 实验结果
- - 4.1 实验结果
  - 4.2 对照实验
- 5 预测时的后处理
- 6 相关工作

0 预备

FPS：Frames per Second，衡量预测速度，越大越好
coarse feature maps：粗略特征图
SSD300：指输入图片的分辨率是300 x 300
conv4_3：指第4个卷积块（block）的第3个卷积层（layer），例证详见文章
jaccard overlap：交并比

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WZxHUj2I-1660897517087)(D:\BigbigShark\shark\source\imgs\SSDimg\1.png)]$

1 什么是SSD

Single Shot MultiBox Detector，单发多框检测器，不需要region proposals，属于one-stage single network范畴。它可以实现实时检测，且能达到比Faster R-CNN更高的准确率和比YOLOV1更快的检测速度。具体精度（mAP）和速度（FPS）的比较如下图所示：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cGSDgcp6-1660897517087)(D:\BigbigShark\shark\source\imgs\SSDimg\3.png)]$

2 SSD的框架

2.1 理解的关键

2.1.1 Default Boxes

在卷积的过程中，我们会得到很多特征图（feature maps），而在某个特征图的某个位置上，以该位置为中心，可以选出有着不同宽高比（aspect ratio）和不同大小（scale）的框。这些框就叫做default boxes。

2.1.2 预测框内物体类别和框位置

在提取成功default boxes后，利用小卷积滤波器来每一个default box上预测类别得分和框的偏移。

2.1.3 为什么叫做多框

首先我们可以在卷积网络不同的出口得到不同大小的特征图，而在每个特征图的不同位置，都可以根据不同的宽高比和不同的大小来采集一系列固定大小的default boxes。然后在每个default box中，去预测框偏移和框内物体类别置信度（得分）。如下图所示：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-t6jQxvh8-1660897517088)(D:\BigbigShark\shark\source\imgs\SSDimg\4.png)]$

训练时，首先需要将default boxes匹配上ground truth boxes。

2.2 基架

骨干网络是VGG16，然后在VGG16的基础上做了一些调整：

将用卷积代替VGG16的fc6和fc7；
将pool5从2 x 2 - s2变成3 x 3 - s1（大小=3 x 3，stride=1）；
atrous algorithm to fill the “holes”（后面的对照实验表明，对精度而言可有可无；但对提升速度帮助很大）；
将VGG16中的所有dropout层和fc8都扔掉。

2.3 添加

Multi-scale feature maps，也就是将基架网络（base network）在最后截断，加上特征卷积层，得到不同大小的特征图；对于不同的特征层，用来预测的卷积滤波器是不同的（即各用各的）；所以可以预测不同大小的输入。
对于不同的特征层，用来预测的卷积滤波器是不同的（即各用各的）。Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. 而且用来预测得分和预测位置的滤波器也是不同的，若一个feature cell有k个default boxes，则一个feature cell总共有（4 + class_nums）* k个 filters，而对m x n的特征图来说，则一共有（4 + class_nums）* kmn个filters。

1、2两点如下图所示：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b2iKx4kh-1660897517088)(D:\BigbigShark\shark\source\imgs\SSDimg\5.png)]$

2.4 整体结构

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XnjFjmfM-1660897517089)(D:\BigbigShark\shark\source\imgs\SSDimg\6.png)]$

训练的时候需要进行ground truth boxes和default boxes的match，而预测时不需要。

3 训练时的部分措施

3.1 难例挖掘

Hard negative mining，直译为硬负挖掘，一般也称为难例挖掘。

本文使用该方法使正负样本之比为1: 3，以避免模型产生更愿意把样本预测成负样本的倾向。

难例挖掘理解

3.2 数据增强

为了增强模型鲁棒性，使模型能够应对各种大小和形状（宽高比）的图片，对训练集中的每一张图片都做下面三种处理中的任意一种：

原图，即不做任何处理；
有条件采样，要求采样和图片中的物体的交并比不低于某个值（0.1，0.3，0.5，0.7或0.9）
随机采样。

总之，采样率在[0.1, 1]区间内，宽高比在[0.5, 2]，自然采样得到的图片大小不一、形状不一，所以在将其喂入模型前，还要做一些处理将它们变换成固定大小。值得注意的是，本文认为，如果某个ground truth box的中心被采样了，那么这个ground truth box和default boxes的重叠部分应该全部加入到采样中。

在本文中，数据增强至关重要，在后面的对照实验中发现，在其他设置保持不变，仅仅扔掉数据增强后，mAP从74.3%降到了65.5%，非常惊人的数字。

此外，数据增强对SSD检测小物体也有很大的帮助，但实现这种效果的是使用了其他数据增强的手段：先把图片放大，再进行上述的三种处理之一，最终mAP提升2%~3%；与此同时付出的代价是，训练需要更多迭代次数。

4 实验结果

4.1 实验结果

训练验证集（用来预训练模型）是PASCAL VOC2007trainval、PASCAL VOC2012trainval，以及COCOtrainval。

用来对比的模型是Fast RCNN和Faster RCNN，它们的输入图片分辨率最短边都在600及以上。

测试集是PASCAL VOC2007test。

mAP越高越好。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TgqGvNyM-1660897517090)(D:\BigbigShark\shark\source\imgs\SSDimg\2.png)]$

由上图可以看出：

SSD在准确率上是好于两个RCNN模型的；
SSD的输入图片分辨率较大时，效果更好；
训练用数据集较大时，效果更好；
效果最好的是SSD512在07+12+COCO上的成果，mAP达到了81.6%（首先在COCO上进行预训练，然后在07和12上微调）。

为什么在COCO上训练，在07和12上微调？

很简单的道理，测试集是07啊，07和12又是同一类数据集！不管是实验，还是现实应用，肯定以目标领域来做最后的校准和评价嘛。

其他数据集上的结果：仅仅是数据集和一些参数的改变，观测到的结果差不多，值得注意的一点是，在和YOLOV1做对比时，发现SSD在小物体的检测上确实赶不上YOLOV1，这可能和YOLOV1两步走来提炼框框的方法有关；其他的不再啰嗦。

4.2 对照实验

对照实验原理不难理解，直接上结论：

Data augmentation is crucial.
More default box shapes is better.
Atrous is faster.
Multiple output layers at different resolution is better.

5 预测时的后处理

非极大值抑制，可以简单地理解为将好的预测框挑出来，其他框全部扔掉。

6 相关工作

本文总结得很好，直接附过来！建议做到胸有成竹！任何地方有困惑都务必再去找到原文加深理解。

There are two established classes of methods for object detection in images, one based on sliding windows and the other based on region proposal classification. Before the advent of convolutional neural networks, the state of the art for those two approaches – Deformable Part Model (DPM) and Selective Search – had comparable performance. However, after the dramatic improvement brought on by R-CNN, which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent. The original R-CNN approach has been improved in a variety of ways. The first set of approaches improve the quality and speed of post-classification, since it requires the classification of thousands of image crops, which is expensive and time-consuming. SPPnet speeds up the original R-CNN approach significantly. It introduces a spatial pyramid pooling layer that is more robust to region size and scale and allows the classification layers to reuse features computed over feature maps generated at several image resolutions. Fast R-CNN extends SPPnet so that it can fine-tune all layers end-toend by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox for learning objectness.

The second set of approaches improve the quality of proposal generation using deep neural networks. In the most recent works like MultiBox, the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network. This further improves the detection accuracy but results in a somewhat complex setup, requiring the training of two neural networks with a dependency between them. Faster R-CNN replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between finetuning shared convolutional layers and prediction layers for these two networks. This way region proposals are used to pool mid-level features and the final classification step is less expensive. Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN. But instead of using these to pool features and evaluate another classifier, we simultaneously produce a score for each object category in each box. Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks.

Another set of methods, which are directly related to our approach, skip the proposal step altogether and predict bounding boxes and confidences for multiple categories directly. OverFeat, a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories. YOLO uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories). Our SSD method falls in this category because we do not have the proposal step but use the default boxes. However, our approach is more flexible than the existing methods because we can use default boxes of different aspect ratios on each feature location from multiple feature maps at different scales. If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO.