如何评价ILSVRC2016的足球比赛结果查询

Large Scale Visual Recognition Challenge 2016 (ILSVRC2016) - 推酷
Large Scale Visual Recognition Challenge 2016 (ILSVRC2016)
Yellow background = winner in this task acco authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition
Object detection (DET)
Task 1a: Object detection with provided training data
Ordered by number of categories won
Entry description
Number of object categories won
Ensemble of 6 models using provided data
Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2
Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2
submission_1
submission_2
Trimps-Soushen
Ensemble 2
360 MCG-ICT-CAS_DET
9 models ensemble with validation and 2 iterations
360 MCG-ICT-CAS_DET
Baseline: Faster R-CNN with Res200
Best single model, mAP is 65.1 on val2
Ensemble of 2 Models
360 MCG-ICT-CAS_DET
9 models ensemble
360 MCG-ICT-CAS_DET
Trimps-Soushen
Ensemble 1
360 MCG-ICT-CAS_DET
res200 dasc obj sink impneg seg
Single Model ( Preactivation Resnet Faster RCNN on Tensflow, On training (1/3 of total epochs was finished)
KAIST-SLSP
2 models ensemble with box rescoring
ensemble of ResNet101, ResNet152 based Faster RCNN
KAIST-SLSP
2 models ensemble
Faceall-BUPT
ensemble plan B; validation map 52.28
Faceall-BUPT
ensemble plan A; validation map 52.24
Faceall-BUPT
multi- validation map 51.73
Ensemble Detection Model E3
Combined 500&500 with 300&300 model
Ensemble Detection Model E1
ToConcoctPellucid
Ensemble of ResNet-101 ResNet-50 followed by prediction pooling using box-voting
Self-implement SSD 500&500 model with ResNet-101
ToConcoctPellucid
Ensemble of different topology of ResNet-101 ResNet-50 followed by prediction pooling using box-voting
ToConcoctPellucid
ResNet-101 Faster-RCNN single model
Faceall-BUPT
validation map 49.30
Self-implement SSD 300&300 model with ResNet-152
Ensemble Detection Model E2
ensemble FRCN and SSD based on Resnet101 networks.
Ensemble of Deep learning model based on VGG16 & ResNet
Single Detection Model S1
hustvision
convbox-googlenet
A deconv-ssd network with input size 300&300.
OutOfMemory
ResNet-152 FasterRCNN
A single model, Faster R-CNN baseline,continuous iterations(~230K)
BUAA ERCACAT
combined model for detection
BUAA ERCACAT
A single model for detection
A single model, Faster R-CNN baseline,discontinuous iterations(~600K)
Single GBD-Net model using provided data
Single Cluster-Net using provided data
Trimps-Soushen
Single model
detetion algorithm 1
Single model A using ResNet for detection
Single model B using ResNet for detection
Ordered by mean average precision
Entry description
Number of object categories won
Ensemble of 6 models using provided data
Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2
Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2
Best single model, mAP is 65.1 on val2
Single GBD-Net model using provided data
Trimps-Soushen
Ensemble 2
Single Cluster-Net using provided data
360 MCG-ICT-CAS_DET
9 models ensemble with validation and 2 iterations
360 MCG-ICT-CAS_DET
9 models ensemble
submission_1
submission_2
360 MCG-ICT-CAS_DET
360 MCG-ICT-CAS_DET
Baseline: Faster R-CNN with Res200
Trimps-Soushen
Single model
Trimps-Soushen
Ensemble 1
360 MCG-ICT-CAS_DET
res200 dasc obj sink impneg seg
Ensemble of 2 Models
Single Model ( Preactivation Resnet Faster RCNN on Tensflow, On training (1/3 of total epochs was finished)
KAIST-SLSP
2 models ensemble with box rescoring
ensemble of ResNet101, ResNet152 based Faster RCNN
KAIST-SLSP
2 models ensemble
detetion algorithm 1
Faceall-BUPT
ensemble plan B; validation map 52.28
Faceall-BUPT
ensemble plan A; validation map 52.24
Faceall-BUPT
multi- validation map 51.73
Ensemble Detection Model E3
Combined 500&500 with 300&300 model
Ensemble Detection Model E1
ToConcoctPellucid
Ensemble of ResNet-101 ResNet-50 followed by prediction pooling using box-voting
Self-implement SSD 500&500 model with ResNet-101
ToConcoctPellucid
Ensemble of different topology of ResNet-101 ResNet-50 followed by prediction pooling using box-voting
ToConcoctPellucid
ResNet-101 Faster-RCNN single model
Faceall-BUPT
validation map 49.30
Single model A using ResNet for detection
Single model B using ResNet for detection
Self-implement SSD 300&300 model with ResNet-152
Ensemble Detection Model E2
ensemble FRCN and SSD based on Resnet101 networks.
Ensemble of Deep learning model based on VGG16 & ResNet
Single Detection Model S1
hustvision
convbox-googlenet
A deconv-ssd network with input size 300&300.
OutOfMemory
ResNet-152 FasterRCNN
A single model, Faster R-CNN baseline,continuous iterations(~230K)
BUAA ERCACAT
combined model for detection
BUAA ERCACAT
A single model for detection
A single model, Faster R-CNN baseline,discontinuous iterations(~600K)
Task 1b: Object detection with additional training data
Ordered by number of categories won
Entry description
Description of outside data used
Number of object categories won
Our model using our labeled landmarks on ImageNet Det data
We used the labeled landmarks on ImageNet Det data
Trimps-Soushen
Ensemble 3
With extra annotations.
submission_4
refine the training data, add labels neglected, remove noisy labels for multi-instance images
submission_3
refine the training data, add labels neglected, remove noisy labels for multi-instance images
submission_5
refine the training data, add labels neglected, remove noisy labels for multi-instance images
DPAI Vison
multi-model ensemble, multiple classifier ensemble
add extra data for class num&1000
DPAI Vison
multi-model ensemble, multiple context classifier ensemble
add extra data for class num&1000
DPAI Vison
multi-model ensemble, extra classifier
add extra data for class num&1000
DPAI Vison
multi-model ensemble, one-scale context classifier
add extra data for class num&1000
DPAI Vison
multi-model ensemble
add extra data for class num&1000
Ordered by mean average precision
Entry description
Description of outside data used
Number of object categories won
Our model using our labeled landmarks on ImageNet Det data
We used the labeled landmarks on ImageNet Det data
Trimps-Soushen
Ensemble 3
With extra annotations.
submission_4
refine the training data, add labels neglected, remove noisy labels for multi-instance images
submission_3
refine the training data, add labels neglected, remove noisy labels for multi-instance images
submission_5
refine the training data, add labels neglected, remove noisy labels for multi-instance images
DPAI Vison
multi-model ensemble, multiple classifier ensemble
add extra data for class num&1000
DPAI Vison
multi-model ensemble, multiple context classifier ensemble
add extra data for class num&1000
DPAI Vison
multi-model ensemble, extra classifier
add extra data for class num&1000
DPAI Vison
multi-model ensemble, one-scale context classifier
add extra data for class num&1000
DPAI Vison
multi-model ensemble
add extra data for class num&1000
Object localization (LOC)
Task 2a: Classification localization with provided training data
Ordered by localization error
Entry description
Localization error
Classification error
Trimps-Soushen
Ensemble 3
Trimps-Soushen
Ensemble 4
Trimps-Soushen
Ensemble 2
Trimps-Soushen
Ensemble 1
Ensemble of 3 Faster R-CNN models for localization
Ensemble of 4 Faster R-CNN models for localization
prefer multi box prediction with refine
prefer multi class prediction
CU-DeepLink
GrandUnion Fused-scale EnsembleNet
CU-DeepLink
GrandUnion Basic Ensemble
CU-DeepLink
GrandUnion Multi-scale EnsembleNet
KAISTNIA_ETRI
Ensembles B (further tuned in class-dependent models I)
CU-DeepLink
GrandUnion Class-reweighted Ensemble with Per-instance Normalization
CU-DeepLink
GrandUnion Class-reweighted Ensemble
KAISTNIA_ETRI
Ensembles A (further tuned in class-dependent model I )
KAISTNIA_ETRI
Ensembles B
KAISTNIA_ETRI
Ensembles A
KAISTNIA_ETRI
Ensembles C
prefer multi box prediction without refine
3 model only for classification
single model only for classification
Faceall-BUPT
Single localization network (II) fine-tuned with object-level annotations of training data.
Faceall-BUPT
Ensemble of 5 models for classification, single model for localization.
Faceall-BUPT
Ensemble of 3 models for classification, single model for localization.
Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029.
Faceall-BUPT
Single localization network (I) fine-tuned with object-level annotations of training data.
DGIST-KAIST
Weighted sum #1 (five models)
DGIST-KAIST
Averaging four models
For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025.
Ensemble C, weighted average, tuned on val. [No bounding box results]
Ensemble B, weighted average, tuned on val. [No bounding box results]
Ensemble A, simple average. [No bounding box results]
Ensemble C, weighted average. [No bounding box results]
Ensemble B, weighted average. [No bounding box results]
SIIT_KAIST-TECHWIN
Ensemble B
SIIT_KAIST-TECHWIN
Ensemble C
SIIT_KAIST-TECHWIN
Ensemble A
SIIT_KAIST-TECHWIN
Single model
DEEPimagine
ImagineNet ensemble for classification only [ALL]
DEEPimagine
ImagineNet ensemble for classification only [PART#2]
DEEPimagine
ImagineNet ensemble for classification only [PART#1]
NEU_SMILELAB
An ensemble of five models. Top-5 error 3.92% on validation set.
NEU_SMILELAB
An ensemble of six models. Top-5 error 4.24% on validation set.
NEU_SMILELAB
A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set.
NEU_SMILELAB
Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set.
DGIST-KAIST
Weighted sum #2 (five models)
DGIST-KAIST
Averaging five models
DGIST-KAIST
Averaging six models
Ordered by classification error
Entry description
Classification error
Localization error
Trimps-Soushen
Ensemble 2
Trimps-Soushen
Ensemble 3
Trimps-Soushen
Ensemble 4
Ensemble C, weighted average, tuned on val. [No bounding box results]
CU-DeepLink
GrandUnion Fused-scale EnsembleNet
CU-DeepLink
GrandUnion Multi-scale EnsembleNet
CU-DeepLink
GrandUnion Basic Ensemble
Ensemble B, weighted average, tuned on val. [No bounding box results]
CU-DeepLink
GrandUnion Class-reweighted Ensemble
CU-DeepLink
GrandUnion Class-reweighted Ensemble with Per-instance Normalization
Ensemble C, weighted average. [No bounding box results]
Trimps-Soushen
Ensemble 1
Ensemble A, simple average. [No bounding box results]
3 model only for classification
Ensemble B, weighted average. [No bounding box results]
KAISTNIA_ETRI
Ensembles A
KAISTNIA_ETRI
Ensembles C
KAISTNIA_ETRI
Ensembles B
DGIST-KAIST
Weighted sum #1 (five models)
DGIST-KAIST
Weighted sum #2 (five models)
prefer multi class prediction
KAISTNIA_ETRI
Ensembles A (further tuned in class-dependent model I )
KAISTNIA_ETRI
Ensembles B (further tuned in class-dependent models I)
DGIST-KAIST
Averaging five models
DGIST-KAIST
Averaging six models
DGIST-KAIST
Averaging four models
SIIT_KAIST-TECHWIN
Ensemble B
SIIT_KAIST-TECHWIN
Ensemble A
SIIT_KAIST-TECHWIN
Ensemble C
prefer multi box prediction with refine
prefer multi box prediction without refine
DEEPimagine
ImagineNet ensemble for classification only [ALL]
DEEPimagine
ImagineNet ensemble for classification only [PART#2]
single model only for classification
DEEPimagine
ImagineNet ensemble for classification only [PART#1]
SIIT_KAIST-TECHWIN
Single model
Ensemble of 3 Faster R-CNN models for localization
Ensemble of 4 Faster R-CNN models for localization
NEU_SMILELAB
An ensemble of five models. Top-5 error 3.92% on validation set.
NEU_SMILELAB
An ensemble of six models. Top-5 error 4.24% on validation set.
NEU_SMILELAB
A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set.
Faceall-BUPT
Ensemble of 5 models for classification, single model for localization.
Faceall-BUPT
Ensemble of 3 models for classification, single model for localization.
Faceall-BUPT
Single localization network (I) fine-tuned with object-level annotations of training data.
Faceall-BUPT
Single localization network (II) fine-tuned with object-level annotations of training data.
For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025.
Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029.
NEU_SMILELAB
Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set.
Task 2b: Classification localization with additional training data
Ordered by localization error
Entry description
Description of outside data used
Localization error
Classification error
Trimps-Soushen
Ensemble 5
With extra annotations.
prefer multi box prediction
ensemble one model trained on CLS Place2 (1365)
prefer multi class prediction
ensemble one model trained on CLS Place2(1365)
Ordered by classification error
Entry description
Description of outside data used
Classification error
Localization error
Trimps-Soushen
Ensemble 5
With extra annotations.
prefer multi class prediction
ensemble one model trained on CLS Place2(1365)
prefer multi box prediction
ensemble one model trained on CLS Place2 (1365)
Object detection from video (VID)
Task 3a: Object detection from video with provided training data
Ordered by number of categories won
Entry description
Number of object categories won
cascaded region regression tracking
cascaded region regression tracking
4-model ensemble with Multi-Context Suppression and Motion-Guided Propagation
Trimps-Soushen
Ensemble 2
MCG-ICT-CAS
ResNet101 ResNet200 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo
MCG-ICT-CAS
ResNet101 ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification
MCG-ICT-CAS
ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification
MCG-ICT-CAS
ResNet101 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo
MCG-ICT-CAS
ResNet101 models for detection, Non-coocurrence filtration, Coherent tublet reclassification
Trimps-Soushen
Ensemble 3
KAIST-SLSP
set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2)
NUS_VISENZE
fused ssd vgg resnet nms
We use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. Then we utilize the contextual information of the video to reduce the noise and add the missing.
Object detection using temporal and contextual information
object detection without contextual information
Faceall-BUPT
faster rcnn, brute force detection, only used DET map on val is 53.51
CIGIT_Media
adopt a new method for merging the scores of R-FCN and SSD detectors
CIGIT_Media
object detection from video without tracking
ssd resnet 101 0.01 confidence rate
ssd resnet 101 0.1 confidence rate
ssd resnet 101 0.2 confidence rate
ssd with resnet101 filted by nms with a 0.6 overlap rate and 0.1 confidence rate
ssd with resnet101 filted by nms with a 0.6 overlap rate, and 0.02 confidence rate
SIS ITMO University
Our model takes into account spatial and temporal information from several previous frames.
We only use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video.
4-model ensemble without MCS & MGP
Single GBD-Net with Multi-Context Suppression & Motion-Guided Propagation
Ordered by mean average precision
Entry description
Number of object categories won
cascaded region regression tracking
cascaded region regression tracking
4-model ensemble with Multi-Context Suppression and Motion-Guided Propagation
4-model ensemble without MCS & MGP
MCG-ICT-CAS
ResNet101 ResNet200 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo
Single GBD-Net with Multi-Context Suppression & Motion-Guided Propagation
MCG-ICT-CAS
ResNet101 ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification
MCG-ICT-CAS
ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification
Trimps-Soushen
Ensemble 2
MCG-ICT-CAS
ResNet101 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo
MCG-ICT-CAS
ResNet101 models for detection, Non-coocurrence filtration, Coherent tublet reclassification
Trimps-Soushen
Ensemble 3
KAIST-SLSP
set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2)
NUS_VISENZE
fused ssd vgg resnet nms
We use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. Then we utilize the contextual information of the video to reduce the noise and add the missing.
Object detection using temporal and contextual information
object detection without contextual information
Faceall-BUPT
faster rcnn, brute force detection, only used DET map on val is 53.51
CIGIT_Media
adopt a new method for merging the scores of R-FCN and SSD detectors
CIGIT_Media
object detection from video without tracking
ssd resnet 101 0.01 confidence rate
ssd resnet 101 0.1 confidence rate
ssd resnet 101 0.2 confidence rate
ssd with resnet101 filted by nms with a 0.6 overlap rate and 0.1 confidence rate
ssd with resnet101 filted by nms with a 0.6 overlap rate, and 0.02 confidence rate
SIS ITMO University
Our model takes into account spatial and temporal information from several previous frames.
We only use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video.
Task 3b: Object detection from video with additional training data
Ordered by number of categories won
Entry description
Description of outside data used
Number of object categories won
cascaded region regression tracking
proposal network is finetuned from COCO
cascaded region regression tracking
proposal network is finetuned from COCO
Trimps-Soushen
Ensemble 6
Extra data from ImageNet dataset(out of the ILSVRC2016)
ITLab-Inha
An ensemble for detection, MCMOT for tracking
pre-trained model from COCO detection, extra data collected by ourselves (100 images per class)
DPAI Vison
single model
extra data
DPAI Vison
single model and iteration regression
extra data
VGG-16 Faster R-CNN
Imagenet DET dataset
Ensemble of 6 models
Imagenet DET dataset
Ensemble of 7 models
Imagenet DET dataset
Ordered by mean average precision
Entry description
Description of outside data used
Number of object categories won
cascaded region regression tracking
proposal network is finetuned from COCO
cascaded region regression tracking
proposal network is finetuned from COCO
ITLab-Inha
An ensemble for detection, MCMOT for tracking
pre-trained model from COCO detection, extra data collected by ourselves (100 images per class)
Trimps-Soushen
Ensemble 6
Extra data from ImageNet dataset(out of the ILSVRC2016)
DPAI Vison
single model
extra data
DPAI Vison
single model and iteration regression
extra data
VGG-16 Faster R-CNN
Imagenet DET dataset
Ensemble of 6 models
Imagenet DET dataset
Ensemble of 7 models
Imagenet DET dataset
Task 3c: Object detection/tracking from video with provided training data
Entry description
4-model ensemble
cascaded region regression tracking
Single GBD-Net
MCG-ICT-CAS
ResNet101 ResNet200 models for detetion, optical flow for tracking, Coherent tublet reclassification , MDNet tracking
MCG-ICT-CAS
ResNet101 ResNet200 models for detetion, optical flow for tracking, Coherent tublet reclassification, MDNet tracking
MCG-ICT-CAS
ResNet101 models for detetion, optical flow for tracking, Coherent tublet reclassification, MDNet tracking
MCG-ICT-CAS
ResNet101 models for detetion, optical flow for tracking, Coherent tublet reclassification
MCG-ICT-CAS
ResNet101 models ResNet200 for detetion, optical flow for tracking, Coherent tublet reclassification
KAIST-SLSP
set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2)
CIGIT_Media
object detection from video with tracking
CIGIT_Media
adopt a new method for merging the scores of R-FCN and SSD detectors
a simple track with ssd_resnet101
NUS_VISENZE
17Sept_result_final_ss_ssd_resnet_nms_fused
NUS_VISENZE
a simple track with ssd_resnet101 with 0.1 confidence
a simple track with ssd_resnet101 with 0.2 confidence
NUS_VISENZE
fused 3 models with tracking
NUS_VISENZE
fused 3 models with tracking max 8 classes
This is the longest run without error.
This had some error I don\’t know if it\’s complete.
Task 3d: Object detection/tracking from video with additional training data
Entry description
Description of outside data used
cascaded region regression tracking
proposal network is finetuned from COCO
ITLab-Inha
An ensemble for detection, MCMOT for tracking
pre-trained model from COCO detection, extra data collected by ourselves (100 images per class)
Scene Classification (Scene)
Entry description
Top-5 classification error
Model ensemble 2
Model ensemble 3
Model ensemble 1
Trimps-Soushen
With extra data.
Trimps-Soushen
Ensemble 2
SIAT_MMLAB
10 models fusion
SIAT_MMLAB
7 models fusion
SIAT_MMLAB
fusion with softmax
SIAT_MMLAB
learning weights with cnn
SIAT_MMLAB
6 models fusion
Trimps-Soushen
Ensemble 4
Trimps-Soushen
Ensemble 3
Single model B
Single model A
Product of 5 ensembles (top-5)
Product of 3 ensembles (top-5)
Sum of 3 ensembles (top-5)
Sum of 5 ensembles (top-3)
Single ensemble of 5 models (top-5)
Four models
Three models
Samsung Research America: General Purpose Acceleration Group
Simple Ensemble, 3 Inception v3 models w/various hyper param changes, 32 multi-crop (60.11 top-1, 88.98 top-5 on val)
Fusion with average strategy (12 models)
Fusion with scoring strategy (14 models)
Fusion with average strategy (13 models)
weighted average1 at scale level using greedy search
weighted average at model level using greedy search
weighted average2 at scale level using greedy search
Fusion with scoring strategy (13 models)
Fusion with scoring strategy (12 models)
simple average using models in entry 3
Samsung Research America: General Purpose Acceleration Group
Model A0, weakly scaled, multi-crop. (59.61 top-1, 88.64 top-5 on val)
Samsung Research America: General Purpose Acceleration Group
Ensemble B, 3 Inception v3 models w/various hyper param changes Inception v4 res2, 128 multi-crop
average on base models
Samsung Research America: General Purpose Acceleration Group
Model A2, weakly scaled, single-crop & mirror. (58.84 top-1, 88.09 top-5 on val)
Samsung Research America: General Purpose Acceleration Group
Model A1, weakly scaled, single-crop. (58.65 top-1, 88.07 top-5 on val)
Trimps-Soushen
Ensemble 1
ensemble model 1
single model
ensemble model 2
ensemble by learned weights – 1
ensemble by product strategy
ensemble by learned weights – 2
ensemble by average strategy
Ensemble of two ResNet-50 with balanced sampling
Ensemble of Model I and II
single model result of 69
ensemble by product strategy (without specialist models)
Model II with adjustment
single model result of 66
Ensemble of Model I and II with adjustment
SJTU-ReadSense
Ensemble 5 models with learnt weights
SJTU-ReadSense
Ensemble 5 models with weighted validation accuracies
A combination of CNN models based on researched influential factors
SJTU-ReadSense
Ensemble 6 models with learnt weights
SJTU-ReadSense
Ensemble 4 models with learnt weights
A combination of CNN models with a strategy w.r.t.validation accuracy
Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.
SIIT_KAIST
101-depth single model (val.error 12.90%)
DPAI Vison
An ensemble model
spectral clustering on confusion matrix
fusion of 4 models with average strategy
inception shortcut CNN
MP_multiCNN_multiscale
inception shortcut CNN
Viz Insight
Multiple Deep Metaclassifiers
FeatureFusion_2L
FeatureFusion_3L
DPAI Vison
Single Model
2 models with size of 288
Faceall-BUPT
A single model with 150crops
A Single Model
SJTU-ReadSense
A single model (based on Inception-BN) trained on the Places365-Challenge dataset
OceanVision
A result obtained by VGG-16
OceanVision
A result obtained by alexnet
OceanVision
A result obtained by googlenet
GoogLeNet Model trained on LSUN dataset and fined tuned on Places2
Vladimir Iglovikov
VGG16 trained on 128&128
Vladimir Iglovikov
VGG19 trained on 128&128
Vladimir Iglovikov
average of VGG16 and VGG19 trained on 128&128
Vladimir Iglovikov
Resnet 50 trained on 128&128
VGG16 4D lstm
Scene Parsing
Entry description
Average of mIoU and pixel accuracy
SenseCUSceneParsing
ensemble more models on trainval data
SenseCUSceneParsing
dense ensemble model on trainval data
SenseCUSceneParsing
ensemble model on trainval data
SenseCUSceneParsing
ensemble model on train data
Multiple models, multiple scales, refined with CRFs
Multiple models, multiple scales
Single model, multiple scales
Multiple models, single scale
360 MCG-ICT-CAS_SP
fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training
Single model, single scale
SenseCUSceneParsing
best single model on train data
360 MCG-ICT-CAS_SP
fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before fusion
360 MCG-ICT-CAS_SP
fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before and after fusion
360 MCG-ICT-CAS_SP
152 layers front model with global context aggregation, iterative boosting and high resolution training
ensemble of 5 models, bilateral filter, 42.7 mIoU on val set
ensemble of 5 models,guided filter, 42.5 mIoU on val set
casia_iva_model4:DeepLab, Multi-Label
casia_iva_model3:DeepLab, OA-Seg, Multi-Label
casia_iva_model5:Aug_data,DeepLab, OA-Seg, Multi-Label
Fusion models from two source models (Train TrainVal)
6 ResNet initialized models (models are trained from TrainVal)
8 ResNet initialized models 2 VGG initialized models (with different bn statistics)
ensemble by joint categories and guided filter, 42.7 on val set
8 ResNet initialized models 2 VGG initialized models (models are trained from Train only)
8 ResNet initialized models 2 VGG initialized models (models are trained from TrainVal)
Ensemble models
ensemble by joint categories and bilateral filter, 42.8 on val set
ACRV-Adelaide
use DenseCRF
single model, 41.3 mIoU on valset
DPAI Vison
different denseCRF parameters of 3 models(B)
Single model
ACRV-Adelaide
an ensemble
360 MCG-ICT-CAS_SP
baseline,152 layers front model with iterative boosting
casia_iva_model2:DeepLab, OA-Seg
DPAI Vison
different denseCRF parameters of 3 models(C)
DPAI Vison
average ensemble of 3 segmentation models
DPAI Vison
different denseCRF parameters of 3 models(A)
casia_iva_model1:DeepLab
scene parsing network 5
scene parsing network
scene parsing network 3
ACRV-Adelaide
a single model
scene parsing network 2
SYSU_HCP-I2_Lab
cascade nets
SYSU_HCP-I2_Lab
DCNN with skipping layers
SYSU_HCP-I2_Lab
DeepLab_CRF
SYSU_HCP-I2_Lab
Pixel normalization networks
SYSU_HCP-I2_Lab
S-LAB-IIE-CAS
Multi-Scale CNN Bbox_Refine FixHole
S-LAB-IIE-CAS
Multi-Scale CNN Bbox_Refine FixHole
S-LAB-IIE-CAS
Combined with the results of other models
S-LAB-IIE-CAS
Combined with the results of other models
S-LAB-IIE-CAS
Multi-Scale CNN Attention
S-LAB-IIE-CAS
Multi-Scale CNN Attention
trained with training set and val set
NUS-AIPARSE
trained with training set only
NUS-AIPARSE
Model fusion of ResNet101 and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.
Model fusion of ResNet101 and FCN, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.
Model fusion of ResNet101, FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.
Faceall-BUPT
6 models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes.
NUS-AIPARSE
Faceall-BUPT
We use six models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes. The pixel-wise accuracy is 76.94% and mean of the class-wise IoU is 0.3552.
Model fusion of ResNet101, FCN and DilatedNet, with data augmentation, fine-tuned from places2 scene classification/parsing 2016 pretrained models.
S-LAB-IIE-CAS
Multi-Scale CNN Bbox_Refine
S-LAB-IIE-CAS
Multi-Scale CNN Bbox_Refine
Faceall-BUPT
3 models finetuned by pre-trained fcn8s with 3 different images sizes.
Faceall-BUPT
3 models finetuned by pre-trained dilatedNet with 3 different images sizes.
Model fusion of FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.
S-LAB-IIE-CAS
Multi-Scale CNN
S-LAB-IIE-CAS
Multi-Scale CNN
Multiscale-FCN-CRFRNN
Multi-scale CRF-RNN
Faceall-BUPT
one models finetuned by pre-trained dilatedNet with images size 384*384. The pixel-wise accuracy is 75.14% and mean of the class-wise IoU is 0.3291.
Deep Cognition Labs
Modified Deeplab Vgg16 with CRF
FCN-8s with classification
NuistParsing
SegNet Smoothing
SegNet trained on ADE20k CRF
Fine tuned version of ParseNet
Fine tuned version of ParseNet
Team information
Team members
360 MCG-ICT-CAS_SP
Rui Zhang (1,2)
Min Lin (1)
Sheng Tang (2)
Yu Li (1,2)
YunPeng Chen (3)
YongDong Zhang (2)
JinTao Li (2)
YuGang Han (1)
ShuiCheng Yan (1,3)(1) Qihoo 360
(2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China
(3) National University of Singapore (NUS)
Technique Details for the Scene Parsing Task:
There are two core and general contributions for our scene parsing system: 1) Local-refinement-network for object boundary refinement, and 2) Iterative-boosting-network for overall parsing refinement.
These two networks collaboratively refine the parsing results from two perspectives, and the details are as below:
1) Local-refinement-network for object boundary refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is m*m feature maps indicating how each of the m*m neighbors propagates the probability vector to the center point for local refinement. It works similar to bounding-box-refinement in object detection task in spirit, but here locally refine the object boundary instead of object bounding box.
2) Iterative-boosting-network for overall parsing refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is the refined probability maps for all classes. It iterative boosting the parsing results in a global way.
Also two other tricks are used as below:
1) Global context aggregation: The scene classification information may potentially provide the global context information for decision as well as capture the co-occurrence relationship between scene and object/stuff in scene. Thus, we add the features from an independent scene classification model trained on ILSVRC 2016 Scene Classification dataset into our scene parsing system as contexts.
2) Multi-scale scheme: Considering the limited amount of training data and the various scales of objects in different training samples, we use multi-scale data argumentation in both training and inference stages. High resolution models are also trained on magnified images to capture details and small objects.
360 MCG-ICT-CAS_DET
Yu Li (1,2),
Sheng Tang (2),
Min Lin (1),
Rui Zhang (1,2),
YunPeng Chen (3),
YongDong Zhang (2),
JinTao Li (2),
YuGang Han (1),
ShuiCheng Yan (1,3),(1) Qihoo 360,
(2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China,
(3) National University of Singapore (NUS)
The new contributions of this system are three-fold: 1) Implicit sub-categories of background class, 2) Sink class when necessary, and 3) new semantic segmentation features.
For training:
1) Implicit sub-categories of background class: for Faster-RCNN [1], the “background” class is considered as ONE class equally as other individual object classes, but it is quite diverse and impossible to describe as one pattern. Thus we use K output nodes, namely K patterns, to implicitly represent the sub-categories of the background class, which well improved the identification capability of the background class.
(2) Sink class when necessary: It is often the case that the ground-truth class may have low probability, and thus the result is incorrect since the sum of all probabilities for all classes equals 1. To address this issue and improve the chance for the ground-truth class with low probability to win, we add a so-called “sink” class, which shall take some probability value if the ground-truth class has low probability, make other classes to have even lower probabilities than the ground-truth class, and make the ground-truth to win. We also propose to use sink class for loss function only when necessary, namely when the ground-truth class is not in the top-k list.
(3) New semantic segmentation features: On one hand, motivated by [2], we generate weakly supervised segmentation feature which is used to train region proposal scoring functions and make the gradient flow among all branches. On the other hand, an independent segmentation model trained on ILSVRC Scene Parsing dataset is used to provide feature for our detection network, which is supposed to bring in both stuff and object information for decision.
(4) Dilation as context: Motivated by widely used dilated convolution [3] in segmentation, we introduce dilated convolutional layers (initialized as identity mapping) to obtain effective context for training.
For testing:
We utilize box refinement, box voting, multi-scale testing, co-occurrence refinement, and models ensemble approaches to benefit inference stage.
References:
[1] Ren, Shaoqing, et al. “Faster R-CNN: Towards real-time object detection with region proposal networks.” Advances in neural information processing systems. 2015.
[2] Gidaris, Spyros, and Nikos Komodakis. “Object detection via a multi-region and semantic segmentation-aware cnn model.” Proceedings of the IEEE International Conference on Computer Vision. 2015.
[3] Yu, Fisher, and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions.” International Conference on Learning Representations. 2016.
Ankan Bansal
Before training on LSUN, this network was trained using the Places205 dataset. The model was trained till it saturated at around 85% (Top-1) accuracy on the validation dataset of the LSUN challenge. Then the model was fine-tuned on the 365 categories in the Places2 challenge.
We did not use the trained models provided by the organisers to initialise our network.
References:
[1] Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[2] Yu, Fisher, et al. “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.” arXiv preprint arXiv: (2015).
ACRV-Adelaide
Guosheng L
Anton van den H
Ian RAffiliations: ACRV; University of A
Our method is based on multi-level information fusion. We generate multi-level representation of the input image and develop a number of fusion networks with different architectures.
Our models are initialized from the pre-trained residual nets [1] with 50 and 101 layers. A part of the network design in our system is inspired by the multi-scale network with pyramid pooling which is described in [2] and the FCN network in [3].Our system achieves good performance on the validation set. The IoU score on the validation set is 40.3 for using a single model, which is clearly better than the reported results of the baseline methods in [4]. Applying DenseCRF [5] slightly improves the result.We are preparing a technical report on our method and it will be available in arXiv soon.
References:
[1] “Deep Residual Learning for Image Recognition”, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016.
[2] “Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation”, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Ian R CVPR 2016
[3] “Fully convolutional networks for semantic segmentation”, J Long, E Shelhamer, T D CVPR 2015
[4] “Semantic Understanding of Scenes through ADE20K Dataset” B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:
[5] “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials”, Philipp Krahenbuhl, Vladlen K NIPS 2012.
Zifeng Wu, University of Adelaide
Chunhua Shen, University of Adelaide
Anton van den Hengel, University of Adelaide
We have trained networks with different newly designed structures. One of them performs as well as the Inception-Residual-v2 network in the classification task. It was further tuned for several epochs using the Places365 dataset, which finally obtained even better results on the validation set in the segmentation task. As for FCNs, we mostly followed the settings in our previous technical reports [1, 2]. The best result was obtained by combining the FCNs initialized using two pre-trained networks.
[1] High-performance Semantic Segmentation Using Very Deep Fully Convolutional Networks. https://arxiv.org/abs/
[2] Bridging Category-level and Instance-level Semantic Image Segmentation. https://arxiv.org/abs/
Romain Vial (VA Master Intern Student)
Zhu Hongyuan (VA Scientist)
Su Bolan (ex ASTAR Scientist)
Shijian Lu (VA Head)
Our system localizes and recognizes objects from various scales, positions and classes. It takes into account spatial (local and global) and temporal information from several previous frames.
The model has been trained on both the training and validation set. We achieve a final score on the validation set of 76.5% mAP.
Andrea Ferri
This is the result of my thesis: Implementing a deep learning envirorment into a computational server and develop a Object Tracking in Video with Tensorflow suitable for the ImageNET VID challenge.
BUAA ERCACAT
Biao Leng (Beihang University), Guanglu Song (Beihang University), Cheng Xu (Beihang University), Jiongchao Jin (Beihang University), Zhang Xiong (Beihang University)
Our group utilize two image object detection architectures, namely Fast R-CNN[1] and Faster R-CNN[2] for the task of object detection. The detection system Faster R-CNN can be divided into two modules including RPN (region proposal network), a fully convolutional network that proposes regions to tell the Faster R-CNN modules where to focus on in an image, and a Fast R-CNN detector that uses region proposals and classifies the objects in the proposal.
Our training model is based on the VGG_16 model, and we utilize a combined model for higher RPN recall.[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:.
[2]Ross Girshick. “Fast R-CNN: Fast Region-based Convolutional Networks for object detection”, CVPR 2015.
Jun Fu,Jing Liu,Xinxin Zhu,Longteng Guo,Zhenwei Shen,Zhiwei Fang,Hanqing Lu
We implement image semantic segmentation based on the fused result of the three deep models: DeepLab[1], OA-Seg[2] and the officially public model in this challenge. DeepLab is trained with the framework of Resnet101, and is further improved with object proposals and multiscale prediction combination. OA-Seg is trained with VGG, in which object proposals and multiscale supervision are considered. We argument training data by multiscale and mirrored variants for the above both models. We additionally employ multi-label annotation for images to refine the segmentation results.
[1]Liang-Chieh Chen et.al, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,arXiv:,2016
[2]Yuhang Wang et.al, Objectness-aware Semantic Segmentation, Accepted by ACM Multimedia, 2016.
Choong Hwan Choi (KAIST)
Ensemble of Deep learning model based on VGG16 & ResNet
Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.
Reference :
[1] Liu, Wei, et. al. “SSD: Single Shot Multibox Detector”
[2] K. Simonyan, A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”
[3] Kaiming He, et. al., “Deep Residual Learning for Image Recognition”
CIGIT_Media
Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences
We present a simple method combining still image object detection and object tracking for the ImageNet VID task. Object detection is first performed on each frame of the video, and the detected targets are then tracked through the nearby frames. Each tracked target is also assigned a detection score by the object detector. According to the scores, non-maximum suppression (NMS) is applied to all the detected and tracked targets on each frame to obtain the VID results. To improve the performance, we actually employ two state-of-the-art detectors for still image object detection, i.e. the R-FCN detector and the SSD detector. We run the above steps for both detectors independently and combine the respective results into the final ones through NMS.
[1] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016.
[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: Single Shot MultiBox Detector. arXiv 2016.
[3] K. Kang, W. Ouyang, H. Li, and X. Wang. Object Detection from Video Tubelets with Convolutional Neural Networks. CVPR 2016.
Seongmin Kang
Seonghoon Kim
Heungwoo Han
Our model is based on Faster RCNN [1].
Pre-activation residual network[2] trained with ILSVRC 2016 dataset is modified for detection tasks.
Heavy data augmentation is applied. OHEM[3] and atrous convolution are also applied.
All of them are implemented on Tensorflow with multi-gpu training.[4]To meet the deadline, the detection model was trained just for 1/3 training epoches we had planned.[1]Shaoqing Ren et al., Faster R-CNN Towards real-time object detection with region proposal networks, NIPS, 2015
[2]Kaiming He et al., Identity Mappings in Deep Residual Networks, ICML, 2016
[3]Abhinav Shrivastava et al., Training Region-based Object Detectors with Online Hard Example Mining, CVPR, 2016
[4]Mart&n Abadi et al., TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org
CU-DeepLink
Major team members
——————-Xingcheng Zhang ^1
Zhizhong Li ^1
Yang Shuo ^1
Yuanjun Xiong ^1
Yubin Deng ^1
Xiaoxiao Li ^1
Kai Chen ^1
Yingrui Wang ^2
Chen Huang ^1
Tong Xiao ^1
Wanshen Feng ^2
Xinyu Pan ^1
Yunxiang Ge ^1
Hang Song ^1
Yujun Shen ^1
Boyang Deng ^1
Ruohui Wang ^1Supervisors
Dahua Lin ^1
Chen Change Loy ^1
Wenzhi Liu ^2
Shengen Yan ^2
1 – Multimedia Lab, The Chinese University of Hong Kong.
2 – SenseTime Inc.
Classification
——————
Our classification framework is built on top of Google\’s Inception-ResNet-v2 (IR-v2) [1]. We combined several important techniques, which together leads to substantial performance gain.
1. We developed a novel building block, called “PolyInception”. Each PolyInception can be considered as a meta-module that integrates multiple inception modules via K-way polynomial composition. In this way, we substantially improve a module\’s expressive power. Also, to facilitate the propagation of gradients across a very deep network, we retain an identity path [2] for each PolyInception.
2. At the core of our framework is the Grand Models. Each grand model comprises three sections operating on different spatial resolutions. Each section is a stack of multiple PolyInception modules. To achieve optimal overall performance (within a certain computational budget), we rebalance the number of modules across the sections.
3. Most of our grand models contain over 500 layers. Whereas they demonstrate remarkable model capacity, we observed notable overfitting at later stage of the training process. To overcome this difficulty, we adopted Stochastic Depth [3] for regularization.
4. We trained 20 Grand Models, some deeper and others wider. These models constitute a performant yet diverse ensemble. The single most powerful Grand Model reached a top-5 classification error at 4.27%(single corp) on the validation set.
5. Given each image, the class label predictions are produced in two steps. First, multiple crops at 8 scales are generated. Predictions are respectively made on these crops, which are subsequently combined via a novel scheme called selective pooling. The multi-crop predictions generated by individual models are finally integrated to reach the final prediction. In particular, we explored two different integration strategies, namely ensemble-net (a two-layer neural-network designed to integrate predictions) and class-dependent model reweighting. With these ensemble techniques, we reached a top-5 classification error below 2.8% on the validation set.
Localization
—————–
Our localization framework is a pipeline comprised of Region Proposal Networks (RPN) and R-CNN models.
1. We trained two RPNs with different design parameters based on ResNet.
2. Given an image, 300 bounding box proposals are derived based on the RPNs, using multi-scale NMS pooling.
3. We also trained four R-CNN models, respectively based on ResNet-101, ResNet-269, Extended IR-v2, and one of our Grand Models. These R-CNNs are used to predict how likely a bounding box belongs to each class as well as to refine the bounding box (via bounding box regression).
4. The four RCNN models form an ensemble. Their predictions (on both class scores and refined bounding boxes) are integrated via average pooling. Given a class label, the refined bounding box with highest score corresponding to that class is used as the result.
Deep Learning Framework
—————–
Both our classification and localization frameworks are implemented using Parrots, a new Deep Learning framework developed internally by ourselves (from scratch). Parrots is featured with a highly scalable distributed training scheme, a memory manager that supports dynamic memory reuse, and a parallel preprocessing pipeline. With this framework, the training time is substantially reduced. Also, with the same GPU memory capacity, much larger networks can be accommodated.
References
—————–
[1] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”. arXiv:. 2016.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv:. 2016.
[3] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Weinberger. “Deep Networks with Stochastic Depth”. arXiv:. 2016.
Affiliation: The Chinese University of Hong Kong, SenseTime Group Limited
Compared with CUImage submission in ILSVRC 2015, the new components are as follows.
(1) The models are pretrained for 1000-class object detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed.
(2) The region proposal is obtained using the improved version of CRAFT in [b].
(3) A GBD network [c] with 269 layers is fine-tuned on 200 detection classes with the gated bidirectional network (GBD-Net), which passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model and ~5% mAP improvement on the Batch normalized GoogleNet.
(4) For handling their long-tail distribution problem, the 200 classes are clustered. Different from the original implementation in [d] that learns several models, a single model is learned, where different clusters have both shared and distinguished feature representations.
(5) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track.
(6) For the external data track, we propose object detection with landmarks. Comparing to the standard bounding box centric approach, our landmark centric approach provides more structural information and can be used to improve both the localization and classification step in object detection. Based on the landmark annotations provided in [e], we annotate 862 landmarks from 200 categories on the training set. Then we use them to train a CNN regressor to predict landmark position and visibility of each proposal in testing images. In the classification step, we use the landmark pooling on top of the fully convolutional network, where features around each landmark are mapped to be a confidence score of the corresponding category. The landmark level classification can be naturally combined with standard bounding box level classification to get the final detection result.
(7) Ensemble of the models using the approaches mentioned above lead to the final result in the external data track.The fastest publicly available multi-GPU caffe code is our strong support [f].[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015.
[b] Yang, B., Yan, J., Lei, Z., Li, S. Z. “Craft objects from images.”&CVPR 2016.
[c] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016.
[d] Ouyang, W., Wang, X., Zhang, C., Yang, X. Factors in Finetuning Deep Model for Object Detection with Long-tail Distribution.&CVPR 2016.
[e] Wanli Ouyang, Hongyang Li, Xingyu Zeng, and Xiaogang Wang, “Learning Deep Representation with Large-scale Attributes”, In&Proc. ICCV&2015.
[f] /yjxiong/caffe
The Chinese University of Hong Kong, SenseTime Group Limited
(1) The models are pretrained for 200-class detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed.
(2) The region proposal is obtained by a separately-trained ResNet-269 model.
(3) A GBD network [b] with 269 layers is fine-tuned on 200 detection classes of the DET task and then on the 30 classes of the VID task. It passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model.
(4) Based on detection boxes of individual frames, tracklet proposals are efficiently generated by trained bounding box regressors. An LSTM network is integrate into the network to learn temporal-based appearance variation.
(5) Multi-context suppression and motion-guide propagation in [c] are utilized to post-process the per-frame detection results. They result in a ~3.5% mAP improvement on the validation set.
(6) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track.
(7) For the VID with tracking task, we modified an online multiple object tracking algorithm [d]. The tracking-by-detection algorithm utilizes our per-frame detection results and generates tracklets for different objects.
The fastest publicly available multi-GPU caffe code is our strong support [e].
[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015.
[b] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016.
[c] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, W. Ouyang, “T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos”, arXiv:
[d] J. H. Yoon, C.-R. Lee, M.-H. Yang, K.-J. Yoon, “Online Multi-Object Tracking via Structural Constraint Event Aggregation”, CVPR 2016
[e] /yjxiong/caffe
Deep Cognition Labs
Mandeep Kumar, Deep Cognition Labs
Krishna Kishore, Deep Cognition Labs
Rajendra Singh, Deep Cognition Labs
We present these results for scene parsing task that are aquired using a modified Deeplab vgg16 network along with CRF.
DEEPimagine
Sung-soo Park(DEEPimagine corp.)
Hyoung-jin Moon(DEEPimagine corp.)Contact email :
1.Model design
– Wide Residual SWAPOUT network
– Inception Residual SWAPOUT network
– We focused on the model multiplicity with many shallow networks
– We adopted a SWAPOUT architecture2.Ensemble
– Fully convolutional dense crop
– Variant parameter model ensemble[1] ” Swapout: Learning an ensemble of deep architectures”
Saurabh Singh, Derek Hoiem, David Forsyth
[2] ” Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi
[3] ” Deep Residual Learning for Image Recognition ”
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Heechul Jung*(DGIST/KAIST), Youngsoo Kim*(KAIST), Byungju Kim(KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Junho Yim(KAIST), Min-Kook Choi(DGIST), Yeakang Lee(KAIST), Soon Kwon(DGIST), Woo Young Jung(DGIST), Junmo Kim(KAIST)
* indicates equal contribution.
We basically use nine networks. Networks consist of one 200-layer ResNet, one Inception-ResNet v2, one Inception v3 Net, two 212-layer ResNets and four Branched-ResNets.
Networks are trained for 95 epochs except Inception-ResNet v2 and Inception v3.
Ensemble A takes an average of one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2.
Ensemble B takes a weighted sum over one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2.
Ensemble C takes an average of one 200-layer ResNet, two 212-layer ResNets, two Branched-ResNets, one Inception v3 and one Inception-ResNet v2. It achieves a top-5 error rate of 3.16% for 20000 validation images.
Ensemble D takes an averaged result on all nine networks.We submit only classification results.References:
[1] He, Kaiming, et al. “Deep residual learning for image recognition.” arXiv preprint arXiv: (2015).
[2] He, Kaiming, et al. “Identity mappings in deep residual networks.” arXiv preprint arXiv: (2016).
[3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. “Inception-v4, inception-resnet and the impact of residual connections on learning.” arXiv preprint arXiv: (2016).
[4] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” arXiv preprint arXiv: (2015).
[5] Sermanet, Pierre, et al. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv: (2013).
[6] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv: (2015).
Acknowledgement
– DGIST was funded by the Ministry of Science, ICT and Future Planning.
– KAIST was funded by Hanwha Techwin CO., LTD.
DGIST-KAIST
Heechul Jung(DGIST/KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Min-Kook Choi(DGIST), Soon Kwon(DGIST), Junmo Kim(KAIST), Woo Young Jung(DGIST)
We basically use ensemble model of state-the-art architectures [1,2,3,4] as following:
[1] He, Kaiming, et al. “Deep residual learning for image recognition.” arXiv preprint arXiv: (2015).
[2] He, Kaiming, et al. “Identity mappings in deep residual networks.” arXiv preprint arXiv: (2016).
[3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. “Inception-v4, inception-resnet and the impact of residual connections on learning.” arXiv preprint arXiv: (2016).
[4] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” arXiv preprint arXiv: (2015).We train five deep neural networks, which models are two 212-layers ResNets, a 224-layers ResNet, an inception-v3, and an Inception-ResNet v2. Given models are linearly combined by weighted some of class probabilities using validation set to obtain appropriate contribution for each model.- This work was funded by the Ministry of Science, ICT and Future Planning.
DPAI Vison
Object detection: Chris Li, Savion Zhao, Bin Liu, Yuhang He, Lu Yang, Cena Liu
Scene classification: Lu Yang, Yuhang He, Cena Liu, Bin Liu, Bo Yu
Scene parsing: Bin Liu, Lu Yang, Yuhang He, Cena Liu, Bo Yu, Chris Li, Xiongwei Xia
Object detection from video: Bin Liu, Cena Liu, Savion Zhao, Yuhang He, Chris Li
Object detection:Our methods is based on faster-rcnn and extra classifier. (1) data processing: data equalization by deleting lots of examples in threee dominating classes (person, dog, and bird); adding extra data for classes with training data less than 1000; (2) COCO pre- (3) Iterative bounding box regression multi-scale (trian/test) random flip images (train / test) (4) Multimodel ensemble: resnet-101 and inception-v3 (5) Extra classifier with 200 classes which helps to promote recall and refine the detection scores of ultimate boxes.
[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:, 2015.
[2] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. .Scene classification: We trained the model on Caffe[1]. An ensemble of Inception-V3[2] and Inception-V4[3]. We totally integrated four models. Top1 error on validation is 0.431 and top5 error is 0.129. The single model is modified on Inception-V3[2], the top1 error on validation is 0.434, top5 error is 0.133.
[1] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:. 2014.
[2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:, 2015.
[3] C.Szegedy,S.Ioffe,V.Vanhoucke. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv preprint arXiv:, 2016.Scene parsing: We trained 3 models on modified deeplab[1] (inception-v3, resnet-101, resnet-152) and only used the ADEChallengeData2016[2] data. Multi-scale \ image crop \ image fliping \ contrast transformation are used for data augmentation and decseCRF is used as post-processing to refine object boundaries. On validation with combining 3 models, witch achieved 0.3966 mIoU and 0.7924 pixel-accuracy.
[1] L. Chen, G. Papandreou, I. K.; Murphy, K.; and Yuille, A. L. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In arXiv preprint arXiv:.
Object detection from video: Our methods is based on faster-rcnn and extra classifier. We train Faster-RCNN based on RES-101 with the provided training data. We also train extra classifier with 30 classes which helps to promote recall and refine the detection scores of ultimate boxes.
Savion DP.co
We fine-tune the detection models using the DET training set and the val1 set. The val2 set is used for validation.
data process to nearly equal ammount. since some categories have much more images than others. So,we need to process the initial data to let amount of each category near equal.
usr res101model fater_rcnn.The networks are pre- trained on the 1000-class ImageNet classification set, and are fine-tuned on the DET data.
use box refinement:In Faster R-CNN, the final output is a regressed box that is different from its proposal box. So for inference, we pool a new feature from the regressed box and obtain a new classification score and a new regressed box. We combine these 300 new predictions with the orig- inal 300 predictions. Non-maximum suppression (NMS) is applied on the union set of predicted boxes using an IoU threshold of 0.3.
use multiscale test.In our current implementation, we have performed multi-scale testing. we compute conv feature maps on an image pyramid, where the image’s shorter sides are 300,450,600
use multiscale anchor. we add two anchor scales to original anchor scales of faster rcnn.
use test flip. we flip image and combine results with original image
We use 5 models with different input scales and different network structures as basic models. They are derived from GoogleNet, VGGNet and ResNet.
We also utilize the idea of dark knowledge [1] to train several specialist models, and use these specialist models to reassign probability scores and refine the basic outputs.
Our final results are based on the ensemble of refined outputs.
[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:, 2015.
Cheng Zhou
Li Jiancheng
Lin Zhihui
Lin Zhiguan
All came from Tsinghua university, Graduate School at ShenZhen Lab F205,China
Our team has five student members from Tsinghua university, Graduate School at ShenZhen Lab F205,China. We have joined two sub-tasks of the ILSVRC2016 & COCO challenge which is the Scene Parsing and Object detection from video. We are the first time to attend this competition.
The two of the members have focus on the Scene Parsing, they mainly utilized several model fusion algorithms on some famous and effective CNN models like ResNet[1], FCN[2] and DilatedNet[3, 4] and used CRF to get more context features to improve the classification accuracy and mean IoU rate. Since the image size is large, the image is downsampled before feeding to the network. What\’s more, we used vertical mirror technique for data augmentation. The places2 scene classification 2016 pretrained model was used to fine-tune ResNet101 and FCN, while DilatedNet fine-tuned from the places2 scene parsing 2016 pretrained model[5]. Later fusion and CRF were added at last.
For object detection from video, the biggest challenge is there are more than 2 millions images with very high resolution in total. We didn\’t think about using the fast-RCNN[6] like models to solve it. It need much more training and testing time. So we chose the ssd[7] which is an effective and efficient framework for object detection. We utilized the ResNet101 as the base model, but it is slower than VGGNet[8]. For testing it can achieve about 10FPS on single GTX TITAN X GPU. However, there are more than 700 thousands images in the test set. It costed lots of time. On tracking task, we have a dynamic adjustment algorithm, but it need a ResNet101 model for scoring the patch. It can just achieve about less than 1FPS. So we cannot do this work on test set. For the submission, we used a simple method to filter the noise proposals and track the object.References:
[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:, 2015.
[2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. -3440.
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:, 2016.
[4] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
[5] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:
[6] Girshick R. Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision. -1448.
[8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint arXiv:, 2015.
[9] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:, 2014.
Faceall-BUPT
Xuankun HUANG, BUPT, CHINA
Jiangqi ZHANG, BUPT, CHINA
Zhiqun HE, BUPT, CHINA
Junfei ZHUANG, BUPT, CHINA
Zesang HUANG, BUPT, CHINA
Yongqiang Yao, BUPT, CHINA
Kun HU, BUPT, CHINA
Fengye XIONG, BUPT, CHINA
Hongliang BAI, Beijing Faceall co., LTD
Wenjian FENG, Beijing Faceall co., LTD
Yuan DONG, BUPT, CHINA
# Classification/Localization
We trained the ResNet-101, ResNet-152 and Inception-v3 for object classification. Multi-view testing and models ensemble is utilized to generate the final classification results.
For localization task, we trained a Region Proposal Network to generate proposals of each image, and we fine-tuned two models with object-level annotations of 1,000 classes. Moreover, a background class is added into the network. Then test images are segmented into 300 regions by RPN and these regions are classified by the fine-tuned model into one of 1,001 classes. And the final bounding box is generated by merging the bounding rectangle of three regions.# Object detection
We utilize faster-rcnn with the publicly available resnet-101. Other than the baseline, we adopt multi-scale roi to obtain features containing richer context information. For testing, we use 3 scales and merge these results using the simple strategy introduced last year.No validation data is used for training, and flipped images are used in onl

我要回帖

更多关于 比赛结果 的文章

 

随机推荐