January 2020
tl;dr: Generate relationship graphs based on associative embedding.
This paper is based on and is almost concurrent with associative embedding.
The heatmap for objects already has some preliminary structure for and may have inspired CenterNet.
The paper proposed a technique for supervising an unordered set of network outputs.
Note the difference between feature vector and embeddings. In this paper, the feature vectors are those generated by Stacked Hourglass backbone, and embeddings are generated with fc layers, a 8-dim vector to represent IDs.
The biggest difference between AE paper and pixel-to-graph
- AE paper only defines clusters of keypoints. There is no connection between clusters.
- pixel-to-graph can represent arbitrary edges between two nodes.
This paper is full of tricks to train neural networks!
- Two heatmaps to predict vertex (detections) and edges (relations).
- Individual feature vectors are extracted from top heatmap locations.
- Feed into fc layers to predict vertex properties (class ID, bbx, ID) and edge properties (class ID, src ID, dst ID).
- The IDs are represented in 8-dim embeddings.
- The graph is a directed graph as many relations are directed (larger than, on top of, etc)
- Loss: on loss
- pull: the embeddings of one vertex and the source/target embeddings of all edges connected to the vertex. L2 loss.
- push: hinge loss (changed from gaussian loss to improve convergence)
- There may be multiple objects and relations per pixel, so 3 objects and 6 relation slots are defined per pixel (may be smaller than input pixel). It is hard to enforce a mapping rule, but the correct loss can still be applied, with Hungarian matching
- Reference vector: one-hot of class and bbox anchor of a given GT object (similar for relationship prediction). Hungarian matching with outputs to find the best match. Then correct loss can be applied. <-- This seems to have much to do with the initial seed of the training process. Hungarian matching seems to select and encourage small advantages to converge better to the GT
- Relation/edges are grounded in the middle of the vertices (bbox centers).
- Output resolution: Different detection can share the same pixel. The higher output resolution, the smaller the chance of center collision.
- The output dimension d increased to 8 from 1 in original AE paper. This improves convergence
- The feature vector from one pixel can generate multiple object ID/prediction/embeddings.
- Prior detections can be incorporated by formatting object detection as a two channel input, where one channel consists of a one-hot activation at the center of the bboxes and the other providing the binary mask of the box. Multiple box can be displayed on these two channels with the second indicating the union of their masks.
- If there are too many bboxes and this gets too crowded, then separate by bbox anchors. To reduce computation cost, these additional inputs are not integrated in the input layer but rather incorporated after several layers of convolution and pooling.
- Sparse supervision: the relationship annotation may not be exhaustive. Therefore the heatmap for relationship is not densely supervised. The negative location are subsampled, first to alleviate data imbalance problem, second to avoid falsely penalizing the unannotated detections.
- Missing annotation is also important for evaluation. Instead of using AP, they used Recall@K proposals.
- Questions and notes on how to improve/revise the current work