Weakly Supervised Roadside Object Segmentation Using Maps
Summary
Publicly available detailed maps offer semantic information of objects in street view images. We investigated if this information can be exploited to automatically create object detection datasets on which (SOTA) state-of-the-art object detection methods can be trained. To accomplish this, we use the Dutch BGT (basis grootschalige topografie) and DTB (digitaal topografisch bestand) maps which are detailed maps of the Netherlands containing the location of a large number of street objects. Using their location information, we find nearby street view images from which we can focus on the position in the image. These images are then collected and labeled with the objects on the map. This allows us to automatically create an image-wide labeled dataset. We also investigated if bounding box and semantic segmentation results could be acquired on this dataset. For this purpose, we used the weakly supervised learning technique of (CAMs) Class activation maps. These CAMs give information about what features in an image lead to a positive classification or negative classification and can give a rough location of where the features are that led to this classification. To improve the accuracy of the CAMs, we investigated two new methods. First, as we used a Resnet 50 network as our SOTA object detection method, we have an additional point from which we could generate CAMs. Normally, when generating CAMs the last convolutional layer, which is a specific layer within a Convolutional neural network (CNN), is used. However, an additional layer within the Resnet 50 network can also be exploited to generate CAMs namely the last ’add’ layer of the network. Second, we investigated if the availability of depth information combined with the map information could be used to automatically generate approximate masks of objects that can improve overall performance. We found that our method of automatically generating datasets using maps would generate large numbers of noisy labels. This noisy being present due to mistakes in maps and changes within the environment or due to dynamic obstruction within the road environment i.e. a truck blocking vision of an object. However, even with these faults, we are able to acquire decent image-wide defections performance acquiring an average F-score of 0.66 on 20 different objects. Where our highest performing object was able to generate a score of 0.92. However, we found our bounding box and semantic segmentation results generated with CAMs very lacking. While an increase of performance could be acquired by using the last ’add’ layer for CAM generation this performance increase was not enough to acquire acceptable bounding box and semantic segmentation results. Furthermore, the automatically generated masks were not able to significantly improve results. While a 7% increase in performance at a semantic segmentation level could be seen, this improvement was far from what was required to allow for useable results. Leading us to conclude that while the method shows promise, it requires further research in order to acquire usable bounding box and semantic segmentation results. Our main suggestion for future work is to investigate how the method performs when less noise is present within the dataset.