Deep Learning for Object Relationships: Applications to Road Safety and Bin Picking

Koh, Auguste Lawrence Whelan

Deep Learning for Object Relationships: Applications to Road Safety and Bin Picking

Files

Koh_Auguste.pdf (15.9 MB)

Date

2024-01-17

Authors

Koh, Auguste Lawrence Whelan

Advisor

Fieguth, Paul

Publisher

University of Waterloo

Abstract

Estimating the relationships between objects is fundamental to certain problems that require an understanding of a scene captured by a camera. This object-relationship theme is explored in two contexts in this thesis: (i) the task of identifying relative placements of objects for bin picking in potential clutter and (ii) the task of estimating distances between vehicles in 3D given some wide-angle video. Bin picking generally refers to the task of picking up an object in a bin with a robotic arm, given some measurement of the scene. This problem can have a number of challenges and difficulties, one of which being the potential for the objects in the scene to be piled up or in a clutter. If the target object to manipulate is partially occluded by other objects in the scene, there can be some difficulty not only in terms of detecting the object, but also in terms of deciding how to clear the way to this target object or how to grasp and lift it appropriately without damaging or excessively displacing the neighbouring objects. Herein is presented a deep-learning module to help deal with potentially cluttered scenes: To account for neighbouring objects when estimating the relationship between two objects, a graph-network architecture was designed and implemented. This architecture relies on the bounding boxes and feature maps that would be outputted by an upstream detector to estimate the relationships for pairs of detected objects. The starting edge and vertex attributes of this proposed graph-network architecture are bounding box coordinates (or values derived from such coordinates) and feature-map crops. In addition to this architecture, some definitions for precision and recall that are tailored to this problem are proposed for comparing a ground-truth graph to a predicted graph. Finally, the proposed architecture was evaluated against a baseline model using existing datasets: one containing computer-rendered images, and one with real images. The problem of estimating distances between vehicles is motivated by the more general problem of estimating the risk of accidents at any given intersection or road segment. The number of traffic accidents per year in Canada, albeit generally decreasing, is still substantial. Estimating the risk of accident at any given traffic intersection or road segment could provide insight and actionable information to municipalities to help determine which intersection or road segment should be prioritized and potentially improved in order to increase road safety. To estimate this risk of accidents, tracking the number of close calls or near misses is more desirable than merely tracking the number of accidents, as it does not require the observer to wait for accidents to occur, and close calls are presumably much more frequent than actual accidents. In order to determine whether a close call has occurred, one could simply refer to the distance between any two given vehicles; although this is not a perfect metric for detecting close calls, it is a starting point and a metric that is simple and easy to interpret. As such, this thesis addresses the more specific problem of estimating distances between any two detected vehicles from wide-angle videos. The wide-angle nature of images or videos introduces a difficulty, as it challenges a core assumption of normal convolutional neural networks—that of translational equivariance. A size-estimation model which uses spherical convolutions was evaluated on a simple, artificial dataset, and results showed that the use of spherical convolutions, as opposed to normal planar convolutions, was able to offer better performance in the tested scenario. In addition to this work, a deep-learning module to estimate distances between vehicles, given some bounding box coordinates and an image, is proposed. An ablation study was performed on this distance-estimating architecture, the results of which quantified the amount of performance gain that could be attributed to the use of pixel information in addition to bounding-box coordinates.

Keywords

deep learning, computer vision, object relationships, traffic safety, bin picking

URI

http://hdl.handle.net/10012/20238

Collections

Theses
Systems Design Engineering

Full item page

Deep Learning for Object Relationships: Applications to Road Safety and Bin Picking

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections