Evaluating Self-Supervised Vision Transformers For Trac-Sign Classification Using The “DINO” Method
Computer vision plays an essential role in the perceiving surrounding environment in the field of autonomous driving. One of the main focuses of this field is the reliable detection and classification of trac signs, to be able to abide by trac laws and provide a safe autonomous product. For this detection and classification, many machine learning based approaches exist. In this thesis, a self-supervised method for training a vision transformer, a recent deep learning architecture, called “self-distillation with no labels” is discussed and evaluated on the German Trac-sign Recognition Benchmark dataset. Moreover, the method is evaluated using a small dataset from a prototype vehicle at the Dahlem Center for Machine Learning. In total 3 models are evaluated. A model pretrained on ImageNet1K, a model further trained on the GTSRB dataset using the weights of the first model, and a model trained from-scratch exclusively on the GTSRB dataset. With these models a k-NN classification on the GTSRB dataset containing 43 classes is performed, producing precision and recall averages of 88.61% and 84.06% for the first model respectively. The second model output better precision and recall averages of 97.77% and 96.37%. The third model achieved comparatively worse with precision and recall averages of 77.91% and 72.46%.