This study investigates the efficacy of modern deep learning architectures in image classification tasks, focusing on the recognition of dog breeds. Leveraging the Stanford Dogs Dataset, we evaluate the performance of Vision Transformer (ViT), VGG-16, and ResNet-50 models, aiming to surpass previous benchmarks set by Hsu (2015) using conventional convolutional neural networks (CNNs). The Vision Transformer (ViT) architecture, originally designed for natural language processing, represents a modern approach to image classification by processing entire images as sequences of tokens. Our results demonstrate significant accuracy improvements over the baseline established by Hsu (2015). VGG-16 achieved 65% testing accuracy, ResNet-50 achieved 84%, and surprisingly, ViT outperformed both with 91% accuracy. These findings suggest the potential of transformer architectures in handling smaller-scale datasets with fine-grained categories. The study contributes to the growing body of research indicating the viability of transformer models in various image classification tasks and calls for further exploration to enhance their performance as the architecture continues to evolve.
-
Notifications
You must be signed in to change notification settings - Fork 0
dlongert/dog_image_classification
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published