1.6 Computer Vision – How AI Learnt to See

Inhalt » ƒon.space » 1. How is an Image... » 1.2 First-Order Social and Neuronal... » 1.3 Second-Order Social and Neuronal... » 1.4 Third-Order Social and Neuronal... » 1.5 AI-Extended Synchronisation Levels » 1.6 Computer Vision – How...

The development of ‘learning to see’ by AI systems, also known as computer vision, is a very fascinating field that has made enormous progress in recent decades. Let’s look at this development chronologically and discuss the most important milestones.

1. Beginnings (1950s – 1970s):

The first attempts to teach computers to ‘see’ began in the 1950s. Marvin Minsky and Seymour Papert (1969) argued in their book ‘Perceptrons’ that simple neural networks were not capable of solving many important problems of pattern recognition [1]. This initially led to a slowdown in research in this area.

2. Basic algorithms (1970s – 1990s):

Important foundations were laid in the 1970s and 1980s:

Canny (1986) developed the Canny edge detector, a fundamental algorithm for detecting edges in images [2].
Lowe (1999) introduced SIFT (Scale-Invariant Feature Transform), a robust algorithm for recognising and describing local features in images [3].

3. Rise of machine learning (1990s – 2000s):

During this phase, machine learning methods were increasingly used for image recognition tasks:

Viola and Jones (2001) developed a real-time face recognition algorithm that used AdaBoost for feature selection [4].
Support Vector Machines (SVMs) have been used for various classification tasks in image processing.

4. Deep Learning Revolution (2010s – today):

The real breakthrough came with the rediscovery and improvement of deep learning techniques:

Krizhevsky et al. (2012) presented AlexNet, a deep convolutional neural network (CNN) that won the ImageNet competition by a large margin and heralded the deep learning revolution in computer vision [5].
He et al. (2015) presented ResNet, a network architecture that made it possible to train very deep networks and further improve accuracy [6].
Redmon et al. (2015) developed YOLO (You Only Look Once), a real-time object recognition algorithm that treats object recognition as a regression problem [7].

5. Current developments:

The latest advances in computer vision are impressive:

Self-supervised learning: Chen et al. (2020) used SimCLR to show how to learn powerful visual representations without manual annotations [8].
Vision Transformers: Dosovitskiy et al. (2021) adapted the Transformer architecture, originally developed for NLP, to image processing tasks and achieved impressive results [9].
Multimodal models: Radford et al. (2021) presented CLIP, a model that can process images and text together and has a remarkable generalisation capability [10].
Generative models: Ramesh et al. (2022) presented DALL-E 2, a model that can generate high-quality images from text descriptions [11].

6. Challenges and future directions:

Despite the impressive progress made, there are still many unresolved challenges:

Interpretability: There is a growing interest in the development of interpretable models, as discussed by Rudin (2019) [12].
Robustness: The susceptibility of deep learning models to adversarial attacks, as shown by Goodfellow et al. (2015), remains an important research topic [13].
Efficiency: The development of energy-efficient models for use on mobile devices and in real-time applications is an active area of research.
Ethical aspects: The increasing use of facial recognition and other surveillance technologies raises important ethical issues that need to be addressed.

To summarise, the development of ‘vision learning’ in AI systems has been an impressive journey from simple edge detectors to complex, multifunctional visual systems. Current systems are approaching or even surpassing human performance in many tasks. Future research is likely to focus on improving robustness, efficiency and interpretability, while at the same time developing new

Continue to chapter 2

How is an image created in a resonance space?

[1] Minsky, M., & Papert, S. (1969). Perceptrons: An introduction to computational geometry. MIT Press.

[2] Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6), 679-698.

[3] Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision (Vol. 2, pp. 1150-1157).

[4] Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. I-I).

[5] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.

[6] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

[7] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).

[8] Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:1597-1607

[9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2021). An image is worth 16×16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

[10] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748-8763). PMLR.

[11] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

[12] Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215.

[13] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations.