How AI "Sees" Images

(And why it's actually not seeing them)

Mar 26, 2025

Hello people!

If you have been on the lookout, Google released the Gemini 2.0 Image model, and it’s far better than other AI models— well, most of the time. Images generated by this model don’t have that weird AI feeling in many cases and are really fast.

(But still, it falls short when I ask it to make a photo of mine a bit better. )

Anyways, the way how AI sees these images is rather interesting. It’s general knowledge that AI can’t see things the way we do but it still can generate images as well as edit images as for requests.

What happens is that AI doesn’t really understand the images it processes like a human brain does; instead, it breaks things down into numbers, which can be a pretty wild concept.

Here's a simple breakdown of how that works:

Pixels, Pixels everywhere!

Every image we see is made up of Pixels.

Each pixel is a number. In digital images, these numbers represent colors, brightness, and even transparency. For example, in a red pixel, the number might be [255, 0, 0] (in RGB values), meaning it's full red, no green or blue.

This means that when an AI “looks” at an image, all it sees is a long sequence of numbers with each pixel’s color and brightness. Technically, it’s not looking at in the human sense; it's just reading the data, as always.

Breaking down the images

Well, AI can see the image but how does it generate or identify elements? That’s where things get interesting.

AI doesn’t look at the entire image all at once. Instead, it uses something called convolutional neural networks (CNNs), which are like scanners. Instead of staring at the whole picture, CNN looks at small parts of it, one piece at a time, to understand what’s going on.

Think of it like examining a puzzle. Instead of seeing the finished picture, you focus on each piece to figure out where it fits. This process is called convolution (that’s where the name CNN comes from), and it helps AI focus on the important features of the image (like edges or lines) without getting distracted by everything at once.

And once the CNN starts scanning the image, it’s looking for features.

Features are like patterns that the AI can use to understand what’s in the image. For instance, it might spot edges (looking at how color changes suddenly), textures, or shapes, based on training data.

(Ever seen something called a data annotator? These data annotators also train AI models to detect images by labeling them. For example, if the image is a cat, the annotator labels it as a cat and its features like “fur“, “whiskers“ and so on. Agh now I need to pet a cat).

Breaking in depth

Just like how it doesn’t look at all parts of images at once, it doesn’t break down all the layers of the image at once either.

In the first few layers, AI focuses on the simplest patterns like colors or edges. But as the image moves through deeper layers of AI, the features it spots become more complex, combining those basic patterns with objects (or parts of the objects).

This helps AI to understand the image in increasing detail, making sure that it doesn’t miss any important parts. This method is not foolproof, but still works great.

And by the time it reaches the final layer, the AI can finally make educated guesses about what the image represents based on the features it identified. Ta-da!

Summing up

In short, AI processes images by breaking them down into pixels, and then analyzing them layer by layer. And of course, it doesn’t “see“ images but reads the patterns and data. This simply means that AI can’t replace artists nor should replace them.

See you next time!