How does DeepSeek work: An inside look

A bit about what's going on behind the scenes, simplified.

Feb 05, 2025

Hello people!

Welcome to the Programming and Doodles blog today we’ll be talking about DeepSeek in-depth— including its architecture, and most importantly, how it’s any different from OpenAI’s ChatGPT.

In a glance,

DeepSeek is an open-source large language model (or as we call them, LLM), developed by a Chinese AI research company. It’s designed to compete with models like OpenAI’s GPT series, especially through its latest R1 model.

DeepSeek (The Chinese Ai Company) Is Closing The Gap With OpenAi - 9meters

Behind the scenes, it’s built with an MoE (mixture of experts) architecture, incorporating transformed layers for natural language processing.

It predicts multiple words at once (instead of one-by-one like ChatGPT and some other LLMs do), uses smarter memory tricks (summarizing key points instead of writing everything down as ChatGPT does), and is trained on both English and Chinese data (which might have been stolen, according to Open AI’s CEO), making it strong in coding, math, and in general, reasoning.

If any of the above jargon feels fancy, don’t worry, everything’s explained down below; keep reading.

The Architecture of DeepSeek

One of the reasons for the popularity DeepSeek received is that it’s not just another ChatGPT clone. It’s unique, and because of its uniqueness, it’s faster, cheaper, and more efficient.

One such unique feature is that DeepSeek-V3 has 671 billion total parameters, but only 37 billion parameters are activated per token (word).

Why does this matter? You see, this means that it doesn't use all of its resources at once; only the necessary parts of the model are used, and this makes it faster + efficient than other LLMs.

What are these parameters?

Think of parameters like this: you’re baking a cake, and the recipe calls for ingredients like flour, sugar, and eggs but the exact amounts vary depending on the cake you want to make. Parameters are just like these specific measurements of these ingredients. If you tweak the sugar amount, the cake might be sweeter; if you adjust the amount of flour, it might be denser or fluffier.

Similarly, in AI models, parameters control how the model processes information. They are the internal settings that the model adjusts during training to make better predictions or generate accurate responses.

Higher computational power means higher cost

1. Mixture of Experts (MoE) architecture

Most of the AI models, like the earlier versions of ChatGPT (GPT-3), use a monolithic transformer architecture for their core. This means that every single part of the model is active all the time, even when it’s not needed.

But here’s the catch: Some other new models, like GPT-4, are also rumored to use Mixture of Experts architectures. The difference? How they use it.

DeepSeek’s MoE design is hyper-specialized. It means that instead of treating every task like a five-alarm fire, it activates only the most relevant “experts“ in its network for each input.

Simply put in bullet points:

Instead of using all of its parameters at once, DeepSeek only activates a subset of experts for each task.
This (obviously) reduces computation waste and makes DeepSeek run faster and cheaper.
Think of this like a team of specialists; instead of asking every professor in a university about a math problem, you ask the math department, not biology or psychology departments.

2. Multi-head Latent Attention (MLA)

Deepseek uses Multi-head attention—MLA, in short— instead of standard self-attention like ChatGPT.

Simply put, this means that instead of keeping track of everything in memory, MLA compresses and stores only the most important details from past interactions.

Think of it as reading a book. Deepseek doesn’t memorize every word and write them down; instead, it summarizes and stores the key ideas. ChatGPT, on the other hand, tries to memorize and write down every word, which makes it slower and inefficient in comparison. Well, that makes sense.

This also makes DeepSeek a better model for long conversations, as it doesn’t drift away from reality and produces chaotic outputs when handling complex discussions.

3. Multi-token prediction (MTP)

If you have read this article where I explain ChatGPT’s behind the scenes, you might remember that ChatGPT predicts one word at a time.

When we ask ChatGPT a question, it usually generates one word at a time. Sure, we can see its beautiful front end making it look like it’s basically messaging with you like a real-time person, but that’s not what’s really happening.
It’s also similar to the game of “20 Questions“, where you gradually build the answer based on each previous guess.
— How ChatGPT works, Programming and Doodles blog

DeepSeek, on the other hand, uses Multi-Token Prediction (MTP), which predicts multiple words at once and also allows pre-planning sentences, making text generation smoother and faster compared to other models.

Think of it like typing on your phone: instead of predicting just the next word, it suggests entire phrases. Chaotic when it happens to us, yes, but in AI generation, this is much faster.

4. FP8 Mixed Precision

One of the biggest challenges with training AI models is GPU memory and cost.

DeepSeek solves this by using FP8 mixed precision training. That means, it stores data in smaller memory units (FP8 instead of FP16 or FP32). This saves GPU memory, allowing training on less expensive hardware.

In my view, the best benefit of using this method is that it allows DeepSeek to be trained with fewer resources than GPT-4, yet achieve similar performances.

5. Load balancing

A common problem with AI models is their uneven workload distribution. Some parts of the model work too hard, while others do nothing.

To fix this bottleneck, DeepSeek uses auxiliary-loss-free load balancing. which evenly distributes the workload, thereby preventing “traffic jams“ in AI processing. It improves stability as well, avoiding sudden performance drops.

And that’s about it the behind-the-scenes. If you have more time to research DeepSeek, a read DeepSeek-V3 paper on GitHub is totally worth it.

Why do people choose Deepseek over ChatGPT?

The main reason, as for any other tool, is its cost. Nobody’s comfortable with paying $200/month when there’s a free, open-source alternative. Also, its ability to run locally has grabbed the popularity of a lot of programmers and developers as well.

But for me, there’s another reason: DeepSeek feels unbiased and direct; This isn’t something proper for a Python developer to say as it has no technical base and is about a “feeling“, but I gotta put it here.

For example, I created new accounts on both DeepSeek and ChatGPT (so there’s nothing about me in memory at all), and told them I’m the founder of an AI/ML startup, looking for someone to work with me as head of content. Then I copy-pasted my resume, personal website, and LinkedIn

As I guessed, ChatGPT’s response was a bit of a mess. It has probably missed the part of my message “Give me a direct reply, recruit them or not.“

Based on Chenuli Jayasinghe's profile and website, she seems like a strong candidate for the Head of Content position, especially if you're looking for someone with a mix of technical expertise in AI/ML and a proven track record in content creation.
— ChatGPT’s response

DeepSeek, on the other hand, started with my request:

Recruitment Recommendation: Recruit Her
— DeepSeek’s response

Note: both of them did pay attention to she/her pronouns on LinkedIn profile.

Both of the responses were also followed by a summary of the web content, resume, and LinkedIn but overall, I’d prefer DeepSeek’s straight-to-the-point reply.

(not because it told me to recruit myself, of course)

Summing up

While we can’t say that this one is the best LLM, DeepSeek does have some bonus points for its cost-effectiveness and efficiency. Refer to the following image on performance benchmark testing results on DeepSeek and ChatGPT, for more information.

Source: https://www.geeksforgeeks.org/deepseek-vs-chatgpt/

But also be aware, that DeepSeek’s policy states that it stores the information for “further training“ of the chatbot in Chinese servers. While it’s not something to get panicked about (most of the applications follow the same principle, despite not being overly open about it), it’s best to use precautions and run it locally through a service like Ollama.

If you loved this article, make sure to subscribe using your email, so you can read all my content inside your inbox without missing any!

turgut kalfaoglu

Apr 1

Great article, and I enjoy deepseek as well. It has helped me tremendously with technical questions. I would leave out the "stole" part, I think that was a rash blurb from chatgpt's founder. I'm sure he remembers that chatgpt too is/was in hot waters for "borrowing" from many sources.

Expand full comment