<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>transformers on Antonio Space</title>
    <link>/tags/transformers/</link>
    <description>Recent content in transformers on Antonio Space</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Sun, 22 Feb 2026 14:00:00 +0800</lastBuildDate><atom:link href="/tags/transformers/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>LLMs Are Math</title>
      <link>/posts/llm-math/</link>
      <pubDate>Sun, 22 Feb 2026 14:00:00 +0800</pubDate>
      
      <guid>/posts/llm-math/</guid>
      <description>“AI feels magical, until you realize it’s mostly linear algebra.”
I learnt that when people interact LLMs, it can feel like intelligence: understanding, creativity, reasoning.
But under the hood?
It’s math!
Not magic. Not consciousness. Not a digital brain.
Just math — and beautiful math at that.
1) Everything starts with vectors LLMs don’t “understand” words the way humans do. They convert text into vectors — lists of numbers.</description>
      <content>&lt;blockquote&gt;
&lt;p&gt;“AI feels magical, until you realize it’s mostly linear algebra.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I learnt that when people interact LLMs, it can feel like intelligence:
understanding, creativity, reasoning.&lt;/p&gt;
&lt;p&gt;But under the hood?&lt;/p&gt;
&lt;p&gt;It’s math!&lt;/p&gt;
&lt;p&gt;Not magic. Not consciousness. Not a digital brain.&lt;/p&gt;
&lt;p&gt;Just math — and beautiful math at that.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;1-everything-starts-with-vectors&#34;&gt;1) Everything starts with vectors&lt;/h2&gt;
&lt;p&gt;LLMs don’t “understand” words the way humans do. They convert text into &lt;strong&gt;vectors&lt;/strong&gt; — lists of numbers.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;/images/embedding-space.png&#34; alt=&#34;Toy embedding space (2D projection)&#34;&gt;&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;#34;king&amp;#34;  -&amp;gt; [0.21, -0.84, 1.33, ..., 0.02]
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&amp;#34;queen&amp;#34; -&amp;gt; [0.25, -0.79, 1.40, ..., 0.04]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Each token becomes a point in a high-dimensional space (often hundreds or thousands of dimensions).
The wild part is that meaning becomes geometry. Relationships show up as vector arithmetic:&lt;/p&gt;
&lt;p&gt;$$
\text{king} - \text{man} + \text{woman} \approx \text{queen}
$$&lt;/p&gt;
&lt;h2 id=&#34;thats-linear-algebra-working-in-semantic-space&#34;&gt;That’s linear algebra working in semantic space.&lt;/h2&gt;
&lt;h2 id=&#34;2-matrices-are-the-real-workhorses&#34;&gt;2) Matrices are the real workhorses&lt;/h2&gt;
&lt;p&gt;If vectors are points, matrices are transformations.&lt;/p&gt;
&lt;p&gt;Here’s a visual intuition: a matrix transforms a grid.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;/images/grid-original.png&#34; alt=&#34;Original grid&#34;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;/images/grid-transformed.png&#34; alt=&#34;Grid after linear transform (matrix W)&#34;&gt;&lt;/p&gt;
&lt;p&gt;A neural network layer is often described as:&lt;/p&gt;
&lt;p&gt;$$
y = xW + b
$$&lt;/p&gt;
&lt;p&gt;Where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$x$ is an input vector&lt;/li&gt;
&lt;li&gt;$W$ is a weight matrix (millions or billions of learned numbers)&lt;/li&gt;
&lt;li&gt;$b$ is a bias vector&lt;/li&gt;
&lt;li&gt;$y$ is the transformed output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When people say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“This model has 70 billion parameters.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They mean:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“There are 70 billion numbers in matrices (and vectors) inside the model.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Training is “just” learning those numbers.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;3-attention-is-still-just-math-dot-products--softmax&#34;&gt;3) Attention is still just math (dot-products + softmax)&lt;/h2&gt;
&lt;p&gt;Modern LLMs are based on the Transformer architecture (introduced in 2017). The key idea is &lt;strong&gt;attention&lt;/strong&gt;:
the model computes how strongly each token should relate to every other token.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;/images/attention-heatmap.png&#34; alt=&#34;Toy attention heatmap&#34;&gt;&lt;/p&gt;
&lt;p&gt;The core formula (scaled dot-product attention) is:&lt;/p&gt;
&lt;p&gt;$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V
$$&lt;/p&gt;
&lt;p&gt;What that means in plain English:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;compute similarities via dot-products ($QK^T$)&lt;/li&gt;
&lt;li&gt;scale them ($\sqrt{d}$) so things don’t blow up&lt;/li&gt;
&lt;li&gt;normalize into probabilities with softmax&lt;/li&gt;
&lt;li&gt;mix the values ($V$) using those probabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It’s matrix multiplication, normalization, and more multiplication.&lt;/p&gt;
&lt;p&gt;Math all the way down.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;4-llms-predict-the-next-token-probability-not-certainty&#34;&gt;4) LLMs predict the next token (probability, not certainty)&lt;/h2&gt;
&lt;p&gt;At its core, an LLM is a probability machine.&lt;/p&gt;
&lt;p&gt;Given a prompt like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“The sky is”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;it produces a probability distribution over the next token.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;/images/softmax-probabilities.png&#34; alt=&#34;Softmax probabilities&#34;&gt;&lt;/p&gt;
&lt;p&gt;The final layer uses &lt;strong&gt;softmax&lt;/strong&gt; to convert scores (“logits”) into probabilities:&lt;/p&gt;
&lt;p&gt;$$
p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$&lt;/p&gt;
&lt;p&gt;Training minimizes &lt;strong&gt;cross-entropy loss&lt;/strong&gt; — a way to measure how wrong the predicted distribution is compared to the true next token:&lt;/p&gt;
&lt;p&gt;$$
\mathcal{L} = -\sum_i y_i \log(p_i)
$$&lt;/p&gt;
&lt;p&gt;Then optimization (gradient descent) adjusts parameters to reduce that loss.&lt;/p&gt;
&lt;p&gt;Which is… calculus and optimization.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;5-so-where-does-intelligence-come-from&#34;&gt;5) So where does “intelligence” come from?&lt;/h2&gt;
&lt;p&gt;There is no single place in the model that contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;grammar rules&lt;/li&gt;
&lt;li&gt;facts about Spain&lt;/li&gt;
&lt;li&gt;knowledge about GPUs&lt;/li&gt;
&lt;li&gt;a hard-coded reasoning engine&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Instead, those behaviors &lt;strong&gt;emerge&lt;/strong&gt; from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;linear algebra (vectors + matrices)&lt;/li&gt;
&lt;li&gt;non-linear functions&lt;/li&gt;
&lt;li&gt;probability distributions&lt;/li&gt;
&lt;li&gt;gradient-based optimization&lt;/li&gt;
&lt;li&gt;scale (lots of data + lots of parameters)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s the surprising part:&lt;/p&gt;
&lt;p&gt;Not that it “thinks” like us — but that math at scale can produce behavior that &lt;em&gt;feels like thinking&lt;/em&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;6-why-this-matters-if-youre-learning-ai&#34;&gt;6) Why this matters if you’re learning AI&lt;/h2&gt;
&lt;p&gt;It’s easy to feel overwhelmed by buzzwords:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transformers&lt;/li&gt;
&lt;li&gt;RLHF&lt;/li&gt;
&lt;li&gt;fine-tuning&lt;/li&gt;
&lt;li&gt;agents&lt;/li&gt;
&lt;li&gt;multimodal models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But the foundation is compact:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Linear algebra&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Probability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Calculus&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimization&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you understand:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;what vectors represent&lt;/li&gt;
&lt;li&gt;what matrix multiplication does&lt;/li&gt;
&lt;li&gt;what a derivative tells you&lt;/li&gt;
&lt;li&gt;what a probability distribution means&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;you understand a huge chunk of modern AI.&lt;/p&gt;
&lt;p&gt;The rest is mostly engineering choices and scale.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;7-llms-are-math&#34;&gt;7) LLMs are math&lt;/h2&gt;
&lt;p&gt;There’s something empowering about this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;AI isn’t mystical.&lt;/li&gt;
&lt;li&gt;It isn’t unreachable.&lt;/li&gt;
&lt;li&gt;It isn’t reserved for a “priesthood”.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It’s math.&lt;/p&gt;
&lt;p&gt;And math is learnable.&lt;/p&gt;
&lt;p&gt;The next time you see a model produce a surprisingly elegant answer, remember:&lt;/p&gt;
&lt;p&gt;Behind those words is a giant pile of matrices multiplying vectors at insane speed.&lt;/p&gt;
&lt;p&gt;And somehow… that’s enough.&lt;/p&gt;
</content>
    </item>
    
  </channel>
</rss>
