满

满五的博客

Louis Aeilot's Blog

https://blog.aeilot.top/

https://blog.aeilot.top/index.xml (RSS订阅地址)

CS231n Lecture Note II: Linear Classifiers

With the disadvantages of the KNN algorithm, we need to come up with a more powerful approach. The new approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. Score Function The score function maps the pixel values of an image to confidence scores for each class. As before, let’s assume a training dataset of images xi∈RD\mathbf{x}_i \in \mathbf{R}^Dxi∈RD, each associated with a label yiy_iyi. Here i=1…Ni = 1 \dots Ni=1…N and yi∈1…Ky_i \in 1 \dots Kyi∈1…K. That is, we have N\mathbf{N}N examples (each with a dimensionality D\mathbf{D}D) and K\mathbf{K}K distinct categories. We will define the score function f:RD↦RKf : \mathbf{R}^D \mapsto \mathbf{R}^Kf:RD↦RK that maps the raw image pixels to class scores. Linear Classifier We will start out with arguably the simplest possible function, a linear mapping. f(xi,W,b)=Wxi+bf(\mathbf{x}_i, \mathbf{W}, \mathbf{b}) = \mathbf{W}\mathbf{x}_i + \mathbf{b}f(xi,W,b)=Wxi+b In the above equation, we are assuming that the image xi\mathbf{x}_ixi has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W\mathbf{W}W (of size [K x D]), and the vector b\mathbf{b}b (of size [K x 1]) are the parameters of the function. The parameters in W\mathbf{W}W are often called the weights, and b\mathbf{b}b is called the bias vector because it influences the output scores, but without interacting with the actual data xi\mathbf{x}_ixi. The input data are given and fixed. The goal is to set W,b\mathbf{W,b}W,b in such way that the computed scores match the ground truth labels across the whole training set. Bias Tricks We can combine the two sets of parameters into a single matrix that holds both of them by extending the vector xi\mathbf{x}_ixi with one additional dimension that always holds the constant 1\mathbf{1}1 - a default bias dimension. f(xi,W)=Wxif(\mathbf{x}_i, \mathbf{W}) = \mathbf{W}\mathbf{x}_if(xi,W)=Wxi Image Data Preprocessing In Machine Learning, it is a very common practice to always perform normalization of input features. In particular, it is important to center your data by subtracting the mean from every feature. Loss Function We will develop Multiclass Support Vector Machine (SVM) loss. The score function takes the pixels and computes the vector f(xi,W)f(\mathbf{x}_i, \mathbf{W})f(xi,W) of class scores, which we will abbreviate to s\mathbf{s}s (short for scores). The Multiclass SVM loss for the i-th example is then formalized as follows: Li=∑j≠yimax⁡(0,sj−syi+Δ)L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + \Delta)Li=j=yi∑max(0,sj−syi+Δ) The function accumulates the error of incorrect classes within Delta. In summary, the SVM loss function wants the score of the correct class yiy_iyi to be larger than the incorrect class scores by at least by Δ\DeltaΔ (delta). The threshold at zero max(0, -) function is often called the hinge loss. We also have squared hinge loss SVM (or L2-SVM), which uses the form max(0, -)² that penalizes violated margins more strongly. Regularization We wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty R(W)R(\mathbf{W})R(W). The most common regularization penalty is the squared L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters: R(W)=∑k∑lWk,l2R(\mathbf{W}) = \sum_k \sum_l W_{k,l}^2R(W)=k∑l∑Wk,l2 Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss LiL_iLi over all examples) and the regularization loss. L=1N∑iLi⏟data loss+λR(W)⏟regularization lossL = \underbrace{\frac{1}{N} \sum_i L_i}_{\text{data loss}} + \underbrace{\lambda R(\mathbf{W})}_{\text{regularization loss}}L=data loss N1i∑Li+regularization loss λR(W) Or in full form: L=1N∑i∑j≠yi[max⁡(0,f(xi;W)j−f(xi;W)yi+Δ)]+λ∑k∑lWk,l2L = \frac{1}{N} \sum_i \sum_{j \neq y_i} \left[ \max(0, f(\mathbf{x}_i; \mathbf{W})_j - f(\mathbf{x}_i; \mathbf{W})_{y_i} + \Delta) \right] + \lambda \sum_k \sum_l W_{k,l}^2L=N1i∑j=yi∑[max(0,f(xi;W)j−f(xi;W)yi+Δ)]+λk∑l∑Wk,l2 Penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. It keeps the weights small and simple. This can improve the generalization performance of the classifiers on test images and lead to less overfitting. It prevents the model from doing too well on the training data. Note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples. Practical Considerations Setting Delta: It turns out that this hyperparameter can safely be set to Δ=1.0\Delta = 1.0Δ=1.0 in all cases. (The exact value of the margin between the scores is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily.) Binary Support Vector Machine: The loss for the i-th example can be written as Li=Cmax⁡(0,1−yiwTxi)+R(W)L_i = C \max(0, 1 - y_i \mathbf{w}^T \mathbf{x}_i) + R(\mathbf{W})Li=Cmax(0,1−yiwTxi)+R(W) C\mathbf{C}C in this formulation and λ\lambdaλ in our formulation control the same tradeoff and are related through reciprocal relation C∝1λC \propto \frac{1}{\lambda}C∝λ1. Other Multiclass SVM formulations: Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes. Another commonly used form is the One-Vs-All (OVA) SVM which trains an independent binary SVM for each class vs. all other classes. Related, but less common to see in practice is also the All-vs-All (AVA) strategy. The last formulation you may see is a Structured SVM, which maximizes the margin between the score of the correct class and the score of the highest-scoring incorrect runner-up class. Softmax Classifier In the Softmax Classifier, we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form: Li=−log⁡(efyi∑jefj)or equivalentlyLi=−fyi+log⁡∑jefjL_i = -\log\left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right) \hspace{1cm} \text{or equivalently} \hspace{1cm} L_i = -f_{y_i} + \log\sum_j e^{f_j}Li=−log(∑jefjefyi)or equivalentlyLi=−fyi+logj∑efj where we are using the notation fjf_jfj to mean the j-th element of the vector of class scores f\mathbf{f}f. The function fj(z)=ezj∑kezkf_j(\mathbf{z}) = \frac{e^{z_j}}{\sum_k e^{z_k}}fj(z)=∑kezkezj is called the softmax function: It takes a vector of arbitrary real-valued scores (in z\mathbf{z}z) and squashes it to a vector of values between zero and one that sum to one. Information Theory View The cross-entropy between a “true” distribution p\mathbf{p}p and an estimated distribution q\mathbf{q}q is defined as: H(p,q)=−∑xp(x)log⁡q(x)H(\mathbf{p}, \mathbf{q}) = -\sum_x p(x) \log q(x)H(p,q)=−x∑p(x)logq(x) Minimizing Cross-Entropy is equivalent to minimizing the KL Divergence. H(p,q)=H(p)+DKL(p∣∣q)H(p, q) = H(p) + D_{KL}(p||q)H(p,q)=H(p)+DKL(p∣∣q) Because the true distribution ppp is fixed (its entropy H(p)H(p)H(p) is zero in this scenario), minimizing cross-entropy is the same as forcing the predicted distribution qqq to look exactly like the true distribution ppp. The Softmax Loss objective is to force the neural network to output a probability distribution where the correct class has a probability very close to 1.0, and all other classes are close to 0.0. Information Theory Supplementary Information Entropy measures the uncertainty or unpredictability of a random variable. The more unpredictable an event is (lower probability), the more information is gained when it occurs, and the higher the entropy. Conversely, if an event has a probability of 1 (certainty), its entropy is 0. For a discrete random variable XXX with possible outcomes {x1,...,xn}\{x_1, ..., x_n\}{x1,...,xn} and probabilities P(xi)P(x_i)P(xi), the entropy H(X)H(X)H(X) is defined as: H(X)=−∑i=1nP(xi)log⁡P(xi)H(X) = -\sum_{i=1}^{n} P(x_i) \log P(x_i)H(X)=−i=1∑nP(xi)logP(xi) Cross Entropy measures the total cost of using distribution qqq to represent distribution ppp. Minimizing the cross-entropy H(p,q)H(p, q)H(p,q) is mathematically equivalent to minimizing the KL Divergence. It forces the predicted distribution qqq to become as close as possible to the true distribution ppp. Probabilistic View P(yi∣xi;W)=efyi∑jefjP(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j}}P(yi∣xi;W)=∑jefjefyi The formula maps raw scores to a range of (0,1)(0, 1)(0,1) such that the sum of all class probabilities equals 1. Using the Cross-Entropy loss function during training is equivalent to maximizing the likelihood of the correct class. Numeric Stability Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. efyi∑jefj=CefyiC∑jefj=efyi+log⁡C∑jefj+log⁡C\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}∑jefjefyi=C∑jefjCefyi=∑jefj+logCefyi+logC A common choice for CCC is to set log⁡C=−max⁡jfj\log C = -\max_j f_jlogC=−maxjfj. This simply states that we should shift the values inside the vector fff so that the highest value is zero. In code: 1 2 3 4 5 6 f = np.array([123, 456, 789]) # example with 3 classes and each having large scores p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup # instead: first shift the values of f so that the highest number is 0: f -= np.max(f) # f becomes [-666, -333, 0] p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer Note To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss. SVM vs Softmax The SVM interprets these as class scores and its loss function encourages the correct class to have a score higher by a margin than the other class scores. The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low). Softmax classifier provides “probabilities” for each class. The “probabilities” are dependent on the regularization strength. They are better thought of as confidences where the ordering of the scores is interpretable. In practice, SVM and Softmax are usually comparable. Compared to the Softmax classifier, the SVM is a more local objective. The Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better. However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint.

满五的博客

CS231n Lecture Note II: Linear Classifiers

CS231n Lecture Note I: Image Classification

CSAPP Cache Lab II: Optimizing Matrix Transposition

CSAPP Cache Lab I: Let's simulate a cache memory!

CS188 Search Lecture Notes III

CS188 Search Lecture Notes II

How to Use TouchID for Sudo Commands on macOS

CS188 Search Lecture Notes I

RECAP2025: 留白

CSAPP Bomb Lab 解析

x64 暫存器速查表

CSAPP Data Lab 解析

矩陣的 Modified Gram Schmidt 方法

聊一聊位掩碼（Bit Mask）

整數溢位與未定義行為

快速排序 幾種劃分方法討論

等待

記夢（DeepSeek 輔助創作）

午夜飛行

橋樑

黎明 或 2012

RECAP2024: 水檻臥聽雨

太陽、潮落

RECAP2023: 泡沫

題解 P1622 釋放囚犯

題解 P5888 傳球遊戲

殘陽似火

再會

飢餓藝術家 卡夫卡

Python 中的 zip() 和 enumerate()

泡沫

“救救孩子……”——談魯迅和《狂人日記》

想念

淺灘

蟬 · 夏

微風

觀星

浮塵

復活

【摘錄 | 轉載】普魯斯特 《追憶似水年華》第一卷 《在斯萬家那邊》（一）

Time - Pink Floyd - The Dark Side of the Moon

【轉載】靜夜思變調

高樓 幻夢 冰

RECAP2022: 流星雨

清夜

割點 Tarjan 演算法

P3147 USACO16OPEN 262144 P 題解

P3354 Riv 河流 題解

馬拉車演算法

夜雨

層霧

從愚人節玩笑到真的玩笑(bushi): 淺談 lsnotes

I made my own Hexo theme

題解 紀念品分組

題解 導彈攔截

如何高效使用搜尋引擎

用 GitHub Actions 格式化 C/C++ 程式碼

四季的天空

洛谷 7 月月賽 Div.2 總結

題解 最近公共祖先 (LCA)

用簡單的物理方法證明牛頓萊布尼茨公式

簡評榮耀手環6

海上生明月，天涯共此時。

我為什麼重新拿出了 iPod

Swift 中的 SharedPreferance —— UserDefaults

凝視那一輪明月

用 GitHub Actions 部署 Hexo 部落格

遲來的日誌 - WWDC 2020 獎學金

vcpkg - 方便的 C/C++ 庫管理器

vimrc 配置指南

NextCloud - DIY NAS 解決方案

sudo shutdown -r now

sudo shutdown -r now

满五的博客

CS231n Lecture Note II: Linear Classifiers

CS231n Lecture Note I: Image Classification

CSAPP Cache Lab II: Optimizing Matrix Transposition

CSAPP Cache Lab I: Let's simulate a cache memory!

CS188 Search Lecture Notes III

快速排序幾種劃分方法討論

黎明或 2012

飢餓藝術家卡夫卡

【摘錄 | 轉載】普魯斯特《追憶似水年華》第一卷《在斯萬家那邊》（一）

高樓幻夢冰

P3354 Riv 河流題解

題解紀念品分組

題解導彈攔截

題解最近公共祖先 (LCA)

快速排序幾種劃分方法討論

黎明或 2012

飢餓藝術家卡夫卡

【摘錄 | 轉載】普魯斯特《追憶似水年華》第一卷《在斯萬家那邊》（一）

高樓幻夢冰

P3354 Riv 河流題解

題解紀念品分組

題解導彈攔截

題解最近公共祖先 (LCA)