How the loss and cost functions got their forms in logistic regression

In logistic regression, the loss function $\mathcal{L}(\hat{y}, y)$ and the cost function J take the forms

$\begin{align} \mathcal{L}(\hat{y}, y) &= - \left(y \log (\hat{y}) + (1 - y) \log (1 - \hat{y}) \right) \\ J &= \frac{1}{m} \sum_{i=1}^m \mathcal{L}\left(\hat{y}^{(i)}, y^{(i)}\right) \end{align}$

where $\hat{y} = \sigma(W^\text{T} X + b)$ is the sigmoid of the linear superposition of the features represented in matrix form, with $W$ as the vector of the feature weights, and $b$ as the intercept term. The sigmoid of some function $z$ is defined as

$\sigma(z) = \frac{1}{1 + e^{-z}}.$

To understand why $\mathcal{L}$ and $J$ take such forms, first note that $\hat{y}$ is the probability of the binary classification variable $y$ to be equal to a positive example ( $y = 1$ ). This means for a single example $x$ , $\hat{y}$ takes the form $\hat{y} = p(y|x)$ if $y=1$ . The other condition is when $y = 0$ , where we want the probability to be equal to zero. To satisfy both of these conditions, we define $p(y|x)$ :

$p(y|x) = \begin{cases} \hat{y} & (\text{if } y = 1)\\ 1 - \hat{y} & (\text{if } y = 0) \end{cases}$

Another way to write $p(y|x)$ is as follows:

$p(y|x) = \hat{y}^{y} (1 - \hat{y})^{(1 - y)}$

You can check this for yourself using $y = 1$ or $y = 0$ .

Now, let’s consider the idea of applying a logarithm on probabilities. The probability defined above for a single example is simple enough, but if you’d like to do maximum likelihood estimation for a set of examples, then we have to consider some “overall probability” value. In probability theory, when two events $A$ and $B$ are independent, then the probability of them both occurring (their intersection) is given by

$P(A \cap B) = P(A) \cdot P(B)$

In most cases, it’s safe to assume that every example in your training set occurred independently of each other. Thus the probability $P_{train}$ for a training set of $m$ samples is

$\begin{align} P_{train} &= P(y^{(1)}|x^{(1)}) \cdot P(y^{(2)}|x^{(2)}) \ldots P(y^{(m)}|x^{(m)}) \\ &= \prod_{i=1}^m p(y^{(i)}) \end{align}$

We can use probabilities outrightly, but using the logarithm of probabilities has many practical advantages. One advantage is that multiplication is more computationally expensive than addition. We’ll see shortly how applying the log translates the above from a multiplication problem to an addition problem. Also, computers have limited floating point accuracy, thus when a computer multiplies a very large set of probabilities (recall that they’re valued from zero to one), you may end up with a value very close or equal to zero, which is far from ideal. You will not encounter this weird behaviour if you apply the logarithm.

Let’s apply the logarithm to $p(y|x)$ . Using the fact that $\log(ab) = \log(a) + \log(b)$ and $\log(a^b) = b \log(a)$ , we get

$\begin{align} \log p(y|x) &= \log\left(\hat{y}^y (1 - \hat{y})^{(1 - y)}\right) \\ &= \log\left(\hat{y}^y\right) + \log\left((1 - \hat{y})^{(1 - y)}\right) \\ &= y \log \hat{y} + (1 - y) \log(1 - \hat{y}) \end{align}$

This is similar in form to the loss function of logistic regression for a single example, except for the minus sign. Logistic regression includes the minus sign in order for the cost function to be minimized.

How about for $P_{train}$ ? By applying the multiplication rule for probabilities, it turns out that $\log(P_{train})$ is just the sum of the loss functions of every training example:

$\begin{align} \log\left(P_{train}\right) &= \log\left(\prod_{i=1}^m p(y^{(i)})\right) \\ &= \log\left(p(y^{(1)})\right) + \log\left(p(y^{(2)})\right) + \cdots + \log\left(p(y^{(m)})\right) \\ &= \sum_{i=1}^m \log\left(p(y^{(i)})\right) \\ &= \sum_{i=1}^m y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \end{align}$

Looks familiar? Recall from above that the cost function $J$ has the form

$\begin{align} J &= \frac{1}{m} \sum_{i=1}^m - \left( y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right) \\ &= \frac{1}{m} \sum_{i=1}^m \mathcal{L}\left(\hat{y}^{(i)}, y^{(i)}\right) \end{align}$

Thus, $\log\left(P_{train}\right)$ is the cost function $J$ without the negative sign and without the scaling factor $(1/m)$ :

$\begin{align} \log\left(P_{train}\right) &= -\sum_{i=1}^m \mathcal{L}\left(\hat{y}^{(i)}, y^{(i)}\right) \\ &= - m J \end{align}$

The factor $(1/m)$ found in $J$ is for scaling purposes, and to allow us to treat the cost function as the average of the loss functions for all $m$ training examples.