In logistic regression, the loss function and the cost function J take the forms
where is the sigmoid of the linear superposition of the features represented in matrix form, with as the vector of the feature weights, and as the intercept term. The sigmoid of some function is defined as
To understand why and take such forms, first note that is the probability of the binary classification variable to be equal to a positive example (). This means for a single example , takes the form if . The other condition is when , where we want the probability to be equal to zero. To satisfy both of these conditions, we define :
Another way to write is as follows:
You can check this for yourself using or .
Now, let’s consider the idea of applying a logarithm on probabilities. The probability defined above for a single example is simple enough, but if you’d like to do maximum likelihood estimation for a set of examples, then we have to consider some “overall probability” value. In probability theory, when two events and are independent, then the probability of them both occurring (their intersection) is given by
In most cases, it’s safe to assume that every example in your training set occurred independently of each other. Thus the probability for a training set of samples is
We can use probabilities outrightly, but using the logarithm of probabilities has many practical advantages. One advantage is that multiplication is more computationally expensive than addition. We’ll see shortly how applying the log translates the above from a multiplication problem to an addition problem. Also, computers have limited floating point accuracy, thus when a computer multiplies a very large set of probabilities (recall that they’re valued from zero to one), you may end up with a value very close or equal to zero, which is far from ideal. You will not encounter this weird behaviour if you apply the logarithm.
Let’s apply the logarithm to . Using the fact that and , we get
This is similar in form to the loss function of logistic regression for a single example, except for the minus sign. Logistic regression includes the minus sign in order for the cost function to be minimized.
How about for ? By applying the multiplication rule for probabilities, it turns out that is just the sum of the loss functions of every training example:
Looks familiar? Recall from above that the cost function has the form
Thus, is the cost function without the negative sign and without the scaling factor :
The factor found in is for scaling purposes, and to allow us to treat the cost function as the average of the loss functions for all training examples.