Basic Calculus for AI, part 5

The point of calculus – or most math, frankly – is to come up with cool shortcuts to bypass complicated, repetitive, and error-prone techniques. The idea of slopes and gradients, for example, makes it unnecessary to go through every .. single .. coordinate .. possible in a function to find minimums. Likewise, the chain rule is a short cut to calculating the derivatives of clustered functions.

A.6 The Chain Rule

DL uses the chain rule for backpropagation, its version of machine learning.

The formal definition is:


If \(g\) is differentiable at \(x\) and \(f\) is differentiable at \(g(x)\), then the composite function \(F = f \circ g\) defined by \(F(x) = f(g(x)))\) is differentiable at \(x\) and \(F’\) is given by the production \(F’(x) = f’(g(x))g’(x)\).


That’s straight from the course slide, but it hurts the brain. What does it mean?

A.6.1 Example

$$ F(x) = (x^3 - 1)^2 $$

Ooh look. An exponent that works on a bracketed expression. That screams that a function is being passed to another function.

The derivative is:

$$\begin{aligned} f’(g(x))g’(x) &= 2(x^3 - 1)3x^2 \\ &= 6x^2(x^3 -1) \end{aligned}$$

Just to show it works, let’s do the derivative the hard way, without the chain rule:

$$\begin{aligned} F(x) &= (x^3 - 1)^2 \\ &= (x^3-1)(x^3-1) \\ &= x^6 - x^3 - x^3 + 1 \\ &= x^6 - 2x^3 + 1 \\ F’(x) &= 6x^5-6x^2 \\ &= 6x^2(x^3-1) \end{aligned}$$


This ends differential calculus. I won’t do integral calculus unless I find it really is necessary to understand area under the curve.