Softplus and softminus
The softplus function is a smooth approximation to the ReLU activation function, and is sometimes used in the neural networks in place of ReLU.
\[\operatorname{softplus}(x) = \log(1 + e^{x})\]It is actually closely related to the sigmoid function. As $x \to -\infty$, the two functions become identical.
\[\operatorname{sigmoid}(x) = \frac{1}{1 + e^{-x}}\]The softplus function also has a relatively unknown sibling, called softminus.
\[\operatorname{softminus}(x) = x - \operatorname{softplus}(x)\]As $x \to +\infty$, it becomes identical to $\operatorname{sigmoid}(x) - 1$. In the following plots, you can clearly see the similarities between softplus & softminus and sigmoid.
Furthermore, there is also an inverse softplus function that does the transformation $x = \operatorname{softplusinv}(\operatorname{softplus}(x))$.
\[\operatorname{softplusinv}(x) = \log(e^{x} - 1)\]Using $\operatorname{softplusinv}(x)$ as an additive constant allows you to adjust the $y$-intercept of the softplus function. For instance, $\operatorname{softplus}(x + \operatorname{softplusinv}(1))$ returns a function with $y$-intercept = 1.
The inverse of the sigmoid function is called logit. As such, $x = \operatorname{logit}(\operatorname{sigmoid}(x))$.
\[\operatorname{logit}(p) = \log\left(\frac{p}{1-p}\right)\]As these functions involve $\exp(x)$ and $\log(x)$, sometimes you might run into numerical stability issues. For instance, this happens to $\exp(x)$ when $x$ is too large; for $\log(x)$ when $x$ is close to zero. The following are the safer expressions of softplus and softminus that should help avoid those issues.
\[\operatorname{softplus}(x) = \max(0, x) + \log(1 + e^{-|x|})\] \[\operatorname{softminus}(x) = \min(0, x) - \log(1 + e^{-|x|})\]While for the sigmoid function, you can simply call the hyperbolic tangent function, because $\tanh(x)$ is just a scaled $\operatorname{sigmoid}(x)$.
\[\operatorname{sigmoid}(x) = \frac{1}{2} \left[1 + \tanh\left(\frac{x}{2}\right)\right]\]As a reminder, $\tanh(x)$ is defined as:
\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}}\]All these functions are easily written with NumPy.
import numpy as np
softplus = lambda x: np.log1p(np.exp(x))
softminus = lambda x: x - softplus(x)
sigmoid = lambda x: 1 / (1 + np.exp(-x))
one_minus_sigmoid = lambda x: 1 / (1 + np.exp(x))
logit = lambda x: np.log(x) - np.log1p(-x)
softplusinv = lambda x: np.log(np.expm1(x))
softminusinv = lambda x: x - np.log(-np.expm1(x))
safe_softplus = lambda x: x * (x >= 0) + np.log1p(np.exp(-np.abs(x)))
safe_softminus = lambda x: x * (x < 0) - np.log1p(np.exp(-np.abs(x)))
safe_sigmoid = lambda x: 0.5 * (1 + np.tanh(0.5 * x))
safe_one_minus_sigmoid = lambda x: 0.5 * (1 + np.tanh(0.5 * -x))
Softplus is also used to compute the log probabilities used in the binary cross-entropy loss function.
\[\operatorname{log prob}_{1}(x) = \log(p(x)) = -\operatorname{softplus}(-x)\] \[\operatorname{log prob}_{0}(x) = \log(1-p(x)) = -\operatorname{softplus}(x)\]The subscripts “0” and “1” are the class labels, and $p(x)$ is the probability of being class “1”. Substitute $p(x) = \operatorname{sigmoid}(x)$ to get the above results. The following plot shows the log prob curves.