This notebook is a complement to the tutorial on operators.
Logical connectives in LTN are grounded using fuzzy semantics. However, while all fuzzy logic operators make sense when simply querying the language, not every operator is equally suited for learning in LTN. This notebook details the motivations behind some choice of operators over the others.
import ltn
import tensorflow as tf
Init Plugin Init Graph Optimizer Init Kernel
One can access the most common fuzzy logic operators in ltn.fuzzy_ops
. They are implemented using tensorflow primitives.
We here compare
Each operator obviously conveys very different meanings, but they can all make sense depending on the intent of the query.
x1 = tf.constant(0.4)
x2 = tf.constant(0.7)
# the stable keyword is explained at the end of the notebook
and_prod = ltn.fuzzy_ops.And_Prod(stable=False)
and_luk = ltn.fuzzy_ops.And_Luk()
print(and_prod(x1,x2))
print(and_luk(x1,x2))
Metal device set to: Apple M1 systemMemory: 16.00 GB maxCacheSize: 5.33 GB tf.Tensor(0.28, shape=(), dtype=float32) tf.Tensor(0.100000024, shape=(), dtype=float32)
2021-09-24 17:11:13.306148: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2021-09-24 17:11:13.306250: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
# the stable keyword is explained at the end of the notebook
forall_min = ltn.fuzzy_ops.Aggreg_Min()
forall_pME = ltn.fuzzy_ops.Aggreg_pMeanError(p=4, stable=False)
print(forall_min(xs))
print(forall_pME(xs))
tf.Tensor(0.1, shape=(), dtype=float32) tf.Tensor(0.31339926, shape=(), dtype=float32)
While all operators are suitable in a querying setting, this not the case in a learning setting. Indeed, many fuzzy logic operators have derivatives not suitable for gradient-based algorithms. For more details, read van Krieken et al., Analyzing Differentiable Fuzzy Logic Operators, 2020.
We here give simple illustrations of such gradient issues.
Some operators have vanishing gradients on some part of their domains.
e.g. in $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$, if $u+v-1 < 0$, the gradients vanish.
x1 = tf.constant(0.3)
x2 = tf.constant(0.5)
with tf.GradientTape() as tape:
tape.watch(x1)
tape.watch(x2)
y = and_luk(x1,x2)
res = y.numpy()
gradients = [v.numpy() for v in tape.gradient(y,[x1,x2])]
print(res)
print(gradients)
0.0 [0.0, 0.0]
Some operators have gradients propagating to only one input at a time, meaning that all other inputs will not benefit from learning at this step.
e.g. in $\min(u_1,\dots,u_n)$.
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_min(xs)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.1 [0. 0. 0. 0. 0. 0. 0. 1.]
Some operators have exploding gradients on some part of their domains.
e.g. in $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$, on the edge case where all inputs are $1.0$.
xs = tf.constant([1.,1.,1.])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
1.0 [nan nan nan]
Our general recommendation is to use the following "product configuration":
As is, this configuration is not fully exempt from issues. The product t-norm has vanishing gradients on the edge case $u=v=0$. The product t-conorm has vanishing gradients on the edge case $u=v=1$. The Reichenbach implication has vanishing gradients on the edge case $u=0$,$v=1$. p-mean has exploding gradients on the edge case $u_1=\dots=u_n=0$. p-mean error has exploding gradients on the edge case $u_1=\dots=u_n=1$.
However, all these issues happen on edge cases and can easily be fixed using the following "trick":
where $\epsilon$ is a small positive value (e.g. $1\mathrm{e}{-5}$).
This gives us stabilized versions of such operators. One can trigger the stable behavior by using the boolean keyword stable
. One can set a default value for stable
when initializing the operator, or can use different values at each call of the operator.
xs = tf.constant([1.,1.,1.])
with tf.GradientTape() as tape:
tape.watch(xs)
# the exploding gradient problem is solved
y = forall_pME(xs,p=4,stable=True)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.9999 [0.33329996 0.33329996 0.33329996]
$p$ offers flexibility in writing more or less strict formulas, to account for outliers in the data depending on the application. Note that this can have strong implications for training.
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.31339926 [0. 0. 0. 0.0482733 0.13246194 0.19772749 0.19772749 0.28152987]
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_pME(xs,p=20)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.1815753 [0.0000000e+00 0.0000000e+00 0.0000000e+00 1.0733546e-05 6.4146910e-03 8.1100456e-02 8.1100456e-02 7.6018697e-01]
While it can be tempting to set a high value for $p$ when querying, in a learning setting, this can quickly lead to a "single-passing" operator that will focus too much on outliers at each step (i.e., gradients overfitting one input at this step, potentially harming the training of the others). We recommend not to set a too high $p$ when learning.