This notebook is a complement to the tutorial on operators.

Logical connectives in LTN are grounded using fuzzy semantics. However, while all fuzzy logic operators make sense when simply *querying* the language, not every operator is equally suited for *learning* in LTN. This notebook details the motivations behind some choice of operators over the others.

In [1]:

```
import logictensornetworks as ltn
import numpy as np
import tensorflow as tf
```

One can access the most common fuzzy logic operators in `ltn.fuzzy_ops`

. They are implemented using tensorflow primitives.

We here compare

- the product t-norm: $u \land_{\mathrm{prod}} v = uv$,
- the Lukasiewicz t-norm: $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$,
- the minimum aggregator: $\min(u_1,\dots,u_n)$,
- the p-mean error aggregator (generalized mean of the deviations w.r.t. the truth): $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$.

Each operator obviously conveys very different meanings, but they can all make sense depending on the intent of the query.

In [9]:

```
x1 = tf.constant(0.4)
x2 = tf.constant(0.7)
# the stable keyword is explained at the end of the notebook
and_prod = ltn.fuzzy_ops.And_Prod(stable=False)
and_luk = ltn.fuzzy_ops.And_Luk()
print(and_prod(x1,x2))
print(and_luk(x1,x2))
```

In [16]:

```
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
# the stable keyword is explained at the end of the notebook
forall_min = ltn.fuzzy_ops.Aggreg_Min()
forall_pME = ltn.fuzzy_ops.Aggreg_pMeanError(p=4, stable=False)
print(forall_min(xs))
print(forall_pME(xs))
```

While all operators are suitable in a querying setting, this not the case in a learning setting. Indeed, many fuzzy logic operators have derivatives not suitable for gradient-based algorithms. For more details, read van Krieken et al., *Analyzing Differentiable Fuzzy Logic Operators*, 2020.

We here give simple illustrations of such gradient issues.

Some operators have vanishing gradients on some part of their domains.

e.g. in $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$, if $u+v-1 < 0$, the gradients vanish.

In [13]:

```
x1 = tf.constant(0.3)
x2 = tf.constant(0.5)
with tf.GradientTape() as tape:
tape.watch(x1)
tape.watch(x2)
y = and_luk(x1,x2)
res = y.numpy()
gradients = [v.numpy() for v in tape.gradient(y,[x1,x2])]
print(res)
print(gradients)
```

Some operators have gradients propagating to only one input at a time, meaning that all other inputs will not benefit from learning at this step.

e.g. in $\min(u_1,\dots,u_n)$.

In [15]:

```
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_min(xs)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
```

Some operators have exploding gradients on some part of their domains.

e.g. in $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$, on the edge case where all inputs are $1.0$.

In [18]:

```
xs = tf.constant([1.,1.,1.])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
```

Our general recommendation is to use the following "product configuration":

- not: the standard negation $\lnot u = 1-u$,
- and: the product t-norm $u \land v = uv$,
- or: the product t-conorm (probabilistic sum) $u \lor v = u+v-uv$,
- implication: the Reichenbach implication $u \rightarrow v = 1 - u + uv$,
- existential quantification ("exists"): the generalized mean (p-mean) $\mathrm{pM}(u_1,\dots,u_n) = \biggl( \frac{1}{n} \sum\limits_{i=1}^n u_i^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$,
- universal quantification ("for all"): the generalized mean of "the deviations w.r.t. the truth" (p-mean error) $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$.

As is, this configuration is not fully exempt from issues. The product t-norm has vanishing gradients on the edge case $u=v=0$. The product t-conorm has vanishing gradients on the edge case $u=v=1$. The Reichenbach implication has vanishing gradients on the edge case $u=0$,$v=1$. p-mean has exploding gradients on the edge case $u_1=\dots=u_n=0$. p-mean error has exploding gradients on the edge case $u_1=\dots=u_n=1$.

However, all these issues happen on edge cases and can easily be fixed using the following "trick":

- if the edge case happens when an input $u$ is $0$, we modify every input with $u' = (1-\epsilon)u+\epsilon$,
- if the edge case happens when an input $u$ is $1$, we modify every input with $u' = (1-\epsilon)u$,

where $\epsilon$ is a small positive value (e.g. $1\mathrm{e}{-5}$).

This gives us stabilized versions of such operators. One can trigger the stable behavior by using the boolean keyword `stable`

. One can set a default value for `stable`

when initializing the operator, or can use different values at each call of the operator.

In [19]:

```
xs = tf.constant([1.,1.,1.])
with tf.GradientTape() as tape:
tape.watch(xs)
# the exploding gradient problem is solved
y = forall_pME(xs,p=4,stable=True)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
```

$p$ offers flexibility in writing more or less strict formulas, to account for outliers in the data depending on the application. Note that this can have strong implications for training.

In [20]:

```
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
```

In [21]:

```
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])
with tf.GradientTape() as tape:
tape.watch(xs)
y = forall_pME(xs,p=20)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
```

While it can be tempting to set a high value for $p$ when querying, in a learning setting, this can quickly lead to a "single-passing" operator that will focus too much on outliers at each step (i.e., gradients overfitting one input at this step, potentially harming the training of the others). We recommend not to set a too high $p$ when learning.