Notebook

Complementary Notebook: Appropriate Operators to Approximate Connectives and Quantifiers¶

This notebook is a complement to the tutorial on operators.

Logical connectives in LTN are grounded using fuzzy semantics. However, while all fuzzy logic operators make sense when simply querying the language, not every operator is equally suited for learning in LTN. This notebook details the motivations behind some choice of operators over the others.

In [1]:

import ltn
import tensorflow as tf

Init Plugin
Init Graph Optimizer
Init Kernel

Querying¶

One can access the most common fuzzy logic operators in ltn.fuzzy_ops. They are implemented using tensorflow primitives.

We here compare

the product t-norm: $u \land_{\mathrm{prod}} v = uv$,
the Lukasiewicz t-norm: $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$,
the minimum aggregator: $\min(u_1,\dots,u_n)$,
the p-mean error aggregator (generalized mean of the deviations w.r.t. the truth): $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$.

Each operator obviously conveys very different meanings, but they can all make sense depending on the intent of the query.

In [2]:

x1 = tf.constant(0.4)
x2 = tf.constant(0.7)

# the stable keyword is explained at the end of the notebook
and_prod = ltn.fuzzy_ops.And_Prod(stable=False)
and_luk = ltn.fuzzy_ops.And_Luk()

print(and_prod(x1,x2))
print(and_luk(x1,x2))

Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

tf.Tensor(0.28, shape=(), dtype=float32)
tf.Tensor(0.100000024, shape=(), dtype=float32)

2021-09-24 17:11:13.306148: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-09-24 17:11:13.306250: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

In [3]:

xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

# the stable keyword is explained at the end of the notebook
forall_min = ltn.fuzzy_ops.Aggreg_Min()
forall_pME = ltn.fuzzy_ops.Aggreg_pMeanError(p=4, stable=False)

print(forall_min(xs))
print(forall_pME(xs))

tf.Tensor(0.1, shape=(), dtype=float32)
tf.Tensor(0.31339926, shape=(), dtype=float32)

Learning¶

While all operators are suitable in a querying setting, this not the case in a learning setting. Indeed, many fuzzy logic operators have derivatives not suitable for gradient-based algorithms. For more details, read van Krieken et al., Analyzing Differentiable Fuzzy Logic Operators, 2020.

We here give simple illustrations of such gradient issues.

1. Vanishing Gradients¶

Some operators have vanishing gradients on some part of their domains.

e.g. in $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$, if $u+v-1 < 0$, the gradients vanish.

In [4]:

x1 = tf.constant(0.3)
x2 = tf.constant(0.5)

with tf.GradientTape() as tape:
    tape.watch(x1)
    tape.watch(x2)
    y = and_luk(x1,x2)
res = y.numpy()
gradients = [v.numpy() for v in tape.gradient(y,[x1,x2])]
print(res)
print(gradients)

0.0
[0.0, 0.0]

2. Single-Passing Gradients¶

Some operators have gradients propagating to only one input at a time, meaning that all other inputs will not benefit from learning at this step.

e.g. in $\min(u_1,\dots,u_n)$.

In [5]:

xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_min(xs)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)

0.1
[0. 0. 0. 0. 0. 0. 0. 1.]

3. Exploding Gradients¶

Some operators have exploding gradients on some part of their domains.

e.g. in $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$, on the edge case where all inputs are $1.0$.

In [6]:

xs = tf.constant([1.,1.,1.])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)

1.0
[nan nan nan]

Stable Product Configuration¶

Product Configuration¶

Our general recommendation is to use the following "product configuration":

not: the standard negation $\lnot u = 1-u$,
and: the product t-norm $u \land v = uv$,
or: the product t-conorm (probabilistic sum) $u \lor v = u+v-uv$,
implication: the Reichenbach implication $u \rightarrow v = 1 - u + uv$,
existential quantification ("exists"): the generalized mean (p-mean) $\mathrm{pM}(u_1,\dots,u_n) = \biggl( \frac{1}{n} \sum\limits_{i=1}^n u_i^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$,
universal quantification ("for all"): the generalized mean of "the deviations w.r.t. the truth" (p-mean error) $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$.

"Stable"¶

As is, this configuration is not fully exempt from issues. The product t-norm has vanishing gradients on the edge case $u=v=0$. The product t-conorm has vanishing gradients on the edge case $u=v=1$. The Reichenbach implication has vanishing gradients on the edge case $u=0$,$v=1$. p-mean has exploding gradients on the edge case $u_1=\dots=u_n=0$. p-mean error has exploding gradients on the edge case $u_1=\dots=u_n=1$.
However, all these issues happen on edge cases and can easily be fixed using the following "trick":

if the edge case happens when an input $u$ is $0$, we modify every input with $u' = (1-\epsilon)u+\epsilon$,
if the edge case happens when an input $u$ is $1$, we modify every input with $u' = (1-\epsilon)u$,

where $\epsilon$ is a small positive value (e.g. $1\mathrm{e}{-5}$).

This gives us stabilized versions of such operators. One can trigger the stable behavior by using the boolean keyword stable. One can set a default value for stable when initializing the operator, or can use different values at each call of the operator.

In [7]:

xs = tf.constant([1.,1.,1.])

with tf.GradientTape() as tape:
    tape.watch(xs)
    # the exploding gradient problem is solved
    y = forall_pME(xs,p=4,stable=True)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)

0.9999
[0.33329996 0.33329996 0.33329996]

The hyper-parameter $p$ in the generalized means¶

$p$ offers flexibility in writing more or less strict formulas, to account for outliers in the data depending on the application. Note that this can have strong implications for training.

In [8]:

xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)

0.31339926
[0.         0.         0.         0.0482733  0.13246194 0.19772749
 0.19772749 0.28152987]

In [9]:

xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_pME(xs,p=20)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)

0.1815753
[0.0000000e+00 0.0000000e+00 0.0000000e+00 1.0733546e-05 6.4146910e-03
 8.1100456e-02 8.1100456e-02 7.6018697e-01]

While it can be tempting to set a high value for $p$ when querying, in a learning setting, this can quickly lead to a "single-passing" operator that will focus too much on outliers at each step (i.e., gradients overfitting one input at this step, potentially harming the training of the others). We recommend not to set a too high $p$ when learning.