Complementary Notebook: Appropriate Operators to Approximate Connectives and Quantifiers

This notebook is a complement to the tutorial on operators.

Logical connectives in LTN are grounded using fuzzy semantics. However, while all fuzzy logic operators make sense when simply querying the language, not every operator is equally suited for learning in LTN. This notebook details the motivations behind some choice of operators over the others.

In [1]:
import logictensornetworks as ltn
import numpy as np
import tensorflow as tf

Querying

One can access the most common fuzzy logic operators in ltn.fuzzy_ops. They are implemented using tensorflow primitives.

We here compare

  • the product t-norm: $u \land_{\mathrm{prod}} v = uv$,
  • the Lukasiewicz t-norm: $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$,
  • the minimum aggregator: $\min(u_1,\dots,u_n)$,
  • the p-mean error aggregator (generalized mean of the deviations w.r.t. the truth): $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$.

Each operator obviously conveys very different meanings, but they can all make sense depending on the intent of the query.

In [9]:
x1 = tf.constant(0.4)
x2 = tf.constant(0.7)

# the stable keyword is explained at the end of the notebook
and_prod = ltn.fuzzy_ops.And_Prod(stable=False)
and_luk = ltn.fuzzy_ops.And_Luk()

print(and_prod(x1,x2))
print(and_luk(x1,x2))
tf.Tensor(0.28, shape=(), dtype=float32)
tf.Tensor(0.100000024, shape=(), dtype=float32)
In [16]:
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

# the stable keyword is explained at the end of the notebook
forall_min = ltn.fuzzy_ops.Aggreg_Min()
forall_pME = ltn.fuzzy_ops.Aggreg_pMeanError(p=4, stable=False)

print(forall_min(xs))
print(forall_pME(xs))
tf.Tensor(0.1, shape=(), dtype=float32)
tf.Tensor(0.31339914, shape=(), dtype=float32)

Learning

While all operators are suitable in a querying setting, this not the case in a learning setting. Indeed, many fuzzy logic operators have derivatives not suitable for gradient-based algorithms. For more details, read van Krieken et al., Analyzing Differentiable Fuzzy Logic Operators, 2020.

We here give simple illustrations of such gradient issues.

1. Vanishing Gradients

Some operators have vanishing gradients on some part of their domains.

e.g. in $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$, if $u+v-1 < 0$, the gradients vanish.

In [13]:
x1 = tf.constant(0.3)
x2 = tf.constant(0.5)

with tf.GradientTape() as tape:
    tape.watch(x1)
    tape.watch(x2)
    y = and_luk(x1,x2)
res = y.numpy()
gradients = [v.numpy() for v in tape.gradient(y,[x1,x2])]
print(res)
print(gradients)
0.0
[0.0, 0.0]

2. Single-Passing Gradients

Some operators have gradients propagating to only one input at a time, meaning that all other inputs will not benefit from learning at this step.

e.g. in $\min(u_1,\dots,u_n)$.

In [15]:
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_min(xs)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.1
[0. 0. 0. 0. 0. 0. 0. 1.]

3. Exploding Gradients

Some operators have exploding gradients on some part of their domains.

e.g. in $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$, on the edge case where all inputs are $1.0$.

In [18]:
xs = tf.constant([1.,1.,1.])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
1.0
[nan nan nan]

Stable Product Configuration

Product Configuration

Our general recommendation is to use the following "product configuration":

  • not: the standard negation $\lnot u = 1-u$,
  • and: the product t-norm $u \land v = uv$,
  • or: the product t-conorm (probabilistic sum) $u \lor v = u+v-uv$,
  • implication: the Reichenbach implication $u \rightarrow v = 1 - u + uv$,
  • existential quantification ("exists"): the generalized mean (p-mean) $\mathrm{pM}(u_1,\dots,u_n) = \biggl( \frac{1}{n} \sum\limits_{i=1}^n u_i^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$,
  • universal quantification ("for all"): the generalized mean of "the deviations w.r.t. the truth" (p-mean error) $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$.

"Stable"

As is, this configuration is not fully exempt from issues. The product t-norm has vanishing gradients on the edge case $u=v=0$. The product t-conorm has vanishing gradients on the edge case $u=v=1$. The Reichenbach implication has vanishing gradients on the edge case $u=0$,$v=1$. p-mean has exploding gradients on the edge case $u_1=\dots=u_n=0$. p-mean error has exploding gradients on the edge case $u_1=\dots=u_n=1$.
However, all these issues happen on edge cases and can easily be fixed using the following "trick":

  • if the edge case happens when an input $u$ is $0$, we modify every input with $u' = (1-\epsilon)u+\epsilon$,
  • if the edge case happens when an input $u$ is $1$, we modify every input with $u' = (1-\epsilon)u$,

where $\epsilon$ is a small positive value (e.g. $1\mathrm{e}{-5}$).

This gives us stabilized versions of such operators. One can trigger the stable behavior by using the boolean keyword stable. One can set a default value for stable when initializing the operator, or can use different values at each call of the operator.

In [19]:
xs = tf.constant([1.,1.,1.])

with tf.GradientTape() as tape:
    tape.watch(xs)
    # the exploding gradient problem is solved
    y = forall_pME(xs,p=4,stable=True)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.9999
[0.3333 0.3333 0.3333]

The hyper-parameter $p$ in the generalized means

$p$ offers flexibility in writing more or less strict formulas, to account for outliers in the data depending on the application. Note that this can have strong implications for training.

In [20]:
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_pME(xs,p=4)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.31339914
[0.         0.         0.         0.0482733  0.13246194 0.19772746
 0.19772746 0.28152987]
In [21]:
xs = tf.constant([1.,1.,1.,0.5,0.3,0.2,0.2,0.1])

with tf.GradientTape() as tape:
    tape.watch(xs)
    y = forall_pME(xs,p=20)
res = y.numpy()
gradients = tape.gradient(y,xs).numpy()
print(res)
print(gradients)
0.18157518
[0.0000000e+00 0.0000000e+00 0.0000000e+00 1.0733546e-05 6.4146915e-03
 8.1100404e-02 8.1100404e-02 7.6018733e-01]

While it can be tempting to set a high value for $p$ when querying, in a learning setting, this can quickly lead to a "single-passing" operator that will focus too much on outliers at each step (i.e., gradients overfitting one input at this step, potentially harming the training of the others). We recommend not to set a too high $p$ when learning.