Additive attention layer, a.k.a. Bahdanau-style attention
layer_additive_attention(
object,
use_scale = TRUE,
...,
causal = FALSE,
dropout = 0
)What to call the new Layer instance with. Typically a keras
Model, another Layer, or a tf.Tensor/KerasTensor. If object is
missing, the Layer instance is returned, otherwise, layer(object) is
returned.
If TRUE, will create a variable to scale the attention scores.
standard layer arguments.
Boolean. Set to TRUE for decoder self-attention. Adds a mask such
that position i cannot attend to positions j > i. This prevents the
flow of information from the future towards the past.
Float between 0 and 1. Fraction of the units to drop for the attention scores.
Inputs are query tensor of shape [batch_size, Tq, dim], value tensor of
shape [batch_size, Tv, dim] and key tensor of shape
[batch_size, Tv, dim]. The calculation follows the steps:
Reshape query and key into shapes [batch_size, Tq, 1, dim]
and [batch_size, 1, Tv, dim] respectively.
Calculate scores with shape [batch_size, Tq, Tv] as a non-linear
sum: scores = tf.reduce_sum(tf.tanh(query + key), axis=-1)
Use scores to calculate a distribution with shape
[batch_size, Tq, Tv]: distribution = tf$nn$softmax(scores).
Use distribution to create a linear combination of value with
shape [batch_size, Tq, dim]:
return tf$matmul(distribution, value).