Performs multi-headed attention from from_tensor to to_tensor.
This is an implementation of multi-headed attention based on "Attention is
all you Need". If from_tensor and to_tensor are the same, then
this is self-attention. Each timestep in from_tensor attends to the
corresponding sequence in to_tensor, and returns a fixed-with vector.
This function first projects from_tensor into a "query" tensor and
to_tensor into "key" and "value" tensors. These are (effectively) a
list of tensors of length num_attention_heads, where each tensor is of
shape [batch_size, seq_length, size_per_head]. Then, the query and key
tensors are dot-producted and scaled. These are softmaxed to obtain attention
probabilities. The value tensors are then interpolated by these
probabilities, then concatenated back to a single tensor and returned.
attention_layer(
from_tensor,
to_tensor,
attention_mask = NULL,
num_attention_heads = 1L,
size_per_head = 512L,
query_act = NULL,
key_act = NULL,
value_act = NULL,
attention_probs_dropout_prob = 0,
initializer_range = 0.02,
do_return_2d_tensor = FALSE,
batch_size = NULL,
from_seq_length = NULL,
to_seq_length = NULL
)Float Tensor of shape [batch_size, from_seq_length,
from_width].
Float Tensor of shape [batch_size, to_seq_length,
to_width].
(optional) Integer Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
Integer; number of attention heads.
Integer; size of each attention head.
(Optional) Activation function for the query transform.
(Optional) Activation function for the key transform.
(Optional) Activation function for the value transform.
(Optional) Numeric; dropout probability of the attention probabilities.
Numeric; range of the weight initializer.
Logical. If TRUE, the output will be of shape
[batch_size * from_seq_length, num_attention_heads * size_per_head].
If false, the output will be of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head].
(Optional) Integer; if the input is 2D, this might (sic) be
the batch size of the 3D version of the from_tensor and
to_tensor.
(Optional) Integer; if the input is 2D, this might be
the seq length of the 3D version of the from_tensor.
(Optional) Integer; if the input is 2D, this might be
the seq length of the 3D version of the to_tensor.
float Tensor of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]. If do_return_2d_tensor is
TRUE, it will be flattened to shape [batch_size * from_seq_length,
num_attention_heads * size_per_head].
In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors.
# NOT RUN {
# Maybe add examples later. For now, this is only called from
# within transformer_model(), so refer to that function.
# }
Run the code above in your browser using DataLab