on the flexibility of dense layers in Keras
Keras offers a large amount of flexibility to its dense layers. In this more advanced post, I'll illustrate this and rationalise why it happens. If you are unfamiliar with shapes or the way Keras handles them, my posts Shapes - an introduction and A closer look at shapes might help.
Let's dive right in with an example.
import tensorflow as tf
# initialise 4 input tensors
# the tensors only contain ones
a = tf.ones((1, 2))
b = tf.ones((1, 2, 3))
c = tf.ones((1, 2, 3, 4))
d = tf.ones((1, 2, 3, 4, 5))
# add the inputs to a list
inputs = [a,b,c,d]
# make a different dense layer for each input
# since they have different shapes
dense_layers = [tf.keras.layers.Dense(2) for _ in inputs]
# run the input through its dense layer and print the shape
for i,layer in zip(inputs, dense_layers):
print(f"Using input of shape {i.shape}")
prediction = layer(i)
print(f"Output shape {prediction.shape}")
weights, bias = layer.get_weights()
print(f"The weights shape is {weights.shape}")
print("---")
Using input of shape (1, 2)
Output shape (1, 2)
The weights shape is (2, 2)
---
Using input of shape (1, 2, 3)
Output shape (1, 2, 2)
The weights shape is (3, 2)
---
Using input of shape (1, 2, 3, 4)
Output shape (1, 2, 3, 2)
The weights shape is (4, 2)
---
Using input of shape (1, 2, 3, 4, 5)
Output shape (1, 2, 3, 4, 2)
The weights shape is (5, 2)
---
In this example, we instantiate four input tensors of varying grade. The minimum grade allowed is two, since Keras layers always expect the first dimension to be the batch. Additionally, grade 5 is the max grade it accepts.
The printed output shows that Keras allows us to apply a dense layer on inputs with varying grades. Whichever input we give it, it gets applied over the last dimension, changing it to represent the output of the layer. Printing the weights shapes helps further illustrate the point. Furthermore, since we initialised the inputs as ones, let's have a look at the output of the third input of shape (1,2,3,4) so we can see how the weights are repeatedly applied.
tf.Tensor(
[[[[0.63791275 0.08311558]
[0.63791275 0.08311558]
[0.63791275 0.08311558]]
[[0.63791275 0.08311558]
[0.63791275 0.08311558]
[0.63791275 0.08311558]]]], shape=(1, 2, 3, 2), dtype=float32)
It's not always this easy
This might seem perfectly expected, but Keras is usually much less tolerant. Have a look at the following example, which uses recurrent layers.
import tensorflow as tf
a = tf.ones((1, 2))
b = tf.ones((1, 2, 3))
c = tf.ones((1, 2, 3, 4))
inputs = [a,b,c]
rec_layers = [tf.keras.layers.SimpleRNN(2) for _ in inputs]
# run the input through its dense layer and print the shape
for i,layer in zip(inputs, rec_layers):
print(f"Using input of shape {i.shape} ", end="")
try:
prediction = layer(i)
print(f"worked! Output shape is {prediction.shape}")
except Exception as err:
print(f"didn't work!")
print(err)
print("---")
Using input of shape (1, 2) didn't work!
Input 0 of layer simple_rnn_9 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [1, 2]
---
Using input of shape (1, 2, 3) worked! Output shape is (1, 2)
---
Using input of shape (1, 2, 3, 4) didn't work!
Input 0 of layer simple_rnn_11 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [1, 2, 3, 4]
---
Only the input with grade 3 worked! Keras interprets an input of shape (a,b,c) as meaning a batch of a items, each with b time steps, and each time step of c dimensions.
We can then even do the following.
import tensorflow as tf
# Same as the first example, but
# the final dimension is kept 2
a = tf.ones((1, 2))
b = tf.ones((1, 2, 2))
c = tf.ones((1, 2, 3, 2))
d = tf.ones((1, 2, 3, 4, 2))
inputs = [a,b,c,d]
# The dense layer, only one is needed
d = tf.keras.layers.Dense(2)
# run the input through its dense layer and print the shape
for i in inputs:
print(f"Using input of shape {i.shape}")
prediction = d(i)
print(f"Output shape {prediction.shape}")
weights, bias = d.get_weights()
print(f"The weights shape is {weights.shape}")
print("---")
Using input of shape (1, 2)
Output shape (1, 2)
The weights shape is (2, 2)
---
Using input of shape (1, 2, 2)
Output shape (1, 2, 2)
The weights shape is (2, 2)
---
Using input of shape (1, 2, 3, 2)
Output shape (1, 2, 3, 2)
The weights shape is (2, 2)
---
Using input of shape (1, 2, 3, 4, 2)
Output shape (1, 2, 3, 4, 2)
The weights shape is (2, 2)
---
This illustrates the versatility of the dense layer, it can be applied on any tensor as long as its grade is between 2 and 5, as long as the final dimension stays the same.
Why is it like this?
Why, though, are we allowed to do this? There are many situations in which we want to pass a higher grade tensor into a dense layer. A simple example would be the final layer of a CTC model. Such a model ends with a bidirectional recurrent layer which returns a sequence with one item per time step. Each time step gets passed through a dense layer for the desired softmax output. The flexibility of the dense layers is helpful.
import tensorflow as tf
# our input of 10 timesteps
i = tf.ones((1,10,2))
# two rnn layers which return sequences, the second of which goes backwards
rnn_f = tf.keras.layers.SimpleRNN(10, return_sequences=True)
rnn_b = tf.keras.layers.SimpleRNN(10, return_sequences=True, go_backwards=True)
rnn_f_out = rnn_f(i)
rnn_b_out = rnn_b(i)
# the shapes are the same
assert rnn_f_out.shape == rnn_b_out.shape
print(f"Both outputs have the shape {rnn_f_out.shape}")
# add them together for the output of the bidirectional layer
rnn_output = rnn_f(i) + rnn_b(i)
assert rnn_output.shape == rnn_b_out.shape
print(f"Output shape of the bidirectional rnn layer is still {rnn_output.shape}")
# add a dense layer for our desired softmax output
# in this case alphabet = 4 + dash
final_dense = tf.keras.layers.Dense(5, activation="softmax")
model_output = final_dense(rnn_output)
print(f"The output shape is {model_output.shape}")
Both outputs have the shape (1, 10, 10)
Output shape of the bidirectional rnn layer is still (1, 10, 10)
The output shape is (1, 10, 5)
Dense layers being so versatile is obviously a requirement since they're such an essential element in deep neural networks and all sorts of different architectures. Without this design choice, we couldn't both create a feed-forward neural network and a CTC model. Transformers operate on tensors of shape (batch_size, num_heads, seq_len, depth) during dot product attention calculation, but they don't apply a dense layer on them. Please let me know if you know any models that pass tensors of grade 4 or 5 through a dense layer!
Thanks for reading! I hope this post helped expand on my previous shape-related posts and gives a bit of intuition into the flexibility we have when using dense layers in Keras, or at least some novel ways of playing around with data and experimenting to learn more.