In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt
import tensorflow

### Activation Functions

#### Sigmoid

$\sigma(x) = \dfrac{1}{1 + e^{-x}}$

In [None]:
import math

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

In [None]:
[sigmoid(x) for x in [0, 1, 2, -3, -4, 5]]

In [None]:
plt.rcParams["figure.figsize"] = (4,3)  # default figure size: 4x3 inches

x_values = np.arange(-5, 5, 0.1)
plt.plot(x_values, [sigmoid(x) for x in x_values])
plt.xlim(-5, 5)
plt.ylim(0, 1)
plt.title("Sigmoid Unit")
plt.xlabel("Input")
plt.ylabel("Activation")
plt.grid()
plt.show()

#### Linear

Linear($x$) = $x$

In [None]:
def linear(x):
    return x

In [None]:
[linear(x) for x in [0, 1, 2, -3, -4, 5]]

In [None]:
x_values = np.arange(-2, 2, 0.1)
plt.plot(x_values, [linear(x) for x in x_values])
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.title("Linear Unit")
plt.xlabel("Input")
plt.ylabel("Activation")
plt.grid()
plt.show()

#### ReLU (Rectified Linear Unit)

ReLU($x$) = $max(x, 0)$

In [None]:
def ReLU(x):
    return max(x, 0)

In [None]:
[ReLU(x) for x in [0, 1, 2, -3, -4, 5]]

In [None]:
x_values = np.arange(-2, 2, 0.1)
plt.plot(x_values, [ReLU(x) for x in x_values])
plt.xlim(-2, 2)
plt.ylim(-0.5, 2)
plt.title("Rectified Linear Unit (ReLU)")
plt.xlabel("Input")
plt.ylabel("Activation")
plt.grid()
plt.show()

#### Softmax

$ \mathbf{x} = [x_1, x_2, x_3, \ldots , x_n]$

Softmax($\mathbf{x}$) $= {\large \dfrac{e^{x_i}}{{\LARGE \Sigma_{i}} ~ e^{x_i} }}$

In [None]:
def softmax(x_values):
    powers = [math.exp(x) for x in x_values]
    total = sum(powers)
    return [ex/total for ex in powers]

In [None]:
softmax([1,1,1,1,1])

In [None]:
sum(softmax([1,1,1,1,1]))

In [None]:
softmax([0, 1, 2, -3, -4, 5])

In [None]:
sum (softmax([0, 1, 2, -3, -4, 5]))

In [None]:
softmax([6,5,5,5])

In [None]:
sum(softmax([6,5,5,5]))

The softmax output values always sum to 1.0, so the output can be interpreted as a **probability distribution**.  The name "softmax" is short for "soft argmax".  The greater the difference between the maximum value in the input vector and the other values, the more closely the output of softmax approximates a one-hot vector.

In [None]:
softmax([30,10,7,100,40,25,3,9])

In [None]:
sum(softmax([30,10,7,100,40,25,3,9]))

[Interactive demo of softmax](http://neuralnetworksanddeeplearning.com/chap3.html#eqtn78) &nbsp; (from Michael Nielsen's book [*Neural Networks and Deep Learning*](http://neuralnetworksanddeeplearning.com))

## The Fashion MNIST Dataset

This dataset is like the Digits Classification dataset, except that the images are of "fashion items" such as shirts, dresses, sneakers, and so on, instead of handwritten digits. There are 10 different categories of items in all:

* 0 = T-shirt/top
* 1 = Trouser
* 2 = Pullover
* 3 = Dress
* 4 = Coat
* 5 = Sandal
* 6 = Shirt
* 7 = Sneaker
* 8 = Bag
* 9 = Ankle boot

The Keras documentation for the Fashion dataset is available [here](https://keras.io/api/datasets/fashion_mnist).

### Examining the Fashion MNIST Data

In [None]:
from tensorflow.keras.datasets import fashion_mnist

In [None]:
(train_images,train_labels), (test_images,test_labels) = fashion_mnist.load_data()

In [None]:
train_images.shape

In [None]:
train_images.dtype

Spend some time examining the images in the dataset. You can use the `plt.imshow` function to display an image, with the optional argument `cmap=` specifying a "colormap". See [this link](https://matplotlib.org/stable/gallery/color/colormap_reference.html) for a full list of available colormaps.

In [None]:
plt.imshow(train_images[0], cmap='gray');

In [None]:
def show_random_image():
    n = random.randrange(60000)
    plt.imshow(train_images[n], cmap='gray')

In [None]:
show_random_image()

<hr>

#### Exercise 1
The category labels for the Digit Classification dataset are integers in the range 0-9, corresponding to the 10 types of handwritten digits. The Fashion dataset also uses integers to specify categories, but these categories correspond to different [types of clothing](https://keras.io/api/datasets/fashion_mnist/) rather than digits.

Write a Python function called `fashion_category` that takes a label number in the range 0-9 as input and returns a short string description of the category. For example, calling the function with 7 should return <tt>'sneaker'</tt>. Then, using `fashion-category` as a helper, modify ``show_random_image`` so that it prints the image number and category label above the image.

In [None]:
def fashion_category(n):
    pass

<hr>

#### Exercise 2
Write some Python code to determine how many examples of each image category there are in the training data. For instance, how many examples are there of sneakers?

<hr>

### Preparing the Data: Creating the Target Vectors

We need to create "one-hot" target vectors of length 10 for each classification category.  In a "one-hot" vector, all values are 0 except for a single 1 at the location corresponding to the target category.  For example, the target vector for images of dresses (category 3) would be [0,0,0,1,0,0,0,0,0,0].  Use the `to-categorical` function to create an array of shape (60000, 10) called `train_targets` containing the one-hot target vectors for all of the `train_images`.

In [None]:
from tensorflow.keras.utils import to_categorical

In [None]:
to_categorical(3, num_classes=10)

<hr>

### Preparing the Data: Normalizing the Input Images

The next step is to convert the input images from arrays of integers in the range 0-255 into "normalized" arrays of floating-point values in the range 0-1.  If <i>a</i> is an array of integers, calling <i>a</i><tt>.astype('float32')</tt> will return a new array of floating-point values, which can then be divided by 255 to create the normalized array.  After doing this, make sure that your new version of <tt>train_images</tt> is of type <tt>float32</tt>, with minimum and maximum values of 0.0 and 1.0.

In [None]:
train_images.dtype, np.min(train_images), np.max(train_images)

<hr>

### Building the Neural Network

Now it's time to build the network shown below. Your network should include a Flatten layer that automatically converts a 28 &times; 28 input image into a 784-element vector.

<img src="http://science.slc.edu/jmarshall/bioai/images/fashion-mnist-network-with-flatten.png" width="50%">

For now, the hidden layer and output layer should each use the<tt>'sigmoid'</tt> activation function. The network's loss function should be <tt>'mean_squared_error'</tt>, the optimizer should be <tt>'SGD'</tt> (stochastic gradient descent), and the evaluation metrics should be <tt>['accuracy']</tt>. Define a function `build_network` that builds and returns a new network each time it is called.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten

In [None]:
def build_network():
    network = Sequential()
    # add new layers
    # compile the network
    return network

In [None]:
network = build_network()

In [None]:
network.summary()

<hr>

### Training the Network

Train your network for 30 epochs, plot the training history graphs, and then evaluate the performance of your network on both the training and testing data. (For the testing data, you will first need to normalize the test images and create the target vectors in the same way you prepared the training data). For convenience, the `plot_history` function is defined for you below.

In [None]:
def plot_history(history):
    loss_values = history.history['loss']
    accuracy_values = history.history['accuracy']
    epoch_nums = range(1, len(loss_values)+1)
    plt.figure(figsize=(12,4)) # width, height in inches
    plt.subplot(1, 2, 1)
    plt.plot(epoch_nums, loss_values, 'r')
    plt.title("Training loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.subplot(1, 2, 2)
    plt.plot(epoch_nums, accuracy_values, 'b')
    plt.title("Training accuracy")
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.show()

In [None]:
history = network.fit(train_images, train_targets, epochs=30)

In [None]:
network.evaluate(train_images, train_targets)

The code below will construct a list of test image numbers that the network got wrong, and then display a random selection of the images that were incorrectly classified.

In [None]:
outputs = network.predict(test_images)

In [None]:
predictions = [np.argmax(vector) for vector in outputs]

In [None]:
wrong = [i for i in range(len(test_images)) if predictions[i] != test_labels[i]]

In [None]:
plt.figure(figsize=(12,12))  # (width, height) in inches
rows, columns = 5, 6
for i in range(1, columns*rows+1):
    w = random.choice(wrong)
    image = test_images[w]
    correct_label = test_labels[w]
    prediction = predictions[w]
    plt.subplot(rows, columns, i)
    plt.title(f'"{prediction}"  (correct: {correct_label})')
    plt.axis('off')
    plt.imshow(image, cmap='gray')

<hr>

### Improving the Performance of the Network

Can you improve the performance of your network? There are many possibilities to explore. You could try using a different number of hidden units. You could add another hidden layer to the network. You could also try different activation functions, loss functions, or optimizers.

#### Try varying the number of hidden units

1. Create a new network with **50 hidden units** and re-train it for the same number of epochs as before. Does the network's performance on the training data improve? What about its performance on the test data?

2. Create a new network with **100 hidden units** (or some other number) and repeat the experiment. At what point does adding more hidden units start to degrade the network's performance on the test data?

3. Create a new network with **two hidden layers of 15 units each**. This network still has a total of 30 hidden units as before, except arranged in two layers instead of one. Hoe does this new configuration affect the performance on the test data?

4. Expand the size of each hidden layer to **30 units each**, for a total of 60 hidden units. Does this change the network's performance in a significant way?

5. Can you find an "ideal" number of hidden units (and layers) that gives the best performance?

#### Try varying the loss function and activation function

For a multi-category classification task like recognizing the MNIST Digits or Fashion images, it is usually better to use the **softmax** activation function on the output layer, together with the **categorical_crossentropy** loss function.  Furthermore, using the **relu** activation function on hidden layers often gives better results than the sigmoid function.

Create a new version of your network with a <tt>'categorical_crossentropy'</tt> loss function, a <tt>'softmax'</tt> activation function on the output layer, and a <tt>'relu'</tt> activation function on the hidden layer, and compare its performance on the test data with your previous experiments.


#### Try varying the optimizer

Keras has a number of optimizers available, which are different learning algorithms for updating the weights and biases of a network during training. All are mathematical variations on the basic principles of backpropagation and gradient descent. In practice, RMSprop often gives better performance than standard stochastic gradient descent (SGD).  Adam is often comparable to RMSprop, and Nadam is generally better than Adam.

Optimizer|Description
--|--
**SGD** | Stochastic Gradient Descent with learning rate and momentum
**Adagrad** | Adaptive Subgradient Descent: uses the $L_2$ norm
**Adamax** | Improved Adagrad using the $L_{\infty}$ norm
**RMSprop** | Another norm-based improvement on Adagrad
**Adam** | Adaptive Moment Estimation: RMSprop + standard momentum
**Nadam** | RMSprop + Nesterov momentum

Experiment with different optimizers to see how they impact the network's performance.

#### Try increasing the number of training epochs

The previous experiments all assumed a training time of 30 epochs. What happens if you increase the length of training time, by increasing the number of epochs, while keeping all of the other experimental parameters the same? Not surprisingly, the performance of the network on the training data will likely improve (resulting in a lower loss value and higher accuracy). However, what happens to the performance on the test data as you continue to increase the number of epochs? Can you find a "sweet spot" &mdash; that is, the number of epochs that results in the highest overall performance on the test data?