The state of the art of non-linearity is to use rectified linear units (relu) instead of sigmoid function in deep neural network. What are the advantages? I know that training a network when relu is

How does that improve neural network? Why do we say that relu is an activation function? Isn't softmax activation function for neu