This is a very interesting question and has an even more interesting solution.
Conceptual stuff:
Each voice has a feature known as Mel-frequency cepstrum . Looking at the coefficients of this feature is the key to classify different emotions.
As you can see from this, each emotion has its own region in a vector space.
To know more about conceptually applying mfcc in emotion recognition: Emotion Detection Using MFCC and Cepstrum Features
Now, on to the interesting part, Machine Learning:
First extract mfcc feature matrix for each voice sample and convert it into a 1*13 vector for each audio. [As per research papers, the first 13 coefficients primarily determine the emotion!]
Next, Classify each vector as per its emotion.
Split the dataset into a 75% training and 25% test data. Put all the vectors in a Feed Forward Neural Network with BackProp. Have a hidden layer with 13*1.5 nodes and add SoftMax layer. Then, voila, you have yourself a voice based emotion recognition system.
I did this project recently using Savee Database classifying three emotions(Anger, Sad and Neutral). I got a pretty good accuracy and was able to find the emotion of a brand new test case. The project was done in Python, I will be uploading the code soon on github so watch out. :)