Human evaluations of model generated text are accurate, but expensive and slow for the purpose of model development. Evaluating the output ofsuch systems automatically, saves time, accelerates further research on the text generation tasks
and it will also be free of human bias. We provide an in-depth review and comparison of traditional metrics which is based on
n-gram word matching to the recently published ones where textual embeddings are compared. We also provide their
correlations of these metrics with human evaluation.