A vision-based smart building control system relies upon human action recognition to determine the number of occupants inside the building and their respective motion type or path. Using this information, a control system that can automatically optimise environmental conditions within the building can be designed. Furthermore, information obtained is also used to aid in other domains such as security and surveillance, interactive application with environment, and content-based video analysis. Current work relating to the recognition task is divided into two processes: human action extraction and human action classification. This paper starts by identifying the distinct difference between global and local extraction methods. The extraction methods filter out features, such as silhouette, colour, edge, motion and interest point, from images for analysing observed human actions. In terms of human action classification, two key methods known as the k-nearest neighbour approach and hidden Markov model are presented and discussed. Lastly, the paper provides a brief summary highlighting gaps and possible milestones for future work. © 2012 IEEE.