The Effect of Sample Size on the Efficiency of Count Data Models: Application to Marriage Data
Abstract
Abstract: Sample size requirements are common in many multivariate analysis techniques as one of the measures taken to ensure the robustness of such techniques, such requirements have not been of interest in the area of count data models. As such, this study investigated the effect of sample size on the efficiency of six commonly used count data models namely: Poisson regression model (PRM), Negative binomial regression model (NBRM), Zero-inflated Poisson (ZIP), Zero-inflated negative binomial (ZINB), Poisson Hurdle model (PHM) and Negative binomial hurdle model (NBHM). The data used in this study were sourced from Data First and were collected by Statistics South Africa through the Marriage and Divorce database. PRM, NBRM, ZIP, ZINB, PHM and NBHM were applied to ten randomly selected samples ranging from 4392 to 43916 and differing by 10% in size. The six models were compared using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Vuong’s test for over-dispersion, McFadden RSQ, Mean Square Error (MSE) and Mean Absolute Deviation (MAD).The results revealed that generally, the Negative Binomial-based models outperformed Poisson-based models. However, the results did not reveal the effect of sample size variations on the efficiency of the models since there was no consistency in the change in AIC, BIC, Vuong’s test for over-dispersion, McFadden RSQ, MSE and MAD as the sample size increased.
Downloads
References
Famoye, F. & Singh, K. P. (2006). Zero-inflated generalized Poisson regression model with an application to domestic violence data. Journal of Data Science, 4(1), 117-130. Fuzi, M. F. M., Jemain, A. A. & Ismail, N. (2016). Bayesian quintile regression model for claim count data. Insurance: Mathematics and Economics, 66, 124-137. Hilbe, J. M. (2014). Modelling count data (pp. 836-839). Springer Berlin Heidelberg. Holman, T. B. (2006). Premarital prediction of marital quality or breakup: Research, theory, and practice. Springer Science & Business Media. INC, S. I. (2010). SAS/STAT® 9.22 User’s Guide. INC, S.I. (2012). SAS/ETS 12.1 User's Guide. Little, T. D. (2013). The Oxford handbook of quantitative methods, volume 1: Foundations. Oxford University Press. Liu, W. & Cela, J. (2008). Count data models in SAS. In SAS Global Forum, 317, 1-12. Matignon, R. (2007). Data mining using SAS enterprise miner (Vol. 638). John Wiley & Sons. Mei-Chen, H., Pavlicova, M. & Nunes, E. V. (2011). Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. The American journal of drug and alcohol abuse, 37(5), 367-375. Merkle, E. C. & Smithson, M. (2013). Generalized linear models for categorical and continuous limited dependent variables. CRC Press. Morel, J. G. & Neerchal, N. (2012). Over dispersion models in SAS. SAS Institute. Ozmen, I. & Famoye, F. (2007). Count regression models with an application to zoological data containing structural zeros. Journal of Data Science, 5(4), 491-502. Park, B. J., Lord, D. & Hart, J. D. (2010). Bias properties of Bayesian statistics in finite mixture of negative binomial regression models in crash data analysis. Accident Analysis & Prevention, 42(2), 741-749. Peng, J., Lyu, T., Shi, J., Nagaraja, H. N. & Xiang, H. (2014). Models for injury count data in the US National Health Interview Survey. Journal of Scientific Research and Reports, 3(17), 2286-2302. Reis, H. T. & Sprecher, S. (2009). Encyclopaedia of Human Relationships: Vol. 1. Sage. Rose, C. E., Martin, S. W., Wannemuehler, K. A. & Plikaytis, B. D. (2006). On the use of zero-inflated and hurdle models for modelling vaccine adverse event count data. Journal of biopharmaceutical statistics, 16(4), 463-481. Statistics South Africa. (2014). Marriages and divorces, 2012: Metadata/Statistics South Africa. Pretoria: Statistics South Africa. Stroup, W. W. (2012). Generalized linear mixed models: modern concepts, methods and applications. CRC press. Tang, W., He, H. & Tu, X. M. (2012). Applied categorical and count data analysis. CRC Press. Tilanus, E. W. (2008). SET, MERGE and beyond, Proceedings of SAS Global 2008 Conference. Cary, NC: SAS Institute Inc. Paper 167-2008. Vach, W. (2012). Regression models as a tool in medical research. CRC Press. Ver Hoef, J. M. & Boveng, P. L. (2007). Quasiâ€poisson vs. Negative binomial regression: how should we model over dispersed count data? Ecology, 88(11), 2766-2772. Wang, J., Xie, H. & Fisher, J. F. (2011). Multilevel models: applications using SAS®. Walter de Gruyter. Washington, S. P., Karlaftis, M. G. & Mannering, F. (2010). Statistical and econometric methods for transportation data analysis. CRC press. Whitehead, J., Haab, T. & Huang, J. C. (2012). Preference data for environmental valuation: combining revealed and stated approaches (Vol. 31). Routledge. Xiao, Y., Zhang, X. & Ji, P. (2015). Modelling forest fire occurrences using count-data mixed models in qiannan autonomous prefecture of Guizhou Province in China. PloS one, 10(3), e0120621. Yip, K. C. & Yau, K. K. (2005). On modelling claim frequency data in general insurance with extra zeros. Insurance: Mathematics and Economics, 36(2), 153-163. Zamani, H. & Ismail, N. (2010). Negative binomial-Lindley distribution and its application. Journal of Mathematics and Statistics, 6(1), 4-9. Zeileis, A., Kleiber, C. & Jackman, S. (2008). Regression models for count data in R. Journal of statistical software, 27(8), 1-25.
Copyright (c) 2017 Volition Tlhalitshi Montshiwa, Ntebogang Dinah Moroke
This work is licensed under a Creative Commons Attribution 4.0 International License.
Author (s) should affirm that the material has not been published previously. It has not been submitted and it is not under consideration by any other journal. At the same time author (s) need to execute a publication permission agreement to assume the responsibility of the submitted content and any omissions and errors therein. After submission of a revised paper in the light of suggestions of the reviewers, editorial team edits and formats manuscripts to bring uniformity and standardization in published material.
This work will be licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) and under condition of the license, users are free to read, copy, remix, transform, redistribute, download, print, search or link to the full texts of articles and even build upon their work as long as they credit the author for the original work. Moreover, as per journal policy author (s) hold and retain copyrights without any restrictions.