The Effect of Sample Size on the Efficiency of Count Data Models: Application to Marriage Data

Volition Tlhalitshi Montshiwa; Ntebogang Dinah Moroke

doi:10.22610/jebs.v9i3(J).1742

Volition Tlhalitshi Montshiwa North West University
Ntebogang Dinah Moroke North West University

DOI: https://doi.org/10.22610/jebs.v9i3(J).1742

Keywords: Poisson regression, Negative binomial regression, Zero-inflated Poisson, Zero-inflated negative binomial, Poisson Hurdle and Negative binomial hurdle

Abstract

Abstract: Sample size requirements are common in many multivariate analysis techniques as one of the measures taken to ensure the robustness of such techniques, such requirements have not been of interest in the area of count data models. As such, this study investigated the effect of sample size on the efficiency of six commonly used count data models namely: Poisson regression model (PRM), Negative binomial regression model (NBRM), Zero-inflated Poisson (ZIP), Zero-inflated negative binomial (ZINB), Poisson Hurdle model (PHM) and Negative binomial hurdle model (NBHM). The data used in this study were sourced from Data First and were collected by Statistics South Africa through the Marriage and Divorce database. PRM, NBRM, ZIP, ZINB, PHM and NBHM were applied to ten randomly selected samples ranging from 4392 to 43916 and differing by 10% in size. The six models were compared using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Vuongâ€™s test for over-dispersion, McFadden RSQ, Mean Square Error (MSE) and Mean Absolute Deviation (MAD).The results revealed that generally, the Negative Binomial-based models outperformed Poisson-based models. However, the results did not reveal the effect of sample size variations on the efficiency of the models since there was no consistency in the change in AIC, BIC, Vuongâ€™s test for over-dispersion, McFadden RSQ, MSE and MAD as the sample size increased.

Downloads

References

Bajpai, N. (2009). Business statistics. Pearson Education India. Berk, K. N. & Carey, P. M. (2009). Data Analysis with Microsoft Excel: Updated for Office 2007. Cengage Learning. Burger, M., Van Oort, F. & Linders, G. J. (2009). On the specification of the gravity model of trade: zeros, excess zeros and zero-inflated estimation. Spatial Economic Analysis, 4(2), 167-190. Cameron, A. C. & Trivedi, P. K. (2013). Regression analysis of count data (Vol. 53). Cambridge university press. Cox, F. & Demmitt, K. (2013). Human intimacy: Marriage, the family, and its meaning. Nelson Education.
Famoye, F. & Singh, K. P. (2006). Zero-inflated generalized Poisson regression model with an application to domestic violence data. Journal of Data Science, 4(1), 117-130. Fuzi, M. F. M., Jemain, A. A. & Ismail, N. (2016). Bayesian quintile regression model for claim count data. Insurance: Mathematics and Economics, 66, 124-137. Hilbe, J. M. (2014). Modelling count data (pp. 836-839). Springer Berlin Heidelberg. Holman, T. B. (2006). Premarital prediction of marital quality or breakup: Research, theory, and practice. Springer Science & Business Media. INC, S. I. (2010). SAS/STATÂ® 9.22 Userâ€™s Guide. INC, S.I. (2012). SAS/ETS 12.1 User's Guide. Little, T. D. (2013). The Oxford handbook of quantitative methods, volume 1: Foundations. Oxford University Press. Liu, W. & Cela, J. (2008). Count data models in SAS. In SAS Global Forum, 317, 1-12. Matignon, R. (2007). Data mining using SAS enterprise miner (Vol. 638). John Wiley & Sons. Mei-Chen, H., Pavlicova, M. & Nunes, E. V. (2011). Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. The American journal of drug and alcohol abuse, 37(5), 367-375. Merkle, E. C. & Smithson, M. (2013). Generalized linear models for categorical and continuous limited dependent variables. CRC Press. Morel, J. G. & Neerchal, N. (2012). Over dispersion models in SAS. SAS Institute. Ozmen, I. & Famoye, F. (2007). Count regression models with an application to zoological data containing structural zeros. Journal of Data Science, 5(4), 491-502. Park, B. J., Lord, D. & Hart, J. D. (2010). Bias properties of Bayesian statistics in finite mixture of negative binomial regression models in crash data analysis. Accident Analysis & Prevention, 42(2), 741-749. Peng, J., Lyu, T., Shi, J., Nagaraja, H. N. & Xiang, H. (2014). Models for injury count data in the US National Health Interview Survey. Journal of Scientific Research and Reports, 3(17), 2286-2302. Reis, H. T. & Sprecher, S. (2009). Encyclopaedia of Human Relationships: Vol. 1. Sage. Rose, C. E., Martin, S. W., Wannemuehler, K. A. & Plikaytis, B. D. (2006). On the use of zero-inflated and hurdle models for modelling vaccine adverse event count data. Journal of biopharmaceutical statistics, 16(4), 463-481. Statistics South Africa. (2014). Marriages and divorces, 2012: Metadata/Statistics South Africa. Pretoria: Statistics South Africa. Stroup, W. W. (2012). Generalized linear mixed models: modern concepts, methods and applications. CRC press. Tang, W., He, H. & Tu, X. M. (2012). Applied categorical and count data analysis. CRC Press. Tilanus, E. W. (2008). SET, MERGE and beyond, Proceedings of SAS Global 2008 Conference. Cary, NC: SAS Institute Inc. Paper 167-2008. Vach, W. (2012). Regression models as a tool in medical research. CRC Press. Ver Hoef, J. M. & Boveng, P. L. (2007). Quasiâ€poisson vs. Negative binomial regression: how should we model over dispersed count data? Ecology, 88(11), 2766-2772. Wang, J., Xie, H. & Fisher, J. F. (2011). Multilevel models: applications using SASÂ®. Walter de Gruyter. Washington, S. P., Karlaftis, M. G. & Mannering, F. (2010). Statistical and econometric methods for transportation data analysis. CRC press. Whitehead, J., Haab, T. & Huang, J. C. (2012). Preference data for environmental valuation: combining revealed and stated approaches (Vol. 31). Routledge. Xiao, Y., Zhang, X. & Ji, P. (2015). Modelling forest fire occurrences using count-data mixed models in qiannan autonomous prefecture of Guizhou Province in China. PloS one, 10(3), e0120621. Yip, K. C. & Yau, K. K. (2005). On modelling claim frequency data in general insurance with extra zeros. Insurance: Mathematics and Economics, 36(2), 153-163. Zamani, H. & Ismail, N. (2010). Negative binomial-Lindley distribution and its application. Journal of Mathematics and Statistics, 6(1), 4-9. Zeileis, A., Kleiber, C. & Jackman, S. (2008). Regression models for count data in R. Journal of statistical software, 27(8), 1-25.

The Effect of Sample Size on the Efficiency of Count Data Models: Application to Marriage Data

Abstract

Downloads

References

Important Links