L. Borrajo, R. Cao Abad, S. Olhede, S. Chandna
It is often argued that in big data setups "numbers speak for themselves". However, some authors have recently warned about the validity of this idea due to the common presence of sampling bias. Several problems coming from ignoring this bias have been recently reported. A fully nonparametric approach is considered in this work. The probability mass function estimation problem is studied over categorical data, when the biasing weight function is known (unrealistic) as well as for unknown weight functions (realistic). In addition to the big-but-biased sample, a small sized simple random sample of the real population is considered. An estimator involving both samples is proposed to remedy the problem of ignoring the weight function. Asymptotic expressions for the mean squared error of this estimator are considered. This leads to some asymptotic formulas for the optimal smoothing parameters. A dataset related to food allergies is used to illustrate the performance of the estimator.
Palabras clave / Keywords: biased data, big data, categorical data, sampling bias, smoothing parameter
Programado
Sesión J03 Estadística No Paramétrica
31 de mayo de 2018 10:20
Sala 2