Environmental Health and Biostatistics and Computing GroupsDepartment of Community MedicineThe University of Hong Kong

Short-term effects of ambient air pollution

on public health in Hong Kong - a follow-up study

A consultancy report submitted to the

Environmental Protection Department, Hong Kong

(Operation Manual)

February 1998

logo of The University of Hong Kong

Department of Community Medicine

The University of Hong Kong

OPERATION MANUAL

Data management and data analysis procedures

a) input data format and medium

The data may be obtained through various media e.g. for meteorological data in floppy disk files provided by the Observatory, for air pollutant data in CD-ROM or electronic data files by the Environmental Protection Department, and for hospital admission data in Unix magnetic tape by the Hospital Authority. The data have to be initially reorganized so that they are in a fixed format.

The data was then read into S-plus objects in a sub-directory (one for each batch of data). The source programmes are attached (Source Programmes 1.1, 2.1, 3.1 and 3.2). The format of the data can be found from these Source Programmes.

b) examination and cleaning data

This is an important step to ensure the quality of the data. The following tables are recommended to check the validity and reliability of the data:

i) meteorological data

- tabulation of monthly average of meteorological measures by year

ii) air pollutant data

- tabulation of monthly average of air pollution data by monitoring stations by year

iii) hospital admission data

- tabulation of total number of in-patient by hospitals discharges for financial year

As far as possible the data have to be summarized in a way which facilitates for comparison with some published reports for final validation and confirmation of the data.

c) missing data definition and replacement

There are usually no missing data for the meteorological data. There is, however a lot of missing data for the pollution data. The definition and method of replacement for missing data are described in Table 7 of the HKU report and the source programmes are attached (Source Programmes 2.2 and 2.3).

After checking for completeness of the hospital admission data, the only missing data arise from the unavailability of discharge diagnosis at the time of data analysis. The scope of missing data of this kind can be minimized by waiting for several months so as to capture those who have a long length of stay in hospital as well as those in hospital, which return the diagnosis and the ICD code late. (It is advisable to obtain data at a date three months later than the required period of the study.)

d) descriptive statistics and graphs

As a final assessment for the validity of the data the following descriptive statistical tables and graphs will be made:

i) number of hospital admissions by disease groups and by age groups

ii) summary statistics of meteorological data with pollution data and hospital admission data (Source Programmes 2.4)

iii) Spearman's rank correlation coefficient between any two pollution, meteorological and health outcome data

iv) time series plots of each data

After examining the tables and graphs, the data can be confirmed and the data will be analysed finally.

e) data analysis

The following standard and essential analyses will be performed:

i) Mean, standard deviation, minimum, 25 percentile, median, 75 percentile and maximum of meteorological, air pollution and hospital admissions data were calculated
- S-plus command used: summary(), sqrt(var())

ii) Spearman's rank correlation coefficients were estimated as to examine the linear relationship between any two measures
- S-Plus command used: cor.test(..., method="spearman")

iii) multiple regression used for obtaining the multiple R squared value
- S-plus command used: lsfit()

iv) Poisson adjusted with over-dispersion regression used for modelling the various health outcomes with air pollution concentrations and other covariates.
- S-Plus command used: glm(..., family=quasi(link="log",var="mu"),...)

v) Akaike Information Criterion (AIC) was computed and used to identify the model with best cumulative lag. The AIC is the sum of the deviance residuals and twice the number of degrees of freedom used in fitting the model. It can be thought of as the deviance with penalty added to take account of the number of parameters in the model. In choosing models, the model with the smaller AIC is preferred.
- S-plus command and self-written S-Plus function used:

AIC ? deviance(glm.object) + 2*(length(glm.object$fitted) - glm.object$df.resid)

where glm.object can be obtained from above procedure (iv).

Then, min(AIC).

vi) Principal component analysis used for generating a composite score of 4 different pollutants
- S-plus command used: princomp()

vii) Interaction effects between co-pollutant, pollutant and 4 different seasons were performed

viii) Autocorrelation function used for estimating the serial correlations among residuals of health outcome after modelling.
- S-Plus command used: acf(resid(glm.object))

Advice for future study

a) coverage of mortality/morbidity data

- include all mortality in Hong Kong

b) deletion and addition of parameters

- include other pollutants e.g. carbon monoxide, total suspended particulates and PM_2.5

c) additional hospitals

- include Caritas Medical Centre which is also a referral based hospital from its A&E department

d) additional data analysis

- generalized additive model
- harvesting effects
- dose response relationship

e) geographical variations

- hospital admission rate by TPU
- clustering of TPU with monitoring stations
- modelling hospital admission rates with air pollutant concentrations and with covariates from each TPU

Suggestion and guidelines for future study

Because of large volume of data and complexity of the data analysis methods involved, the statistical package S-plus in Unix platform are recommended, as described in the protocol for APHEA II. We used S-plus in Unix platform for the data management and data analysis of the study.

To be in line with the new trend in developing new hypotheses and methodological insights in the APHEA II project, the following should be the focus in any future study relevant to the Hong Kong situation:

1. investigation of the dose-response relationships using methods (like non-parametric smoothing) which allow a better identification of non-linearities in the shape of the curves and of possible thresholds;

2. exploration of new methodological approaches to develop a better understanding of how premature deaths caused by air pollution (harvesting or mortality displacement) and what is the effect of harvesting on estimating the size of the effect parameters; and

3. investigation of regional differences and explanation of heterogeneous effect estimates via modelling techniques by taking advantage of the extended data-base.

In this study, using only two years of data, we can only study the linear effect of air pollutants without taking account of a possible threshold. We need to check more closely the residual plot for each statistical model, identify the sources of unexplained variations, autocorrelation and harvesting effects and make adjustment for them if any. They are important issues in obtaining an explanatory model for the effects of air pollutant concentrations on hospital admissions and deaths. Although we found air pollutants were related to acute hospital admissions and deaths, we should be cautious at this stage as the methodology is still under development in other parts of the world and studies on the acute health effects in Hong Kong are still in their infancy.