A STUDY OF SHORT-TERM EFFECTS OF AMBIENT AIR POLLUTION ON PUBLIC HEALTH

APPENDIX 1

STEPS INVOLVED IN MANIPULATING THE EPD DATA FILES

The standard form of the EPD data files is as follows:

There are files from Jan 1993 through to June 1996 on PC diskettes, but we will not be using the entire data set.

< The time period selected for the first stage of analysis is Jan1994 to Dec 1995.

< The air pollution parameters being used are: Sulphur Dioxide (SO2), Nitrogen dioxide (NO2), TEOM and OZONE (O3).

< All stations will be used except for Mongkok

This protocol should only be carried out for those data that belong to the above data subset.

< Temperature and humidity will be extracted from the Royal Observatory data files seperately.

File format:

Each disk hold the data for a 3 month period (usually), for all monitoring stations.

On each disk, each file corresponds to one station for that period and lists the data for all pollutants, one at a time.

The files have created by EPD using the program "Q" (DOS-program)

and have the form - 'filename'.MTS

We now have a copy of this program so that we can look at the files in their original format

This format can be opened and read by EXCEL.

There is a mixture of both textural information (Month, station name, pollutant etc) and numeric data in each file.

For the tabulated numeric data, the columns are in the following sequence - day no., 24 columns of hourly values, mean daily value - taking up 26 columns altogether.

XXXX = signifies invalid data

blanks = signifies missing data

(For our purposes, both are treated as missing)

NB.

For SO2, O3 and CO - the first hourly value is always missing due to equipment calibration

For NO2, NO, NOX - the first 2 hourly values are always missing, except at Sham Shui Po station where only one value is missing (different equipment)

STEP 1.
Import a given data file into EXCEL, making sure that the columns of data match.. Some additional columns may need to be created and others widened to fit the text etc.


STEP 2. - Completeness criteria
Check the number of hourly values missing for each day (No. of 24 one-hourly values) and type this value into column 27

Only exception - For OZONE, check the number of hourly values within the period 6am to 7pm that are missing and type this into column 27. (6am -7pm period corresponds to the 7th-18th hourly values inclusive)

(day no.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 (mean)


STEP 3. - Calculation of summary statistics for each day.
NB: If the number of missing hourly values (value calculated above and typed into column 27) is greater than 6, the daily summary statistics should not be calculated and a code of 999 inserted into the respective columns instead. For Ozone, if more than 3 of the 12 values (between 6am-7pm) are missing, type 999 in these columns.

The following summary statistics should be calculated and the values inserted into subsequent columns, in order, starting from column 28.

For each of the following pollutants- Sulphur Dioxide (SO2), Nitrogen dioxide (NO2), TEOM and OZONE (O3) calculate the:

median (24 hr), mean(24 hr), standard deviation(24 hr), maximum(24 hr), 95 centile(24 hr),.

In addition, Ozone requires 2 extra statistics to be calculated and typed into the subsequent column for this parameter (column 33)

For OZONE calculate the 8-hour mean (9am-5pm)

NB. average of 8 hourly values (10th-17th columns of hourly values inclusive - see below)

NB. At least 6 of these 8 values have to be available to calculate this mean

(day no.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 (mean)


STEP 4: File rearrangement (to be done by one person)
Rearrange/combine the data into the following format in a new file.

each file = one pollutant, summary statistics for the entire time series (Jan 1994-Dec 1995)

with each station? set of summary statistics side by side in different column sets.

Can now delete the textural data that describes station name etc, as the station name and parameter name should be in the filename. eg. The new filename could be of the form (SO2.exl)

eg. SO2.exl

Station code TW TW TW TW TW TW MK MK etc.
sub-parameter No. missing median (24 hr) mean
(24 hr)
std. dev (24 hr) Max. (24 hr) 95th centile (24 hr) No. missing meian (24 hr) sequence repeated for other stations
010194                  
020194                  
|                  
311295                  


eg. ozone.exl

Station code TW TW TW TW TW TW TW MK MK etc.
sub-parameter No. missing median (24 hr) mean
(24 hr)
std. dev (24 hr) Max. (24 hr) 95th centile (24 hr) mean
(8 hr)
No. missing meian (24 hr) sequence repeated
010194                    
020194                    
|                    
311295                    


STEP 5
Calculate % of missing daily values for each parameter for each station, for the following periods and insert values in separate rows

Jan-Dec 1994

Jan-Dec 1995

Jan 1994-Dec 1995

eg. ozone.exl

Station code TW TW TW TW TW TW TW MK MK etc.
sub-parameter No. missing median (24 hr) mean
(24 hr)
std. dev (24 hr) Max. (24 hr) 95th centile (24 hr) mean
(8 hr)
No. missing meian (24 hr) sequence repeated for other stations
010194                    
020194                    
|                    
311295                    
% missing 94                    
% missing 95                    
%missing
Jan94-Dec95
                   


STEP 6

Decision needs to be made as to whether to include a particular station in the aggregated data set following the 25% completeness rule in the protocol (S14). "If a station has more than 25% of the values for the whole period of analysis missing, it is excluded (for that parameter)".


Step 7 - missing value imputation

For those stations with some missing data, but less than 25%, the possibility of filling in the missing values (imputation) is considered so as to produce a complete air pollutant time series. The method is based on estimating the missing value/s, using the available measurements from the other monitoring stations on the same day/s. A regression approach is used (see protocol, p. S14).

output - table without any missing values

If less than 25% of the values for a given 3-month season are missing for a particular station (j), use the following to calculate the estimated value for day i at station j (EVi)

E Vi = E x mean of the daily measurements for the missing station over a 3- month period

Average of the means of the remaining, non-missing stations in the same period

where E = mean of the non-missing measurements for the other stations on day i


Step 8

Aggregate the data for each sub-parameter, across all included monitoring stations to produce an aggregated data file of the following format.

columns = sub-parameters (SO2 median; SO2mean; SO2sd; SO2max; SO295; NO2 median....etc.)

rows = starting from day 1 (01 Jan 1994) through to 30 Jun 1996.

eg. agg-data.exl

sub-parameter SO2 SO2 SO2 SO2 SO2 TW NO2 NO2 etc.
Mean across all stations median (24 hr) mean
(24 hr)
std. dev
(24 hr)
Max.
(24 hr)
95 th
centile
(24 hr)
mean
(8hr)
median
(24 hr)
mean
(24 hr)
remaining parameters
010194                  
020194                  
|                  
311295                  
% missing Jan94-Dec95                  
overall mean Jan94-Dec95                  


Modified protocol for temperature and humidity data.

For these two parameters, the hourly values are not available. The summary statistics of daily mean, maximum and minimum are provided on two disks for the period January 1994 to December 1995. If less than 24 hourly values were available for a particular day, these statistics are still provided, but are flagged with the symbol #. The symbol XXX designates that no data are available for that day.

There are 7 monitoring stations covering the territory (King's Park, Lau Fau Shan, Wong Chuk Hang, Shatin, Tuen Mun, Ta Kwu Ling and Tseung Kwan O). The data are basically complete with an entire series of daily values being provided for all stations except for Ta Kwu Ling, where three days have no data available.

Steps involved:

Step 1 - as per EPD data

Step 2 - delete # symbol

It serves no purpose in this analysis. Without the hourly values, there is no way of assessing whether the daily summary statistics have been calculated using the completeness criteria of 75 %. Similarly, the summary statistics other than mean, maximum and minimum, such as the standard deviation, median and various centiles can not be computed.

Step 3 - Produce a summary table in the following format for both temperature and humidity.

eg. temp.exl

Station code KP KP KP LFS LFS etc.
sub-parameter mean
(24 hr)
Max.
(24hr)
Minimum
(24 hr)
mean
(24 hr)
Max.
(24hr)
sequence repeated for other stations
010194            
020194            
|            
311295            

Step 4 - Produce an aggregated data file with the means across all stations for each sub-parameter

eg. agg-temp.exl

sub-parameter Temp. Temp. Temp. Hum. Hum. Hum.
Mean across all stations

mean
(24 hr)

Max.
(24hr)

Min.
(24 hr)

mean
(24 hr)

Max.
(24 hr)

Min.
(24 hr.)

010194            
020194            
|            
311295            

% missing Jan 94 - Dec 95

           
overall mean Jan 94 - Dec 95            

Step 5 - Append above table to EPD data file


STEPS INVOLVED IN MANIPULATING THE HA DATA FILES

The HA hospital admissions data originate from two major database systems IPAS and MRAS (details provided in inception report - extract these)

The data were downloaded by the HA hospital by hospital, separately for each system in a text file format.

The primary data units consist of individual hospital admission records in the order of Patient's ID, episode number and admission date)

The time period for which computerised data are available varies between each hospital and there is a significant overlap of records between the two systems.

Step 1 - Data retrieval

Convert all data files from text file format to d-base format as they become available.

Step 2 - Data checking and re-formatting

Check selected data fields within each database for logical ranges and useability.

Adjust data formats to make them compatible between IPAS and MRAS

Step 3 - Finalise inclusivity criteria for time period, hospitals, ICD codes and sources of admission

Time period - 1 January, 1994 to 30 June, 1996. Initial analysis performed on data up to December 1995

Hospitals - list criteria and the final selection. Most of them have 24 hour emergency department.

Some were not included because of insufficient data. Eg. CMC had no computerised data until Oct., 1994.

ICD codes - as per inception report.

Sources of admission - records representing transfers from other hospitals were deleted (eg. source code = 5 )

Step 4 - Develop and use SPSS program to check for and delete duplicated records

Step 5 - merging of data and creation of new variables

Merge records contained in IPAS and MRAS data bases and join all files from different hospitals into one file. Create new variable, admission age (YY, MM, DD)

Step 6 - Prepare data files for time series analysis

Use SPSS to produce frequency tables (no. of admissions) for each ICD group of codes for each day.

Step 7 - Input SPSS-generated frequency data into EXCEL to facilitate the production of graphs etc.

Step 8 - Append HA data to EPD data file in EXCEL

Step 9 - Convert Excel file to text file format for statistical modelling.

 

Back to topdot_clear.gifBackTable of Content