|
APPENDIX
1
STEPS
INVOLVED IN MANIPULATING THE EPD DATA FILES
The standard
form of the EPD data files is as follows:
There
are files from Jan 1993 through to June 1996 on PC diskettes,
but we will not be using the entire data set.
<
The time period selected for the first stage of analysis is
Jan1994 to Dec 1995.
<
The air pollution parameters being used are: Sulphur Dioxide
(SO2), Nitrogen dioxide (NO2), TEOM and OZONE (O3).
<
All stations will be used except for Mongkok
This protocol
should only be carried out for those data that belong to the
above data subset.
<
Temperature and humidity will be extracted from the Royal
Observatory data files seperately.
File format:
Each disk
hold the data for a 3 month period (usually), for all monitoring
stations.
On each
disk, each file corresponds to one station for that
period and lists the data for all pollutants, one at a time.
The files
have created by EPD using the program "Q" (DOS-program)
and have
the form - 'filename'.MTS
We now
have a copy of this program so that we can look at the files
in their original format
This format
can be opened and read by EXCEL.
There
is a mixture of both textural information (Month, station
name, pollutant etc) and numeric data in each file.
For the
tabulated numeric data, the columns are in the following sequence
- day no., 24 columns of hourly values, mean daily value -
taking up 26 columns altogether.
XXXX =
signifies invalid data
blanks
= signifies missing data
(For our
purposes, both are treated as missing)
NB.
For SO2,
O3 and CO - the first hourly value is always missing due to
equipment calibration
For NO2,
NO, NOX - the first 2 hourly values are always missing, except
at Sham Shui Po station where only one value is missing (different
equipment)
STEP 1.
Import
a given data file into EXCEL, making sure that the columns
of data match.. Some additional columns may need to be created
and others widened to fit the text etc.
STEP 2. - Completeness criteria
Check the number of hourly values missing for each
day (No. of 24 one-hourly values) and type this value into
column 27
Only exception
- For OZONE, check the number of hourly values within the
period 6am to 7pm that are missing and type this into column
27. (6am -7pm period corresponds to the 7th-18th hourly values
inclusive)
(day no.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 (mean)
STEP 3. - Calculation of summary statistics for each day.
NB: If the number of missing hourly values (value calculated
above and typed into column 27) is greater than 6,
the daily summary statistics should not be calculated and
a code of 999 inserted into the respective columns instead.
For Ozone, if more than 3 of the 12 values (between 6am-7pm)
are missing, type 999 in these columns.
The following
summary statistics should be calculated and the values inserted
into subsequent columns, in order, starting from column 28.
For each
of the following pollutants- Sulphur Dioxide (SO2), Nitrogen
dioxide (NO2), TEOM and OZONE (O3) calculate the:
median
(24 hr), mean(24 hr), standard deviation(24 hr), maximum(24
hr), 95 centile(24 hr),.
In
addition, Ozone requires 2 extra statistics to be calculated
and typed into the subsequent column for this parameter (column
33)
For OZONE
calculate the 8-hour mean (9am-5pm)
NB. average
of 8 hourly values (10th-17th columns of hourly values inclusive
- see below)
NB. At
least 6 of these 8 values have to be available to calculate
this mean
(day no.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 (mean)
STEP 4:
File rearrangement (to be done by one person)
Rearrange/combine
the data into the following format in a new file.
each file
= one pollutant, summary statistics for the entire time series
(Jan 1994-Dec 1995)
with each
station? set of summary statistics side by side in different
column sets.
Can now
delete the textural data that describes station name etc,
as the station name and parameter name should be in the filename.
eg. The new filename could be of the form (SO2.exl)
eg. SO2.exl
| Station
code |
TW |
TW |
TW |
TW |
TW |
TW |
MK |
MK |
etc. |
| sub-parameter |
No.
missing |
median
(24 hr) |
mean
(24 hr) |
std.
dev (24 hr) |
Max.
(24 hr) |
95th
centile (24 hr) |
No.
missing |
meian
(24 hr) |
sequence
repeated for other stations |
| 010194 |
|
|
|
|
|
|
|
|
|
| 020194 |
|
|
|
|
|
|
|
|
|
| | |
|
|
|
|
|
|
|
|
|
| 311295 |
|
|
|
|
|
|
|
|
|
eg. ozone.exl
| Station
code |
TW |
TW |
TW |
TW |
TW |
TW |
TW |
MK |
MK |
etc. |
| sub-parameter |
No.
missing |
median
(24 hr) |
mean
(24 hr) |
std.
dev (24 hr) |
Max.
(24 hr) |
95th
centile (24 hr) |
mean
(8 hr) |
No.
missing |
meian
(24 hr) |
sequence
repeated |
| 010194 |
|
|
|
|
|
|
|
|
|
|
| 020194 |
|
|
|
|
|
|
|
|
|
|
| | |
|
|
|
|
|
|
|
|
|
|
| 311295 |
|
|
|
|
|
|
|
|
|
|
STEP 5
Calculate
% of missing daily values for each parameter for each station,
for the following periods and insert values in separate rows
Jan-Dec
1994
Jan-Dec
1995
Jan 1994-Dec
1995
eg. ozone.exl
| Station
code |
TW |
TW |
TW |
TW |
TW |
TW |
TW |
MK |
MK |
etc. |
| sub-parameter |
No.
missing |
median
(24 hr) |
mean
(24 hr) |
std.
dev (24 hr) |
Max.
(24 hr) |
95th
centile (24 hr) |
mean
(8 hr) |
No.
missing |
meian
(24 hr) |
sequence
repeated for other stations |
| 010194 |
|
|
|
|
|
|
|
|
|
|
| 020194 |
|
|
|
|
|
|
|
|
|
|
| | |
|
|
|
|
|
|
|
|
|
|
| 311295 |
|
|
|
|
|
|
|
|
|
|
| %
missing 94 |
|
|
|
|
|
|
|
|
|
|
| %
missing 95 |
|
|
|
|
|
|
|
|
|
|
%missing
Jan94-Dec95
|
|
|
|
|
|
|
|
|
|
|
STEP 6
Decision
needs to be made as to whether to include a particular station
in the aggregated data set following the 25% completeness
rule in the protocol (S14). "If a station has more than 25%
of the values for the whole period of analysis missing, it
is excluded (for that parameter)".
Step 7
- missing value imputation
For those
stations with some missing data, but less than 25%, the possibility
of filling in the missing values (imputation) is considered
so as to produce a complete air pollutant time series. The
method is based on estimating the missing value/s, using the
available measurements from the other monitoring stations
on the same day/s. A regression approach is used (see protocol,
p. S14).
output
- table without any missing values
If less
than 25% of the values for a given 3-month season are missing
for a particular station (j), use the following to calculate
the estimated value for day i at station j (EVi)
E Vi =
E x mean of the daily measurements for the missing station
over a 3- month period
Average
of the means of the remaining, non-missing stations in the
same period
where
E = mean of the non-missing measurements for the other stations
on day i
Step 8
Aggregate
the data for each sub-parameter, across all included monitoring
stations to produce an aggregated data file of the following
format.
columns
= sub-parameters (SO2 median; SO2mean; SO2sd; SO2max; SO295;
NO2 median....etc.)
rows =
starting from day 1 (01 Jan 1994) through to 30 Jun 1996.
eg. agg-data.exl
| sub-parameter |
SO2 |
SO2 |
SO2 |
SO2 |
SO2 |
TW |
NO2 |
NO2 |
etc. |
| Mean
across all stations |
median
(24 hr) |
mean
(24 hr) |
std.
dev
(24 hr) |
Max.
(24 hr) |
95
th
centile
(24 hr) |
mean
(8hr) |
median
(24 hr) |
mean
(24 hr) |
remaining
parameters |
| 010194 |
|
|
|
|
|
|
|
|
|
| 020194 |
|
|
|
|
|
|
|
|
|
| | |
|
|
|
|
|
|
|
|
|
| 311295 |
|
|
|
|
|
|
|
|
|
| %
missing Jan94-Dec95
|
|
|
|
|
|
|
|
|
|
| overall
mean Jan94-Dec95 |
|
|
|
|
|
|
|
|
|
Modified protocol for temperature and humidity data.
For these
two parameters, the hourly values are not available. The summary
statistics of daily mean, maximum and minimum are provided
on two disks for the period January 1994 to December 1995.
If less than 24 hourly values were available for a particular
day, these statistics are still provided, but are flagged
with the symbol #. The symbol XXX designates that no data
are available for that day.
There
are 7 monitoring stations covering the territory (King's Park,
Lau Fau Shan, Wong Chuk Hang, Shatin, Tuen Mun, Ta Kwu Ling
and Tseung Kwan O). The data are basically complete with an
entire series of daily values being provided for all stations
except for Ta Kwu Ling, where three days have no data available.
Steps
involved:
Step 1
- as per EPD data
Step 2
- delete # symbol
It serves
no purpose in this analysis. Without the hourly values, there
is no way of assessing whether the daily summary statistics
have been calculated using the completeness criteria of 75
%. Similarly, the summary statistics other than mean, maximum
and minimum, such as the standard deviation, median and various
centiles can not be computed.
Step 3
- Produce a summary table in the following format for both
temperature and humidity.
eg. temp.exl
| Station
code |
KP |
KP |
KP |
LFS |
LFS |
etc. |
| sub-parameter |
mean
(24
hr) |
Max.
(24hr)
|
Minimum
(24 hr) |
mean
(24
hr) |
Max.
(24hr)
|
sequence
repeated for other stations |
| 010194 |
|
|
|
|
|
|
| 020194 |
|
|
|
|
|
|
| | |
|
|
|
|
|
|
| 311295 |
|
|
|
|
|
|
Step 4
- Produce an aggregated data file with the means across all
stations for each sub-parameter
eg. agg-temp.exl
| sub-parameter |
Temp. |
Temp. |
Temp. |
Hum. |
Hum. |
Hum. |
| Mean
across all stations |
mean
(24 hr)
|
Max.
(24hr)
|
Min.
(24 hr)
|
mean
(24 hr)
|
Max.
(24 hr)
|
Min.
(24 hr.)
|
| 010194 |
|
|
|
|
|
|
| 020194 |
|
|
|
|
|
|
| | |
|
|
|
|
|
|
| 311295 |
|
|
|
|
|
|
|
%
missing Jan
94 -
Dec 95
|
|
|
|
|
|
|
| overall
mean Jan 94 - Dec 95 |
|
|
|
|
|
|
Step 5
- Append above table to EPD data file
STEPS
INVOLVED IN MANIPULATING THE HA DATA FILES
The HA
hospital admissions data originate from two major database
systems IPAS and MRAS (details provided in inception report
- extract these)
The data
were downloaded by the HA hospital by hospital, separately
for each system in a text file format.
The primary
data units consist of individual hospital admission records
in the order of Patient's ID, episode number and admission
date)
The time
period for which computerised data are available varies between
each hospital and there is a significant overlap of records
between the two systems.
Step 1
- Data retrieval
Convert
all data files from text file format to d-base format as they
become available.
Step 2
- Data checking and re-formatting
Check
selected data fields within each database for logical ranges
and useability.
Adjust
data formats to make them compatible between IPAS and MRAS
Step 3
- Finalise inclusivity criteria for time period, hospitals,
ICD codes and sources of admission
Time
period - 1 January, 1994 to 30 June, 1996. Initial analysis
performed on data up to December 1995
Hospitals
- list criteria and the final selection. Most of them have
24 hour emergency department.
Some were
not included because of insufficient data. Eg. CMC had no
computerised data until Oct., 1994.
ICD codes
- as per inception report.
Sources
of admission - records representing transfers from other
hospitals were deleted (eg. source code = 5 )
Step 4
- Develop and use SPSS program to check for and delete duplicated
records
Step 5
- merging of data and creation of new variables
Merge
records contained in IPAS and MRAS data bases and join all
files from different hospitals into one file. Create new variable,
admission age (YY, MM, DD)
Step 6
- Prepare data files for time series analysis
Use SPSS
to produce frequency tables (no. of admissions) for each ICD
group of codes for each day.
Step 7
- Input SPSS-generated frequency data into EXCEL to facilitate
the production of graphs etc.
Step 8
- Append HA data to EPD data file in EXCEL
Step 9
- Convert Excel file to text file format for statistical modelling.
|