/* CMA 2019-20 FD update: use 2019-20 and keep HDD in the sample
In this do file we include the 19-20 outturn APR data in the CMA provisional findings base models:
- include 19-20 in the SWB merge
- retain the year 19-20 in the sample
- drop SVE and HDD up to 18-19 but retain them both in 19-20.
*/
*This do file has the following stages:
*1.- Data preparation
*1a.- Import the master dataset from Excel and assign labels and codes for all variables.
*1b.- Aggregate data for SWB in 2016/17 and 2017/18 and save a temporary file.
*In the period from 2011/12 to 2015/16 South West and Bournemouth Water are treated as separate entities.
*1c.- Incorporate the temporary file into the master file.
*2.- Variables generation using codes. Each variable code has a label assigned to it.
*2a.- Dependent variables (costs)
*2b.- Independent variables (cost drivers)
*3.- Structure the dataset in a panel dataset format
*4.- Calculate correlation coefficients and export them into Excel
*5.- Prepare the macros for different regression specifications
*6.- Run regressions and tests
*6a) Pooled OLS
*6b) Random effects
*7.- Export the following results:
*7a) Export a matrix for the coefficients that are used used in feeder models 2 and 4.
*7b) Export coefficients, significance levels and statistical tests into Excel
*8.- Add a day and date stamp
*************************************************************************************************************************************************************************************
*========================================================================================
*1.- Data preparation
*========================================================================================
*1a.- Import the master dataset from Excel and assign labels and codes for all variables.
clear
matrix drop _all
set more off
set matsize 800
estimates drop _all
macro drop _all
import excel "O:\OFWSHARE\Cost assessment\PR19\Update PR19 models with 2019-20 data\Impact of using CMA modelling approach (CMA FD)\FM_WW1.xlsx", sheet("Stata dataset - water")
*The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18.
*The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25.
*The third row provides a corresponding label of each variable.
*Firstly, we eliminate the historical codes
drop in 2
*Next, we assign names and labels with the first and second rows respectively to each variable using the following loop:
foreach var of varlist * {
label variable `var' "`=`var'[2]'"
rename `var' `=`var'[1]'
}
*As we have identified the variable names and codes for each indicator, we no longer need the first two rows:
drop in 1/2
*Then can destring all variables
*We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order we have:
g start = ""
g end = ""
order companycode codecombine financialyear start
destring start-end, replace force
drop start end
*Next we rename the variables that were from external sources:
rename C_CD0001_PR19WW1 popdensity
rename C_CD0002_PR19WW1 popsparsity
rename C_CD0005_PR19WW1 wedensitywater
rename C_CD0022_PR19WW1 welogdensitywater
rename C_CD0023_PR19WW1 welogsqdensitywater
rename C_CD0011_PR19WW1 populationwater
rename C_CD0012_PR19WW1 rwasoc1
rename C_CD0013_PR19WW1 rwasoc2
rename C_CD0014_PR19WW1 cpihadj
*1b.- BWH and SWT are separated entities from 2011/12 to 2015/16. In 2016/17 they merge to form SWB.
*For modelling purposes, we consider the merging companies as a merged entity (SWB) for the years 2016/17 and 2017/18 (ie merging years) and as separate entities (BWH and SWT) for the previous years.
*However, regional wage and the density measures developed by Ofwat are reported separately for the merging years.
*We estimate regional wage and density measures in the merging years for SWB as the weighted average of BWH and SWT.
*We use our population estimates for both companies as weights to determine the size of each company.
*The following preserve restore functionality allows to compute the weights for SWT and BWH in 2016/17, aggregate the data, and save a temporary file.
*Then, this temporary file will then be appended to the master file.
preserve
*Keep only SWT, BWH and SWB data for the merging years
keep if inlist(companycode,"SWT","BWH","SWB") & inlist(financialyear,"2016-17","2017-18","2018-19","2019-20")
*Assign the merging code to all observations
replace companycode = "SWB"
*Next, allocate all variables on the left hand side that are string (companycode financialyear as well as wages and density measures).
order companycode codecombine financialyear popdensity popsparsity rwasoc1 rwasoc2 we* cpih*
*Now we calculate the weighted averages
foreach x of varlist popdensity popsparsity rwasoc1 rwasoc2 we*{
g product`x' = (populationwater)*`x'
bysort financialyear: egen sumproduct`x' = sum(product`x')
bysort financialyear: egen sumactivity`x' = sum(populationwater)
g wa`x' = sumproduct`x'/sumactivity`x'
replace wa`x' = . if wa`x' == 0
replace `x' = wa`x'
drop product* sumproduct* sumactivity* wa*
}
*Afterwards, we sum up all the remaining variables (and keep only one data point for weighted averages computed above) and save a temporary file
collapse (sum) WS1001WR-populationwater (min) popdensity popsparsity rwasoc1 rwasoc2 cpihadj we*, by(companycode financialyear)
tempfile SWB
save `SWB', replace
restore
*1c.- We append the SWB data to the master file.
*Before doing so we need to remove the existing disaggregated data in 2016/17 and 2017/18 for BWH and SWT and SWB to avoid duplication
drop if inlist(companycode,"SWT","BWH","SWB") & inlist(financialyear,"2016-17","2017-18","2018-19","2019-20")
append using `SWB'
*We check the number of osbervations per company and financial year to show there are observations for SWB only from 2016/17 to 2020/21 to 2024/25.
*That leaves no observations in the sample for the econometric models.
tab companycode financialyear, m
*========================================================================================
*2.- Variables generation using the codes. Each variable code has a label assigned to it.
*========================================================================================
*2a.- Dependent variables (costs)
*Modelled base costs are defined as:
*The sum of the following cost categories:
*a) Power (WS1001)
*b) Income treated as negative expenditure (WS1002)
*c) Bulk supply (WS1004)
*d) Renewals expensed in year - infra (WS1005)
*e) Renewals expensed in year - noninfra (WS1006)
*f) Other operating expenditure excluding renewals (WS1007)
*g) Maintaining the long term of capability of the assets - infra (WS1012)
*h) Maintaining the long term of capability of the assets - noninfra (WS1013)
*i) New developments enhancement costs (W3009)
*j) New connections enhancement costs (WS2004)
*k) Addressing low pressure enhancement costs (W3002)
*Minus the following cost categories:
*i) Traffic Management Act (W3032)
*j) Statutory water softening (W3036)
*Based on the description above, we used the following codes to estimate modelled base costs (botex)
*The following acronyms are used for each level of aggregation:
*botex stands for base costs used for econometric modelling
*wr: water resources
*twd: treated water distribution
*npw: network plus water
*wrp: water resources plus (which includes water resources, raw water distribution and water treatment)
*ww: wholesale water
*As an example, botexww stands for wholesale water base costs
g botexwr = WS1001WR + WS1002WR + WS1004WR + WS1005WR + WS1006WR + WS1007WR + WS1012WR + WS1013WR + W3009WR + WS2004WR + W3002WR - W3032WR - W3036WR
g botextwd = WS1001TWD + WS1002TWD + WS1004TWD + WS1005TWD + WS1006TWD + WS1007TWD + WS1012TWD + WS1013TWD + W3009TWD + WS2004TWD + W3002TWD - W3032TWD - W3036TWD - APP28RR_W0002 - APP28RR_W0003
g botexww = WS1001CAW + WS1002CAW + WS1004CAW + WS1005CAW + WS1006CAW + WS1007CAW + WS1012CAW + WS1013CAW + W3009CAW + WS2004CAW + W3002CAW - W3032CAW - W3036CAW - APP28RR_W0002 - APP28RR_W0003
g botexnpw = botexww - botexwr
g botexwrp = botexww - botextwd
*2b.- Independent variables (cost drivers)
*Independent variables have the following terminology:
*Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%.
*Variables are transformed in logs with the exception of the percent variables.
*Scale variables
*Number of connected properties (for all different supply chain levels)
g properties = (BN2221 + BN2161) * 1000
*Lengths of main (for twd, npw and ww)
g lengthsofmain = BN1100 + BN11590
*Specific network characteristics variables for each element of the supply chain
*Water treatment
*% proportion of water treated in water treatment works with complexity levels 3-6
g watertreated = CPMW0098 + CPMW0104 + CPMW0110 + CPMW0116 + CPMW0165 + CPMW0166 + CPMW0167 + CPMW0027 + CPMW0033 + CPMW0039 + CPMW0045 + CPMW0185 + CPMW0197 + CPMW0198
g watertreated36 = CPMW0116 + CPMW0165 + CPMW0166 + CPMW0167 + CPMW0045 + CPMW0185 + CPMW0197 + CPMW0198
g pctwatertreated36 = (watertreated36 / watertreated) *100
*Weighted average level of treatment complexity (WAC)
g wac = (1*(CPMW0098+CPMW0027)/watertreated) + (2*(CPMW0104+CPMW0033)/watertreated) + (3*(CPMW0110+CPMW0039)/watertreated) + (4*(CPMW0116+CPMW0045)/watertreated) + (5*(CPMW0165+CPMW0185)/watertreated) + (6*(CPMW0166+CPMW0197)/watertreated) + (7*(CPMW0167+CPMW0198)/watertreated)
*Treated water distribution
*Number of boosters per lenghts of main as a proxy for pumping activity
g boosterperlength = BN11390 / lengthsofmain
*The next step is to keep the relevant variables for logs and real terms transformation where applicable.
keep botex* /// Dependent variables
properties lengthsofmain /// Scale variables
pctwatertreated36 wac /// Complexity, water treatment
boosterperlength /// Complexity, treated water distribution
wedensitywater /// Density (wedensitywater stands for weighted average density measure developed by Ofwat)
cpihadj companycode financialyear /// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs.
*Convert costs real terms using the CPIH adjustments and 2017/18 as bease year
foreach x of varlist botex*{
replace `x' = `x' * cpihadj
rename `x' real`x'
}
*Transform variables to log. We don't trasform those variables expressed in percentage.
*To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log
g temp1=1
g temp2=1
*Then place those variables in the left that are not being transformed
order companycode financialyear cpihadj pct* temp1
*Transform to natural logs
foreach x of varlist temp1-temp2 {
g ln`x' = ln(`x')
}
drop temp* lntemp*
*Calculate the quadratic term for weighted average density
g lnwedensitywater2= lnwedensitywater^2
*==================================================
*3.- We structure the dataset in a panel dataset format
*==================================================
*First assign numerical values to company codes and financial years:
*For company code:
encode companycode, g (id)
*For financial year:
g year=real(substr(financialyear,1,4))+1
*Then we set the dataset as panel
xtset id year
*Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the readers to test.
*Time trend
egen timetrend = group(financialyear)
*Time dummies
tab year, g(dummyyear)
*Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification
egen codecombine = concat(companycode year)
*We don't consider the forecast period for econometric models (2020-21 to 2024-25)
drop if year > 2020
*We also do not consider HDD or SVE for 2017/18 or 2018/19
drop if companycode == "SVE" & year < 2020
drop if companycode == "HDD" & year < 2020
drop if companycode == "SVT" & year == 2020
drop if companycode == "DVW" & year == 2020
*The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model 1.
export excel companycode financialyear codecombine realbotex* properties lengthsofmain pctwatertreated36 wac boosterperlength wedensitywater using "O:\OFWSHARE\Cost assessment\PR19\Update PR19 models with 2019-20 data\Impact of using CMA modelling approach (CMA FD)\WW - Costs and costs drivers.xlsx", sheet("Costs and costs drivers") sheetmodify firstrow(variables)
*=================================================================
*4.- Calculate correlation coefficients and export them into Excel
*=================================================================
corr *
putexcel set "O:\OFWSHARE\Cost assessment\PR19\Update PR19 models with 2019-20 data\Impact of using CMA modelling approach (CMA FD)\WW Correlation Matrix.xls", replace
putexcel A1 = matrix(r(C)), names
*=================================================================
*5.- Prepare the macros for different regression specifications
*=================================================================
*Label the variables for easy understanding:
*Dependent variables
label var lnrealbotexwrp "Water resources plus botex in logs"
label var lnrealbotextwd "Treated water distribution botex in logs"
label var lnrealbotexww "Wholesale water botex in logs"
*Independent variables
label var lnproperties "Number of connected properties in logs"
label var lnlengthsofmain "Lenghts of main"
label var pctwatertreated36 "Percent of water treated in water treatment works with complexity levels 3 to 6"
label var lnwac "Water treatment complexity index in logs"
label var lnboosterperlength "Number of booster pumping stations per lenghts of main in logs"
label var lnwedensitywater "Weighted average population density in logs"
label var lnwedensitywater2 "Squared term of weighted average population density in logs"
*Water resources plus
global reg1 lnrealbotexwrp lnproperties pctwatertreated36 lnwedensitywater lnwedensitywater2
global reg2 lnrealbotexwrp lnproperties lnwac lnwedensitywater lnwedensitywater2
*Treated water distribution
global reg3 lnrealbotextwd lnlengthsofmain lnboosterperlength lnwedensitywater lnwedensitywater2
*Wholesale water
global reg4 lnrealbotexww lnproperties pctwatertreated36 lnboosterperlength lnwedensitywater lnwedensitywater2
global reg5 lnrealbotexww lnproperties lnwac lnboosterperlength lnwedensitywater lnwedensitywater2
*=================================================================
*6.- Run regressions and tests
*=================================================================
*6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications)
forvalues i = 1(1)5{
*Run Pooled OLS with clustered standard errors at company level
eststo ols`i' : reg ${reg`i'}, vce(cluster id)
*Adjusted R squared
estadd scalar R_squared = e(r2_a): ols`i'
*VIF test
vif
estadd scalar VIF_statistic = r(vif_1): ols`i'
*RESET Test
ovtest
estadd scalar RESET_P_value = r(p): ols`i'
*Add title to the regression
estadd local Econometric_model = "Pooled OLS": ols`i'
*Drop irrelevant variables created in the loop
keep companycode - codecombine
}
*6b) Random effects
forvalues i = 1(1)5{
*Run Pooled random effect regressions with clustered standard errors at company level and store results.
eststo re`i': xtreg ${reg`i'}, re vce(cluster id)
*We will need predicted and actual costs in levels for the RESET test.
predict yhatre`i' , xb
replace yhatre`i' = exp(yhatre`i')
*Overall R squared
estadd scalar R_squared = e(r2_o): re`i'
*The following code lines calculates the RESET test as Stata does as this is not available for RE models.
quietly sum yhatre`i'
g nmlz_xb`i' = (yhatre`i'-r(mean))/r(sd)
g xb2`i' = (nmlz_xb`i')^2
g xb3`i' = (nmlz_xb`i')^3
g xb4`i' = (nmlz_xb`i')^4
xtreg ${reg`i'} xb2`i' xb3`i' xb4`i'
test xb2`i' xb3`i' xb4`i'
estadd scalar RESET_P_value = r(p): re`i'
*Add title to the regression
estadd local Econometric_model = "Random Effects": re`i'
*Drop irrelevant variables
keep companycode - codecombine
}
*==================================================
*7.- Export results and matrices into Excel
*==================================================
*7a) Export a matrix for the coefficients that are used in feeder models 2 and 4.
estout * using "O:\OFWSHARE\Cost assessment\PR19\Update PR19 models with 2019-20 data\Impact of using CMA modelling approach (CMA FD)\WW coefficients.xls" , replace ///
stats(depvar)
*7b) Export coefficients, significance levels and statistical tests into Excel
estout * using "O:\OFWSHARE\Cost assessment\PR19\Update PR19 models with 2019-20 data\Impact of using CMA modelling approach (CMA FD)\WW results.xls" , replace ///
cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) ///
stats(Econometric_model depvar N vce R_squared VIF_statistic RESET_P_value) mlabels(,titles)
eststo clear
*==================================================
*8.-Add a day and date stamp
*==================================================
g run_time_stamp = "$S_TIME"
g run_date_stamp = "$S_DATE"
export excel run_date_stamp run_time_stamp using "O:\OFWSHARE\Cost assessment\PR19\Update PR19 models with 2019-20 data\Impact of using CMA modelling approach (CMA FD)\WW coefficients datestamp.xls", cell (A1) sheetmodify firstrow(variables)