*In this do file we present the econometric models for IAP in PR19
*This do file has the following stages:
*1.- Data preparation
*1a.- Import the master dataset from Excel and assign labels and codes for all variables.
*1b.- Aggregate data for SWB in 2016/17 and 2017/18 and save a temporary file.
*In the period from 2011/12 to 2015/16 South West and Bournemouth Water are treated as separate entities.
*1c.- Incorporate the temporary file into the master file.
*2.- Variables generation using codes. Each variable code has a label assigned to it.
*2a.- Dependent variables (costs)
*2b.- Independent variables (cost drivers)
*3.- Structure the dataset in a panel dataset format
*4.- Calculate correlation coefficients and export them into Excel
*5.- Prepare the macros for different regression specifications
*6.- Run regressions and tests
*6a) Pooled OLS
*6b) Random effects
*7.- Export the following results:
*7a) Export a matrix for the coefficients that are used in models PR19CA002 and PR19CA004.
*7b) Export coefficients, significance levels and statistical tests into Excel
*8.- Add a day and date stamp
*************************************************************************************************************************************************************************************
*========================================================================================
*1.- Data preparation
*========================================================================================
*1a.- Import the master dataset from Excel and assign labels and codes for all variables.
clear
matrix drop _all
set more off
set matsize 800
estimates drop _all
macro drop _all
import excel "FM_WW1.xlsx", sheet("Stata dataset - wholesale water") /*In this command you need to choose the location where you want to save it*/
*The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18.
*The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25.
*The third row provides a corresponding label of each variable.
*Firstly, we eliminate the historical codes
drop in 2
*Next, we assign names and labels with the first and second rows, respectively, to each variable using the following loop:
foreach var of varlist * {
label variable `var' "`=`var'[2]'"
rename `var' `=`var'[1]'
}
*As we have identified the variable names and codes for each indicator, we no longer need the first two rows:
drop in 1/2
*Then can destring all variables
*We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order they are in:
g start = ""
g end = ""
order companycode codecombine financialyear start
destring start-end, replace force
drop start end
*1b.- BWH and SWT are separated entities from 2011/12 to 2015/16. In 2016/17 they merge to form SWB.
*For modelling purposes, we consider the merging companies as a merged entity (SWB) for the years 2016/17 and 2017/18 (ie merging years) and as separate entities (BWH and SWT) for the previous years.
*However, regional wage and the density measures developed by Ofwat are reported separately for the merging years.
*We estimate regional wage and density measures in the merging years for SWB as the weighted average of BWH and SWT.
*We use our population estimates for both companies as weights to determine the size of each company.
*The following preserve restore functionality allows to compute the weights for SWT and BWH in 2016/17, aggregate the data, and save a temporary file.
*Then, this temporary file will then be appended to the master file.
preserve
*Keep only SWT, BWH and SWB data for the merging years
keep if inlist(companycode,"SWT","BWH","SWB") & inlist(financialyear,"2016-17","2017-18")
*Assign the merging code to all observations
replace companycode = "SWB"
*Next, allocate all variables on the left hand side that are string (companycode financialyear as well as wages and density measures).
order companycode codecombine financialyear popdensity popsparsity rwasoc1 rwasoc2 we* cpih*
*Now we calculate the weighted averages
foreach x of varlist popdensity popsparsity rwasoc1 rwasoc2 we*{
g product`x' = (populationwater)*`x'
bysort financialyear: egen sumproduct`x' = sum(product`x')
bysort financialyear: egen sumactivity`x' = sum(populationwater)
g wa`x' = sumproduct`x'/sumactivity`x'
replace wa`x' = . if wa`x' == 0
replace `x' = wa`x'
drop product* sumproduct* sumactivity* wa*
}
*Afterwards, we sum up all the remaining variables (and keep only one data point for weighted averages computed above) and save a temporary file
collapse (sum) WS1001WR-populationwater (min) popdensity popsparsity rwasoc1 rwasoc2 cpihadj we*, by(companycode financialyear)
tempfile SWB
save `SWB', replace
restore
*1c.- We append the SWB data to the master file.
*Before doing so we need to remove the existing disaggregated data in 2016/17 and 2017/18 for BWH and SWT and SWB to avoid duplication
drop if inlist(companycode,"SWT","BWH","SWB") & inlist(financialyear,"2016-17","2017-18")
append using `SWB'
*We check the number of osbervations per company and financial year to show there are observations for SWB only from 2016/17 to 2020/21 to 2024/25.
*That leaves no observations in the sample for the econometric models.
tab companycode financialyear, m
*========================================================================================
*2.- Variables generation using the codes. Each variable code has a label assigned to it.
*========================================================================================
*2a.- Dependent variables (costs)
*Base costs (botex) are defined as:
*The sum of the following cost categories:
*a) Power
*b) Abstraction charges/ Discharge consents but NOT for water resources
*c) Bulk supply
*d) Renewals expensed in year - infra
*e) Renewals expensed in year - noninfra
*f) Other operating expenditure excluding renewals
*g) Maintaining the long term of capability of the assets - infra
*h) Maintaining the long term of capability of the assets - noninfra
*Minus the following cost categories:
*i) Traffic Management Act
*j) Statutory water softening
*Based on the description above, we used the following codes to estimate base costs (botex)
*The following acronyms are used for each level of aggregation:
*botex stands for base costs used for econometric modelling
*wr: water resources
*twd: treated water distribution
*npw: network plus water
*wrp: water resources plus (which includes water resources, raw water distribution and water treatment)
*ww: wholesale water
*As an example, botexww stands for wholesale water base costs
g botexwr = WS1001WR + WS1004WR + WS1005WR + WS1006WR + WS1007WR + WS1012WR + WS1013WR - W3032WR - W3036WR
g botextwd = WS1001TWD + WS1003TWD + WS1004TWD + WS1005TWD + WS1006TWD + WS1007TWD + WS1012TWD + WS1013TWD - W3032TWD - W3036TWD
g botexww = WS1001CAW + WS1003CAW + WS1004CAW + WS1005CAW + WS1006CAW + WS1007CAW + WS1012CAW + WS1013CAW - W3032CAW - W3036CAW - WS1003WR
g botexnpw = botexww - botexwr
g botexwrp = botexww - botextwd
*2b.- Independent variables (cost drivers)
*Independent variables have the following abreviations:
*Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%.
*Variables are transformed in logs with the exception of the percent variables.
*Scale variables
*Number of connected properties (for all different supply chain levels)
g properties = (BN2221 + BN2161) * 1000
*Lengths of main (for twd, npw and ww)
g lengthsofmain = BN1100 + BN11590
*Specific network characteristics variables for each element of the supply chain
*Water treatment
*% proportion of water treated in water treatment works with complexity levels 3-6
g watertreated = CPMW0098 + CPMW0104 + CPMW0110 + CPMW0116 + CPMW0165 + CPMW0166 + CPMW0167 + CPMW0027 + CPMW0033 + CPMW0039 + CPMW0045 + CPMW0185 + CPMW0197 + CPMW0198
g watertreated36 = CPMW0116 + CPMW0165 + CPMW0166 + CPMW0167 + CPMW0045 + CPMW0185 + CPMW0197 + CPMW0198
g pctwatertreated36 = (watertreated36 / watertreated) *100
*Weighted average level of treatment complexity (WAC)
g wac = (1*(CPMW0098+CPMW0027)/watertreated) + (2*(CPMW0104+CPMW0033)/watertreated) + (3*(CPMW0110+CPMW0039)/watertreated) + (4*(CPMW0116+CPMW0045)/watertreated) + (5*(CPMW0165+CPMW0185)/watertreated) + (6*(CPMW0166+CPMW0197)/watertreated) + (7*(CPMW0167+CPMW0198)/watertreated)
*Treated water distribution
*Number of boosters per lenghts of main as a proxy for pumping activity
g boosterperlength = BN11390 / lengthsofmain
*The next step is to keep the relevant variables for logs and real terms transformation where applicable.
keep botex* /// Dependent variables
properties lengthsofmain /// Scale variables
pctwatertreated36 wac /// Complexity, water treatment
boosterperlength /// Complexity, treated water distribution
wedensitywater /// Density (wedensitywater stands for weighted average density measure developed by Ofwat)
cpihadj companycode financialyear /// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs.
*Convert costs to real terms using the CPIH adjustments and 2017/18 as base year
foreach x of varlist botex*{
replace `x' = `x' * cpihadj
rename `x' real`x'
}
*Transform variables to log. We don't transform the variables expressed in percentage.
*To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log
g temp1=1
g temp2=1
*Then place those variables on the left that are not being transformed
order companycode financialyear cpihadj pct* temp1
*Transform to natural logs
foreach x of varlist temp1-temp2 {
g ln`x' = ln(`x')
}
drop temp* lntemp*
*Calculate the quadratic term for weighted average density
g lnwedensitywater2= lnwedensitywater^2
*==================================================
*3.- We structure the dataset in a panel dataset format
*==================================================
*First assign numerical values to company codes and financial years:
*For company code:
encode companycode, g (id)
*For financial year:
g year=real(substr(financialyear,1,4))+1
*Then we set the dataset as panel
xtset id year
*Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the readers to test.
*Time trend
egen timetrend = group(financialyear)
*Time dummies
tab year, g(dummyyear)
*Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification
egen codecombine = concat(companycode year)
*We don't consider the forecast period for econometric models:
drop if year > 2018
*We also do not consider HDD or SVE for 2017/18
drop if inlist(companycode,"SVE","HDD")
*The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model PR19CA001.
export excel companycode financialyear codecombine realbotex* properties lengthsofmain pctwatertreated36 wac boosterperlength wedensitywater using "WW - Costs and costs drivers.xlsx", sheet("Costs and costs drivers") sheetreplace firstrow(variables) /*In this command you have to write the right location you save it*/
*=================================================================
*4.- Calculate correlation coefficients and export them into Excel
*=================================================================
corr *
putexcel set "WW Correlation Matrix.xls", replace /*In this command you need to choose the location where you want to save it*/
putexcel A1 = matrix(r(C)), names
*=================================================================
*5.- Prepare the macros for different regression specifications
*=================================================================
*Label the variables for easy understanding:
*Dependent variables
label var lnrealbotexwrp "Water resources plus botex in logs"
label var lnrealbotextwd "Treated water distribution botex in logs"
label var lnrealbotexww "Wholesale water botex in logs"
*Independent variables
label var lnproperties "Number of connected properties in logs"
label var lnlengthsofmain "Lenghts of main"
label var pctwatertreated36 "Percent of water treated in water treatment works with complexity levels 3 to 6"
label var lnwac "Water treatment complexity index in logs"
label var lnboosterperlength "Number of booster pumping stations per lenghts of main in logs"
label var lnwedensitywater "Weighted average population density in logs"
label var lnwedensitywater2 "Squared term of weighted average population density in logs"
*Water resources plus
global reg1 lnrealbotexwrp lnproperties pctwatertreated36 lnwedensitywater lnwedensitywater2
global reg2 lnrealbotexwrp lnproperties lnwac lnwedensitywater lnwedensitywater2
*Treated water distribution
global reg3 lnrealbotextwd lnlengthsofmain lnboosterperlength lnwedensitywater lnwedensitywater2
*Wholesale water
global reg4 lnrealbotexww lnproperties pctwatertreated36 lnboosterperlength lnwedensitywater lnwedensitywater2
global reg5 lnrealbotexww lnproperties lnwac lnboosterperlength lnwedensitywater lnwedensitywater2
*=================================================================
*6.- Run regressions and tests
*=================================================================
*6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications)
forvalues i = 1(1)5{
*Run Pooled OLS with clustered standard errors at company level
eststo ols`i' : reg ${reg`i'}, vce(cluster id)
*Adjusted R squared
estadd scalar R_squared = e(r2_a): ols`i'
*VIF test
vif
estadd scalar VIF_statistic = r(vif_1): ols`i'
*RESET Test
ovtest
estadd scalar RESET_P_value = r(p): ols`i'
*Add title to the regression
estadd local Econometric_model = "Pooled OLS": ols`i'
*Drop irrelevant variables created in the loop
keep companycode - codecombine
}
*6b) Random effects
forvalues i = 1(1)5{
*Run Pooled random effect regressions with clustered standard errors at company level and store results.
eststo re`i': xtreg ${reg`i'}, re vce(cluster id)
*We will need predicted and actual costs in levels for the RESET test.
predict yhatre`i' , xb
replace yhatre`i' = exp(yhatre`i')
*Overall R squared
estadd scalar R_squared = e(r2_o): re`i'
*The following code lines calculates the RESET test as Stata does as this is not available for RE models.
quietly sum yhatre`i'
g nmlz_xb`i' = (yhatre`i'-r(mean))/r(sd)
g xb2`i' = (nmlz_xb`i')^2
g xb3`i' = (nmlz_xb`i')^3
g xb4`i' = (nmlz_xb`i')^4
xtreg ${reg`i'} xb2`i' xb3`i' xb4`i'
test xb2`i' xb3`i' xb4`i'
estadd scalar RESET_P_value = r(p): re`i'
*Add title to the regression
estadd local Econometric_model = "Random Effects": re`i'
*Drop irrelevant variables
keep companycode - codecombine
}
*==================================================
*7.- Export results and matrices into Excel
*==================================================
*7a) Export a matrix for the coefficients that are used in feeder models PR19CA002 and PR19CA004.
estout * using "WW coefficients.xls" , replace /// /*In this command you need to choose the location where you want to save it*/
stats(depvar)
*7b) Export coefficients, significance levels and statistical tests into Excel
estout * using "WW results.xls" , replace /// /*In this command you need to choose the location where you want to save it*/
cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) ///
stats(Econometric_model depvar N vce R_squared VIF_statistic RESET_P_value) mlabels(,titles)
eststo clear
*==================================================
*8.-Add a day and date stamp
*==================================================
g run_time_stamp = "$S_TIME"
g run_date_stamp = "$S_DATE"
export excel run_date_stamp run_time_stamp using "WW coefficients datestamp.xls", cell (A1) sheetmodify firstrow(variables) /*In this command you need to choose the location where you want to save it*/