*In this do file we present the econometric models for PR19 final determinations
*This do file has the following stages:
*1.- Data preparation
*2.- Variables generation using codes. Each variable code has a label assigned to it.
*2a.- Dependent variables (costs)
*2b.- Independent variables (cost drivers)
*3.- Structure the dataset in a panel dataset format
*4.- Calculate correlation coefficients and export them into Excel
*5.- Prepare the macros for different regression specifications
*6.- Run regressions and tests
*6a) Pooled OLS
*6b) Random effects
*7.- Export the following results:
*7a) Export a matrix for the coefficients that are used in feeder models 2 and 4.
*7b) Export coefficients, significance levels and statistical tests into Excel
*8.- Add a day and date stamp
*************************************************************************************************************************************************************************************
*========================================================================================
*1.- Data preparation
*========================================================================================
*1a.- Import the master dataset from Excel and assign labels and codes for all variables.
clear
matrix drop _all
set more off
set matsize 800
estimates drop _all
macro drop _all
cd "..."
import excel "...\Bioresources_Sep22Paper_Dataset.xlsx", sheet("Stata dataset") firstrow
*The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18.
*The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25.
*The third row provides a corresponding label for each variable.
*Firstly, we eliminate the historical codes
* drop in 2
*Next, we assign names and labels with the first and second rows respectively to each variable using the following loop:
foreach var of varlist * {
label variable `var' "`=`var'[1]'"
* rename `var' `=`var'[1]'
}
*As we have identified the variable names and codes for each indicator, we no longer need the first two rows:
drop in 1
*Then can destring all variables
*We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order we have:
g start = ""
g end = ""
order companycode codecombine financialyear start
destring start-end, replace force
drop start end
*Next, we rename the variables that were from external sources
rename C_CD0003_PR19WWW1 popdensity
rename C_CD0004_PR19WWW1 popsparsity
rename C_CD0008_PR19WWW1 wedensitywastewater
rename C_CD0009_PR19WWW1 welogdensitywastewater
rename C_CD0010_PR19WWW1 welogsqdensitywastewater
rename C_CD0015_PR19WWW1 populationwastewater
rename C_CD0016_PR19WWW1 rwasoc1
rename C_CD0017_PR19WWW1 rwasoc2
rename C_CD0014_PR19WW1 cpihadj
*1.b - Keep SVH, drop SVE, SVT and HDD
drop if inlist(companycode, "SVE", "HDD", "SVT")
*We check the number of osbervations per company and financial year
tab companycode financialyear, m
*========================================================================================
*2.- Variables generation using the codes. Each variable code has a label assigned to it.
*========================================================================================
*2a.- Dependent variables (costs)
*Modelled base costs are defined as:
*The sum of the following cost categories:
*a) Power (WS1001)
*b) Income treated as negative expenditure (WS1002)
*c) Service charges/ Discharge consents (WS1003)
*d) Bulk discharge (WS1004)
*e) Renewals expensed in year - infra (WS1005)
*f) Renewals expensed in year - noninfra (WS1006)
*g) Other operating expenditure excluding renewals (WS1007)
*h) Maintaining the long term of capability of the assets - infra (WS1012)
*i) Maintaining the long term of capability of the assets - noninfra (WS1013)
*j) Transfer private sewers and pumping stations (S3024)
*k) Smoothed New development and enw connections enhancement costs (S3020)
*l) Smoothed Growth at sewage treatment works (excluding sludge) (S3021)
*m) Smoothed Reduce flooding risk for properties (S3023)
*Minus the following cost categories:
*n) Traffic Management Act (W3032)
*o) Industrial Emissions Directorate (W3036)
*Minus the diversion costs for sewage collection only
*p) Diversions expenditure - wastewater NRSWA (APP28RR_WW0002)
*q) Diversions expenditure - wastewater Other non-s185 (APP28RR_WW0003)
*Based on the description above, we used the following codes to calculate base costs (botex)
*The following acronyms are used for each level of aggregation:
*botex stands for base costs used for econometric modelling
*swc: sewage collection
*swt: sewage treatment
*br: bioresources
*brp: bioresources plus (which includes bioresources and sewage treatment)
*npww: network plus wastewater
*www: wholesale wastewater
*As an example, botexwww stands for wholesale wastewater base costs
*Before calculating costs please note that data for trasnfer private sewer and pumping stations costs for some companies are missing. This should be read as zero and not as missing.
replace S3024STP=0 if S3024STP==. & financialyear=="2017-18"
replace S3024SDT=0 if S3024SDT==. & financialyear=="2017-18"
replace S3024SDD=0 if S3024SDD==. & financialyear=="2017-18"
/* This section has been commented out because in the new FM_WWW1 we transform TMA and IED costs before importing the dataset to Stata
*Traffic Managament Act costs and Industrial Emissions Directorate costs were given in £000s for the period 2011/12 to 2016/17 and also for 2017/18 for SVT
foreach x of varlist W3032* W3036*{
replace `x' = `x'/1000 if inlist(financialyear,"2011-12","2012-13","2013-14","2014-15","2015-16","2016-17")
replace `x' = `x'/1000 if companycode == "SVT" & inlist(financialyear,"2017-18","2018-19")
}
*/
*As a first step we calculate modelled base costs without enhancement
g botexbr = BM602STP + BM836STP + BM631STP + BM140STP + BM839ISTP + BM839NISTP + BM839OSTP + BC30945STP + CS00036STP + S3024STP + /// /* sludge transport */
BM602SDT + BM836SDT + BM631SDT + BM140SDT + BM839ISDT + BM839NISDT + BM839OSDT + BC30945SDT + CS00036SDT + S3024SDT + /// /* sludge treatment */
BM602SDD + BM836SDD + BM631SDD + BM140SDD + BM839ISDD + BM839NISDD + BM839OSDD + BC30945SDD + CS00036SDD + S3024SDD /// /* sludge disposal */
- W3032SL - W3036SL + BACKCAST
*For year 2020-21 onwards, we change the calculation of the dependent variable to
* (i) remove transferred private sewers and pumping stations from the calculation (not reported from 2020-21);
* (ii) remove non-section 185 diversions from the calculation of SWC (as these costs are not included in base from 2020-21)
* (iii) add developer services opex, Growth at STW opex and Risk of sewer flooding opex separately for 2020-21. Note there is no developer services expenditure for bioresources
replace botexbr = (BM602STP + BM836STP + BM631STP + BM140STP + BM839ISTP + BM839NISTP + BM839OSTP + BC30945STP + CS00036STP + WWS2034STP + WWS2038STP /// /* sludge transport */
+ BM602SDT + BM836SDT + BM631SDT + BM140SDT + BM839ISDT + BM839NISDT + BM839OSDT + BC30945SDT + CS00036SDT + WWS2034SDT + WWS2038SDT /// /* sludge treatment */
+ BM602SDD + BM836SDD + BM631SDD + BM140SDD + BM839ISDD + BM839NISDD + BM839OSDD + BC30945SDD + CS00036SDD + WWS2034SDD + WWS2038SDD /// /* sludge disposal */
- W3032SL - W3036SL + BACKCAST) if financialyear == "2020-21"
*2b.- Independent variables (cost drivers)
*Independent variables have the following terminology:
*Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%.
*Variables are transformed in logs with the exception of the percent variables.
*Scale variables
*Number of properties
g properties = BN1178 * 1000
*Sludge produced
g sludgeprod = MP05611
*Load
g load = STWD128
*Sewer length
g sewerlength = BN13535_21 + BN13528
*Specific network characteristics variables for each element of the supply chain
*Sewage collection
*Pumping station capacity per sewer lenght
g pumpingcapperlength = S4029 / sewerlength
*Sewage treatment
*% of load with ammonia consent below 3mg
g pctnh3below3mg = ((STWDA121 + STWDA122_21) / (load)) * 100
*% of load treated at bands 1-3 (smaller bands tend to have higher unit cost)
g pctbands13 = ((STWD012_21 + STWD026_21 + STWD040_21) / (load)) * 100
*% of load treated at bands 6 (larger bands tend to have lower unit cost due to economies of scale)
g pctbands6 = ((STWD108_21) / (load)) * 100
*Number of sewage treatment works per property (sewage treatment)
g swtwperpro = STWC115_21 / properties
*Properties density
g density = properties / sewerlength
*The next step is to keep the relevant variables for logs and real terms transformation where applicable.
keep botex* /// Dependent variables and enhancement costs
properties sewerlength sludgeprod load /// Scale variables
pumpingcapperlength /// Complexity, sewage collection
pctnh3below3mg pctbands13 pctbands6 swtwperpro /// Complexity, sewage treatment
density wedensitywastewater /// Density (wedensitywastewater stands for weighted average density measyre developed by Ofwat)
cpihadj companycode financialyear /// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs.
*Convert modelled base costs and enhancement costs in real terms using the CPIH adjustments and 2017/18 as bease year
foreach x of varlist botex* {
replace `x' = `x' * cpihadj
rename `x' real`x'
}
*Transform variables to log. We don't trasform those variables expressed in percentage.
*To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log
g temp1=1
g temp2=1
*Then place those variables in the left that are not being transformed
order companycode financialyear cpihadj pct* temp1
*Transform to natural logs
foreach x of varlist temp1-temp2 {
g ln`x' = ln(`x')
}
drop temp* lntemp*
*Calculate the quadratic term for weighted average density
g lnwedensitywastewater2= lnwedensitywastewater^2
*==================================================
*3.- We structure the dataset in a panel dataset format
*==================================================
*First assign numerical values to company codes and financial years:
*For company code:
encode companycode, g (id)
*For financial year:
g year=real(substr(financialyear,1,4))+1
*Then we set the dataset as panel
xtset id year
*Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the reader to test.
*Time trend
egen timetrend = group(financialyear)
*Time dummies
tab year, g(dummyyear)
*Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification
egen codecombine = concat(companycode year)
*We don't consider the forecast period for econometric models:
* drop if year > 2021 /* this is not needed anymore since we no longer have forecast data in FM1 */
*We also do not consider HDD or SVE for 2019-20, and SVT for 2019-20
*drop if inlist(companycode,"SVE","HDD")
*drop if companycode == "SVT" & year > 2019
*drop if companycode == "SVH" & year < 2020
*Use SVH as continuation of SVT, as per CMA FD methodology
*replace companycode = "SVT" if companycode == "SVH"
*replace id = 8 if id == 7
*replace codecombine = "SVT2020" if codecombine == "SVH2020"
*replace codecombine = "SVT2021" if codecombine == "SVH2021"
*The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model 1.
export excel companycode financialyear codecombine realbotex* sewerlength sludgeprod load pumpingcapperlength pctnh3below3mg pctbands13 pctbands6 swtwperpro density wedensitywastewater lnwedensitywastewater2 using "Costs and costs drivers backcast.xlsx", sheet("Costs and costs drivers") sheetmodify firstrow(variables)
* Plot data charts
/*
graph bar (sum) realbotexwww, over(year) blabel(total, format(%9.0f)) title("Industry wholesale wastewater modelled base costs, £m")
graph export "O:\OFWSHARE\Cost assessment\PR24\Update with 2020-21 data\Charts\wwtotalbotex.png", replace
graph bar (sum) realbotexww, over(year) blabel(total, size(vsmall) format(%9.0f)) by(, title("Wholesale wastewater botex by company, £m")) by(companycode)
graph export "O:\OFWSHARE\Cost assessment\PR24\Update with 2020-21 data\Charts\wwwtotalbotexbycompany.png", replace
*/
*=================================================================
*4.- Calculate correlation coefficients and export them into Excel
*=================================================================
corr *
putexcel set "WWW Correlation Matrix.xls", replace
putexcel A1 = matrix(r(C)), names
*=================================================================
*5.- Prepare the macros for different regression specifications
*=================================================================
* Label variables for easy understanding
*Dependent variables
label var realbotexbr "Bioresources botex, £m"
*Independent variables
label var properties "Number of connected properties"
label var sludgeprod "Sludge produced"
label var load "Load entering the STWs"
label var sewerlength "Total sewer length"
label var pumpingcapperlength "Pumping capacity/km of sewer"
label var pctnh3below3mg "Percentage of load with ammonia below 3mg/l"
label var swtwperpro "Number of STWs per property"
label var pctbands13 "Percentage of load treated in STWs bands 1-3"
label var pctbands6 "Percentage of load treated in STWs band 6"
label var density "Number of properties per km of sewer length"
label var wedensitywastewater "Weighted average population density"
* Bioresources*
global reg1 lnrealbotexbr lnsludgeprod lnwedensitywastewater pctbands13
global reg2 lnrealbotexbr lnsludgeprod lnswtwperpro
*=================================================================
*6.- Run regressions and tests
*=================================================================
*6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications)
forvalues i = 1(1)2{
*Run Pooled OLS with clustered standard errors at company level
eststo ols`i' : reg ${reg`i'}, vce(cluster id)
*Adjusted R squared
estadd scalar R_squared = e(r2_a): ols`i'
*VIF test
vif
estadd scalar VIF_statistic = r(vif_1): ols`i'
*RESET Test
ovtest
estadd scalar RESET_P_value = r(p): ols`i'
*Add title to the regression
estadd local Econometric_model = "Pooled OLS": ols`i'
*Drop irrelevant variables created in the loop
keep companycode - codecombine
}
*6b) Random effects
forvalues i = 1(1)2{
*Run Pooled random effect regressions with clustered standard errors at company level and store results.
eststo re`i': xtreg ${reg`i'}, re vce(cluster id)
*We will need predicted and actual costs in levels for the RESET test.
predict yhatre`i' , xb
replace yhatre`i' = exp(yhatre`i')
*Overall R squared
estadd scalar R_squared = e(r2_o): re`i'
*The following code lines calculates the RESET test as Stata does as this is not available for RE models.
quietly sum yhatre`i'
g nmlz_xb`i' = (yhatre`i'-r(mean))/r(sd)
g xb2`i' = (nmlz_xb`i')^2
g xb3`i' = (nmlz_xb`i')^3
g xb4`i' = (nmlz_xb`i')^4
xtreg ${reg`i'} xb2`i' xb3`i' xb4`i'
test xb2`i' xb3`i' xb4`i'
estadd scalar RESET_P_value = r(p): re`i'
*Add title to the regression
estadd local Econometric_model = "Random Effects": re`i'
*Drop irrelevant variables
keep companycode - codecombine
}
*==================================================
*9.- Export results and matrices into Excel
*==================================================
*Export a matrix for the coefficients that are used in feeder models 2 and 4.
estout * using ".\WWW coefficients - backcast.xls" , replace ///
stats(depvar)
estout using "WW Results - backcast.xls", replace ///
cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) ///
stats(Econometric_model depvar N vce R_squared RESET_P_value) mlabels(,titles)
*==================================================
*10.-Add a day and date stamp
*==================================================
g run_time_stamp = "$S_TIME"
g run_date_stamp = "$S_DATE"
export excel run_date_stamp run_time_stamp using ".\WWW coefficients datestamp - backcast.xls", cell (A1) sheetmodify firstrow(variables)