*In this do file we present the econometric models for Draft Determinations in PR19 *This do file has the following stages: *1.- Data preparation *2.- Variables generation using codes. Each variable code has a label assigned to it. *2a.- Dependent variables (costs) *2b.- Independent variables (cost drivers) *3.- Structure the dataset in a panel dataset format *4.- Calculate correlation coefficients and export them into Excel *5.- Prepare the macros for different regression specifications *6.- Run regressions and tests *6a) Pooled OLS *6b) Random effects *7.- Export the following results: *7a) Export a matrix for the coefficients that are used in feeder models 2 and 4. *7b) Export coefficients, significance levels and statistical tests into Excel *8.- Add a day and date stamp ************************************************************************************************************************************************************************************* *======================================================================================== *1.- Data preparation *======================================================================================== *1a.- Import the master dataset from Excel and assign labels and codes for all variables. clear matrix drop _all set more off set matsize 800 estimates drop _all macro drop _all import excel "FM_WWW1.xlsx", sheet("Stata dataset - wastewater") /*In this command you need to choose the location of the file you want to open*/ *The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18. *The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25. *The third row provides a corresponding label for each variable. *Firstly, we eliminate the historical codes drop in 2 *Next, we assign names and labels with the first and second rows respectively to each variable using the following loop: foreach var of varlist * { label variable `var' "`=`var'[2]'" rename `var' `=`var'[1]' } *As we have identified the variable names and codes for each indicator, we no longer need the first two rows: drop in 1/2 *Then can destring all variables *We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order we have: g start = "" g end = "" order companycode codecombine financialyear start destring start-end, replace force drop start end *Next, we rename the variables that were from external sources rename C_CD0003_PR19WWW1 popdensity rename C_CD0004_PR19WWW1 popsparsity rename C_CD0008_PR19WWW1 wedensitywastewater rename C_CD0009_PR19WWW1 welogdensitywastewater rename C_CD0010_PR19WWW1 welogsqdensitywastewater rename C_CD0015_PR19WWW1 populationwastewater rename C_CD0016_PR19WWW1 rwasoc1 rename C_CD0017_PR19WWW1 rwasoc2 rename C_CD0014_PR19WW1 cpihadj *======================================================================================== *2.- Variables generation using the codes. Each variable code has a label assigned to it. *======================================================================================== *2a.- Dependent variables (costs) *Modelled base costs are defined as: *The sum of the following cost categories: *a) Power (WS1001) *b) Income treated as negative expenditure (WS1002) *c) Service charges/ Discharge consents (WS1003) *d) Bulk discharge (WS1004) *e) Renewals expensed in year - infra (WS1005) *f) Renewals expensed in year - noninfra (WS1006) *g) Other operating expenditure excluding renewals (WS1007) *h) Maintaining the long term of capability of the assets - infra (WS1012) *i) Maintaining the long term of capability of the assets - noninfra (WS1013) *j) Transfer private sewers and pumping stations (S3024) *k) Smoothed New development and enw connections enhancement costs (S3020) *l) Smoothed Growth at sewage treatment works (excluding sludge) (S3021) *m) Smoothed Reduce flooding risk for properties (S3023) *Minus the following cost categories: *n) Traffic Management Act (W3032NP) *o) Industrial Emissions Directorate (W3036NP) *Based on the description above, we used the following codes to calculate base costs (botex) *The following acronyms are used for each level of aggregation: *botex stands for base costs used for econometric modelling *swc: sewage collection *swt: sewage treatment *br: bioresources *brp: bioresources plus (which includes bioresources and sewage treatment) *npww: network plus wastewater *www: wholesale wastewater *As an example, botexwww stands for wholesale wastewater base costs *Before calculating costs please note that data for trasnfer private sewer and pumping stations costs for some companies are missing. This should be read as zero and not as missing. replace S3024bioresources=0 if S3024bioresources==. & financialyear=="2017-18" *Traffic Managament Act costs and Industrial Emissions Directorate costs were given in £000s for the period 2011/12 to 2016/17 and also for 2017/18 for SVT foreach x of varlist W3032* W3036*{ replace `x' = `x'/1000 if inlist(financialyear,"2011-12","2012-13","2013-14","2014-15","2015-16","2016-17") replace `x' = `x'/1000 if companycode == "SVT" & financialyear == "2017-18" } *As a first step we calculate modelled base costs without enhancement g botexswc = WWS1001SC + WWS1002SC + WWS1003SC + WWS1004SC + WWS1005SC + WWS1006SC + WWS1007SC + WWS1012SC + WWS1013SC + S3024SC - W3032NPSC - W3036NPSC g botexswt = WWS1001ST + WWS1002ST + WWS1003ST + WWS1004ST + WWS1005ST + WWS1006ST + WWS1007ST + WWS1012ST + WWS1013ST + S3024ST - W3032NPST - W3036NPST g botexbr = BM802bioresources + BM831bioresources + BM836bioresources + BM140bioresources + BM839Ibioresources + BM839NIbioresources + BM839Obioresources + BC30945bioresources + CS00036bioresources + S3024bioresources - W3032SL - W3036SL g botexbrp = botexbr + botexswt g botexnpww = botexswc + botexswt g botexwww = botexnpww + botexbr *2b.- Independent variables (cost drivers) *Independent variables have the following terminology: *Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%. *Variables are transformed in logs with the exception of the percent variables. *Scale variables *Number of properties g properties = BN1178 * 1000 *Sludge produced g sludgeprod = MP05611 *Load g load = STWDA126 *Sewer length g sewerlength = BN13535 + BN13528 *Specific network characteristics variables for each element of the supply chain *Sewage collection *Pumping station capacity per sewer lenght g pumpingcapperlength = S4029 / sewerlength *Sewage treatment *% of load with ammonia consent below 3mg g pctnh3below3mg = ((STWDA121 + STWDA122) / (load)) * 100 *% of load treated at bands 1-3 (smaller bands tend to have higher unit cost) g pctbands13 = ((STWDA006+STWDA020+STWDA034) / (load)) * 100 *% of load treated at bands 6 (larger bands tend to have lower unit cost due to economies of scale) g pctbands6 = ((STWDA106) / (load)) * 100 *Number of sewage treatment works per property (sewage treatment) g swtwperpro = STWC115 / properties *Properties density g density = properties / sewerlength *The next step is to keep the relevant variables for logs and real terms transformation where applicable. keep botex* S3020* S3021* S3023* /// Dependent variables and enhancement costs properties sewerlength sludgeprod load /// Scale variables pumpingcapperlength /// Complexity, sewage collection pctnh3below3mg pctbands13 pctbands6 swtwperpro /// Complexity, sewage treatment density wedensitywastewater /// Density (wedensitywastewater stands for weighted average density measyre developed by Ofwat) cpihadj companycode financialyear /// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs. *Convert modelled base costs and enhancement costs in real terms using the CPIH adjustments and 2017/18 as bease year foreach x of varlist botex* S3020* S3021* S3023*{ replace `x' = `x' * cpihadj rename `x' real`x' } *Now smooth real enhancement costs foreach x of varlist realS3020* realS3021* realS3023*{ bysort companycode: egen smoothed`x' = mean(`x') if inlist(financialyear,"2011-12","2012-13","2013-14","2014-15","2015-16","2016-17","2017-18") replace smoothed`x' = `x' if smoothed`x' == . } replace realbotexswc = realbotexswc + smoothedrealS3020SC + smoothedrealS3021SC + smoothedrealS3023SC replace realbotexswt = realbotexswt + smoothedrealS3020ST + smoothedrealS3021ST + smoothedrealS3023ST replace realbotexbr = realbotexbr + smoothedrealS3020bioresources + smoothedrealS3021bioresources + smoothedrealS3023bioresources replace realbotexbrp = realbotexbr + realbotexswt replace realbotexnpww = realbotexswc + realbotexswt replace realbotexwww = realbotexnpww + realbotexbr *Drop enhancement costs drop realS3020* realS3021* realS3023* smoothed* *Transform variables to log. We don't trasform those variables expressed in percentage. *To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log g temp1=1 g temp2=1 *Then place those variables in the left that are not being transformed order companycode financialyear cpihadj pct* temp1 *Transform to natural logs foreach x of varlist temp1-temp2 { g ln`x' = ln(`x') } drop temp* lntemp* *================================================== *3.- We structure the dataset in a panel dataset format *================================================== *First assign numerical values to company codes and financial years: *For company code: encode companycode, g (id) *For financial year: g year=real(substr(financialyear,1,4))+1 *Then we set the dataset as panel xtset id year *Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the reader to test. *Time trend egen timetrend = group(financialyear) *Time dummies tab year, g(dummyyear) *Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification egen codecombine = concat(companycode year) *We don't consider the forecast period for econometric models: drop if year > 2018 *We also do not consider HDD or SVE for 2017/18 drop if inlist(companycode,"SVE","HDD") *The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model 1. export excel companycode financialyear codecombine realbotex* sewerlength sludgeprod load pumpingcapperlength pctnh3below3mg pctbands13 pctbands6 swtwperpro density wedensitywastewater using "WWW - Costs and costs drivers.xlsx", sheet("Costs and costs drivers") sheetmodify firstrow(variables) /*In this command you have to write the right location of the master file where you save it*/ *================================================================= *4.- Calculate correlation coefficients and export them into Excel *================================================================= corr * putexcel set "WWW Correlation Matrix.xls", replace /*In this command you have to write the right location of the master file where you save it*/ putexcel A1 = matrix(r(C)), names *================================================================= *5.- Prepare the macros for different regression specifications *================================================================= * Label variables for easy understanding *Dependent variables label var realbotexbr "Bioresources botex, £m" label var realbotexswt "Sewage treatment botex, £m" label var realbotexswc "Sewage collection botex, £m" label var realbotexbrp "Bioresources plus botex, £m" label var realbotexnpww "Wastewater network plus botex, £m" label var realbotexwww "Wholesale wastewater botex, £m" *Independent variables label var properties "Number of connected properties" label var sludgeprod "Sludge produced" label var load "Load entering the STWs" label var sewerlength "Total sewer length" label var pumpingcapperlength "Pumping capacity/km of sewer" label var pctnh3below3mg "Percentage of load with ammonia below 3mg/l" label var swtwperpro "Number of STWs per property" label var pctbands13 "Percentage of load treated in STWs bands 1-3" label var pctbands6 "Percentage of load treated in STWs band 6" label var density "Number of properties per km of sewer length" label var wedensitywastewater "Weighted average population density" * Sewage Collection global reg1 lnrealbotexswc lnsewerlength lnpumpingcapperlength lndensity global reg2 lnrealbotexswc lnsewerlength lnpumpingcapperlength lnwedensitywastewater * Sewage Treatment* global reg3 lnrealbotexswt lnload pctbands13 pctnh3below3mg global reg4 lnrealbotexswt lnload pctbands6 pctnh3below3mg * Bioresources* global reg5 lnrealbotexbr lnsludgeprod lnwedensitywastewater pctbands13 global reg6 lnrealbotexbr lnsludgeprod lnswtwperpro * Bioresources plus * global reg7 lnrealbotexbrp lnload pctbands13 pctnh3below3mg global reg8 lnrealbotexbrp lnload pctbands6 pctnh3below3mg *================================================================= *6.- Run regressions and tests *================================================================= *6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications) forvalues i = 1(1)8{ *Run Pooled OLS with clustered standard errors at company level eststo ols`i' : reg ${reg`i'}, vce(cluster id) *Adjusted R squared estadd scalar R_squared = e(r2_a): ols`i' *VIF test vif estadd scalar VIF_statistic = r(vif_1): ols`i' *RESET Test ovtest estadd scalar RESET_P_value = r(p): ols`i' *Add title to the regression estadd local Econometric_model = "Pooled OLS": ols`i' *Drop irrelevant variables created in the loop keep companycode - codecombine } *6b) Random effects forvalues i = 1(1)8{ *Run Pooled random effect regressions with clustered standard errors at company level and store results. eststo re`i': xtreg ${reg`i'}, re vce(cluster id) *We will need predicted and actual costs in levels for the RESET test. predict yhatre`i' , xb replace yhatre`i' = exp(yhatre`i') *Overall R squared estadd scalar R_squared = e(r2_o): re`i' *The following code lines calculates the RESET test as Stata does as this is not available for RE models. quietly sum yhatre`i' g nmlz_xb`i' = (yhatre`i'-r(mean))/r(sd) g xb2`i' = (nmlz_xb`i')^2 g xb3`i' = (nmlz_xb`i')^3 g xb4`i' = (nmlz_xb`i')^4 xtreg ${reg`i'} xb2`i' xb3`i' xb4`i' test xb2`i' xb3`i' xb4`i' estadd scalar RESET_P_value = r(p): re`i' *Add title to the regression estadd local Econometric_model = "Random Effects": re`i' *Drop irrelevant variables keep companycode - codecombine } *================================================== *7.- Export results and matrices into Excel *================================================== *7a) Export a matrix for the coefficients that are used used in feeder models 2 and 4. estout * using "WWW coefficients.xls" , replace /// /*In this command you have to write the right location of the master file where you save it*/ stats(depvar) *7b) Export coefficients, significance levels and statistical tests into Excel estout * using "WWW results.xls" , replace /// /*In this command you have to write the right location of the master file where you save it*/ cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) /// stats(Econometric_model depvar N vce R_squared VIF_statistic RESET_P_value) mlabels(,titles) eststo clear *================================================== *8.-Add a day and date stamp *================================================== g run_time_stamp = "$S_TIME" g run_date_stamp = "$S_DATE" export excel run_date_stamp run_time_stamp using "WWW coefficients datestamp.xls", cell (A1) sheetmodify firstrow(variables) /*In this command you have to write the right location of the master file where you save it*/