*In this do file we present the econometric models for IAP in PR19 *This do file has the following stages: *1.- Data preparation *2.- Variables generation using codes. Each variable code has a label assigned to it. *2a.- Dependent variables (costs) *2b.- Independent variables (cost drivers) *3.- Structure the dataset in a panel dataset format *4.- Calculate correlation coefficients and export them into Excel *5.- Prepare the macros for different regression specifications *6.- Run regressions and tests *6a) Pooled OLS *6b) Random effects *7.- Export the following results: *7a) Export a matrix for the coefficients that are used in feeder models 2 and 4. *7b) Export coefficients, significance levels and statistical tests into Excel *8.- Add a day and date stamp ************************************************************************************************************************************************************************************* *======================================================================================== *1.- Data preparation *======================================================================================== *1a.- Import the master dataset from Excel and assign labels and codes for all variables. clear matrix drop _all set more off set matsize 800 estimates drop _all macro drop _all import excel "FM_WWW1.xlsx", sheet("Stata dataset - wastewater") /*In this command you have to write the right location of the master file where you save it*/ *The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18. *The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25. *The third row provides a corresponding label for each variable. *Firstly, we eliminate the historical codes drop in 2 *Next, we assign names and labels to the first and second rows, respectively, to each variable using the following loop: foreach var of varlist * { label variable `var' "`=`var'[2]'" rename `var' `=`var'[1]' } *As we have identified the variable names and codes for each indicator, we no longer need the first two rows: drop in 1/2 *Then we can destring all variables *We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order they are in: g start = "" g end = "" order companycode codecombine financialyear start destring start-end, replace force drop start end *======================================================================================== *2.- Variables generation using the codes. Each variable code has a label assigned to it. *======================================================================================== *2a.- Dependent variables (costs) *Base costs (botex) are defined as: *The sum of the following cost categories: *a) Power *b) Income treated as negative expenditure for bioresources ONLY *c) Service charges/ Discharge consents *d) Bulk discharge *e) Renewals expensed in year - infra *f) Renewals expensed in year - noninfra *g) Other operating expenditure excluding renewals *h) Maintaining the long term of capability of the assets - infra *i) Maintaining the long term of capability of the assets - noninfra *j) Transfer private sewers and pumping stations *Minus the following cost categories: *i) Traffic Management Act *j) Industrial Emissions Directorate *Based on the description above, we used the following codes to calculate base costs (botex) *The following acronyms are used for each level of aggregation: *botex stands for base costs used for econometric modelling *swc: sewage collection *swt: sewage treatment *br: bioresources *brp: bioresources plus (which includes bioresources and sewage treatment) *npww: network plus wastewater *www: wholesale wastewater *As an example, botexwww stands for wholesale wastewater base costs *Before calculating costs please note that data for trasnfer private sewer and pumping stations costs for some companies are blank. This should be read as zero and not as missing. replace S3024bioresources=0 if S3024bioresources==. & financialyear=="2017-18" g botexswc = (WWS1001SC + WWS1003SC + WWS1004SC + WWS1005SC + WWS1006SC + WWS1007SC + WWS1012SC + WWS1013SC + S3024SC) - (W3032NPSC / 1000) - (W3036NPSC / 1000) g botexswt = (WWS1001ST + WWS1003ST + WWS1004ST + WWS1005ST + WWS1006ST + WWS1007ST + WWS1012ST + WWS1013ST + S3024ST) - (W3032NPST / 1000) - (W3036NPST / 1000) g botexbr = (BM802bioresources + BM831bioresources + BM836bioresources + BM140bioresources + BM839Ibioresources + BM839NIbioresources + BM839Obioresources + BC30945bioresources + CS00036bioresources + S3024bioresources) - (W3032SL / 1000) - (W3036SL / 1000) g botexbrp = botexbr + botexswt g botexnpww = botexswc + botexswt g botexwww = botexnpww + botexbr *2b.- Independent variables (cost drivers) *Independent variables have the following abreviation: *Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%. *Variables are transformed in logs with the exception of the percent variables. *Scale variables *Number of properties g properties = BN1178 * 1000 *Sludge produced g sludgeprod = MP05611 *Load g load = STWDA126 *Sewer length g sewerlength = BN13535 + BN13528 *Specific network characteristics variables for each element of the supply chain *Sewage collection *Pumping station capacity per sewer lenght g pumpingcapperlength = S4029 / sewerlength *Sewage treatment *% of load with ammonia consent below 3mg g pctnh3below3mg = ((STWDA121 + STWDA122) / (STWDA126)) * 100 *% of load treated at bands 1-3 (smaller bands tend to have higher unit cost) g pctbands13 = ((STWDA006+STWDA020+STWDA034) / (load)) * 100 *% of load treated at bands 6 (larger bands tend to have lower unit cost due to economies of scale) g pctbands6 = ((STWDA106) / (load)) * 100 *Number of sewage treatment works per property (sewage treatment) g swtwperpro = STWC115 / properties *Properties density g density = properties / sewerlength *The next step is to keep the relevant variables for logs and real terms transformation where applicable. keep botex* /// Dependent variables properties sewerlength sludgeprod load /// Scale variables pumpingcapperlength /// Complexity, sewage collection pctnh3below3mg pctbands13 pctbands6 swtwperpro /// Complexity, sewage treatment density wedensitywastewater /// Density (wedensitywastewater stands for weighted average density measyre developed by Ofwat) cpihadj companycode financialyear /// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs. *Convert costs real terms using the CPIH adjustments and 2017/18 as bease year foreach x of varlist botex*{ replace `x' = `x' * cpihadj rename `x' real`x' } *Transform variables to log. We don't trasform the variables expressed in percentage. *To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log g temp1=1 g temp2=1 *Then place those variables in the left that are not being transformed order companycode financialyear cpihadj pct* temp1 *Transform to natural logs foreach x of varlist temp1-temp2 { g ln`x' = ln(`x') } drop temp* lntemp* *================================================== *3.- We structure the dataset in a panel dataset format *================================================== *First assign numerical values to company codes and financial years: *For company code: encode companycode, g (id) *For financial year: g year=real(substr(financialyear,1,4))+1 *Then we set the dataset as panel xtset id year *Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the reader to test. *Time trend egen timetrend = group(financialyear) *Time dummies tab year, g(dummyyear) *Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification egen codecombine = concat(companycode year) *We don't consider the forecast period for econometric models: drop if year > 2018 *We also do not consider HDD or SVE for 2017/18 drop if inlist(companycode,"SVE","HDD") *The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model PR19CA005 export excel companycode financialyear codecombine realbotex* sewerlength sludgeprod load pumpingcapperlength pctnh3below3mg pctbands13 pctbands6 swtwperpro density wedensitywastewater using "WWW - Costs and costs drivers.xlsx", sheet("Costs and costs drivers") sheetreplace firstrow(variables) /*In this command you have to write the right location you save it*/ *================================================================= *4.- Calculate correlation coefficients and export them into Excel *================================================================= corr * putexcel set "WWW Correlation Matrix.xls", replace /*In this command you need to choose location where you want to save it*/ putexcel A1 = matrix(r(C)), names *================================================================= *5.- Prepare the macros for different regression specifications *================================================================= * Label variables for easy understanding *Dependent variables label var realbotexbr "Bioresources botex, £m" label var realbotexswt "Sewage treatment botex, £m" label var realbotexswc "Sewage collection botex, £m" label var realbotexbrp "Bioresources plus botex, £m" label var realbotexnpww "Wastewater network plus botex, £m" label var realbotexwww "Wholesale wastewater botex, £m" *Independent variables label var properties "Number of connected properties" label var sludgeprod "Sludge produced" label var load "Load entering the STWs" label var sewerlength "Total sewer length" label var pumpingcapperlength "Pumping capacity/km of sewer" label var pctnh3below3mg "Percentage of load with ammonia below 3mg/l" label var swtwperpro "Number of STWs per property" label var pctbands13 "Percentage of load treated in STWs bands 1-3" label var pctbands6 "Percentage of load treated in STWs band 6" label var density "Number of properties per km of sewer length" label var wedensitywastewater "Weighted average population density" * Sewage Collection global reg1 lnrealbotexswc lnsewerlength lnpumpingcapperlength lndensity global reg2 lnrealbotexswc lnsewerlength lnpumpingcapperlength lnwedensitywastewater * Sewage Treatment* global reg3 lnrealbotexswt lnload pctbands13 pctnh3below3mg global reg4 lnrealbotexswt lnload pctbands6 pctnh3below3mg * Bioresources* global reg5 lnrealbotexbr lnsludgeprod lnwedensitywastewater global reg6 lnrealbotexbr lnsludgeprod lnswtwperpro * Bioresources plus * global reg7 lnrealbotexbrp lnload pctbands13 pctnh3below3mg global reg8 lnrealbotexbrp lnload pctbands6 pctnh3below3mg *================================================================= *6.- Run regressions and tests *================================================================= *6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications) forvalues i = 1(1)8{ *Run Pooled OLS with clustered standard errors at company level eststo ols`i' : reg ${reg`i'}, vce(cluster id) *Adjusted R squared estadd scalar R_squared = e(r2_a): ols`i' *VIF test vif estadd scalar VIF_statistic = r(vif_1): ols`i' *RESET Test ovtest estadd scalar RESET_P_value = r(p): ols`i' *Add title to the regression estadd local Econometric_model = "Pooled OLS": ols`i' *Drop irrelevant variables created in the loop keep companycode - codecombine } *6b) Random effects forvalues i = 1(1)8{ *Run Pooled random effect regressions with clustered standard errors at company level and store results. eststo re`i': xtreg ${reg`i'}, re vce(cluster id) *We will need predicted and actual costs in levels for the RESET test. predict yhatre`i' , xb replace yhatre`i' = exp(yhatre`i') *Overall R squared estadd scalar R_squared = e(r2_o): re`i' *The following code lines calculates the RESET test as Stata does as this is not available for RE models. quietly sum yhatre`i' g nmlz_xb`i' = (yhatre`i'-r(mean))/r(sd) g xb2`i' = (nmlz_xb`i')^2 g xb3`i' = (nmlz_xb`i')^3 g xb4`i' = (nmlz_xb`i')^4 xtreg ${reg`i'} xb2`i' xb3`i' xb4`i' test xb2`i' xb3`i' xb4`i' estadd scalar RESET_P_value = r(p): re`i' *Add title to the regression estadd local Econometric_model = "Random Effects": re`i' *Drop irrelevant variables keep companycode - codecombine } *================================================== *7.- Export results and matrices into Excel *================================================== *7a) Export a matrix for the coefficients that are used used in PR19CA06 and PR19CA08. estout * using "WWW coefficients.xls" , replace /// /*In this command you need to choose location where you want to save it*/ stats(depvar) *7b) Export coefficients, significance levels and statistical tests into Excel estout * using "WWW results.xls" , replace /// /*In this command you need to choose location where you want to save it*/ cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) /// stats(Econometric_model depvar N vce R_squared VIF_statistic RESET_P_value) mlabels(,titles) eststo clear *================================================== *8.-Add a day and date stamp *================================================== g run_time_stamp = "$S_TIME" g run_date_stamp = "$S_DATE" export excel run_date_stamp run_time_stamp using "WWW coefficients datestamp.xls", cell (A1) sheetmodify firstrow(variables) /*In this command you need to choose location where you want to save it*/