*In this do file we present the econometric models for Draft Determinations in PR19
*This do file has the following stages:

 *1.- Data preparation
  
 *2.- Variables generation using codes. Each variable code has a label assigned to it.  
   
   *2a.- Dependent variables (costs)
   *2b.- Independent variables (cost drivers)
 
 *3.- Structure the dataset in a panel dataset format  
 
 *4.- Calculate correlation coefficients and export them into Excel 
 
 *5.- Prepare the macros for different regression specifications 
 
 *6.- Run regressions and tests
   *6a) Pooled OLS
   *6b) Random effects
    
 *7.- Export the following results:
	*7a) Export a matrix for the coefficients that are used in feeder models 2 and 4.
	*7b) Export coefficients, significance levels and statistical tests into Excel 

 *8.- Add a day and date stamp


*************************************************************************************************************************************************************************************


 *======================================================================================== 
 *1.- Data preparation
 *========================================================================================

   *1a.- Import the master dataset from Excel and assign labels and codes for all variables. 
clear
matrix drop _all
set more off
set matsize 800
estimates drop _all
macro drop _all

import excel "FM_WWW1.xlsx", sheet("Stata dataset - wastewater") /*In this command you need to choose the location of the file you want to open*/

	*The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18. 
	*The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25.  
	*The third row provides a corresponding label for each variable. 

	*Firstly, we eliminate the historical codes 
drop in 2 	
	
	*Next, we assign names and labels with the first and second rows respectively to each variable using the following loop: 
	
foreach var of varlist * {
  label  variable  `var' "`=`var'[2]'"
  rename `var' `=`var'[1]'
}

	*As we have identified the variable names and codes for each indicator, we no longer need the first two rows:
drop in 1/2
	
	*Then can destring all variables 
	 *We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order we have:
g start		= ""
g end		= ""
order 		companycode codecombine financialyear start
destring 	start-end, replace force
drop 		start end

*Next, we rename the variables that were from external sources

rename	C_CD0003_PR19WWW1	popdensity
rename	C_CD0004_PR19WWW1	popsparsity
rename	C_CD0008_PR19WWW1	wedensitywastewater
rename	C_CD0009_PR19WWW1	welogdensitywastewater
rename	C_CD0010_PR19WWW1	welogsqdensitywastewater
rename	C_CD0015_PR19WWW1	populationwastewater
rename	C_CD0016_PR19WWW1	rwasoc1
rename	C_CD0017_PR19WWW1	rwasoc2
rename	C_CD0014_PR19WW1	cpihadj


*========================================================================================
 *2.- Variables generation using the codes. Each variable code has a label assigned to it.  
 *======================================================================================== 
   
    *2a.- Dependent variables (costs)
	  
	  *Modelled base costs are defined as:
	   
	   *The sum of the following cost categories:
	  
		*a) Power																(WS1001)
		*b) Income treated as negative expenditure								(WS1002)
		*c) Service charges/ Discharge consents									(WS1003)
		*d) Bulk discharge														(WS1004)
		*e) Renewals expensed in year - infra									(WS1005)
		*f) Renewals expensed in year - noninfra								(WS1006)
		*g) Other operating expenditure excluding renewals						(WS1007)
		*h) Maintaining the long term of capability of the assets - infra		(WS1012)
		*i) Maintaining the long term of capability of the assets - noninfra	(WS1013)
		*j) Transfer private sewers and pumping stations						(S3024)
		*k) Smoothed New development and enw connections enhancement costs		(S3020)
		*l) Smoothed Growth at sewage treatment works (excluding sludge)		(S3021)
		*m) Smoothed Reduce flooding risk for properties						(S3023)
	
	   *Minus the following cost categories:
		
		*n) Traffic Management Act												(W3032NP) 
		*o) Industrial Emissions Directorate									(W3036NP)
		
		*Based on the description above, we used the following codes to calculate base costs (botex)
	  
	  *The following acronyms are used for each level of aggregation:
	    *botex stands for base costs used for econometric modelling
	    *swc: 	sewage collection
	    *swt: 	sewage treatment
	    *br: 	bioresources
	    *brp: 	bioresources plus (which includes bioresources and sewage treatment)
	    *npww: 	network plus wastewater
	    *www:   wholesale wastewater
		*As an example, botexwww stands for wholesale wastewater base costs	
	
	  *Before calculating costs please note that data for trasnfer private sewer and pumping stations costs for some companies are missing. This should be read as zero and not as missing.
	replace S3024bioresources=0 if S3024bioresources==. & financialyear=="2017-18"

	  *Traffic Managament Act costs and Industrial Emissions Directorate costs were given in £000s for the period 2011/12 to 2016/17 and also for 2017/18 for SVT
foreach x of varlist W3032* W3036*{
replace `x' = `x'/1000 if inlist(financialyear,"2011-12","2012-13","2013-14","2014-15","2015-16","2016-17")
replace `x' = `x'/1000 if companycode == "SVT" & financialyear == "2017-18"
}

	  *As a first step we calculate modelled base costs without enhancement

	g botexswc		=	WWS1001SC  		  + WWS1002SC		  + WWS1003SC		   	+ WWS1004SC		   	+ WWS1005SC 			+ WWS1006SC 		   + WWS1007SC 		    + WWS1012SC 		  + WWS1013SC 		    + S3024SC  			- W3032NPSC - W3036NPSC 
	g botexswt		=	WWS1001ST		  + WWS1002ST	   	  + WWS1003ST		   	+ WWS1004ST		   	+ WWS1005ST 			+ WWS1006ST 		   + WWS1007ST 		    + WWS1012ST 		  + WWS1013ST 		    + S3024ST  			- W3032NPST - W3036NPST 
	g botexbr		=	BM802bioresources + BM831bioresources + BM836bioresources	+ BM140bioresources + BM839Ibioresources 	+  BM839NIbioresources + BM839Obioresources + BC30945bioresources + CS00036bioresources + S3024bioresources - W3032SL   - W3036SL 
	g botexbrp 		= 	botexbr	 	+ botexswt
	g botexnpww 	=   botexswc 	+ botexswt
	g botexwww 		= 	botexnpww 	+ botexbr

	
    *2b.- Independent variables (cost drivers)	
	   
	   *Independent variables have the following terminology:
	      *Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%. 
		  *Variables are transformed in logs with the exception of the percent variables. 

       *Scale variables
	  
	      *Number of properties
	g properties	=   BN1178 * 1000
	      *Sludge produced
	g sludgeprod	= 	MP05611
	      *Load
	g load			=   STWDA126
	      *Sewer length
	g sewerlength 	=	BN13535 + BN13528  
	
	   *Specific network characteristics variables for each element of the supply chain
	
	     *Sewage collection 
	 		
		   *Pumping station capacity per sewer lenght
	g pumpingcapperlength	= S4029 / sewerlength
	
	   *Sewage treatment 
	   	   *% of load with ammonia consent below 3mg
	g pctnh3below3mg	= ((STWDA121 + STWDA122) / (load)) * 100
	  	   *% of load treated at bands 1-3 (smaller bands tend to have higher unit cost)
	g pctbands13	 	= ((STWDA006+STWDA020+STWDA034) / (load)) * 100
	  	   *% of load treated at bands 6 (larger bands tend to have lower unit cost due to economies of scale)
	g pctbands6			= ((STWDA106) / (load)) * 100
			   *Number of sewage treatment works per property (sewage treatment)
	g swtwperpro		= 	STWC115 / properties
	
   	   *Properties density
	g density		 	= properties / sewerlength  


  *The next step is to keep the relevant variables for logs and real terms transformation where applicable.	
	
keep  botex* S3020* S3021* S3023*						/// Dependent variables and enhancement costs	
	  properties sewerlength sludgeprod load			/// Scale variables
	  pumpingcapperlength 			 					/// Complexity, sewage collection
	  pctnh3below3mg pctbands13 pctbands6 swtwperpro	/// Complexity, sewage treatment	
	  density wedensitywastewater						/// Density (wedensitywastewater stands for weighted average density measyre developed by Ofwat)
	  cpihadj companycode financialyear					/// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs.
  
  *Convert modelled base costs and enhancement costs in real terms using the CPIH adjustments and 2017/18 as bease year
	foreach x of varlist botex* S3020* S3021* S3023*{
		replace `x'	= `x' * cpihadj
		rename  `x' real`x'
	}  
 
 	
   *Now smooth real enhancement costs	
	foreach x of varlist realS3020* realS3021* realS3023*{
		bysort companycode: egen smoothed`x' = mean(`x') if inlist(financialyear,"2011-12","2012-13","2013-14","2014-15","2015-16","2016-17","2017-18")
		replace smoothed`x' = `x' if smoothed`x' == .
		}	
	
	replace realbotexswc 	= realbotexswc 	+ 	smoothedrealS3020SC 			+ 	smoothedrealS3021SC 			+	smoothedrealS3023SC
	replace realbotexswt	= realbotexswt 	+	smoothedrealS3020ST 			+ 	smoothedrealS3021ST 			+ 	smoothedrealS3023ST
	replace realbotexbr		= realbotexbr	+	smoothedrealS3020bioresources 	+ 	smoothedrealS3021bioresources 	+ 	smoothedrealS3023bioresources
	replace realbotexbrp 	= realbotexbr	+ 	realbotexswt
	replace realbotexnpww 	= realbotexswc 	+ 	realbotexswt
	replace realbotexwww 	= realbotexnpww	+ 	realbotexbr 
	
	*Drop enhancement costs 
	drop realS3020* realS3021* realS3023* smoothed* 
 
 
   *Transform variables to log. We don't trasform those variables expressed in percentage. 
    
	  *To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log
	g temp1=1
	g temp2=1
	
	  *Then place those variables in the left that are not being transformed 
	order companycode financialyear cpihadj pct* temp1
	
	  *Transform to natural logs
	foreach x of varlist temp1-temp2 {
		g ln`x' = ln(`x')
	} 

	drop temp* lntemp*    
  
  
 
*================================================== 
 *3.- We structure the dataset in a panel dataset format 
 *================================================== 

    *First assign numerical values to company codes and financial years:
		
		*For company code:
  encode companycode, g (id)
		
		*For financial year:
  g year=real(substr(financialyear,1,4))+1
          
		*Then we set the dataset as panel
  xtset id year
 
        *Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the reader to test.  
  		 *Time trend
  egen timetrend 				= group(financialyear)
		 *Time dummies
  tab  year, g(dummyyear)
  
		*Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification
  egen codecombine = concat(companycode year)
 
 		*We don't consider the forecast period for econometric models:
  drop if year > 2018
		
		*We also do not consider HDD or SVE for 2017/18
  drop if inlist(companycode,"SVE","HDD")
  
        *The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model 1. 
  export excel companycode financialyear codecombine realbotex* sewerlength sludgeprod load pumpingcapperlength pctnh3below3mg pctbands13 pctbands6 swtwperpro density wedensitywastewater using "WWW - Costs and costs drivers.xlsx", sheet("Costs and costs drivers") sheetmodify firstrow(variables) /*In this command you have to write the right location of the master file where you save it*/


 *=================================================================
 *4.- Calculate correlation coefficients and export them into Excel 
 *================================================================= 
corr *
putexcel set "WWW Correlation Matrix.xls", replace /*In this command you have to write the right location of the master file where you save it*/
putexcel A1 = matrix(r(C)), names   

 *=================================================================
 *5.- Prepare the macros for different regression specifications 
 *=================================================================  

 * Label variables for easy understanding
	
	*Dependent variables
	label var realbotexbr 			"Bioresources botex, £m"
	label var realbotexswt 			"Sewage treatment botex, £m"
	label var realbotexswc 			"Sewage collection botex, £m"
	label var realbotexbrp 			"Bioresources plus botex, £m"
	label var realbotexnpww 		"Wastewater network plus botex, £m"
	label var realbotexwww 			"Wholesale wastewater botex, £m"
	
	*Independent variables
	label var properties 			"Number of connected properties"
	label var sludgeprod 			"Sludge produced"
	label var load 					"Load entering the STWs"
	label var sewerlength 			"Total sewer length"
	label var pumpingcapperlength 	"Pumping capacity/km of sewer"
	label var pctnh3below3mg 		"Percentage of load with ammonia below 3mg/l"
	label var swtwperpro 			"Number of STWs per property"
	label var pctbands13 			"Percentage of load treated in STWs bands 1-3"
	label var pctbands6 			"Percentage of load treated in STWs band 6"
	label var density 				"Number of properties per km of sewer length"
	label var wedensitywastewater 	"Weighted average population density"

	* Sewage Collection
	global reg1  lnrealbotexswc lnsewerlength 	lnpumpingcapperlength lndensity
	global reg2  lnrealbotexswc lnsewerlength 	lnpumpingcapperlength lnwedensitywastewater

	* Sewage Treatment*
	global reg3  lnrealbotexswt  lnload		 	pctbands13 pctnh3below3mg
	global reg4  lnrealbotexswt  lnload		 	pctbands6  pctnh3below3mg

	* Bioresources*
	global reg5  lnrealbotexbr 	 lnsludgeprod 	lnwedensitywastewater	pctbands13
	global reg6  lnrealbotexbr 	 lnsludgeprod 	lnswtwperpro

	* Bioresources plus *
	global reg7  lnrealbotexbrp  lnload		 	pctbands13 				pctnh3below3mg
	global reg8  lnrealbotexbrp  lnload		 	pctbands6  				pctnh3below3mg	

	
 *=================================================================	
 *6.- Run regressions and tests
 *=================================================================
 
   *6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications)
forvalues i = 1(1)8{
	
	*Run Pooled OLS with clustered standard errors at company level
	eststo  ols`i'   : reg ${reg`i'}, vce(cluster id) 
		
	*Adjusted R squared
	estadd scalar R_squared = e(r2_a): ols`i'

	*VIF test
	vif
	estadd scalar VIF_statistic = r(vif_1): ols`i'	
 
	*RESET Test
	ovtest
	estadd scalar RESET_P_value = r(p): ols`i' 
	
	*Add title to the regression
	estadd local Econometric_model = "Pooled OLS": ols`i'

	*Drop irrelevant variables created in the loop
	keep companycode - codecombine
}

   *6b) Random effects

forvalues i = 1(1)8{
	
	*Run Pooled random effect regressions with clustered standard errors at company level and store results. 
	eststo re`i': xtreg ${reg`i'}, re vce(cluster id)
	
	*We will need predicted and actual costs in levels for the RESET test.
	predict yhatre`i'		, xb
 	replace yhatre`i'     	= exp(yhatre`i')
	
	*Overall R squared
	estadd scalar R_squared = e(r2_o): re`i'

	*The following code lines calculates the RESET test as Stata does as this is not available for RE models.
	quietly sum yhatre`i'						  
	g nmlz_xb`i' 	= (yhatre`i'-r(mean))/r(sd)  	  
	g xb2`i' 		= (nmlz_xb`i')^2
	g xb3`i' 		= (nmlz_xb`i')^3
	g xb4`i' 		= (nmlz_xb`i')^4
	xtreg ${reg`i'} xb2`i' xb3`i' xb4`i'
	test 			xb2`i' xb3`i' xb4`i'
	estadd scalar RESET_P_value = r(p): re`i'	

	*Add title to the regression
	estadd local Econometric_model = "Random Effects": re`i'
	
	*Drop irrelevant variables
	keep companycode - codecombine
}
   
 *==================================================
 *7.- Export results and matrices into Excel	
 *==================================================
 	*7a) Export a matrix for the coefficients that are used used in feeder models 2 and 4.
	estout * 		using "WWW coefficients.xls"	, replace /// /*In this command you have to write the right location of the master file where you save it*/
					stats(depvar)
	*7b) Export coefficients, significance levels and statistical tests into Excel 
	estout * 		using "WWW results.xls"		, replace /// /*In this command you have to write the right location of the master file where you save it*/
					cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) 								  ///
					stats(Econometric_model depvar N vce R_squared VIF_statistic RESET_P_value) mlabels(,titles)
	eststo clear
	

 *==================================================
 *8.-Add a day and date stamp
 *==================================================
 
	g run_time_stamp = "$S_TIME" 
	g run_date_stamp = "$S_DATE"
	
export excel run_date_stamp run_time_stamp using "WWW coefficients datestamp.xls", cell (A1) sheetmodify firstrow(variables) /*In this command you have to write the right location of the master file where you save it*/