*In this do file we present the econometric models for IAP in PR19
*This do file has the following stages:

 *1.- Data preparation
  
 *2.- Variables generation using codes. Each variable code has a label assigned to it.  
   
   *2a.- Dependent variables (costs)
   *2b.- Independent variables (cost drivers)
 
 *3.- Structure the dataset in a panel dataset format  
 
 *4.- Calculate correlation coefficients and export them into Excel 
 
 *5.- Prepare the macros for different regression specifications 
 
 *6.- Run regressions and tests
   *6a) Pooled OLS
   *6b) Random effects
    
 *7.- Export the following results:
	*7a) Export a matrix for the coefficients that are used in feeder models 2 and 4.
	*7b) Export coefficients, significance levels and statistical tests into Excel 

 *8.- Add a day and date stamp


*************************************************************************************************************************************************************************************


 *======================================================================================== 
 *1.- Data preparation
 *========================================================================================

   *1a.- Import the master dataset from Excel and assign labels and codes for all variables. 
clear
matrix drop _all
set more off
set matsize 800
estimates drop _all
macro drop _all

import excel "FM_WWW1.xlsx", sheet("Stata dataset - wastewater") /*In this command you have to write the right location of the master file where you save it*/

	*The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18. 
	*The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25.  
	*The third row provides a corresponding label for each variable. 

	*Firstly, we eliminate the historical codes 
drop in 2 	
	
	*Next, we assign names and labels to the first and second rows, respectively, to each variable using the following loop: 
	
foreach var of varlist * {
  label  variable  `var' "`=`var'[2]'"
  rename `var' `=`var'[1]'
}

	*As we have identified the variable names and codes for each indicator, we no longer need the first two rows:
drop in 1/2
	
	*Then we can destring all variables 
	 *We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order they are in:
g start		= ""
g end		= ""
order 		companycode codecombine financialyear start
destring 	start-end, replace force
drop 		start end


*========================================================================================
 *2.- Variables generation using the codes. Each variable code has a label assigned to it.  
 *======================================================================================== 
   
    *2a.- Dependent variables (costs)
	  
	  *Base costs (botex) are defined as:
	   
	   *The sum of the following cost categories:
	  
		*a) Power
		*b) Income treated as negative expenditure for bioresources ONLY
		*c) Service charges/ Discharge consents
		*d) Bulk discharge
		*e) Renewals expensed in year - infra
		*f) Renewals expensed in year - noninfra
		*g) Other operating expenditure excluding renewals
		*h) Maintaining the long term of capability of the assets - infra
		*i) Maintaining the long term of capability of the assets - noninfra
		*j) Transfer private sewers and pumping stations
		
	   *Minus the following cost categories:
		
		*i) Traffic Management Act
		*j) Industrial Emissions Directorate
		
		*Based on the description above, we used the following codes to calculate base costs (botex)
	  
	  *The following acronyms are used for each level of aggregation:
	    *botex stands for base costs used for econometric modelling
	    *swc: 	sewage collection
	    *swt: 	sewage treatment
	    *br: 	bioresources
	    *brp: 	bioresources plus (which includes bioresources and sewage treatment)
	    *npww: 	network plus wastewater
	    *www:   wholesale wastewater
		*As an example, botexwww stands for wholesale wastewater base costs	
	
	  *Before calculating costs please note that data for trasnfer private sewer and pumping stations costs for some companies are blank. This should be read as zero and not as missing.
	replace S3024bioresources=0 if S3024bioresources==. & financialyear=="2017-18"

	g botexswc		=	(WWS1001SC		   					   + WWS1003SC		   	+ WWS1004SC		   	+ WWS1005SC 			+ WWS1006SC 		   + WWS1007SC 		    + WWS1012SC 		  + WWS1013SC 		    + S3024SC)  			- (W3032NPSC / 1000) - (W3036NPSC / 1000)
	g botexswt		=	(WWS1001ST		   					   + WWS1003ST		   	+ WWS1004ST		   	+ WWS1005ST 			+ WWS1006ST 		   + WWS1007ST 		    + WWS1012ST 		  + WWS1013ST 		    + S3024ST)  			- (W3032NPST / 1000) - (W3036NPST / 1000)
	g botexbr		=	(BM802bioresources + BM831bioresources + BM836bioresources	+ BM140bioresources + BM839Ibioresources 	+  BM839NIbioresources + BM839Obioresources + BC30945bioresources + CS00036bioresources + S3024bioresources)  	- (W3032SL   / 1000) - (W3036SL	  / 1000) 
	g botexbrp 		= 	botexbr	 	+ botexswt
	g botexnpww 	=   botexswc 	+ botexswt
	g botexwww 		= 	botexnpww 	+ botexbr

    *2b.- Independent variables (cost drivers)	
	   
	   *Independent variables have the following abreviation:
	      *Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%. 
		  *Variables are transformed in logs with the exception of the percent variables. 

       *Scale variables
	  
	      *Number of properties
	g properties	=   BN1178 * 1000
	      *Sludge produced
	g sludgeprod	= 	MP05611
	      *Load
	g load			=   STWDA126   
	      *Sewer length
	g sewerlength 	=	BN13535 + BN13528  
	
	   *Specific network characteristics variables for each element of the supply chain
	
	     *Sewage collection 
	 		
		   *Pumping station capacity per sewer lenght
	g pumpingcapperlength	= S4029 / sewerlength
	
	     *Sewage treatment 
	   	   *% of load with ammonia consent below 3mg
	g pctnh3below3mg	= ((STWDA121 + STWDA122) / (STWDA126)) * 100
	  	   *% of load treated at bands 1-3 (smaller bands tend to have higher unit cost)
	g pctbands13	 	= ((STWDA006+STWDA020+STWDA034) / (load)) * 100
	  	   *% of load treated at bands 6 (larger bands tend to have lower unit cost due to economies of scale)
	g pctbands6			= ((STWDA106) / (load)) * 100
			   *Number of sewage treatment works per property (sewage treatment)
	g swtwperpro		= 	STWC115 / properties
	
			
   	     *Properties density
	g density		 	= properties / sewerlength  


  *The next step is to keep the relevant variables for logs and real terms transformation where applicable.	
	
keep  botex*											/// Dependent variables	
	  properties sewerlength sludgeprod load			/// Scale variables
	  pumpingcapperlength 								/// Complexity, sewage collection
	  pctnh3below3mg pctbands13 pctbands6 swtwperpro	/// Complexity, sewage treatment	
	  density wedensitywastewater						/// Density (wedensitywastewater stands for weighted average density measyre developed by Ofwat)
	  cpihadj companycode financialyear					/// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs.
  
  *Convert costs real terms using the CPIH adjustments and 2017/18 as bease year
	foreach x of varlist botex*{
		replace `x'	= `x' * cpihadj
		rename  `x' real`x'
	}  
 
  *Transform variables to log. We don't trasform the variables expressed in percentage. 
    
	  *To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log
	g temp1=1
	g temp2=1
	
	  *Then place those variables in the left that are not being transformed 
	order companycode financialyear cpihadj pct* temp1
	
	  *Transform to natural logs
	foreach x of varlist temp1-temp2 {
		g ln`x' = ln(`x')
	} 

	drop temp* lntemp*    
  
  
 *================================================== 
 *3.- We structure the dataset in a panel dataset format 
 *================================================== 

    *First assign numerical values to company codes and financial years:
		
		*For company code:
  encode companycode, g (id)
		
		*For financial year:
  g year=real(substr(financialyear,1,4))+1
          
		*Then we set the dataset as panel
  xtset id year
 
        *Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the reader to test.  
  		 *Time trend
  egen timetrend 				= group(financialyear)
		 *Time dummies
  tab  year, g(dummyyear)
  
		*Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification
  egen codecombine = concat(companycode year)
 
 		*We don't consider the forecast period for econometric models:
  drop if year > 2018
		
		*We also do not consider HDD or SVE for 2017/18
  drop if inlist(companycode,"SVE","HDD")
  
        *The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model PR19CA005
		 
  export excel companycode financialyear codecombine realbotex* sewerlength sludgeprod load pumpingcapperlength pctnh3below3mg pctbands13 pctbands6 swtwperpro density wedensitywastewater using "WWW - Costs and costs drivers.xlsx", sheet("Costs and costs drivers") sheetreplace firstrow(variables) /*In this command you have to write the right location you save it*/


 *=================================================================
 *4.- Calculate correlation coefficients and export them into Excel 
 *================================================================= 
corr *
putexcel set "WWW Correlation Matrix.xls", replace /*In this command you need to choose location where you want to save it*/
putexcel A1 = matrix(r(C)), names   

 *=================================================================
 *5.- Prepare the macros for different regression specifications 
 *=================================================================  

 * Label variables for easy understanding
	
	*Dependent variables
	label var realbotexbr 			"Bioresources botex, £m"
	label var realbotexswt 			"Sewage treatment botex, £m"
	label var realbotexswc 			"Sewage collection botex, £m"
	label var realbotexbrp 			"Bioresources plus botex, £m"
	label var realbotexnpww 		"Wastewater network plus botex, £m"
	label var realbotexwww 			"Wholesale wastewater botex, £m"
	
	*Independent variables
	label var properties 			"Number of connected properties"
	label var sludgeprod 			"Sludge produced"
	label var load 					"Load entering the STWs"
	label var sewerlength 			"Total sewer length"
	label var pumpingcapperlength 	"Pumping capacity/km of sewer"
	label var pctnh3below3mg 		"Percentage of load with ammonia below 3mg/l"
	label var swtwperpro 			"Number of STWs per property"
	label var pctbands13 			"Percentage of load treated in STWs bands 1-3"
	label var pctbands6 			"Percentage of load treated in STWs band 6"
	label var density 				"Number of properties per km of sewer length"
	label var wedensitywastewater 	"Weighted average population density"

	* Sewage Collection
	global reg1  lnrealbotexswc lnsewerlength 	lnpumpingcapperlength lndensity
	global reg2  lnrealbotexswc lnsewerlength 	lnpumpingcapperlength lnwedensitywastewater

	* Sewage Treatment*
	global reg3  lnrealbotexswt  lnload		 	pctbands13 pctnh3below3mg
	global reg4  lnrealbotexswt  lnload		 	pctbands6  pctnh3below3mg

	* Bioresources*
	global reg5  lnrealbotexbr 	 lnsludgeprod 	lnwedensitywastewater
	global reg6  lnrealbotexbr 	 lnsludgeprod 	lnswtwperpro

	* Bioresources plus *
	global reg7  lnrealbotexbrp  lnload		 	pctbands13 				pctnh3below3mg
	global reg8  lnrealbotexbrp  lnload		 	pctbands6  				pctnh3below3mg	
		
 *=================================================================	
 *6.- Run regressions and tests
 *=================================================================
 
   *6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications)
forvalues i = 1(1)8{
	
	*Run Pooled OLS with clustered standard errors at company level
	eststo  ols`i'   : reg ${reg`i'}, vce(cluster id) 
		
	*Adjusted R squared
	estadd scalar R_squared = e(r2_a): ols`i'

	*VIF test
	vif
	estadd scalar VIF_statistic = r(vif_1): ols`i'	
 
	*RESET Test
	ovtest
	estadd scalar RESET_P_value = r(p): ols`i' 
	
	*Add title to the regression
	estadd local Econometric_model = "Pooled OLS": ols`i'

	*Drop irrelevant variables created in the loop
	keep companycode - codecombine
}

   *6b) Random effects

forvalues i = 1(1)8{
	
	*Run Pooled random effect regressions with clustered standard errors at company level and store results. 
	eststo re`i': xtreg ${reg`i'}, re vce(cluster id)
	
	*We will need predicted and actual costs in levels for the RESET test.
	predict yhatre`i'		, xb
 	replace yhatre`i'     	= exp(yhatre`i')
	
	*Overall R squared
	estadd scalar R_squared = e(r2_o): re`i'

	*The following code lines calculates the RESET test as Stata does as this is not available for RE models.
	quietly sum yhatre`i'						  
	g nmlz_xb`i' 	= (yhatre`i'-r(mean))/r(sd)  	  
	g xb2`i' 		= (nmlz_xb`i')^2
	g xb3`i' 		= (nmlz_xb`i')^3
	g xb4`i' 		= (nmlz_xb`i')^4
	xtreg ${reg`i'} xb2`i' xb3`i' xb4`i'
	test 			xb2`i' xb3`i' xb4`i'
	estadd scalar RESET_P_value = r(p): re`i'	

	*Add title to the regression
	estadd local Econometric_model = "Random Effects": re`i'
	
	*Drop irrelevant variables
	keep companycode - codecombine
}
   
 *==================================================
 *7.- Export results and matrices into Excel	
 *==================================================
 	*7a) Export a matrix for the coefficients that are used used in PR19CA06 and PR19CA08.
	estout * 		using "WWW coefficients.xls"	, replace /// /*In this command you need to choose location where you want to save it*/
					stats(depvar)
	*7b) Export coefficients, significance levels and statistical tests into Excel 
	estout * 		using "WWW results.xls"		, replace /// /*In this command you need to choose location where you want to save it*/
					cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) 								  ///
					stats(Econometric_model depvar N vce R_squared VIF_statistic RESET_P_value) mlabels(,titles)
	eststo clear
	

 *==================================================
 *8.-Add a day and date stamp
 *==================================================
 
	g run_time_stamp = "$S_TIME" 
	g run_date_stamp = "$S_DATE"
	
export excel run_date_stamp run_time_stamp using "WWW coefficients datestamp.xls", cell (A1) sheetmodify firstrow(variables) /*In this command you need to choose location where you want to save it*/