*In this do file we present the wholesale water econometric models for PR19 Final determinations

*This do file has the following stages:

 *1.- Data preparation
  
   *1a.- Import the master dataset from Excel and assign labels and codes for all variables. 
   *1b.- Aggregate data for SWB in 2016/17 and 2017/18 and save a temporary file. 
         *In the period from 2011/12 to 2015/16 South West and Bournemouth Water are treated as separate entities. 
   *1c.- Incorporate the temporary file into the master file. 
 
 *2.- Variables generation using codes. Each variable code has a label assigned to it.  
   
   *2a.- Dependent variables (costs)
   *2b.- Independent variables (cost drivers)
 
 *3.- Structure the dataset in a panel dataset format  
 
 *4.- Calculate correlation coefficients and export them into Excel 
 
 *5.- Prepare the macros for different regression specifications 
 
 *6.- Run regressions and tests
   *6a) Pooled OLS
   *6b) Random effects
    
 *7.- Export the following results:
	*7a) Export a matrix for the coefficients that are used used in feeder models 2 and 4.
	*7b) Export coefficients, significance levels and statistical tests into Excel 

 *8.- Add a day and date stamp	
 
  
*************************************************************************************************************************************************************************************
 
 
 *======================================================================================== 
 *1.- Data preparation
 *========================================================================================
 
   *1a.- Import the master dataset from Excel and assign labels and codes for all variables. 
clear
matrix drop _all
set more off
set matsize 800
estimates drop _all
macro drop _all

import excel "FM_WW1.xlsx", sheet("Stata dataset - wholesale water") /*In this command you need to choose the location of the file you want to open*/

	*The first row of the dataset display variables' historical codes for the period between 2011/12 and 2017/18. 
	*The second row displays the variables' updated codes corresponding to the period 2018/19 to 2020/21 to 2024/25.  
	*The third row provides a corresponding label of each variable. 

	*Firstly, we eliminate the historical codes 
drop in 2 	
	
	*Next, we assign names and labels with the first and second rows respectively to each variable using the following loop: 
	
foreach var of varlist * {
  label  variable  `var' "`=`var'[2]'"
  rename `var' `=`var'[1]'
}

	*As we have identified the variable names and codes for each indicator, we no longer need the first two rows:
drop in 1/2
	
	*Then can destring all variables 
	 *We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order we have:
g start		= ""
g end		= ""
order 		companycode codecombine financialyear start
destring 	start-end, replace force
drop 		start end

*Next we rename the variables that were from external sources:

rename	C_CD0001_PR19WW1	popdensity
rename	C_CD0002_PR19WW1	popsparsity
rename	C_CD0005_PR19WW1	wedensitywater
rename	C_CD0022_PR19WW1	welogdensitywater
rename	C_CD0023_PR19WW1	welogsqdensitywater
rename	C_CD0011_PR19WW1	populationwater
rename	C_CD0012_PR19WW1	rwasoc1
rename	C_CD0013_PR19WW1	rwasoc2
rename	C_CD0014_PR19WW1	cpihadj

  *1b.-  BWH and SWT are separated entities from 2011/12 to 2015/16. In 2016/17 they merge to form SWB. 
		
	*For modelling purposes, we consider the merging companies as a merged entity (SWB) for the years 2016/17 and 2017/18 (ie merging years) and as separate entities (BWH and SWT) for the previous years. 
	
	*However, regional wage and the density measures developed by Ofwat are reported separately for the merging years. 
	*We estimate regional wage and density measures in the merging years for SWB as the weighted average of BWH and SWT.
	*We use our population estimates for both companies as weights to determine the size of each company.

    *The following preserve restore functionality allows to compute the weights for SWT and BWH in 2016/17, aggregate the data, and save a temporary file. 
	*Then, this temporary file will then be appended to the master file. 
	   
preserve

	   *Keep only SWT, BWH and SWB data for the merging years
keep if inlist(companycode,"SWT","BWH","SWB") & inlist(financialyear,"2016-17","2017-18", "2018-19")

       *Assign the merging code to all observations 
replace companycode = "SWB"
       
  	   *Next, allocate all variables on the left hand side that are string (companycode financialyear as well as wages and density measures). 
order companycode codecombine financialyear popdensity popsparsity rwasoc1 rwasoc2 we* cpih*

	   *Now we calculate the weighted averages
foreach x of varlist popdensity popsparsity rwasoc1 rwasoc2 we*{
	g product`x'    		= (populationwater)*`x'
	bysort financialyear: 	egen sumproduct`x' 		= sum(product`x')
	bysort financialyear: 	egen sumactivity`x'   	= sum(populationwater)
	g wa`x' 				= sumproduct`x'/sumactivity`x'
	replace wa`x' 			= . if 			  wa`x' == 0
	replace `x' = wa`x'
	drop product* sumproduct* sumactivity* wa*
	}   

       *Afterwards, we sum up all the remaining variables (and keep only one data point for weighted averages computed above) and save a temporary file
collapse (sum) WS1001WR-populationwater  (min) popdensity popsparsity rwasoc1 rwasoc2 cpihadj we*, by(companycode financialyear)
tempfile SWB
save `SWB', replace

restore

     *1c.- We append the SWB data to the master file.
		*Before doing so we need to remove the existing disaggregated data in 2016/17 and 2017/18 for BWH and SWT and SWB to avoid duplication 
drop if inlist(companycode,"SWT","BWH","SWB") & inlist(financialyear,"2016-17","2017-18", "2018-19")
append using `SWB'
		
		*We check the number of observations per company and financial year to show there are observations for SWB only from 2016/17 to 2020/21 to 2024/25. 
		*That leaves no observations in the sample for the econometric models.
tab companycode financialyear, m


 *========================================================================================
 *2.- Variables generation using the codes. Each variable code has a label assigned to it.  
 *======================================================================================== 
   
    *2a.- Dependent variables (costs)
	  
	  *Modelled base costs are defined as:
	   
	   *The sum of the following cost categories:
	  
		*a) Power  								(WS1001)															
		*b) Income treated as negative expenditure				(WS1002)
		*c) Bulk supply 							(WS1004)
		*d) Renewals expensed in year - infra 					(WS1005)
		*e) Renewals expensed in year - noninfra				(WS1006)
		*f) Other operating expenditure excluding renewals			(WS1007)
		*g) Maintaining the long term of capability of the assets - infra	(WS1012)
		*h) Maintaining the long term of capability of the assets - noninfra	(WS1013)
		*i) New developments enhancement costs					(W3009)
		*j) New connections enhancement costs					(WS2004)
		*k) Addressing low pressure enhancement costs				(W3002)
	  
	   *Minus the following cost categories:	
		
		*i) Traffic Management Act						(W3032)
		*j) Statutory water softening						(W3036)

	   *Minus the diversion costs for treated water distribution only
		
		*p) Diversions expenditure - wastewater NRSWA				(APP28RR_WW0002)
		*q) Diversions expenditure - wastewater    Other non-s185		(APP28RR_WW0003)

      *Based on the description above, we used the following codes to estimate modelled base costs (botex)
	  
	  *The following acronyms are used for each level of aggregation:
	    *botex stands for base costs used for econometric modelling
	    *wr: 	water resources
	    *twd: 	treated water distribution
	    *npw: 	network plus water
	    *wrp: 	water resources plus (which includes water resources, raw water distribution and water treatment)
	    *ww: 	wholesale water
	    *As an example, botexww stands for wholesale water base costs

	g botexwr 	= WS1001WR  + WS1002WR	+ WS1004WR  + WS1005WR 	+ WS1006WR  + WS1007WR 	+ WS1012WR  + WS1013WR	+ W3009WR   + WS2004WR  + W3002WR  - W3032WR  - W3036WR								
	g botextwd	= WS1001TWD + WS1002TWD	+ WS1004TWD + WS1005TWD	+ WS1006TWD + WS1007TWD	+ WS1012TWD + WS1013TWD	+ W3009TWD  + WS2004TWD + W3002TWD - W3032TWD - W3036TWD - APP28RR_W0002 - APP28RR_W0003	    	
	g botexww       = WS1001CAW + WS1002CAW	+ WS1004CAW + WS1005CAW + WS1006CAW + WS1007CAW + WS1012CAW + WS1013CAW + W3009CAW  + WS2004CAW + W3002CAW - W3032CAW - W3036CAW - APP28RR_W0002 - APP28RR_W0003
	g botexnpw	= botexww   - botexwr
	g botexwrp	= botexww   - botextwd 

    *2b.- Independent variables (cost drivers)	
	   
	   *Independent variables have the following terminology:
	      *Those starting with "pct" refer to a variables expressed in percentage terms. For these variables we use a scale from 0 to 100. So 80 means 80%. 
		  *Variables are transformed in logs with the exception of the percent variables. 

       *Scale variables
	  
	     *Number of connected properties  (for all different supply chain levels) 	
	g properties   			= (BN2221 + BN2161) * 1000
         *Lengths of main (for twd, npw and ww)
	g lengthsofmain			= BN1100 + BN11590

	   *Specific network characteristics variables for each element of the supply chain
		
	     *Water treatment
			
			*% proportion of water treated in water treatment works with complexity levels 3-6
	g watertreated 		= CPMW0098 + CPMW0104 + CPMW0110 + CPMW0116 + CPMW0165 + CPMW0166 + CPMW0167 + CPMW0027 + CPMW0033 + CPMW0039 + CPMW0045 + CPMW0185 + CPMW0197 + CPMW0198
	g watertreated36	= CPMW0116 + CPMW0165 + CPMW0166 + CPMW0167 + CPMW0045 + CPMW0185 + CPMW0197 + CPMW0198
	g pctwatertreated36     = (watertreated36 / watertreated) *100
    		*Weighted average level of treatment complexity (WAC)
	g wac = (1*(CPMW0098+CPMW0027)/watertreated) + (2*(CPMW0104+CPMW0033)/watertreated) + (3*(CPMW0110+CPMW0039)/watertreated) + (4*(CPMW0116+CPMW0045)/watertreated) + (5*(CPMW0165+CPMW0185)/watertreated) + (6*(CPMW0166+CPMW0197)/watertreated) + (7*(CPMW0167+CPMW0198)/watertreated)
              
         *Treated water distribution
			
			*Number of boosters per lenghts of main as a proxy for pumping activity
	g boosterperlength	=   BN11390 / lengthsofmain

    *The next step is to keep the relevant variables for logs and real terms transformation where applicable.
	
  keep botex*					/// Dependent variables
	   properties lengthsofmain		/// Scale variables
	   pctwatertreated36	wac 		/// Complexity, water treatment
	   boosterperlength		  	/// Complexity, treated water distribution
	   wedensitywater  			/// Density (wedensitywater stands for weighted average density measure developed by Ofwat) 
	   cpihadj companycode financialyear 	/// Variables to set panel dataset as well as cpih adjustments (cphadj) to deflate costs.

 	*Convert costs real terms using the CPIH adjustments and 2017/18 as base year
	foreach x of varlist botex*{
		replace `x'	= `x' * cpihadj
		rename  `x' real`x'
	}  
 
    *Transform variables to log. We don't trasform those variables expressed in percentage. 
    
	  *To allow for flexibility we generate two temporary variables that mark the start and end of the range to be converted to log
	g temp1=1
	g temp2=1
	
	  *Then place those variables in the left that are not being transformed 
	order companycode financialyear cpihadj pct* temp1
	
	  *Transform to natural logs
	foreach x of varlist temp1-temp2 {
		g ln`x' = ln(`x')
	} 
		
	drop temp* lntemp*  
	
   
   	   *Calculate the quadratic term for weighted average density
  g lnwedensitywater2= lnwedensitywater^2
	
*================================================== 
 *3.- We structure the dataset in a panel dataset format 
 *==================================================
 
      *First assign numerical values to company codes and financial years:
		
		*For company code:
  encode companycode, g (id)
		
		*For financial year:
  g year=real(substr(financialyear,1,4))+1
          
		*Then we set the dataset as panel
  xtset id year
 
        *Even though we do not use any time trend or years dummies in our models for consultation, we generate these variables for the readers to test.  
  		 *Time trend
  egen timetrend 				= group(financialyear)
		 *Time dummies
  tab  year, g(dummyyear)
  
		*Also create our codecombine (id year) combination for references. We will need this variable for influential observations identification
  egen codecombine = concat(companycode year)
 
 		*We don't consider the forecast period for econometric models:
  drop if year > 2019
		
		*We also do not consider HDD or SVE for 2017/18 or 2018/19
  drop if inlist(companycode,"SVE","HDD")
  
		*The relevant data can be exported into Excel as a QA check with our calculations of costs and costs drivers in feeder model 1. 
  export excel companycode financialyear codecombine realbotex* properties lengthsofmain pctwatertreated36 wac boosterperlength  wedensitywater using "WW - Costs and costs drivers.xlsx", sheet("Costs and costs drivers") sheetmodify firstrow(variables) /*In this command you need to choose the location where you want to save it*/
  
 *=================================================================
 *4.- Calculate correlation coefficients and export them into Excel 
 *================================================================= 
corr *
putexcel set "WW Correlation Matrix.xls", replace /*In this command you need to choose the location where you want to save it*/
putexcel A1 = matrix(r(C)), names   
  
 *=================================================================
 *5.- Prepare the macros for different regression specifications 
 *=================================================================  
 
   *Label the variables for easy understanding:
    
    *Dependent variables
	label var lnrealbotexwrp 	"Water resources plus botex in logs"
    	label var lnrealbotextwd 	"Treated water distribution botex in logs"
	label var lnrealbotexww 	"Wholesale water botex in logs"
	
    *Independent variables
    label var lnproperties		"Number of connected properties in logs"
	label var lnlengthsofmain	"Lenghts of main"
	label var pctwatertreated36	"Percent of water treated in water treatment works with complexity levels 3 to 6"
	label var lnwac			"Water treatment complexity index in logs"	
	label var lnboosterperlength	"Number of booster pumping stations per lenghts of main in logs"
	label var lnwedensitywater	"Weighted average population density in logs"
	label var lnwedensitywater2	"Squared term of weighted average population density in logs"
	
  *Water resources plus
	global reg1 lnrealbotexwrp lnproperties 	pctwatertreated36 	lnwedensitywater	lnwedensitywater2
    	global reg2 lnrealbotexwrp lnproperties 	lnwac			lnwedensitywater	lnwedensitywater2  
  
  *Treated water distribution 
	global reg3 lnrealbotextwd lnlengthsofmain 	lnboosterperlength 	lnwedensitywater 	lnwedensitywater2
  
  *Wholesale water
	global reg4 lnrealbotexww lnproperties 		pctwatertreated36 	lnboosterperlength	lnwedensitywater 	lnwedensitywater2
	global reg5 lnrealbotexww lnproperties 		lnwac			lnboosterperlength	lnwedensitywater 	lnwedensitywater2	
  
 *=================================================================	
 *6.- Run regressions and tests
 *=================================================================

   *6a) Pooled OLS (regressions have been added for reference but we opted to use random effects in all specifications)
forvalues i = 1(1)5{
	
	*Run Pooled OLS with clustered standard errors at company level
	eststo  ols`i'   : reg ${reg`i'}, vce(cluster id) 
		
	*Adjusted R squared
	estadd scalar R_squared = e(r2_a): ols`i'

	*VIF test
	vif
	estadd scalar VIF_statistic = r(vif_1): ols`i'	
 
	*RESET Test
	ovtest
	estadd scalar RESET_P_value = r(p): ols`i' 
	
	*Add title to the regression
	estadd local Econometric_model = "Pooled OLS": ols`i'

	*Drop irrelevant variables created in the loop
	keep companycode - codecombine
}

   *6b) Random effects

forvalues i = 1(1)5{
	
	*Run Pooled random effect regressions with clustered standard errors at company level and store results. 
	eststo re`i': xtreg ${reg`i'}, re vce(cluster id)
	
	*We will need predicted and actual costs in levels for the RESET test.
	predict yhatre`i'		, xb
 	replace yhatre`i'     	= exp(yhatre`i')
	
	*Overall R squared
	estadd scalar R_squared = e(r2_o): re`i'

	*The following code lines calculates the RESET test as Stata does as this is not available for RE models.
	quietly sum yhatre`i'						  
	g nmlz_xb`i' 	= (yhatre`i'-r(mean))/r(sd)  	  
	g xb2`i' 		= (nmlz_xb`i')^2
	g xb3`i' 		= (nmlz_xb`i')^3
	g xb4`i' 		= (nmlz_xb`i')^4
	xtreg ${reg`i'} xb2`i' xb3`i' xb4`i'
	test 			xb2`i' xb3`i' xb4`i'
	estadd scalar RESET_P_value = r(p): re`i'	

	*Add title to the regression
	estadd local Econometric_model = "Random Effects": re`i'
	
	*Drop irrelevant variables
	keep companycode - codecombine
}

 *==================================================
 *7.- Export results and matrices into Excel	
 *==================================================
 	*7a) Export a matrix for the coefficients that are used in feeder models 2 and 4.
	estout * 		using "WW coefficients.xls"	, replace /// /*In this command you need to choose the location where you want to save it*/
					stats(depvar)
	*7b) Export coefficients, significance levels and statistical tests into Excel 
	estout * 		using "WW results.xls"		, replace /// /*In this command you need to choose the location where you want to save it*/
					cells(b(star fmt(3)) p(par({ }))) starlevels( * 0.10 ** 0.05 *** 0.010) 								  ///
					stats(Econometric_model depvar N vce R_squared VIF_statistic RESET_P_value) mlabels(,titles)
	eststo clear

 *==================================================
 *8.-Add a day and date stamp
 *==================================================
 
	g run_time_stamp = "$S_TIME" 
	g run_date_stamp = "$S_DATE"
	
export excel run_date_stamp run_time_stamp using "WW coefficients datestamp.xls", cell (A1) sheetmodify firstrow(variables) /*In this command you need to choose the location where you want to save it*/