﻿
*In this do file we present the residential retail econometric models for PR19 Final Determinations

*This do file has the following stages

*1.- DATA PREPARATION
	*1a. - Pre-preparations to alter setting and clear any data from storage.
	*1b. - Import master dataset from Excel
	*1c. - Assign labels and codes for all variables.
	*1d. - Delete years and companies not used in the modelling
	
*2.- Variables generation using codes. Each variable code has a label assigned to it.   
   *2a.- Unit conversions of variables
   *2b.- Generate unit costs to use as the dependent variable
   *2c.- Transform variable to logs
 
*3.- Structure the dataset in a panel dataset format  
 
*4.- Prepare the macros for different regression specifications
	*4a. - OTHER (TOTEX MINUS DEBT) COSTS
	*4b. - BAD DEBT RELATED COSTS
	*4c. - TOTAL COSTS
 
*5.- Run regressions and tests
 
*6.- Export the following results:
	*6a. - Export a matrix for the coefficients that are used in feeder models 2 and 4.
	*6b. - Export coefficients, significance levels and statistical tests into Excel 
	*6c. - Add a day and date stamp
	
************************************************************************************************************
*1. DATA PREPARATION
************************************************************************************************************	
*1a.- Pre-preparations
	clear
	set more off
	set matsize 800
	matrix drop _all
	estimates drop _all  
   
	set type double, perm /*Set so that all new variables that are created must be in double format 
						  (definition: this data storage option has about 16 digits of accuracy with a trade
						  off of a larger do-file size) rather than float format (definition: this default 
						  data storage option has only 7 digits of accuracy) in order to ensure accuracy to the
						  last decimal point (espeically important when doing the transformations from £ millions 
						  to £s) */
	

*1b. - Choose the location to source data and save regression outputs. Import the master dataset from Excel
	cd "O:\OFWSHARE\PR19 Modelling\Model runs\FD\Model Run 8\Cost Assessment\Final_07_11_19\Cost Assessment"
	
	import excel "FM_RR1.xlsx", ///
			cellrange(A2:AK215) sheet("real_statafile") firstrow
				
*1c. - Assign labels and codes for all variables.
	*Firstly, we assign labels with the second row to each variable using the following loop:
	foreach var of varlist * {
		label  variable  `var' "`=`var'[2]'"
	}
	
	*Next, we drop the first four rows not containing data
	drop in 1/4

	*Add names to columns that were previously unnamed
	rename A codecombine
	rename B companycode
	rename PR19boncode financialyear
	
	/*Then can destring all variables. We generate two variables, "start" and "end" so we always destring the
	  relevant variables no matter what order we have: */	
	g start=""
	g end=""
	order codecombine companycode financialyear start
	destring start-end, replace 
	
*1d.- Create the dataset used for modelling (i.e. delete unused variables, years and companies).
	*For residential retail data, only data from 2013-14 to 2018-19 is used for econometric modelling.
	drop if financialyear == "2010-11"
	drop if financialyear == "2011-12"
	drop if financialyear == "2012-13"
	
	drop if financialyear == "2019-20"
	drop if financialyear == "2020-21"
	drop if financialyear == "2021-22"
	drop if financialyear == "2022-23"
	drop if financialyear == "2023-24"
	drop if financialyear == "2024-25"
	
	*Create an unbalanced panel data. Model SWT + BWH for 2013-14 to 2015-16 and SWB for 2016-17, 2017-18 and 2018-19
	/* Bournemouth Water (BWH) and South West Water (SWT) were separate entities from 2011/12 to 2015/16. 
	In 2016/17, they merged to form South West Bournemouth (SWB). To reflect this in our modelling, we 
	consider the the separate entitities (SWT and BWH) for 2013/14 to 2015/16 and merged for 2016/17 and
	2017/18 */
	drop if companycode=="SWB" & financialyear=="2013-14"
	drop if companycode=="SWB" & financialyear=="2014-15"
	drop if companycode=="SWB" & financialyear=="2015-16"
	drop if companycode=="SWT" & financialyear=="2016-17"
	drop if companycode=="BWH" & financialyear=="2016-17"
	
	*We also do not consider Hafren Dyfrdwy (HDD) or Severn Trent England (SVE) but only its previous entities, SVT and DVW
	drop if companycode=="HDD"
	drop if companycode=="SVE"
	drop if companycode=="SVH"
	
	
***********************************************************************************************************
*2. Variables generation using the codes. Each variable code has a label assigned to it.
************************************************************************************************************	
*2a.- Unit conversion of variables	
	*For variables in £ millions:
	global Vars_in_million 	///
	/*Ofwat*/ 	DC_t sTC_tr sOC_tr ///

	*For variables in 000s:
	global Vars_in_000 ///
	/*Ofwat*/	hh_t 
	
	
	*For variables in %:
	global Vars_in_perc ///
	/*Ofwat*/	hhm_hh hhdu_hh ///
	/*External*/ 	incomescore eq_lpcf62 totalmigration ///
 
	*For variables already in appropriate units:
	global Vars_in_units ///
	/*Ofwat*/	rev_hh ///
	
	*Million * 1,000,000
	foreach v of varlist $Vars_in_million {
		summ `v'
		rename `v' `v'_million
		g `v' = `v'_million*1000000
		label variable `v' "`v' (units)"
	}

	*Thousands * 1,000
	foreach v of varlist $Vars_in_000 {
		summ `v'
		rename `v' `v'_000
		g `v' = `v'_000*1000
		label variable `v' "`v' (units)"
	}

	*Percentage * 100
	foreach v of varlist $Vars_in_perc {
		summ `v'
		rename `v' `v'_perc
		g `v' = `v'_perc
		label variable `v' "`v' (%)"
	}
	summ $Vars_in_perc
	
	*Units * 1 (no change, only label)
	foreach v of varlist $Vars_in_units {
		summ `v'
		rename `v' `v'_unit
		g `v' = `v'_unit*1
		label variable `v' "`v' (£/hh)"
	}	
	
	
*2b.- Generate unit cost dependent variables
	/* We model at 3 levels of aggregation
		i) Total costs per customer. The acronym is sTC_tr;
		ii) Bad debt related costs per customer. The acronym is DC_t;
		iii) Other costs per customer. The acronym is sOC_tr.

	i) Total costs per customer are calculated in the following way.
		(customer services + debt management + doubtful debt + meter reading + other operating expenditure 
		+ depreciation smoothed over 5 years + recharges costs net of recharges income
		+ local authority rates + exceptional items when reported separately) / number of connected households
		
		third party costs and pension deficit repair costs are excluded.	
	ii) Bad debt related costs per customer = (doubtful debt + debt management costs) / number of connected households.
	iii) Other costs per customer = Total costs per customer - bad debt related costs per customer                   */
	g DC_hh = DC_t / hh_t
	label variable DC_hh "Bad debt related cost per household (£/hh)"
	
	g sTC_hh = sTC_tr / hh_t
	label variable sTC_hh "Total cost per household with smoothed depreciation(£/hh)"
	
	g sOC_hh = sOC_tr / hh_t
	label variable sOC_hh "Total cost less debt per household with smoothed depreciation(£/hh)"
	
*2c.- Transform variables to log. We don't trasform those variables expressed in percentage	
	foreach v of varlist hh_t rev_hh DC_hh sOC_hh sTC_hh DC_t sTC_tr sOC_tr {
	g ln`v' = ln(`v')
	}
	
	/*Our independent variables are denoted by the following acronyms:
		*hhdu_hh: % of dual service households
		*hhm_hh: % of metered households
		*lnhh_t: log (number of connected households)
		*lnrev_hh: log (real average bill size £/hh)
		*eq_lpcf62: % of households with default (Equifax variable)
		*incomescore: % of households income deprived (income score from IMD)
		*totalmigration: % of transient houesholds (total internal and international migration) */
		
***********************************************************************************************************
*3.- We structure the dataset in a panel dataset format 
************************************************************************************************************	
*First assign numerical values to company codes and financial years:
	*For company code:
	encode companycode, g(id)
	
	*For financial year:
	g year=.
		replace year=2014 if financialyear=="2013-14"
		replace year=2015 if financialyear=="2014-15"
		replace year=2016 if financialyear=="2015-16"
		replace year=2017 if financialyear=="2016-17"
		replace year=2018 if financialyear=="2017-18"
		replace year=2019 if financialyear=="2018-19"
		
	*Then we set the dataset as a panel
	xtset id year
	
************************************************************************************************************
*4.- Prepare the macros for different regression specifications
************************************************************************************************************
*4a. OTHER (TOTEX MINUS BAD DEBT) COSTS
	*i Totex less debt per household = proportion of dual service households + proportion of metered households   			
	global regROC2 		lnsOC_hh ///
						hhdu_hh hhm_hh ///
							 				
	/*ii Totex less debt per household = proportion of dual service households + proportion of metered households
	     + number of connected households */   				
	global regROC4 		lnsOC_hh ///
						hhdu_hh hhm_hh lnhh_t ///
												
						
 *4b. BAD DEBT RELATED COSTS
	*i. bad debt per household = average bill size + probability of default (using Equifax variable)
	global regRDC1		lnDC_hh ///
						lnrev_hh eq_lpcf62				
									
	*ii. bad debt per household =  average bill size + probability of default (using IMD) + transience			
	global regRDC20 	lnDC_hh ///
						lnrev_hh incomescore totalmigration			
											
*4c. TOTAL COSTS
	*i. totex per household = average bill size + proportion of metered households + probability of default
	global regRTC3		lnsTC_hh ///
						lnrev_hh hhm_hh eq_lpcf62 ///
																
  	/* ii. totex per household = average bill size + proportion of metered households + probability of default 
	      + number of connected households */							 					
	global regRTC4		lnsTC_hh ///
						lnrev_hh hhm_hh eq_lpcf62 lnhh_t ///			
						
	/* iii. totex per household = average bill size + proportion of metered households + probability of default 
	       + transience + number of connected households */		
	global regRTC8		lnsTC_hh ///
						lnrev_hh hhm_hh totalmigration incomescore lnhh_t ///						
						
************************************************************************************************************
*5.- Run regressions and tests
************************************************************************************************************
*5a. Create a macro
	global MODELS ROC2 ROC4 RDC1 RDC20 RTC3 RTC4 RTC8
	
			
*5b. - Run Random effects (RE)
	foreach v in $MODELS {

		*Run pooled random effect regressions with clustered standard errors at company level and store results.
		eststo  re`v'   : xtreg ${reg`v'}, re vce(cluster id)
		
		*We will need predicted and actual costs in levels for the RESET test.
		predict yhatre`v' 	, xb
		predict ere`v'		, ue
		
		*RESET - the following lines calculate the RESET test as Stata does not have a command for RE models
		quietly sum yhatre`v'						  
		g nmlz_xb`v' = (yhatre`v'-r(mean))/r(sd)  	  
		g xb2`v' = (nmlz_xb`v')^2
		g xb3`v' = (nmlz_xb`v')^3
		g xb4`v' = (nmlz_xb`v')^4
		xtreg ${reg`v'} xb2`v' xb3`v' xb4`v'
		test xb2`v' xb3`v' xb4`v'
		estadd scalar RESET_P_value = r(p): re`v'	
		
		*Overall adjusted R-squared
		estadd scalar Adj_R_squared = e(r2_o): re`v'
	
		*Add title for regressions
		estadd local Estimation_method = "RE": re`v'
	}	
	
************************************************************************************************************ 
*6.- Export results to Excel
************************************************************************************************************* 
*6a. - Export a matrix for the coefficients that are used in feeder models 2 and 4 (ie FM_RR2 and FM_RR4).
	estout * using "./RR-coefficients.xls", replace cells(b(fmt(10)))
	
*6b. - Export coefficients, significance levels and statistical tests into Excel.
	estout * using "./RR-coefficients-stars.xls", replace ///
		cells(b(star fmt(3)) ///
		p(par({ }))) ///
		starlevels( * 0.10 ** 0.05 *** 0.010) ///
		stats 	( 			  ///
				Estimation_method ///
				depvar 		  ///
				Adj_R_squared ///
				VIF_statistic ///
				RESET_P_value ///
				N 			  ///
				) 			  ///
		mlabels(,titles)	  ///
	
	eststo clear 
		
*6c. - Export time and date of run
	g run_time_stamp = "$S_TIME" 
	g run_date_stamp = "$S_DATE"
	
	export excel run_date_stamp run_time_stamp using "./RR-coefficients-datestamp.xls", cell (A1) sheetmodify firstrow(variables)
	
*************************************************************************************************************