*In this do file we present the residential retail econometric models for PR19 Final Determinations *This do file has the following stages *1.- DATA PREPARATION *1a. - Pre-preparations to alter setting and clear any data from storage. *1b. - Import master dataset from Excel *1c. - Assign labels and codes for all variables. *1d. - Delete years and companies not used in the modelling *2.- Variables generation using codes. Each variable code has a label assigned to it. *2a.- Unit conversions of variables *2b.- Generate unit costs to use as the dependent variable *2c.- Transform variable to logs *3.- Structure the dataset in a panel dataset format *4.- Prepare the macros for different regression specifications *4a. - OTHER (TOTEX MINUS DEBT) COSTS *4b. - BAD DEBT RELATED COSTS *4c. - TOTAL COSTS *5.- Run regressions and tests *6.- Export the following results: *6a. - Export a matrix for the coefficients that are used in feeder models 2 and 4. *6b. - Export coefficients, significance levels and statistical tests into Excel *6c. - Add a day and date stamp ************************************************************************************************************ *1. DATA PREPARATION ************************************************************************************************************ *1a.- Pre-preparations clear set more off set matsize 800 matrix drop _all estimates drop _all set type double, perm /*Set so that all new variables that are created must be in double format (definition: this data storage option has about 16 digits of accuracy with a trade off of a larger do-file size) rather than float format (definition: this default data storage option has only 7 digits of accuracy) in order to ensure accuracy to the last decimal point (espeically important when doing the transformations from £ millions to £s) */ *1b. - Choose the location to source data and save regression outputs. Import the master dataset from Excel cd "O:\OFWSHARE\PR19 Modelling\Model runs\FD\Model Run 8\Cost Assessment\Final_07_11_19\Cost Assessment" import excel "FM_RR1.xlsx", /// cellrange(A2:AK215) sheet("real_statafile") firstrow *1c. - Assign labels and codes for all variables. *Firstly, we assign labels with the second row to each variable using the following loop: foreach var of varlist * { label variable `var' "`=`var'[2]'" } *Next, we drop the first four rows not containing data drop in 1/4 *Add names to columns that were previously unnamed rename A codecombine rename B companycode rename PR19boncode financialyear /*Then can destring all variables. We generate two variables, "start" and "end" so we always destring the relevant variables no matter what order we have: */ g start="" g end="" order codecombine companycode financialyear start destring start-end, replace *1d.- Create the dataset used for modelling (i.e. delete unused variables, years and companies). *For residential retail data, only data from 2013-14 to 2018-19 is used for econometric modelling. drop if financialyear == "2010-11" drop if financialyear == "2011-12" drop if financialyear == "2012-13" drop if financialyear == "2019-20" drop if financialyear == "2020-21" drop if financialyear == "2021-22" drop if financialyear == "2022-23" drop if financialyear == "2023-24" drop if financialyear == "2024-25" *Create an unbalanced panel data. Model SWT + BWH for 2013-14 to 2015-16 and SWB for 2016-17, 2017-18 and 2018-19 /* Bournemouth Water (BWH) and South West Water (SWT) were separate entities from 2011/12 to 2015/16. In 2016/17, they merged to form South West Bournemouth (SWB). To reflect this in our modelling, we consider the the separate entitities (SWT and BWH) for 2013/14 to 2015/16 and merged for 2016/17 and 2017/18 */ drop if companycode=="SWB" & financialyear=="2013-14" drop if companycode=="SWB" & financialyear=="2014-15" drop if companycode=="SWB" & financialyear=="2015-16" drop if companycode=="SWT" & financialyear=="2016-17" drop if companycode=="BWH" & financialyear=="2016-17" *We also do not consider Hafren Dyfrdwy (HDD) or Severn Trent England (SVE) but only its previous entities, SVT and DVW drop if companycode=="HDD" drop if companycode=="SVE" drop if companycode=="SVH" *********************************************************************************************************** *2. Variables generation using the codes. Each variable code has a label assigned to it. ************************************************************************************************************ *2a.- Unit conversion of variables *For variables in £ millions: global Vars_in_million /// /*Ofwat*/ DC_t sTC_tr sOC_tr /// *For variables in 000s: global Vars_in_000 /// /*Ofwat*/ hh_t *For variables in %: global Vars_in_perc /// /*Ofwat*/ hhm_hh hhdu_hh /// /*External*/ incomescore eq_lpcf62 totalmigration /// *For variables already in appropriate units: global Vars_in_units /// /*Ofwat*/ rev_hh /// *Million * 1,000,000 foreach v of varlist $Vars_in_million { summ `v' rename `v' `v'_million g `v' = `v'_million*1000000 label variable `v' "`v' (units)" } *Thousands * 1,000 foreach v of varlist $Vars_in_000 { summ `v' rename `v' `v'_000 g `v' = `v'_000*1000 label variable `v' "`v' (units)" } *Percentage * 100 foreach v of varlist $Vars_in_perc { summ `v' rename `v' `v'_perc g `v' = `v'_perc label variable `v' "`v' (%)" } summ $Vars_in_perc *Units * 1 (no change, only label) foreach v of varlist $Vars_in_units { summ `v' rename `v' `v'_unit g `v' = `v'_unit*1 label variable `v' "`v' (£/hh)" } *2b.- Generate unit cost dependent variables /* We model at 3 levels of aggregation i) Total costs per customer. The acronym is sTC_tr; ii) Bad debt related costs per customer. The acronym is DC_t; iii) Other costs per customer. The acronym is sOC_tr. i) Total costs per customer are calculated in the following way. (customer services + debt management + doubtful debt + meter reading + other operating expenditure + depreciation smoothed over 5 years + recharges costs net of recharges income + local authority rates + exceptional items when reported separately) / number of connected households third party costs and pension deficit repair costs are excluded. ii) Bad debt related costs per customer = (doubtful debt + debt management costs) / number of connected households. iii) Other costs per customer = Total costs per customer - bad debt related costs per customer */ g DC_hh = DC_t / hh_t label variable DC_hh "Bad debt related cost per household (£/hh)" g sTC_hh = sTC_tr / hh_t label variable sTC_hh "Total cost per household with smoothed depreciation(£/hh)" g sOC_hh = sOC_tr / hh_t label variable sOC_hh "Total cost less debt per household with smoothed depreciation(£/hh)" *2c.- Transform variables to log. We don't trasform those variables expressed in percentage foreach v of varlist hh_t rev_hh DC_hh sOC_hh sTC_hh DC_t sTC_tr sOC_tr { g ln`v' = ln(`v') } /*Our independent variables are denoted by the following acronyms: *hhdu_hh: % of dual service households *hhm_hh: % of metered households *lnhh_t: log (number of connected households) *lnrev_hh: log (real average bill size £/hh) *eq_lpcf62: % of households with default (Equifax variable) *incomescore: % of households income deprived (income score from IMD) *totalmigration: % of transient houesholds (total internal and international migration) */ *********************************************************************************************************** *3.- We structure the dataset in a panel dataset format ************************************************************************************************************ *First assign numerical values to company codes and financial years: *For company code: encode companycode, g(id) *For financial year: g year=. replace year=2014 if financialyear=="2013-14" replace year=2015 if financialyear=="2014-15" replace year=2016 if financialyear=="2015-16" replace year=2017 if financialyear=="2016-17" replace year=2018 if financialyear=="2017-18" replace year=2019 if financialyear=="2018-19" *Then we set the dataset as a panel xtset id year ************************************************************************************************************ *4.- Prepare the macros for different regression specifications ************************************************************************************************************ *4a. OTHER (TOTEX MINUS BAD DEBT) COSTS *i Totex less debt per household = proportion of dual service households + proportion of metered households global regROC2 lnsOC_hh /// hhdu_hh hhm_hh /// /*ii Totex less debt per household = proportion of dual service households + proportion of metered households + number of connected households */ global regROC4 lnsOC_hh /// hhdu_hh hhm_hh lnhh_t /// *4b. BAD DEBT RELATED COSTS *i. bad debt per household = average bill size + probability of default (using Equifax variable) global regRDC1 lnDC_hh /// lnrev_hh eq_lpcf62 *ii. bad debt per household = average bill size + probability of default (using IMD) + transience global regRDC20 lnDC_hh /// lnrev_hh incomescore totalmigration *4c. TOTAL COSTS *i. totex per household = average bill size + proportion of metered households + probability of default global regRTC3 lnsTC_hh /// lnrev_hh hhm_hh eq_lpcf62 /// /* ii. totex per household = average bill size + proportion of metered households + probability of default + number of connected households */ global regRTC4 lnsTC_hh /// lnrev_hh hhm_hh eq_lpcf62 lnhh_t /// /* iii. totex per household = average bill size + proportion of metered households + probability of default + transience + number of connected households */ global regRTC8 lnsTC_hh /// lnrev_hh hhm_hh totalmigration incomescore lnhh_t /// ************************************************************************************************************ *5.- Run regressions and tests ************************************************************************************************************ *5a. Create a macro global MODELS ROC2 ROC4 RDC1 RDC20 RTC3 RTC4 RTC8 *5b. - Run Random effects (RE) foreach v in $MODELS { *Run pooled random effect regressions with clustered standard errors at company level and store results. eststo re`v' : xtreg ${reg`v'}, re vce(cluster id) *We will need predicted and actual costs in levels for the RESET test. predict yhatre`v' , xb predict ere`v' , ue *RESET - the following lines calculate the RESET test as Stata does not have a command for RE models quietly sum yhatre`v' g nmlz_xb`v' = (yhatre`v'-r(mean))/r(sd) g xb2`v' = (nmlz_xb`v')^2 g xb3`v' = (nmlz_xb`v')^3 g xb4`v' = (nmlz_xb`v')^4 xtreg ${reg`v'} xb2`v' xb3`v' xb4`v' test xb2`v' xb3`v' xb4`v' estadd scalar RESET_P_value = r(p): re`v' *Overall adjusted R-squared estadd scalar Adj_R_squared = e(r2_o): re`v' *Add title for regressions estadd local Estimation_method = "RE": re`v' } ************************************************************************************************************ *6.- Export results to Excel ************************************************************************************************************* *6a. - Export a matrix for the coefficients that are used in feeder models 2 and 4 (ie FM_RR2 and FM_RR4). estout * using "./RR-coefficients.xls", replace cells(b(fmt(10))) *6b. - Export coefficients, significance levels and statistical tests into Excel. estout * using "./RR-coefficients-stars.xls", replace /// cells(b(star fmt(3)) /// p(par({ }))) /// starlevels( * 0.10 ** 0.05 *** 0.010) /// stats ( /// Estimation_method /// depvar /// Adj_R_squared /// VIF_statistic /// RESET_P_value /// N /// ) /// mlabels(,titles) /// eststo clear *6c. - Export time and date of run g run_time_stamp = "$S_TIME" g run_date_stamp = "$S_DATE" export excel run_date_stamp run_time_stamp using "./RR-coefficients-datestamp.xls", cell (A1) sheetmodify firstrow(variables) *************************************************************************************************************