Energy Certificates of Buildings (ECB) provide interesting information on the standard energy performance, thermo-physical and geometrical related pro... Buildings - Space heating - Data mining - Correlation - Water heating - Itemsets - Clustering algorithms - Data exploration - machine learning algorithms - high dimensional data

168 downloads 2250 Views 507KB Size

Exploring energy certiﬁcates of buildings through unsupervised data mining techniques Evelina Di Corso and Tania Cerquitelli

Marco Savino Piscitelli and Alfonso Capozzoli

Dipartimento di Automatica e Informatica Politecnico di Torino Torino, Italy Email: {evelina.dicorso, tania.cerquitelli}@polito.it

Dipartimento di Energia Politecnico di Torino Torino, Italy Email: {marco.piscitelli, alfonso.capozzoli}@polito.it

edge from ECB data is a multi-step process that requires considerable interaction between an energy and a computer scientist. Both scientists assume a key role in the analytics process and should be very closely involved in designing innovative and efﬁcient algorithms. This paper presents EP I CA (Energy P Iedmont Certiﬁcate Analysis) a multi-tiered data mining framework to discover interesting knowledge items from a collection of Energy Certiﬁcates of Buildings (ECB). The multi-tiered architecture has been proposed to effectively deal with high dimensional data characterized by a large variety of building and system features. EP I CA exploits two exploratory and unsupervised data mining techniques (a clustering algorithm and a generalized association rule mining approach) after the data dimensionality reduction yielded through the Principal Component Analysis. The cluster analysis allows discovery of groups of ECB with similar thermo-physical features, while the generalized association rules summarize the energy performance of buildings at different coarse granularities. This process supported by the domain expertise makes it possible to perform a detailed characterization of a large number of ECB. The unsupervised approach allows to discovery reference values of the attributes driving a certain level of building energy performance. As a case study, a real collection of ECB related to Piedmont Region, in North Western of Italy, was analyzed. The experimental results show the effectiveness of EP I CA in dealing with high dimensional data and discovering a manageable set of human-readable knowledge for each group of buildings with a given energy performance label. In this paper, Section II introduces an overview of the EP I CA analytics framework with a thorough description of its main components. Section III discusses the experimental results obtained on real data related to ECB, and Section IV draws conclusions and presents future developments of this work.

Abstract—Energy Certiﬁcates of Buildings (ECB) provide interesting information on the standard energy performance, thermo-physical and geometrical related properties of existing buildings. The analysis of such data collection is challenging due to data volume and heterogeneity of attributes. This paper presents EP I CA a data mining framework to automatically explore a collection of ECB to extract interesting knowledge items. To this aim, EP I CA ﬁrst reduces the data dimensionality through the Principal Component Analysis, then a clustering algorithm is exploited to discover groups of ECB with similar features. Each group is then locally characterized by a set of relevant generalized association rules able to summarize interesting relations among variables inﬂuencing energy performance of buildings at different coarse granularities. Experimental results, obtained on real data collected from an energy certiﬁcation dataset related to Piedmont Region, in North Western of Italy, shows the effectiveness of EP I CA in extracting a manageable set of human-readable knowledge items characterizing the groups of buildings with different energy performance levels. Index Terms—Data exploration; machine learning algorithms; high dimensional data.

I. I NTRODUCTION Energy efﬁciency is a growing policy priority for many orthogonal applications (e.g., buildings, IoT-based devices, wireless networks) to reduce wasteful energy consumption. Thus, a large volume of energy-related data have been continuously collected in different domains. Many research activities have been devoted to addressing energy efﬁciency with different ﬁnal goal: (i) to facilitate proactive energy-savings services [21], (ii) to optimize resource usage [20], (iii) to reduce emission [14], and (iv) to characterize energy efﬁciency in buildings [6] through the analysis of either data streams of energy consumption in buildings [2], [9], [10] or data related to energy certiﬁcates of buildings [7], [8], [11]. In this paper we focus on Energy Certiﬁcates of Buildings (ECB) providing interesting information of building energy performance. Authority planners and designers are mainly interested in identifying groups of buildings with homogeneous properties together with their energy performances to (i) understand which buildings consume most energy and why and (ii) evaluate the potential effects achievable by implementing retroﬁt actions. Innovative data analytics techniques should be designed and evaluated to efﬁciently support the above decision-making process. Extracting actionable knowl978-1-5386-3066-2/17 $31.00 © 2017 IEEE DOI 10.1109/iThings-GreenCom-CPSCom-SmartData.2017.152

II. T HE EP I CA ANALYTICS FRAMEWORK EP I CA has been tailored to analyze any collection of ECB. This kind of data is high dimensional because for each energy certiﬁcate a large number of attributes are speciﬁed. The exploitation of a given exploratory data mining algorithm on such data (e.g., cluster analysis, pattern mining) is challenging 991

Category Name

Buildings characteristics

Efﬁciency of the subsystems for space heating

Fig. 1: The EP I CA architecture

System efﬁciency

due to the high variability and dimensionality of data. To effectively deal with such data EP I CA combines different data analysis techniques with the aim of reducing data complexity and discovering interesting knowledge items easily exploitable by domain experts. Figure 1 shows the overall architecture of EP I CA including three main components: (i) data preprocessing, (ii) cluster analysis and (iii) cluster characterization. A detailed description of each component is given in the following sub-sections. Data description. EP I CA has been validated on real data of energy certiﬁcates of buildings. The data, related to buildings and ﬂat units sited in Piedmont Region (North Western of Italy), were collected on a Web platform developed by CSI Piemonte (the Information System Consortium) and are regulated by the Piedmont Region authority (Sustainable Energy Development Sector). We focused on ECB related to residential category, in particular only certiﬁcates of ﬂats have been analyzed. Table I reports for each energy certiﬁcate the main subset of available attributes (with the corresponding notation and units) grouped into 4 categories (building characteristics, efﬁciency of the subsystems for space heating, system efﬁciency and energy performance). The data set contains information on building envelope and system features and on primary energy demand for space heating and domestic hot water (DHW) for each ﬂat. The primary energy demand is calculated in ’standard rating’ conditions, according to EN ISO 137901 , UNI TS 11300-12 , and UNI TS 11300-23 .

Energy performance

Attribute name [Units] Floor Area [m2 ] Heated volume [m3 ] Heat transfer surface [m2 ] Aspect ratio [m−1 ] Average U-value of the vertical opaque envelope [W/m2 K] Average U-value of the windows [W/m2 K] Average ceiling height [m] Emission subsystem efﬁciency [-] Distribution subsystem efﬁciency [-] Control subsystem efﬁciency [-] Generation subsystem efﬁciency [-] Average global efﬁciency for space heating [-] Average global efﬁciency for domestic hot water [-] Normalized primary energy demand for space heating [kWh/m2 ] Normalized primary energy demand for domestic hot water [kWh/m2 ] Normalized primary energy demand for space heating and domestic hot water [kWh/m2 ] Energy performance label [-]

Notation A V S S/V Uo Uw H ηe ηd ηrg ηgn ηh ηw hPEDh hPEDw hPEDh,w EPL

TABLE I: A subset of attributes characterizing ECB Correlation analysis. Since correlated attributes have similar impacts in the analysis process, we removed them to reduce the space and time complexity of data mining algorithms. EP I CA leverages the correlation matrix to analyze the dependence between multiple variables at the same time. EP I CA computes for each couple of attributes the correlation coefﬁcient through the Pearson correlation. Since the higher the coefﬁcient values the stronger the correlation, for each pair of attributes with the Pearson correlation higher than 0.80, EP I CA only considers the attribute which is more correlated to energy consumption. This step permits smoothing the effect of noisy data, thus improving the effectiveness of the next analytics steps. Principal Component Analysis (PCA) is a statistical procedure that converts a set of observations into a set of values of linearly uncorrelated variables (i.e., principal components) through an orthogonal transformation. The new axes coincide with the directions of maximum variation of the original observations. Let be X the data matrix in which each row represents a ECB, EP I CA computes the correlation matrix and its decomposition. The eigenvectors of this matrix are known as principal components. The ﬁrst principal component has the largest possible variance while each other component has the highest possible variance under the constraint that it is orthogonal to the previous ones. A plot of the principal components visualizes the amount of the variance explained by using only k principal components (for further detail see Section III-A). In the new orthogonal axes EP I CA selects k eigenvectors that correspond to the largest k eigenvalues.

A. The preprocessing component The knowledge extraction step in EP I CA is preceded by a preprocessing phase, which aims to reduce the data dimensionality to analyze a more tractable dataset. Preprocessing entails the following two steps: (i) correlation analysis and (ii) principal component analysis. 1 ISO (International Organization for Standardization) EN ISO 13790: 2008: Energy performance of buildings calculation of energy use for space heating and cooling. EU, Brussels, Belgium 2 UNI (Ente Nazionale Italiano di Uniﬁcazione) UNI/TS 11300-1: 2008: Energy performance of buildings Part 1: evaluation of energy need for space heating and cooling. Milan, Italy (in Italian). 3 UNI (Ente Nazionale Italiano di Uniﬁcazione) UNI/TS 11300-2: 2008: Energy performance of buildings Part 2: evaluation of primary energy and system efﬁciencies for space heating and domestic hot water production. Milan, Italy (in Italian).

B. The cluster analysis component EP I CA integrates a partitional cluster algorithm, the Kmeans algorithm [13] to identify a set of groups of ECB

992

characterized by similar properties. The analysis is applied on data modeled through the principal components (highlighted by the PCA in the previous step) and the distance between two energy certiﬁcates is computed with the Euclidean distance. The K-means algorithm subdivides the input dataset into K groups, where K is a user speciﬁed parameter. Each group is represented by its centroid computed as the average of all the energy certiﬁcates in the cluster. 4 K-means is the most popular clustering algorithm, although it has a bias towards clusters with a spherical shape. However, it identiﬁes a good cluster set in a limited computational time. To automatically set a good K value, EP I CA exploits the self-tuning strategy proposed in [12] based on Silhouettebased indices to overcome the local optimality choice of K-means. Speciﬁcally, to automatically compare and rank different partitions of energy certiﬁcates identiﬁed with different K values, EP I CA selects the optimal value for K according to the highest values assumed by both the Silhouette index and the weighted Silhouette index. The Silhouette index [18] measures both intra-cluster cohesion and inter-cluster separation by evaluating the appropriateness of the assignment of a energy certiﬁcate to one cluster rather than to another. The weighted silhouette index [12], computed on a given set of energy certiﬁcates, is the ratio between (i) the sum of the percentage of energy certiﬁcates in each positive bin weighted with an integer value (weights in [wmin =1-wmax =10], where the highest weight is assigned to the ﬁrst bin [1 - 0.9] and so on) and (ii) the overall sum of weights. The higher the weighted silhouette index, the better the identiﬁed partition.

largely been used to extract hidden correlations at different granularity levels among data items. To introduce the concept of the generalized association rule, we ﬁrst recall the notion of generalized itemset. In the context of relational database storing ECB, a generalized itemset is a set of items (attribute, value) and/or generalized items (attribute, generalized value) where generalized value is deﬁned through a taxonomy, all belonging to distinct attributes. A taxonomy is a forest of generalization trees, each one representing a hierarchy of aggregations deﬁned on an attribute domain. Note that traditional (non-generalized) itemsets [3] are a special case of generalized itemsets in which all items assume values in the lowest levels of the corresponding taxonomy (i.e., a set of generalization trees). Figure 2 shows an example of the generalization tree relative to the Avarage Global Efﬁciency of heating and domestic hot water system attribute. Leaf nodes are labeled with values in the Average Global Efﬁciency attribute domain (after a discretization step), while non-leaf nodes are leaf node aggregations labeled with distinct values (not in the attribute domain for categorical attributes or wider value ranges in the case of discretized attributes).

C. The cluster characterization component Clusters are anonymous groups of ECB characterized by their principal components extracted through PCA, but humanreadable results are much more valuable to domain experts. To this aim, EP I CA enriches clusters with the original features characterizing energy certiﬁcates and providing two kinds of human-readable knowledge: (i) general attribute-based statistics and (ii) high-level correlations among data represented through generalized association rules [19]. General attribute-based statistics. EP I CA characterizes the cluster set through different methods to highlight the quality of the identiﬁed partition (e.g., well-separated and cohesive groups of ECB): (1) the singular value decomposition (SVD) to show the cluster set in a graphical and friendly way, (2) statistics-based characteristics for each cluster, and (3) the boxplot distribution [17] for the top three principal components. For further details see Section III-B. High-level correlations among data. EP I CA discovers correlations, in terms of generalized association rules, from each cluster identiﬁed by the K-Means algorithm and enrich them with original data features. Generalized association rule mining [19] is an exploratory data mining technique that has

Fig. 2: Example of a generalization tree The traditional (non-generalized) itemset (Aspect Ratio, (0.2,045]),(Average Global Efﬁciency, (0.75,0.85]) indicates that items (Aspect Ratio, (0.2,045]) and (Average Global Efﬁciency, (0.75,0.85]) co-occur in the analyzed data. Instead, the generalized itemset (Aspect Ratio, (0.2,055]),(Average Global Efﬁciency,(0.75,1.0]) generalizes the former itemset by aggregating item values according to the generalization trees built on the Aspect Ratio and Average Global Efﬁciency attributes. A generalized item matches a given record if its value corresponds or is an aggregation of the value of any item of the energy certiﬁcate (at any abstraction level). The support of a generalized itemset in a relational dataset is an established quality index which is computed as the percentage of dataset records matched by all of its items. Let X and Y be two disjointed generalized itemsets. A generalized association rule is represented in the form X → Y , where X and Y are the body (antecedent) and the head (consequent) of the rule respectively. To rank the most interesting rules, EP I CA uses four quality indices named support, conﬁdence, lift and conviction [16]. The rule support is the percentage of records containing both X and Y .

4 First, the algorithm sets K initial centroids, chosen randomly. Then each point is assigned iteratively to the closest centroid. Next, the centroids are recalculated. The algorithm repeats the previous steps until the centroids no longer change.

993

The rule conﬁdence is the conditional probability that the consequent Y is true under the condition of the antecedent X. Given a set of transactions D, EP I CA ﬁnds all the rules having support ≥ minsup and conﬁdence ≥ minconf , where minsup and minconf are the corresponding support and conﬁdence thresholds that are user-speciﬁed parameters. The lift index [16] measures the (symmetric) correlation between antecedent and consequent of the extracted rules. When a rule has lift equal to one, the occurrence probability of the antecedent and the consequent are independent, so X and Y are not correlated. Lift values above 1 (below 1) show a positive (negative) correlation between itemsets X and Y . EP I CA ranks rules according to their lift value to focus on the subset of most positively correlated rules. The Conviction index is not a symmetric measure proposed in [5] to tackle some of the weaknesses of conﬁdence and lift. It is inspired by the logical deﬁnition of implication and attempts to measure the degree of implication of a rule. Taxonomy generation. A taxonomy can be either provided by a domain expert or inferred by data. EP I CA generates a taxonomy from the data under analysis through CART (Classiﬁcation And Regression Tree) algorithm [4]. The variables are individually analyzed using as response variable the annual primary energy demand normalized on ﬂoor area, calculated in standard climatic condition of the Piedmont city of Turin. Splits identiﬁed through the regression tree are used as aggregate values in the corresponding generalized tree. The annual primary energy demand is calculated from the net energy demand considering the thermal losses through different subsystems (emission, control, distribution, generation) related to both space heating and domestic hot water production. In the Piedmont region, residential buildings with an energy demand lower than 82,00 kWh/m2 are considered high-performing buildings (energy performance labels A+, A, B). Indeed, the selection of the normalized primary energy demand as responsive variable of the CART is ﬁnalized at ﬁnding for each variable different levels of aggregation representative of the ﬂats with similar energy performance. Rule classes. Exploring the results of the rule extraction process can be a challenging task, because the number of mined rules can be very high. To ease the manual exploration of the results and to allow experts to focus their attention on a subset of interesting rules, EP I CA categorizes the extracted rules according to four classes as shown in Table II. The meaning of a rule is determined by its class which includes only few attributes characterizing ECB with the energy performance label. Speciﬁcally, all the classes model the Correlations among three attributes and the energy performance label and are locally extracted from each cluster content. The subset of considered attributes are reported in Table II. In more detail, rules in the form of classes C1 , C2 and C3 model how Average U-value of vertical opaque envelope and Average U-value of windows inﬂuence the energy performance label together with either the Average global efﬁciency, the Heat transfer surface or the Aspect ratio. This kind of knowledge allows characterization of different types of building energy proﬁles

Cid C1 C2 C3 C4

Rule Classes {Average U-value of vertical opaque envelope, Average Uvalue of windows, Average global efﬁciency} → {Energy performance label} {Average U-value of vertical opaque envelope, Average Uvalue of windows, Heat transfer surface} → {Energy performance label} {Average U-value of vertical opaque envelope, Average Uvalue of windows, Aspect ratio} → {Energy performance label} {Aspect ratio, Average global efﬁciency, Average Ceiling Height} → {Energy performance label}

TABLE II: Rule Classes

according to physical characteristics of building. Rules in the form as C4 instead, model how Aspect ratio and Average global efﬁciency inﬂuence building energy performance label together with the Average Ceiling Height of buildings. In this case, we linked the compactness of the buildings and their heat transfer surface to characterize their energy performance. III. E XPERIMENTAL RESULTS We experimentally evaluated EP I CA on a real data collection of ECB collected in the Piedmont region, North Western of Italy, in 2013. The dataset includes 9101 energy certiﬁcates, each one characterized by a large variety of features. Analyzed buildings, both detached houses and ﬂats in condos, are distributed across the Piedmont region in 25 different cities. Energy certiﬁcates of buildings were issued in the ﬁrst six months in 2013. Experimental validation has been designed to address three main issues related to the effectiveness of EP I CA in: (i) correctly performing the data dimensionality reduction, (ii) identifying well-separated and cohesive groups of ECB with similar properties, and (iii) discovering a manageable set of patterns in terms of generalized association rules to compactly characterize each group of buildings. Based on the experimental evaluation discussed in Sections III-B and III-C, parameter setting K=9, minconf =1%, minsup=0.1%, minlif t=1.1 has been used as reference default conﬁguration for EP I CA. To address the problem of centroids initialization for the K-means algorithm we performed multiple runs, with randomly chosen initial centroids and the number of iterations set to 100. The open source RapidMiner toolkit [1] has been used for the correlation analysis, Principal Component Analysis and generalized association rule extraction. The toolkit MATLAB [15] has been used to perform the analysis of data distribution through boxplot. The framework Apache Spark has been used to perform the cluster analysis exploiting the K-Means algorithm available in MLlib. Experiments were performed on a 2.66-GHz Intel(R) Core(TM)2 Quad PC with 8 GBytes of main memory running a standalone Apache Spark framework 1.3.0. A. Performance of the EP I CA data dimensionality reduction Here we present a subset of performed experiments to show the ability of EP I CA in identifying correlated attributes and

994

in the considered dataset 10 principal components need to be considered for the subsequent analysis steps. Table IV shows the most representative attributes characterizing the ﬁrst three components selected through PCA. For each attribute the corresponding representative weight is also reported. For example, the ﬁrst component mainly represents Average U-value of vertical opaque envelope, Average Uvalue of windows and Average global efﬁciency of the heating system with weights 0.41, -0.36 and -0.35 respectively.

the performance of PCA in terms of selected principal components and their characterization. The correlation analysis performed before PCA reduces the percentage of noise data in the data under analysis. We experimentally evaluated that PCA performs better after applying the correlation analysis than the exploitation of PCA directly on the original dataset (without the correlation analysis). Speciﬁcally, the correlation analysis step smooths the effect of noisy data, improving the beneﬁt of PCA. EP I CA exploits the correlation matrix to analyse the dependence between multiple variables at the same time. The correlation matrix shown in Table III contains the correlation coefﬁcients between a subset (only 7 over 20) of couples of numerical attributes under analysis. Values are computed as discussed in Section II-A. A generic element (i, j) of the correlation matrix models the correlation between the attribute in row i and the one in column j. Correlation coefﬁcients always lie in the range [−1, 1]. A positive value (]0, 1]) implies a positive correlation between attributes i and j. Thus, large (small) values of attribute i tend to be associated with large (small) values of attribute j. A negative value ([−1, 0)) means a negative or inverse association. In this case large values of i tend to be associated with small values of j and vice versa. A value near 0 indicates weakly correlated data. This matrix is symmetric (i.e. the correlation of column i with column j is the same as the correlation of column j with column i), and elements on its diagonal are always 1 since they represent the correlation of an attribute with itself. Results reported in Table III highlight three strong correlations: (1) a positive and strong correlation (0.96) between Heated volume, i.e. the volume of a building measured by its external dimensions, and Floor area, i.e. the total area of all enclosed spaces measured according to the internal face of the external walls. (2) A high correlation, greater than 0.85, also exists between Floor area and Heat transfer surface, and (3) a high correlation between Heated volume and Heat transfer surface. Since highly correlated attributes are similar in behaviour, for each couple of attributes highlighted in the matrix one (the less correlated with the building energy consumption) is removed from the analysis to reduce both the computational cost and the cardinality of the extracted knowledge. Based on the above results, we do not consider Floor area and Heat transfer surface in the subsequent analysis process. To sum up the overall results of the Pearson Correlation Analysis, EP I CA analyzes 20 numerical attributes and selects 16 of them by disregarding 4 correlated attributes. This step smooths the effect of noisy data, improving the principal component analysis. As a next analytics step in the preprocessing component EP I CA exploits PCA to address the data dimensionality reduction. This component receives as input all the features selected through the correlation analysis phase (i.e., 16 features) and identiﬁes the principal components explaining 80% of the data variability. This percentage of explained variability can be calculated via the variance. The variance represents an index of dispersion of data. To capture 80% of the data variability

B. Performance of the EP I CA data clustering In EP I CA, groups of energy certiﬁcates are identiﬁed by analyzing the most interesting principal components, characterizing the collection, selected through PCA. The K-means algorithm is exploited. To automatically set the K-means input parameter K, EP I CA exploits the method described in II-B. For all K values in range [4, 30] all identiﬁed partitions are evaluated through both silhouette and weighted silhouette. The corresponding values were plotted against K. The optimal value of K has been selected at the coordinates where both silhouette and weighted silhouette assume the highest values, i.e., the best partition of well-separated and cohesive clusters is identiﬁed. Based on this method, we set K = 9. EP I CA characterizes the cluster set through: (1) the singular value decomposition (SVD) to show the results in a graphical and friendly way, (2) statistics-based characteristics for the cluster set, and (3) the boxplot distribution for the top three principal components. SVD representation of the cluster set. Figure 3 shows the SVD decomposition of the identiﬁed cluster set. SVD is a matrix factorization method that factorizes the input data matrix into three matrices. It can be easily exploited to reduce the data dimensionality by only considering the most representative attributes. All clusters in Figure 3 are wellseparated, thus K-means is able to identify a good partition of ECB.

Fig. 3: Cluster set representation through SVD Statistics-based characteristics for the cluster set. To characterize the cluster set, we enriched each cluster with the energy performance label deﬁned by the Italian legislation. In Italy there are 8 energy performance labels {A+, A, B, C, D, E, F, G} corresponding to different energy

995

Attribute name Floor area (A) Heated volume (V) Average ceiling height (H) Heat transfer surface (S) Aspect ratio (S/V) Average U-value of vertical opache envelope (Uo ) Average U-value of windows (Uw )

A 1.00 0.96 -0.09 0.81 0.08 -0.08 -0.11

V 0.96 1.00 0.14 0.84 0.09 -0.07 -0.10

H -0.09 0.14 1.00 0.13 0.05 0.02 0.03

S 0.81 0.84 0.13 1.00 0.56 -0.11 -0.12

S/V 0.08 0.09 0.05 0.56 1.00 -0.10 -0.08

Uo -0.08 -0.07 0.02 -0.11 -0.10 1.00 0.45

Uw -0.11 -0.10 0.03 -0.12 -0.08 0.45 1.00

TABLE III: Correlation matrix between a subset of couples of numerical attributes in the considered dataset. p1 attribute Average Global Efﬁciency Average U-value of the vertical opaque envelope Average U-value of windows

weight 0.41 -0.36 -0.35

p2 attribute Building energy need Aspect Ratio Emission subsystem efﬁciency

weight 0.56 0.52 -0.26

p3 attribute Generation subsystem efﬁciency Bolier size Average Ceiling Height

weight -0.54 -0.39 -0.28

TABLE IV: The most representative attributes characterizing the ﬁrst three components selected through PCA

ClusterID

# of certiﬁcates

Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7 Cluster8 Cluster9

1708 1059 629 69 1174 1431 787 667 1577

% of certiﬁcates with the majority energy performance label 77% - {CD} 69% - {CD} 70% - {EFG} 65% - {CD} 56% - {CD} 71% - {CD} 59% - {EFG} 65% - {A+AB} 96% - {EFG}

New energy performance label medium medium low medium medium medium low high low

of numerical data through their quartiles. It sums up data distribution through a few numbers (i.e. median, quartiles, min and max values) modeling the frequency distribution. The median summarizes the central tendency of the distribution and compared to quartiles provides information about the asymmetry of the distribution. The quartiles give an indication of the variability through the difference interquantile. The set of clusters is characterized by both negative and positive skewness and groups of data are quite different. In the case of positive skewness, observations increase in correspondence with the lowest values, while in the case of negative skewness, the observations increase in correspondence with the highest ones. To better understand the differences among custers, boxplots of different components should be jointly analyzed. Speciﬁcally, Figure 4a shows the impact of the ﬁrst principal component in the found cluster set. Cluster1 and Cluster8 have quite high median values while Cluster3 and Cluster4 have lower median values. Cluster2 and Cluster6 have a negative skewness (Q3 −M e) < (M e−Q1 ), where M e is the median, Q1 the ﬁrst quartile and Q3 the third quartile. Data are more concentrated between the median and the third quartile, as the same percentage of observations falls in a smaller range. These clusters have higher values related to the ﬁrst component than Cluster3 which instead has a positive skewness due to the presence of lower values. Figures 4b and 4c show the impact of the second and third principal components in the found cluster set. These distributions also demonstrate that the identiﬁed set of clusters is quite well-separated and cohesive.

TABLE V: Statistics-based characteristics for the cluster set

use intensity of buildings. The 8 energy performance labels correspond to different range of the normalized primary energy demand calculated in the standard climatic condition of the Piedmont city of Turin. In EP I CA we grouped these 8 labels into 3 {(A+AB), (CD), (EF G)}5 representing more general groups of buildings with high, medium and low energy performance respectively. The identiﬁcation of these classes have been performed with the support of energy domain experts. The main characteristics in terms of number of certiﬁcates, percentage of records with the majority energy performance label and the corresponding description are reported in Table V. EP I CA identiﬁed 5 homogeneous clusters in size, 3 medium clusters in size and 1 small cluster. Cluster8 represents the groups of ECB with high energy performance, Cluster1 , Cluster2 , Cluster4 , Cluster5 and Cluster6 are groups of certiﬁcates with medium energy performance, and the other three clusters include the subset of certiﬁcates with low energy performance. The same building energy performance label can result from different combinations of values assumed by the most inﬂuencing attributes. Analysis of the data distribution among clusters through boxplot. Figure 4 shows the distribution (through boxplot [17]) of top three principal components in the nine discovered clusters. A boxplot (whiskers plot) graphically shows groups

C. Cluster characterization through generalized patterns EP I CA also characterizes the cluster set by extracting generalized association rules from each cluster locally. The knowledge discovery process is driven by a taxonomy (generated as discussed in Section II-C and thresholds values for all quality indices (i.e., support, conﬁdence, lift and conviction). The generalization trees generated for the Average global efﬁciency and Average U-value of vertical opaque envelope attributes are reported in Table VI. Since the Average global efﬁciency of buildings is an attribute assuming real and positive values, the corresponding generalization tree includes leaf values (4 leaves

5 Piedmont regional legislation. Legge regionale 28 maggio 2007 n. 13 (Disposizioni in materia di rendimento energetico nell’edilizia) art. 6 e art. 21, comma 1, lettere d), e) ed f).

996

(a) p1 principal component

(b) p2 principal component

(c) p3 principal component

Fig. 4: Characterization of the cluster set through the distribution of the top three components CId

RId

C1

R1

C2

R2

C3

R3

C4

R4

Rule body (Average Global Efﬁciency, (0.85-1;0]); (Avg U-value of opaque envelope, [0.15-0.45]); (Average U-value of windows, [1.1-1.85]) (Average U-value of windows, [1.1-1.85]); (Heat transfer surface, [75.0-83.35]); (Avg U-value of opaque envelope, [0.15-0.45]) (Avg U-value of opaque envelope, [0.15-0.35]); (Average U-value of windows, [1.1-1.85]); (Aspect Ratio, [0.55-0.85]) (Average Global Efﬁciency, (0.75-1.0]); (Average Ceiling Height, [3.75-4.05]); (Aspect Ratio, [0.2-0.55])

Quality indices Conf % Lift

Rule head

Supp %

A+

0.88

81.92

1.50

Conv 2.99

B

3.39

76.67

1.34

1.83

B

18.11

76.88

1.34

1.85

A+AB

6.19

95.45

1.47

7.71

TABLE VII: Subset of hierarchical rules charactering Cluster8 CId

RId

C1

R5

C2

R6

C3

R7

C4

R8

Rule body (Avg U-value of opaque envelope, [0.65-1.0]); (Average Global Efﬁciency, [0.55-0.75]); (Average U-value of windows, (2.45-2.75]) (Avg U-value of opaque envelope, [0.65-1.0]); (Aspect Ratio, (0.85-1.55]); (Average U-value of windows, (3.35-5.5]) (Avg U-value of opaque envelope, [0.65-1.0]); (Aspect Ratio, (0.55-0.85]); (Average U-value of windows, (3.35-5.5]) (Average Ceiling Height, (3.75-5.5]); (Aspect Ratio, (0.75-0.85]); (Average Global Efﬁciency, (0.55-0.65])

Quality indices Conf % Lift

Rule head

Supp %

EFG

4.21

98.77

1.07

Conv 6.39

G

8.52

58.06

1.54

1.48

F

1.26

50.00

1.52

1.34

G

1.26

77.42

2.05

2.75

TABLE VIII: Subset of hierarchical rules characterizing Cluster9 Attribute name Average global efﬁciency

Average U-value of vertical opaque envelope

Leaf values [0,4-0,65] (0,65-0,75] (0,75-0,85] (0,85-1,0] [0,15-0,35] (0,35-0,45] (0,45-0,55] (0,55-0,65] (0,65-0,85] (0,85-1,1]

Aggregate value

Domain

2.75] (2.75, 3.35] (3.35, 4.85] (4.85, 5.5]) and 4 aggregate values (i.e., [1.1, 2.05] (2.05, 2.45] (2.45, 3.35] (3.35, 5.5]). While the generalization tree of Aspect ratio includes 6 leaves ([0.2, 0.45] (0.45, 0.55] (0.55, 0.75] (0.75, 0.85] (0.85, 1.35] (1.35, 1.55]) and 3 aggregate values (i.e. [0.2, 0.55] (0.55, 0.85] (0.85, 1.55]). We recommend users to set low support and conﬁdence threshold values (e.g., 0.1% and 1% respectively) to avoid pruning some interesting rules with low conﬁdence but high lift and conviction values. We also recommend a minimum lift threshold equal to 1.1 to prune both negatively correlated and uncorrelated item combinations. No threshold has been set for the conviction quality index. Tables VII and VIII show a subset of interesting generalized rules extracted from Cluster8 and Cluster9 separately. We categorized rules according to classes deﬁned in Table II and we report an interesting rule for each class. Cluster8 includes certiﬁcates of buildings with high energy performance label while Cluster9 is related to buildings with low energy performance label. Generalized rules include both traditional

[0,4-0,75] [0,4-1,0] (0,75-1,0] [0,15-0,45] (0,45-0,65]

[0,15-1,1]

(0,65-1,1]

TABLE VI: Generalization trees for the Average global efﬁciency and Average U-value of vertical opaque envelope attributes

as shown Table VI), each one associated to a range of nonoverlapped values. The tree also includes two aggregate values (intermediate node) and the root including all values in the corresponding domain. The generalization tree of the Average U-value of the windows includes 8 leaves (8 non-overlapped ranges [1.1, 1.85] (1.85, 2.05] (2.05, 2.25] (2.25, 2.45] (2.45,

997

and generalized items. Thus, the rule set includes more speciﬁc correlations among traditional items, more general correlations among generalized items, and cross-level correlations (rules with some traditional items and other generalized items. Generalized items are reported in bold in Tables VII and VIII. Rule R1 (in Table VII) captures the main correlations among two speciﬁc items ((Average Global Efﬁciency,(0.85-1;0]), (Average Uval of windows, [1.1-1.85])) and a generalized item ((Avg Uval of opaque envelope, [0.15-0.45])) characterizing the high energy performance buildings (with energy performance label A+). Rules R4 (in Table VII) models high level correlations among average global efﬁciency, average ceiling height and aspect ratio characterizing the general class of high performance buildings (A + AB). Rule R5 (in Table VIII) captures the main generalized items characterizing low performance buildings (with energy performance label EF G). Speciﬁcally, buildings characterized by (Avg Uval of opaque envelope, [0.65-1.0]); (Average Global Efﬁciency, ([0.55-0.75], (Average Uval of windows, (2.45-2.75]) are low performance buildings, thus the corresponding energy certiﬁcates receive as energy performance label either E, F or G. Rule R8 (in Table VIII) instead is a more speciﬁc knowledge since it captures the main speciﬁc items ((Average Ceiling Height, (3.75-5.5]); (Aspect Ratio, (0.75-0.85]); (Average Global Efﬁciency, (0.550.65])) characterizing the set of buildings with low energy performance label (G).

[3] E. Baralis, T. Cerquitelli, S. Chiusano, and A. Grand. P-mine: Parallel itemset mining on large datasets. In Workshops Proceedings of the 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 266–271, 2013. [4] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classiﬁcation and regression trees. CRC press, 1984. [5] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD Rec., 26(2):255–264, June 1997. [6] A. Capozzoli, T. Cerquitelli, and M. S. Piscitelli. Enhancing energy efﬁciency in buildings through innovative data analytics technologies. In Pervasive Computing: Next Generation Platforms for Intelligent Data Collection., pages 353–389. Elsevier, 2016. [7] A. Capozzoli, D. Grassi, M. S. Piscitelli, and G. Serale. Discovering knowledge from a residential building stock through data mining analysis for engineering sustainability. Energy Procedia, 83:370 – 379, 2015. [8] A. Capozzoli, G. Serale, M. S. Piscitelli, and D. Grassi. Data mining for energy analysis of a large data set of ﬂats. Proceedings of the institution of civil engineers. Engineering sustainability, 170:3–18, 2017. [9] T. Cerquitelli. Predicting large scale ﬁne grain energy consumption. Energy Procedia, 111:1079–1088, 2017. [10] T. Cerquitelli and E. Di Corso. Characterizing thermal energy consumption through exploratory data mining algorithms. In Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference, EDBT/ICDT Workshops 2016, Bordeaux, France, March 15, 2016., 2016. [11] G. Dall’O, L. Sarto, N. Sanna, V. Tonetti, and M. Ventura. On the use of an energy certiﬁcation database to create indicators for energy planning purposes: Application in northern italy. Energy Policy, 85(C):207–217, 2015. [12] E. Di Corso, T. Cerquitelli, and F. Ventura. Self-tuning techniques for large scale cluster analysis on textual data collections. In Proceedings of the 32nd Annual ACM Symposium on Applied Computing, Marrakesh, Morocco, April 3rd-7th, 2017, pages 771–776, 2017. [13] B.-H. Juang and L. Rabiner. The segmental k-means algorithm for estimating parameters of hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 38(9):1639–1641, Sep 1990. [14] X. Li, L. Nie, and S. Chen. Approximate dynamic programming based data center resource dynamic scheduling for energy optimization. In IEEE iThings/GreenCom/CPSCom 2014, Taipei, Taiwan, September 13, 2014, pages 494–501, 2014. [15] MathWorks. MATLAB and Simulink for Technical Computing. Available: www.mathworks.com Last access on March 2017. [16] Pang-Ning T. and Steinbach M. and Kumar V. Introduction to Data Mining. Addison-Wesley, 2006. [17] S. M. Ross. Introduction to probability and statistics for engineers and scientists (2. ed.). Academic Press, 2000. [18] P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 – 65, 1987. [19] R. Srikant and R. Agrawal. Mining generalized association rules. In Proceedings of the 21th International Conference on Very Large Data Bases, VLDB ’95, pages 407–419, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. [20] I. Takouna, E. Alzaghoul, and C. Meinel. Robust virtual machine consolidation for efﬁcient energy and performance in virtualized data centers. In IEEE iThings/GreenCom/CPSCom 2014, Taipei, Taiwan, September 1-3, 2014, pages 470–477, 2014. [21] C. Wu, W. Chen, Y. Tseng, L. Fu, and C. Lu. Anticipatory reasoning for a proactive context-aware energy saving system. In IEEE iThings/GreenCom/CPSCom 2014, Taipei, Taiwan, September 1-3, 2014, pages 228–234, 2014.

IV. C ONCLUSION This paper presented EP I CA a multi-tiered analytics framework to characterize a collection of ECB through unsupervised techniques. EP I CA also exploited different methods to address data dimensionality reduction and to smooth the effect of noisy data. The proposed methodological framework allows to discover useful knowledge about buildings energy performance according to physical driving variables. This knowledge has a direct impact in facilitating both selection of targeted retroﬁtting strategies and ﬁnancial investment planning at regional level. In addition, the obtained results are easily understandable and exploitable also by no-expert users. As a future work we plan to integrate in EP I CA supervised data mining techniques (e.g., deep learning algorithms, decision trees and Bayesian networks) for a quick and ﬂexible estimation of the energy performance label of buildings based on their feautures and few inﬂuencing physical variables. ACKNOWLEDGMENT The authors express their gratitude to Giovanni Nuvoli (Settore Sviluppo Energetico Sostenibile Regione Piemonte) and to CSI Piemonte. R EFERENCES [1] R. M. P. . The Rapid Miner Project for Machine Learning. Available: http://rapid-i.com/ Last access on March 2017. [2] A. Acquaviva, D. Apiletti, A. Attanasio, E. Baralis, L. Bottaccioli, F. B. Castagnetti, T. Cerquitelli, S. Chiusano, E. Macii, D. Martellacci, and E. Patti. Energy signature analysis: Knowledge at your ﬁngertips. In 2015 IEEE International Congress on Big Data, New York City, NY, USA, June 27 - July 2, 2015, pages 543–550, 2015.

998