# Theoretical and Empirical Analysis of Relieff and Rrelieff

A broad spectrum of successful uses calls for especially careful investigation of various features Relief algorithms have. In this paper we theoretically and empirically investigate and discuss how and why they work, their theoretical and practical properties, their parameters, what kind of dependencies they detect, how do they scale up to large number of examples and features, how to sample data for them, how robust are they regarding the noise, how irrelevant and redundant attributes in? uence their output and how different metrics in? uences them.

Keywords: attribute estimation, feature selection, Relief algorithm, classi? cation, regression 2 ? Robnik Sikonja and Kononenko 1. Introduction A problem of estimating the quality of attributes (features) is an important issue in the machine learning. There are several important tasks in the process of machine learning e. g. , feature subset selection, constructive induction, decision and regression tree building, which contain the attribute estimation procedure as their (crucial) ingredient. In many learning problems there are hundreds or thousands of potential features describing each input object.

Majority of learning methods do not behave well in this circumstances because, from a statistical point of view, examples with many irrelevant, but noisy, features provide very little information. A feature subset selection is a task of choosing a small subset of features that ideally is necessary and suf? cient to describe the target concept. To make a decision which features to retain and which to discard we need a reliable and practically ef? cient method of estimating their relevance to the target concept.

In the constructive induction we face a similar problem. In order to enhance the power of the representation language and construct a new knowledge we introduce new features. Typically many candidate features are generated and again we need to decide which features to retain and which to discard. To estimate the relevance of the features to the target concept is certainly one of the major components of such a decision procedure. Decision and regression trees are popular description languages for representing knowledge in the machine learning.

While constructing a tree the learning algorithm at each interior node selects the splitting rule (feature) which divides the problem space into two separate subspaces. To select an appropriate splitting rule the learning algorithm has to evaluate several possibilities and decide which would partition the given (sub)problem most appropriately. The estimation of the quality of the splitting rules seems to be of the principal importance. The problem of feature (attribute) estimation has received much attention in the literature. There are several measures for estimating attributes’ quality.

If the target concept is a discrete variable (the classi? cation problem) these are e. g. , information gain (Hunt et al. , 1966), Gini index (Breiman et al. , 1984), distance measure (Mantaras, 1989), j-measure (Smyth and Goodman, 1990), Relief (Kira and Rendell, 1992b), ReliefF (Kononenko, 1994), MDL (Kononenko, 1995), and also ? 2 and G statistics are used. If the target concept is presented as a real valued function (numeric class and the regression problem) then the estimation heuristics are e. g. , the mean squared and the ? mean absolute error (Breiman et al. 1984), and RReliefF (Robnik Sikonja and Kononenko, 1997). Theoretical and Empirical Analysis of ReliefF and RReliefF 3 The majority of the heuristic measures for estimating the quality of the attributes assume the conditional (upon the target variable) independence of the attributes and are therefore less appropriate in problems which possibly involve much feature interaction. Relief algorithms (Relief, ReliefF and RReliefF) do not make this assumption. They are ef? cient, aware of the contextual information, and can correctly estimate the quality of attributes in problems with strong dependencies between attributes.

While Relief algorithms have commonly been viewed as feature subset selection methods that are applied in a prepossessing step before the model is learned (Kira and Rendell, 1992b) and are one of the most successful preprocessing algorithms to date (Dietterich, 1997), they are actually general feature estimators and have been used successfully in a variety of settings: to select splits in the building phase of decision tree learning (Kononenko et al. , 1997), to select splits and guide the constructive induction in learning ? f the regression trees (Robnik Sikonja and Kononenko, 1997), as attribute weighting method (Wettschereck et al. , 1997) and also in inductive logic programming (Pompe and Kononenko, 1995). The broad spectrum of successful uses calls for especially careful investigation of various features Relief algorithms have: how and why they work, what kind of dependencies they detect, how do they scale up to large number of examples and features, how to sample data for them, how robust are they regarding the noise, how irrelevant and duplicate attributes in? ence their output and what effect different metrics have. In this work we address these questions as well as some other more theoretical issues regarding the attribute estimation with Relief algorithms. In Section 2 we present the Relief algorithms and discuss some theoretical issues. We conduct some experiments to illustrate these issues. We then turn (Section 3) to the practical issues on the use of ReliefF and try to answer the above questions (Section 4). Section 5 discusses applicability of Relief algorithms for various tasks.

In Section 6 we conclude with open problems on both empirical and theoretical fronts. We assume that examples I1 , I2 , … , In in the instance space are described by a vector of attributes Ai , i = 1, … , a, where a is the number of explanatory attributes, and are labelled with the target value ? j . The examples are therefore points in the a dimensional space. If the target value is categorical we call the modelling task classi? cation and if it is numerical we call the modelling task regression. 4 ? Robnik Sikonja and Kononenko

Algorithm Relief Input: for each training instance a vector of attribute values and the class value Output: the vector W of estimations of the qualities of attributes 1. set all weights W [A] := 0. 0; 2. for i := 1 to m do begin 3. randomly select an instance Ri ; 4. ?nd nearest hit H and nearest miss M; 5. for A := 1 to a do 6. W [A] := W [A] ? diff(A, Ri , H)/m + diff(A, Ri , M)/m; 7. end; Figure 1. Pseudo code of the basic Relief algorithm 2. Relief family of algorithms In this Section we describe the Relief algorithms and discuss their similarities and differences.

First we present the original Relief algorithm (Kira and Rendell, 1992b) which was limited to classi? cation problems with two classes. We give account on how and why it works. We discuss its extension ReliefF (Kononenko, 1994) which can deal with multiclass problems. The improved algorithm is more robust and also able to deal with incomplete and noisy data. Then we show how ReliefF was adapted for continuous class (regression) ? problems and describe the resulting RReliefF algorithm (Robnik Sikonja and Kononenko, 1997).

After the presentation of the algorithms we tackle some theoretical issues about what Relief output actually is. 2. 1. R ELIEF – BASIC IDEAS A key idea of the original Relief algorithm (Kira and Rendell, 1992b), given in Figure 1, is to estimate the quality of attributes according to how well their values distinguish between instances that are near to each other. For that purpose, given a randomly selected instance Ri (line 3), Relief searches for its two nearest neighbors: one from the same class, called nearest hit H, and the other from the different class, called nearest miss M (line 4).

It updates the quality estimation W [A] for all attributes A depending on their values for Ri , M, and H (lines 5 and 6). If instances Ri and H have different values of the attribute A then the attribute A separates two instances with the same class which is not desirable so we decrease the quality estimation W [A]. On the other hand if instances Ri and M have different values of the attribute A then the attribute A separates two instances with different class values which is desirable so we increase the quality estimation W [A]. The whole process is repeated for m times, where m is a user-de? ed parameter. Theoretical and Empirical Analysis of ReliefF and RReliefF 5 Algorithm ReliefF Input: for each training instance a vector of attribute values and the class value Output: the vector W of estimations of the qualities of attributes 1. set all weights W [A] := 0. 0; 2. for i := 1 to m do begin 3. randomly select an instance Ri ; 4. ?nd k nearest hits H j ; 5. for each class C = class(Ri ) do 6. from class C ? nd k nearest misses M j (C); 7. for A := 1 to a do 8. 9. W [A] := W [A] – ? diff(A, Ri , H j )/(m · k) + j=1 k C=class(Ri ) ? P(C) [ 1? P(class(Ri )) ? diff(A, Ri , M j (C))]/(m · k); j=1 10. end; Figure 2. Pseudo code of ReliefF algorithm Function diff(A, I1 , I2 ) calculates the difference between the values of the attribute A for two instances I1 and I2 . For nominal attributes it was originally de? ned as: diff(A, I1 , I2 ) = and for numerical attributes as: diff(A, I1 , I2 ) = |value(A, I1 ) ? value(A, I2 )| max(A) ? min(A) (2) 0 ; value(A, I1 ) = value(A, I2 ) 1 ; otherwise (1) The function diff is used also for calculating the distance between instances to ? nd the nearest neighbors. The total distance is simply the sum of distances over all attributes (Manhattan distance).

The original Relief can deal with nominal and numerical attributes. However, it cannot deal with incomplete data and is limited to two-class problems. Its extension, which solves these and other problems, is called ReliefF. 2. 2. R ELIEF F – EXTENSION The ReliefF (Relief-F) algorithm (Kononenko, 1994) (see Figure 2) is not limited to two class problems, is more robust and can deal with incomplete and noisy data. Similarly to Relief, ReliefF randomly selects an instance Ri (line 3), but then searches for k of its nearest neighbors from the same class, 6 ? Robnik Sikonja and Kononenko alled nearest hits H j (line 4), and also k nearest neighbors from each of the different classes, called nearest misses M j (C) (lines 5 and 6). It updates the quality estimation W [A] for all attributes A depending on their values for Ri , hits H j and misses M j (C) (lines 7, 8 and 9). The update formula is similar to that of Relief (lines 5 and 6 on Figure 1), except that we average the contribution of all the hits and all the misses. The contribution for each class of the misses is weighted with the prior probability of that class P(C) (estimated from the training set).

Since we want the contributions of hits and misses in each step to be in [0, 1] and also symmetric (we explain reasons for that below) we have to ensure that misses’ probability weights sum to 1. As the class of hits is missing in the sum we have to divide each probability weight with factor 1? P(class(Ri )) (which represents the sum of probabilities for the misses’ classes). The process is repeated for m times. Selection of k hits and misses is the basic difference to Relief and ensures greater robustness of the algorithm concerning noise. User-de? ed parameter k controls the locality of the estimates. For most purposes it can be safely set to 10 (see (Kononenko, 1994) and discussion below). To deal with incomplete data we change the diff function. Missing values of attributes are treated probabilistically. We calculate the probability that two given instances have different values for given attribute conditioned over class value: ? if one instance (e. g. , I1 ) has unknown value: diff(A, I1 , I2 ) = 1 ? P(value(A, I2 )|class(I1 )) ? if both instances have unknown value: #values(A) (3) diff(A, I1 , I2 ) = 1 ? ? V P(V |class(I1 )) ?

P(V |class(I2 )) (4) Conditional probabilities are approximated with relative frequencies from the training set. 2. 3. RR ELIEF F – IN REGRESSION We ? nish the description of the algorithmic family with RReliefF (Regres? sional ReliefF) (Robnik Sikonja and Kononenko, 1997). First we theoretically explain what Relief algorithm actually computes. Relief’s estimate W [A] of the quality of attribute A is an approximation of the following difference of probabilities (Kononenko, 1994): W [A] = P(diff. value of A|nearest inst. from diff. class) ? P(diff. value of A|nearest inst. from same class) 5) Theoretical and Empirical Analysis of ReliefF and RReliefF 7 The positive updates of the weights (line 6 in Figure 1 and line 9 in Figure 2) are actually forming the estimate of probability that the attribute discriminates between the instances with different class values, while the negative updates (line 6 in Figure 1 and line 8 in Figure 2) are forming the probability that the attribute separates the instances with the same class value. In regression problems the predicted value ? (·) is continuous, therefore (nearest) hits and misses cannot be used. To solve this dif? ulty, instead of requiring the exact knowledge of whether two instances belong to the same class or not, a kind of probability that the predicted values of two instances are different is introduced. This probability can be modelled with the relative distance between the predicted (class) values of two instances. Still, to estimate W[A] in (5), information about the sign of each contributed term is missing (where do hits end and misses start). In the following derivation Equation (5) is reformulated, so that it can be directly evaluated using the probability that predicted values of two instances are different.

If we rewrite Pdi f f A = P(different value of A|nearest instances) Pdi f fC = P(different prediction|nearest instances) and Pdi f fC|di f f A = P(diff. prediction|diff. value of A and nearest instances) (8) we obtain from (5) using Bayes’ rule: W [A] = Pdi f fC|di f f A Pdi f f A (1 ? Pdi f fC|di f f A )Pdi f f A ? Pdi f fC 1 ? Pdi f fC (9) (6) (7) Therefore, we can estimate W [A] by approximating terms de? ned by Equations 6, 7 and 8. This can be done by the algorithm on Figure 3. Similarly to ReliefF we select random instance Ri (line 3) and its k nearest instances I j (line 4).

The weights for different prediction value ? (·) (line 6), different attribute (line 8), and different prediction & different attribute (line 9 and 10) are collected in NdC , NdA [A], and NdC&dA [A], respectively. The ? nal estimation of each attribute W [A] (Equation (9)) is computed in lines 14 and 15. The term d(i, j) in Figure 3 (lines 6, 8 and 10) takes into account the distance between the two instances Ri and I j . Rationale is that closer instances should have greater in? uence, so we exponentially decrease the in? uence of the instance I j with the distance from the given instance Ri : d(i, j) = d1 (i, j) k ? =1 d1 (i, l) and (10) 8 ? Robnik Sikonja and Kononenko Algorithm RReliefF Input: for each training instance a vector of attribute values x and predicted value ? (x) Output: vector W of estimations of the qualities of attributes 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. set all NdC , NdA [A], NdC&dA [A], W [A] to 0; for i := 1 to m do begin randomly select instance Ri ; select k instances I j nearest to Ri ; for j := 1 to k do begin NdC := NdC + diff(? (·), Ri , I j ) · d(i, j); for A := 1 to a do begin NdA [A] := NdA [A] + diff(A, Ri , I j ) · d(i, j); NdC&dA [A] := NdC&dA [A] + diff(? ·), Ri , I j )· diff(A, Ri , I j ) · d(i, j); end; end; end; for A := 1 to a do W [A] := NdC&dA [A]/NdC – (NdA [A] ? NdC&dA [A])/(m ? NdC ); Figure 3. Pseudo code of RReliefF algorithm 2 d1 (i, j) = e ? rank(Ri ,I j ) ? (11) where rank(Ri , I j ) is the rank of the instance I j in a sequence of instances ordered by the distance from Ri and ? is a user de? ned parameter controlling the in? uence of the distance. Since we want to stick to the probabilistic interpretation of the results we normalize the contribution of each of k nearest instances by dividing it with the sum of all k contributions.

The reason for using ranks instead of actual distances is that actual distances are problem dependent while by using ranks we assure that the nearest (and subsequent as well) instance always has the same impact on the weights. ReliefF was using a constant in? uence of all k nearest instances I j from the instance Ri . For this we should de? ne d1 (i, j) = 1/k. Discussion about different distance functions can be found in following sections. 2. 4. C OMPUTATIONAL COMPLEXITY For n training instances and a attributes Relief (Figure 1) makes O(m · n · a) operations.

The most complex operation is selection of the nearest hit Theoretical and Empirical Analysis of ReliefF and RReliefF 9 and miss as we have to compute the distances between R and all the other instances which takes O(n · a) comparisons. Although ReliefF (Figure 2) and RReliefF (Figure 3) look more complicated their asymptotical complexity is the same as that of original Relief, i. e. , O(m · n · a). The most complex operation within the main for loop is selection of k nearest instances. For it we have to compute distances from all the instances to R, which can be done in O(n · a) steps for n instances.

This is the most complex operation, since O(n) is needed to build a heap, from which k nearest instances are extracted in O(k log n) steps, but this is less than O(n · a). Data structure k-d (k-dimensional) tree (Bentley, 1975; Sefgewick, 1990) is a generalization of the binary search tree, which instead of one key uses k keys (dimensions). The root of the tree contains all the instances. Each interior node has two successors and splits instances recursively into two groups according to one of k dimensions. The recursive splitting stops when there are less than a prede? ned number of instances in a node.

For n instances we can build the tree where split on each dimension maximizes the variance in that dimension and instances are divided into groups of approximately the same size in time proportional to O(k · n · log n). With such tree called optimized k-d tree we can ? nd t nearest instances to the given instance in O(log n) steps (Friedman et al. , 1975). If we use k-d tree to implement the search for nearest instances we can re? duce the complexity of all three algorithms to O(a · n · log n) (Robnik Sikonja, 1998). For Relief we ? rst build the optimized k-d tree (outside the main loop) in O(a · · log n) steps so we need only O(m · a) steps in the loop and the total complexity of the algorithm is now the complexity of the preprocessing which is O(a · n · log n). The required sample size m is related to the problem complexity (and not to the number of instances) and is typically much more than log n so asymptotically we have reduced the complexity of the algorithm. Also it does not make sense to use sample size m larger than the number of instances n. The computational complexity of ReliefF and RReliefF using k-d trees is the same as that of Relief.

They need O(a · n ·log n) steps to build k-d tree, and in the main loop they select t nearest neighbors in log n steps, update weights in O(t · a) but O(m(t · a + log n)) is asymptotically less than the preprocessing which means that the complexity has reduced to O(a · n · log n). This analysis shows that ReliefF family of algorithms is actually in the same order of complexity as multikey sort algorithms. Several authors have observed that the use of k-d trees becomes inef? cient with increasing number of attributes (Friedman et al. , 1975; Deng and Moore, 1995; Moore et al. 1997) and this was con? rmed for Relief family of ? algorithms as well (Robnik Sikonja, 1998). 10 ? Robnik Sikonja and Kononenko Kira and Rendell (Kira and Rendell, 1992b) consider m an arbitrary chosen constant and claim that the complexity of Relief is O(a · n). If we accept their argument than the complexity of ReliefF and RReliefF is also O(a · n), and the above analysis using k-d trees is useless. However, if we want to obtain sensible and reliable results with Relief algorithms then the required sample size m is related to the problem complexity and is not constant as we will show below. . 5. G ENERAL FRAMEWORK OF R ELIEF ALGORITHMS By rewriting Equation (5) into a form suitable also for regression W [A] = P(diff. value of A|near inst. with diff. prediction) ? P(diff. value of A|near inst. with same prediction) (12) we see that we are actually dealing with (dis)similarities of attributes and prediction values (of near instances). A generalization of Relief algorithms would take into account the similarity of the predictions ? and of the attributes A and combine them into a generalized weight: WG [A] = I1 ,I2 ? I ? similarity(? I1, I2) · similarity(A, I1 , I2 ) (13) where I1 and I2 were appropriate samples drawn from the instance population I . If we use [0, 1] normalized similarity function (like e. g. , diff) than with these weights we can model the following probabilities: • P(similar A|similar ? ), P(dissimilar A|similar ? ), and • P(similar A|dissimilar ? ), P(dissimilar A|dissimilar ? ). In the probabilistic framework we can write: P(similar A|similar ? ) + P(dissimilar A|similar ? ) = 1 P(similar A|dissimilar ? ) + P(dissimilar A|dissimilar ? ) = 1 (14) (15) so it is suf? ient to compute one of the pair of probabilities from above and still to get all the information. Let us think for a moment what we intuitively expect from a good attribute estimator. In our opinion good attributes separate instances with different prediction values and do not separate instances with close prediction values. These considerations are ful? lled by taking one term from each group of probabilities from above and combine them in a sensible way. If we rewrite Relief’s weight from Equation (12): W [A] = 1 ? P(similar A|dissimilar ? ) ? 1 + P(similar A|similar ? )

Theoretical and Empirical Analysis of ReliefF and RReliefF 11 = P(similar A|similar ? ) ? P(similar A|dissimilar ? ) we see that this is actually what Relief algorithms do: they reward attribute for not separating similar prediction values and punish it for not separating different prediction values. The similarity function used by Relief algorithms is similarity(A, I1 , I2 ) = ? diff(A, I1 , I2 ) which enables intuitive probability based interpretation of results. We could get variations of Relief estimator by taking different similarity functions and by combining the computed probabilities in a different way.

For example, the Contextual Merit (CM) algorithm (Hong, 1997) uses only the instances with different prediction values and therefore it takes only the ? rst term of Equation (12) into account. As a result CM only rewards attribute if it separates different prediction values and ignores additional information, which the similar prediction values offer. Consequently CM is less sensitive than Relief algorithms are, e. g. , in parity problems with three important attributes CM separates important from unimportant attributes for a factor of 2 to 5 and only 1. 05-1. 9 with numerical attributes while with ReliefF under the same conditions this factor is over 100. Part of troubles CM has with numerical attributes also comes from the fact that it does not take the second term into account, namely it does not punish attributes for separating similar prediction values. As numerical attributes are very likely to do that CM has to use other techniques to confront this effect. 2. 6. R ELIEF AND IMPURITY FUNCTIONS Estimations of Relief algorithms are strongly related to impurity functions (Kononenko, 1994). When the number of nearest neighbors increases i. . , when we eliminate the requirement that the selected instance is the nearest, Equation (5) becomes W [A] = P(different value of A|different class) ? P(different value of A|same class) If we rewrite Peqval = P(equal value of A) Psamecl = P(same class) Psamecl|eqval = P(same class|equal value of A) we obtain using Bayes’ rule: W [A] = Psamecl|eqval Peqval (1 ? Psamecl|eqval )Peqval ? Psamecl 1 ? Psamecl (16) 12 ? Robnik Sikonja and Kononenko For sampling with replacement in strict sense the following equalities hold: Psamecl = ? P(C)2 C Psamecl|eqval = ? V

P(V )2 ? ? P(C|V )2 ? V P(V )2 C Using the above equalities we obtain: W [A] = where Ginigain (A) = Peqval ? Ginigain (A) Psamecl (1 ? Psamecl ) (17) ? V P(V )2 ? ? P(C|V )2 ? ? P(C)2 ? V P(V )2 C C (18) is highly correlated with the Gini-index gain (Breiman et al. , 1984) for classes C and values V of attribute A. The difference is that instead of factor P(V )2 ? V P(V )2 the Gini-index gain uses P(V ) = P(V ) ? V P(V ) Equation (17) (which we call myopic ReliefF), shows strong correlation of Relief’s weights with the Gini-index gain. The probability Pequal = ?

V P(V )2 that two instances have the same value of attribute A in Equation (17) is a kind of normalization factor for multi-valued attributes. Impurity functions tend to overestimate multi-valued attributes and various normalization heuristics are needed to avoid this tendency (e. g. , gain ratio (Quinlan, 1986), distance measure (Mantaras, 1989), and binarization of attributes (Cestnik et al. , 1987)). Equation (17) shows that Relief exhibits an implicit normalization effect. Another de? ciency of Gini-index gain is that its values tend to decrease with the increasing number of classes.

Denominator, which is constant factor in Equation (17) for a given attribute, again serves as a kind of normalization and therefore Relief’s estimates do not exhibit such strange behavior as Giniindex gain does. This normalization effect remains even if Equation (17) is used as (myopic) attribute estimator. The detailed bias analysis of various attribute estimation algorithms including Gini-index gain and myopic ReliefF can be found in (Kononenko, 1995). The above derivation eliminated from the probabilities the condition that the instances are the nearest.

If we put it back we can interpret Relief’s estimates as the average over local estimates in smaller parts of the instance Theoretical and Empirical Analysis of ReliefF and RReliefF 13 space. This enables Relief to take into account the context of other attributes, i. e. the conditional dependencies between the attributes given the predicted value, which can be detected in the context of locality. From the global point of view, these dependencies are hidden due to the effect of averaging over all training instances, and exactly this makes the impurity functions myopic.

The impurity functions use correlation between the attribute and the class disregarding the context of other attributes. This is the same as using the global point of view and disregarding local peculiarities. The power of Relief is its ability to exploit information locally, taking the context into account, but still to provide the global view. 0. 5 0. 4 ReliefF’s estimate 0. 3 0. 2 0. 1 0 0 -0. 1 10 20 30 40 50 60 Informative Random 70 80 90 Number of nearest neighbors Figure 4. ReliefF’s estimates of informative attribute are deteriorating with increasing number of nearest neighbors in parity domain.

We illustrate this in Figure 4 which shows dependency of ReliefF’s estimate to the number of nearest neighbors taken into account. The estimates are for the parity problem with two informative, 10 random attributes, and 200 examples. The dotted line shows how the ReliefF’s estimate of one of informative attributes is becoming more and more myopic with the increasing number of the nearest neighbors and how the informative attribute eventually becomes indistinguishable from the unimportant attributes.

The negative estimate of random attributes with small numbers of neighbors is a consequence of slight asymmetry between hits and misses. Recall that Relief algorithm (Figure 1) randomly selects an instance R and its nearest instance from the same class H and from different class M. Random attributes with different values at R and H get negative update. Random attributes with different values at R and M get positive update. With larger number of nearest neighbors the positive and negative updates are equiprobable and the quality estimates of random attributes is zero.

The miss has different class value therefore there has to be at least some difference also in the values of the important attributes. The sum of the differences in the values of attributes forms the distance, therefore if there is a difference in the values of the important attribute and also in the values of some random attributes, such instances are less likely to be in the 14 ? Robnik Sikonja and Kononenko nearest neighborhood. This is especially so when we are considering only a small number of nearest instances.

The positive update of random attributes is therefore less likely than the negative update and the total sum of all updates is slightly negative. 2. 7. R ELIEF ’ S WEIGHTS AS THE PORTION OF EXPLAINED CONCEPT CHANGES We analyze the behavior of Relief when the number of the examples approaches in? nity i. e. , when the problem space is densely covered with the examples. We present the necessary de? nitions and prove that Relief’s quality estimates can be interpreted as the ratio between the number of the explained changes in the concept and the number of examined instances. The xact form of this property differs between Relief, ReliefF and RReliefF. We start with Relief and claim that in a classi? cation problem as the number of examples goes to in? nity Relief’s weights for each attribute converge to the ratio between the number of class label changes the attribute is responsible for and the number of examined instances. If a certain change can be explained in several different ways, all the ways share the credit for it in the quality estimate. If several attributes are involved in one way of the explanation all of them get the credit in their quality estimate.

We formally present the de? nitions and the property. DEFINITION 2. 1. Let B(I) be the set of instances from I nearest to the instance I ? I which have different prediction value ? than I: B(I) = {Y ? I ; diff(? , I,Y ) ;gt; 0 ? Y = arg min ? (I,Y )} Y ? I (19) Let b(I) be a single instance from the set B(I) and p(b(I)) a probability that it is randomly chosen from B(I). Let A(I, b(I)) be a set of attributes with different values at instances I and b(I). A(I, b(I)) = {A ? A ; b(I) ? B(I) ? diff(A, I, b(I)) ;gt; 0} (20) We say that attributes A ?

A(I, b(I)) are responsible for the change of the predicted value of the instance I to the predicted value of b(I) as the change of their values is one of the minimal number of changes required for changing the predicted value of I to b(I). If the sets A(I, b(I)) are different we say that there are different ways to explain the changes of the predicted value of I to the predicted value b(I). The probability of certain way is equal to the probability that b(I) is selected from B(I). Let A(I) be a union of sets A(I, b(I)): A(I) = b(I)? B(I) A(I, b(I)) (21) Theoretical and Empirical Analysis of ReliefF and RReliefF 5 We say that the attributes A ? A(I) are responsible for the change of the predicted value of the instance I as the change of their values is the minimal necessary change of the attributes’ values of I required to change its predicted value. Let the quantity of this responsibility take into account the change of the predicted value and the change of the attribute: rA (I, b(I)) = p(b(I)) · diff(? , I, b(I)) · diff(A, I, b(I)) (22) The ratio between the responsibility of the attribute A for the predicted values of the set of cases S and the cardinality m of that set is therefore: RA = 1 ? rA (I, b(I)) m I?

S (23) PROPERTY 2. 1. Let the concept be described with the attributes A ? A and n noiseless instances I ? I ; let S ? I be the set of randomly selected instances used by Relief (line 3 on Figure 1) and let m be the cardinality of that set. If Relief randomly selects the nearest instances from all possible nearest instances then for its quality estimate W [A] the following property holds: lim W [A] = RA (24) n;? The quality estimate of the attribute can therefore be explained as the ratio of the predicted value changes the attribute is responsible for to the number of the examined instances. Proof.

Equation (24) can be explained if we look into the spatial representation. There are a number of different characteristic regions of the problem space which we usually call peaks. The Relief algorithms selects an instance R from S and compares the value of the attribute and the predicted value of its nearest instances selected from the set I (line 6 on Figure 1), and than updates the quality estimates according to these values. For Relief this mean: W [A] := W [A] + diff(A, R, M)/m ? diff(A, R, H)/m, where M is the nearest instance from the different class and H is the nearest instance from the same class.

When the number of the examples is suf? cient (n ; ? ), H must be from the same characteristic region as R and its values of the attributes converge to the values of the instance R. The contribution of the term ? diff(A, R, H) to W [A] in the limit is therefore 0. Only terms diff(A, R, M) contribute to W [A]. The instance M is randomly selected nearest instance with different prediction than R, therefore in noiseless problems there must be at least some difference in the values of the attributes and M is therefore an instance of b(R) selected with probability p(M).

As M has different prediction value than R the value diff(? , R, M) = 1. 16 ? Robnik Sikonja and Kononenko The attributes with different values at R and M constitute the set A(R, M). The contribution of M to W [A] for the attributes from the A(R, M) equals diff(A, R, M)/m = diff(? , R, M)) · diff(A, R, M))/m with probability p(M)). Relief selects m instances I ? S and for each I randomly selects its nearest miss b(I) with probability p(b(I)). The sum of updates of W [A] for each attribute is therefore: ? I? S p(b(I))diff(? , R, b(I))diff(A, R, b(I))/m) = RA , and this concludes the proof.

Let us show an example, which illustrates the idea. We have a Boolean problem where the class value is de? ned as ? = (A1 ? A2 ) ? (A1 ? A3 ). Table I gives a tabular description of the problem. The right most column shows which of the attributes is responsible for the change of the predicted value. Table I. Tabular description of the concept ? = (A1 ? A2 ) ? (A1 ? A3 ) and the responsibility of the attributes for the change of the predicted value. line 1 2 3 4 5 6 7 8 A1 1 1 1 1 0 0 0 0 A2 1 1 0 0 1 1 0 0 A3 1 0 1 0 1 0 1 0 ? 1 1 1 0 0 0 0 0 esponsible attributes A1 A1 or A2 A1 or A3 A2 or A3 A1 A1 A1 (A1 , A2 ) or (A1 , A3 ) In line 1 we say that A1 is responsible for the class assignment because changing its value to 0 would change ? to 0, while changing only one of A2 or A3 would leave ? unchanged. In line 2 changing any of A1 or A2 would change ? too, so A1 and A2 represent two manners how to change ? and also share the responsibility. Similarly we explain lines 3 to 7, while in line 8 changing only one attribute is not enough for ? to change. However, changing A1 and A2 or A1 and A3 changes ? Therefore the minimal number of required changes is 2 and the credit (and updates in the algorithm) goes to both A1 and A2 or A1 and A3 . There are 8 peaks in this problem which are equiprobable so A1 gets 2 2 the estimate = 3 = 0. 75 (it is alone responsible for lines 1, 5, 6, 8 4 and 7, shares the credit for lines 2 and 3 and cooperates in both credits for 4+2· 1 +2· 1 3 line 8). A2 (and similarly A3 ) gets estimate 28 2 = 16 = 0. 1875 (it shares the responsibility for lines 2 and 4 and cooperates in one half of line 8). Figure 5 shows the estimates of the quality of the attributes for this problem for Relief (and also ReliefF).

As we wanted to scatter the concept we added besides three important attributes also ? ve random binary attributes to the problem description. We can observe that as we increase the number of the examples the estimate for A1 is converging to 0. 75, while the estimates for A2 2· 1 + 1 Theoretical and Empirical Analysis of ReliefF and RReliefF 17 0. 8 0. 7 A1 asymptotic value for A1 = 0. 75 A2 A3 asymptotic value for A2 and A3 = 0. 1875 0. 6 ReliefF’s estimate 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 0 1000 2000 3000 4000 5000 6000 7000 Number of examples Figure 5.

The estimates of the attributes by Relief and ReliefF are converging to the ratio between class labels they explain and the number of examined instances. and A3 are converging to 0. 1875 as we expected. The reason for rather slow convergence is in the random sampling of the examples, so we need much more examples than the complete description of the problem (256 examples). For ReliefF this property is somehow different. Recall that in this algorithm we search nearest misses from each of the classes and weight their contributions with prior probabilities of the classes (line 9).

We have to de? ne the responsibility for the change of the class value ci to the class value c j . DEFINITION 2. 2. Let B j (I) be a set of instances from I nearest to the instance I ? I , ? (I) = ci with prediction value c j , c j = ci : B j (I) = {Y ? I ; ? (Y ) = c j ? Y = arg min ? (I,Y )} Y ? I (25) Let b j (I) be a single instance from the set B j (I) and p(b j (I)) a probability that it is randomly chosen from B j (I). Let A(I, b j (I)) be a set of attributes with different values at instances I and b j (I). A(I, b j (I)) = {A ? A ; b j (I) ? B j (I) ? diff(A, I, b j (I)) ;gt; 0} (26)

We say that attributes A ? A(I, b j (I)) are responsible for the change of the predicted value of the instance I to the predicted value of b j (I) as the change of their values is one of the minimal number of changes required for changing the predicted value of I to b j (I). If the sets A(I, b j (I)) are different we say that there are different ways to explain the changes of the predicted value of I to the predicted value b j (I). The probability of certain way is equal to the probability that b j (I) is selected from B j (I). 18 ? Robnik Sikonja and Kononenko

Let A j (I) be a union of sets A(I, b j (I)): A j (I) = b j (I)? B j (I) A(I, b j (I)) (27) We say that the attributes A ? A j (I) are responsible for the change of predicted value ci of the instance I to the value c j = ci as the change of their values is the minimal necessary change of the attributes’ values of I required to change its predicted value to c j . Let the quantity of this responsibility take into account the change of the predicted value and the change of the attribute: rA j (I, b(I)) = p(b j (I)) · diff(? , I, b j (I)) · diff(A, I, b j (I)) (28)

The ratio between the responsibility of the attribute A for the change of predicted values from ci to c j for the set of cases S and the cardinality m of that set is thus: 1 (29) RA (i, j) = ? rA (I, b j (I)) m I? S j PROPERTY 2. 2. Let p(ci ) represent the prior probability of the class ci . Under the conditions of Property 2. 1, algorithm ReliefF behaves as: n;? lim W [A] = ? ? i=1 j=1 j=i c c p(ci )p(c j ) RA (i, j) 1 ? p(ci ) (30) We can therefore explain the quality estimates as the ratio of class values changes the attribute is responsible for to the number of the examined instances weighted with the prior probabilities of class values.

Proof is similar to the proof of Property 2. 1. Algorithm ReliefF selects an instance R ? S (line 3 on Figure 2). Probability of R being labelled with class value ci is equal to prior probability of that value p(ci ). The algorithm than searches for k nearest instances from the same class and k nearest instances from each of the other classes (line 6 on Figure 2) and and than updates the quality estimates W [A] according to these values (lines 8 and 9 on Figure 2). As the number of the examples is suf? cient (n ; ? , instances H j must be from the same characteristic region as R and their values of the attributes converge to the values of the attributes of the instance R. The contribution of nearest hits to W [A] in the limit is therefore 0. Only nearest misses contribute to W [A]. The instances M j are randomly selected nearest instances with different prediction than R, therefore in the noiseless problems there must be at least some difference in the values of the attributes and all M j are therefore instances of b j (R) selected with probability p j (M). The contributions of k instances are weighted with the probabilities p(b j (R)).

Theoretical and Empirical Analysis of ReliefF and RReliefF 19 As M j have different prediction value than R the value diff(? , R, M j ) = 1. The attributes with different values at R and M j constitute the set A j (R, b j (R)). The contribution of the instances M j to W [A] for the attributes from the p(c j ) A j (R, b j (R)) equals ? cj=1 1? p(ci ) p(b j (R))diff(A, Ri , b j (R))/m. ReliefF selects m instances I ? S where p(ci )·m of them are labelled with ci . For each I it randomly selects its nearest misses b j (I), j = i with probabilities p(b j (I)).

The sum of updates of W [A] for each attribute is therefore: W [A] = = j=i I? S ? ? 1 ? p(ci ) p(b j (I))diff(C, I, b j (I))diff(A, I, b j (I))/m j=1 j=i c p(c j ) I? S ? ? 1 ? p(ci ) rA (I, b(I))/m, j j=1 j=i c p(c j ) This can be rewritten as the sum over the class values as only the instances with class value ci contribute to the RA (i, j): = i=1 c ? p(ci ) ? j=1 j=i c c p(c j ) RA (i, j) 1 ? p(ci ) = which we wanted to prove. i=1 ?? j=1 j=i c p(ci )p(c j ) RA (i, j), 1 ? p(ci ) COROLLARY 2. 3. In two class problems where diff function is symmetric: diff(A, I1 , I2 ) = diff(A, I2 , I1 ) Property 2. is equivalent to Property 2. 1. Proof. As diff is symmetric also the responsibility is symmetric RA (i, j) = RA ( j, i). Let us rewrite Equation (30) with p(c1 ) = p and p(c2 ) = 1 ? p. By taking into account that we are dealing with only two classes we get limn;? W [A] = RA (1, 2) = RA . In a Boolean noiseless case as in the example presented above the fastest convergence would be with only 1 nearest neighbor. With more nearest neighbors (default with the algorithms ReliefF and RReliefF) we need more examples to see this effect as all of them has to be from the same/nearest peak.

The interpretation of the quality estimates with the ratio of the explained changes in the concept is true for RReliefF as well, as it also computes Equation (12), however, the updates are proportional to the size of the difference in the prediction value. The exact formulation and proof remain for further work. 20 ? Robnik Sikonja and Kononenko Note that the sum of the expressions (24) and (30) for all attributes is usually greater than 1. In certain peaks there are more than one attribute responsible for the class assignment i. e. , the minimal number of attribute changes required for changing the value of the class is greater than 1 (e. . , line 8 in Table I). The total number of the explanations is therefore greater (or equal) than the number of the inspected instances. As Relief algorithms normalize the weights with the number of the inspected instances m and not the total number of possible explanations, the quality estimations are not proportional to the attributes’s responsibility but present rather a portion of the explained changes. For the estimates to represent the proportions we would have to change the algorithm and thereby lose the probabilistic interpretation of attributes’ weights.

When we omit the assumption of the suf? cient number of the examples then the estimates of the attributes can be greater than their asymptotic values because the instances more distant than the minimal number of required changes might be selected into the set of nearest instances and the attributes might be positively updated also when they are responsible for the changes which are more distant than the minimal number of required changes. The behavior of the Relief algorithm in the limit (Equation (2. 1)) is the same as the the symptotic behavior of the algorithm Contextual Merit (CM) (Hong, 1997) which uses only the contribution of nearest misses. In multi class problems CM searches for nearest instances from different class disregarding the actual class they belong, while ReliefF selects equal number of instances from each of the different classes and normalizes their contribution with their prior probabilities. The idea is that the algorithm should estimate the ability of attributes to separate each pair of the classes regardless of which two classes are closest to each other.

It was shown (Kononenko, 1994) that this approach is superior and the same normalization factors occur also in asymptotic behavior of ReliefF given by Equation (2. 2). Based solely on the asymptotic properties one could come to, in our opinion, the wrong conclusion that it is suf? cient for estimation algorithms to consider only nearest instances with different prediction value. While the nearest instances with the same prediction have no effect when the number of the instances is unlimited they nevertheless play an important role in problems of practical sizes.

Clearly the interpretation of Relief’s weights as the ratio of explained concept changes is more comprehensible than the interpretation with the difference of two probabilities. The responsibility for the explained changes of the predicted value is intuitively clear. Equation 24 is non probabilistic, unconditional, and contains a simple ratio, which can be understood taking the unlimited number of the examples into account. The actual quality estimates Theoretical and Empirical Analysis of ReliefF and RReliefF 21 of the attributes in given problem are therefore approximations of these ideal estimates which occur only with abundance of data. . Parameters of ReliefF and RReliefF In this section we address different parameters of ReliefF and RReliefF: the impact of different distance measures, the use of numerical attributes, how distance can be taken into account, the number of nearest neighbors used and the number of iterations. The datasets not de? ned in the text and used in our demonstrations and tests are brie? y described in the Appendix. 3. 1. M ETRICS The diff(Ai , I1 , I2 ) function calculates the difference between the values of the attribute Ai for two instances I1 and I2 .

Sum of differences over all attributes is used to determine the distance between two instances in the nearest neighbors calculation. ? (I1 , I2 ) = ? diff(Ai , I2 , I2 ) i=1 a (31) This looks quite simple and parameterless, however, in instance based learning there are a number of feature weighting schemes which assign different weights to the attributes in the total sum: ? (I1 , I2 ) = ? w(Ai )diff(Ai , I1 , I2 ) i=1 a (32) ReliefF’s estimates of attributes’ quality can be successfully used as such weights (Wettschereck et al. , 1997).

Another possibility is to form a metric in a different way: ? (I1 , I2 ) = ( ? diff(Ai , I2 , I2 ) p ) p i=1 a 1 (33) which for p = 1 gives Manhattan distance and for p = 2 Euclidean distance. In our use of Relief algorithms we never noticed any signi? cant difference in the estimations using these two metrics. For example, on the regression problems from the UCI repository (Murphy and Aha, 1995) (8 tasks: Abalone, Auto-mpg, Autoprice, CPU, Housing, PWlinear, Servo, and Wisconsin breast cancer) the average (linear) correlation coef? ient is 0. 998 and (Spearman’s) rank correlation coef? cient is 0. 990. 22 ? Robnik Sikonja and Kononenko 3. 2. N UMERICAL ATTRIBUTES If we use diff function as de? ned by (1) and (2) we run into the problem of underestimating numerical attributes. Let us illustrate this by taking two instances with 2 and 5 being their values of attribute Ai , respectively. If Ai is the nominal attribute, the value of diff(Ai , 2, 5) = 1, since the two categorical values are different. If Ai is the numerical attribute, diff(Ai , 2, 5) = |2? 5| ? 7 0. 43.

Relief algorithms use results of diff function to update their weights therefore with this form of diff numerical attributes are underestimated. Estimations of the attributes in Modulo-8-2 data set (see de? nition by Equation 43) by RReliefF in left hand side of Table II illustrate this effect. Values of each of 10 attributes are integers in the range 0-7. Half of the attributes are treated as nominal and half as numerical; each numerical attribute is exact match of one of the nominal attributes. The predicted value is the sum of 2 important attributes by modulo 8: ? (I1 + I2 ) mod 8. We can see that nominal attributes get approximately double score of their numerical counterparts. This causes that not only important numerical attributes are underestimated but also numerical random attributes are overestimated which reduces the separability of the two groups of attributes. Table II. Estimations of attributes in Modulo-8-2 dataset assigned by RReliefF. Left hand estimations are for diff function de? ned by Equations (1) and (2), while the right hand estimations are for diff function using thresholds (Equation (34)).

Attribute Important-1, nominal Important-2, nominal Random-1, nominal Random-2, nominal Random-3, nominal Important-1, numerical Important-2, numerical Random-1, numerical Random-2, numerical Random-3, numerical no ramp 0. 193 0. 196 -0. 100 -0. 105 -0. 106 0. 096 0. 094 -0. 042 -0. 044 -0. 043 ramp 0. 436 0. 430 -0. 200 -0. 207 -0. 198 0. 436 0. 430 -0. 200 -0. 207 -0. 198 We can overcome this problem with the ramp function as proposed by (Hong, 1994; Hong, 1997). It can be de? ned as a generalization of diff function for the numerical attributes (see Figure 6): Theoretical and Empirical Analysis of ReliefF and RReliefF 3 6 diff(A, I1 , I2 ) 1 ……… . ?. . ?. . ? . . . ? . . ? . . . ? . . ? . . ? . . ? . 0 t t eq – di f f d = |value(A, I1 ) ? value(A, I2 )| Figure 6. Ramp function ? ? 0 ? diff(A, I1 , I2 ) = 1 ? d? teq ? tdi f f ? teq ; d ? teq ; d ;gt; tdi f f ;teq ;lt; d ? tdi f f (34) where d = |value(A, I1 )? value(A, I2 )| presents the distance between attribute values of two instances, and teq and tdi f f are two user de? nable threshold values; teq is the maximum distance between two attribute values to still consider them equal, and tdi f f is the minimum distance between attribute values to still consider them different.

If we set teq = 0 and tdi f f = max(A) ? min(A) we obtain (2) again. Estimations of attributes in Modulo-8-2 data set by RReliefF using the ramp function are in the right hand side of Table II. The thresholds are set to their default values: 5% and 10% of the length of the attribute’s value interval for teq and tdi f f , respectively. We can see that estimates for nominal attributes and their numerical counterparts are identical. The threshold values can be set by the user for each attribute individually, which is especially appropriate when we are dealing with measured attributes.

Thresholds can be learned in advance considering the context (Ricci and Avesani, 1995) or automatically set to sensible defaults (Domingos, 1997). The sigmoidal function could also be used, but its parameters do not have such straightforward interpretation. In general if the user has some additional information about the character of a certain attribute she/he can supply the appropriate diff function to (R)ReliefF. We use the ramp function in results reported throughout this work. 24 ? Robnik Sikonja and Kononenko 3. 3.

TAKING DISTANCE INTO ACCOUNT In instance based learning it is often considered useful to give more impact to the near instances than to the far ones i. e. , to weight their impact inversely proportional to their distance from the query point. RReliefF is already taking the distance into account through Equations (10) and (11). By default we are using 70 nearest neighbors and exponentially decrease their in? uence with increasing distance from the query point. ReliefF originally used constant in? uence of k nearest neighbors with k set to some small number (usually 10).

We believe that the former approach is less risky (as it turned out in a real world application (Dalaka et al. , 2000)) because as we are taking more near neighbors we reduce the risk of the following pathological case: we have a large number of instances and a mix of nominal and numerical attributes where numerical attributes prevail; it is possible that all the nearest neighbors are closer than 1 so that there are no nearest neighbors with differences in values of a certain nominal attribute. If this happens in a large part of the problem space this attribute gets zero weight (or at least small and unreliable one).

By taking more nearest neighbors with appropriately weighted in? uence we eliminate this problem. ReliefF can be adjusted to take distance into account by changing the way it updates it weights (lines 8 and 9 in Figure 2): W [A] := W [A] ? 1 m j=1 ? diff(A, R, H j )d(R, H j ) ? diff(A, R, M j (C))d(R, M j (C)) k k + P(C) 1 ? 1 ? P(class(R)) m C=class(R) (35) j=1 The distance factor of two instances d(I1 , I2 ) is de? ned with Equations (10) and (11). The actual in? uence of the near instances is normalized: as we want probabilistic interpretation of results each random query point should give equal contribution.

Therefore we normalize contributions of each of its k nearest instances by dividing it with the sum of all k contributions in Equation (10). However, by using ranks instead of actual distances we might lose the intrinsic self normalization contained in the distances between instances of the given problem. If we wish to use the actual distances we only change Equation (11): 1 d1 (i, j) = a (36) ? l=1 diff(Al , Ri , I j ) We might use also some other decreasing function of the distance, e. g. , square of the sum in the above expression, if we wish to emphasize the in? ence of Theoretical and Empirical Analysis of ReliefF and RReliefF 25 (37) the distance: d1 (i, j) = 1 (? a diff(Al , Ri , I j ))2 l=1 The differences in estimations can be substantial although the average correlation coef? cients between estimations and ranks over regression datasets from UCI obtained with RReliefF are high as shown in Table III. Table III. Linear correlation coef? cients between estimations and ranks over 8 UCI regression datasets. We compare RReliefF using Equations (11), (36) and (37). Eqs. (11) and (36) ? r 0. 969 -0. 486 0. 844 0. 999 0. 959 0. 88 0. 778 0. 721 0. 974 0. 174 0. 775 0. 990 0. 830 0. 999 0. 842 0. 798 Eqs. (11) and (37) ? r 0. 991 0. 389 0. 933 0. 990 0. 937 0. 985 0. 987 0. 888 0. 881 -0. 321 0. 819 0. 943 0. 341 0. 800 0. 645 0. 587 Eqs. (36) and (37) ? r 0. 929 0. 143 0. 749 1. 000 0. 181 1. 000 0. 743 0. 678 0. 952 0. 357 0. 945 0. 943 0. 769 0. 800 0. 961 0. 818 Problem Abalone Auto-mpg Autoprice CPU Housing Servo Wisconsin Average The reason for substantial deviation in Auto-mpg problem is sensibility of the algorithm concerning the number of nearest neighbors when using actual distances.

While with expression (11) we exponentially decreases in? uence according to the number of nearest neighbors, Equations (36) and (37) use inverse of the distance and also instances at a greater distance may have a substantial in? uence. With actual distances and 70 nearest instances in this problem we get myopic estimate which is uncorrelated to non-myopic estimate. So, if we are using actual distances we have to use a moderate number of the nearest neighbors or test several settings for it. 3. 4. N UMBER OF NEAREST NEIGHBORS While the number of nearest neighbors used is related to the istance as described above there are still some other issues to be discussed, namely how sensitive Relief algorithms are to the number of nearest neighbors used (lines 4,5, and 6 in Figure 2 and line 4 in Figure 3). The optimal number of the nearest neighbors used is problem dependent as we illustrate in Figure 7 which shows ReliefF’s estimates for four important and one of the random attributes in Boolean domain de? ned as: Bool? Simple : C = (A1 ? A2 ) ? (A3 ? A4 ). (38) 26 ? Robnik Sikonja and Kononenko 0. 3 0. 2 A1 A2 A3 A4 Random ReliefF’s estimate 0. 1 0. 0 0 20 40 60 80 100 120

Number of nearest neighbors -0. 1 Figure 7. ReliefF’s estimates and the number of nearest neighbors. We know that A1 and A2 are more important for determination of the class value than A3 and A4 (the attributes’ values are equiprobable). ReliefF recognizes this with up to 60 nearest neighbors (there are 200 instances). With more that 80 nearest neighbors used the global view prevails and the strong conditional dependency between A1 and A2 is no longer detected. If we increase the number of instances from 200 to 900 we obtain similar picture as in Figure 7, except that the crossing point moves from 70 to 250 nearest eighbors. The above example is quite illustrative: it shows that ReliefF is robust in the number of nearest neighbors as long as it remains relatively small. If it is too small it may not be robust enough, especially with more complex or noisy concepts. If the in? uence of all neighbors is equal disregarding their distance to the query point the proposed default value is 10 (Kononenko, 1994). If we do take distance into account we use 70 nearest neighbors with exponentially decreasing in? uence (? = 20 in Equation (11).

In a similar problem with CM algorithm (Hong, 1997) it is suggested that using log n nearest neighbors gives satisfactory results in practice. However we have to emphasize that this is problem dependent and especially related to the problem complexity, the amount of noise and the number of available instances. Another solution to this problem is to compute estimates for all possible numbers of nearest neighbors and take the highest estimate of each attribute as its ? nal result. In this way we avoid the danger of accidentally missing an important attribute.

Because all attributes receive somewhat higher score we risk that some differences would be blurred and, we increase the computational complexity. The former risk can be resolved later on in the process of Theoretical and Empirical Analysis of ReliefF and RReliefF 27 investigating the domain by producing a graph similar to Figure 7 showing dependencies of ReliefF’s estimates on the number of nearest neighbors. The computational complexity increases from O(m · n · a) to O(m · n · (a + log n)) due to sorting of the instances with decreasing distance. In the algorithm we have to do also some additional bookkeeping, e. . , keep the score for each attribute and each number of nearest instances. 3. 5. S AMPLE SIZE AND NUMBER OF ITERATIONS Estimates of Relief algorithms are actually statistical estimates i. e. , the algorithms collect the evidence for (non)separation of similar instances by the attribute across the problem space. To provide reliable estimates the coverage of the problem space must be appropriate. The sample has to cover enough representative boundaries between the prediction values. There is an obvious trade off between the use of more instances and the ef? ciency of computation.

Wherever we have large datasets sampling is one of possible solutions to make problem tractable. If a dataset is reasonably large and we want to speed-up computations we suggest selection of all the available instances (n in complexity calculations), and rather to control the number of iterations with parameter m (line 2 in Figures 2 and 3). As it is non trivial to select a representative sample of the unknown problem space our decision is in favor of the (possibly) sparse coverage of the more representative space rather than the dense coverage of the (possibly) non-representative sample. . 3 0. 2 A1 A2 A3 Random1 Random2 RReliefF’s estimate 0. 1 0. 0 0 -0. 1 100 200 Number of iterations 300 400 -0. 2 Figure 8. RReliefF’s estimates and the number of iterations (m) on Cosinus-Lin dataset. Figure 8 illustrates the behavior of RReliefF’s estimates changing with the number of iterations on Cosinus-Lin dataset with 10 random attributes 28 ? Robnik Sikonja and Kononenko and 1000 examples. We see that after the initial variation at around 20 iterations the estimates settle to stable values, except for dif? culties at detecting differences e. g. the quality difference between A1 and A3 is not resolved until around 300 iterations (A1 within the cosine function controls the sign of the expression, while A3 with the coef? cient 3 controls the amplitude of the function). We should note that this is quite typical behavior and usually we get stable estimates after 20-50 iterations. However if we want to re? ne the estimates we have to iterate further on. The question of how much more iterations we need is problem dependent. We try to answer this question for some chosen problems in Section 4. 5. 4.

Analysis of performance In this Section we investigate some practical issues on the use of ReliefF and RReliefF: what kind of dependencies they detect, how do they scale up to large number of examples and features, how many iterations we need for reliable estimation, how robust are they regarding the noise, and how irrelevant and duplicate attributes in? uence their output. For ReliefF some of these questions have been tackled in a limited scope in (Kononenko, 1994; ? Kononenko et al. , 1997) and for RReliefF in (Robnik Sikonja and Kononenko, 1997). Before we analyze these issues we have to de? e some useful measures and concepts for performance analysis, comparison and explanation. 4. 1. U SEFUL DEFINITIONS AND CONCEPTS 4. 1. 1. Concept variation We are going to examine abilities of ReliefF and ReliefF to recognize and rank important attributes for a set of different problems. As Relief algorithms are based on the nearest neighbor paradigm they are capable to detect classes of problems for which this paradigm holds. Therefore we will de? ne a measure of concept dif? culty, called concept variation, based on the nearest neighbor principle. Concept variation (Rendell and Seshu, 1990) is a easure of problem dif? culty based on the nearest neighbor paradigm. If many pairs of neighboring examples do not belong to the same class, then the variation is high and the problem is dif? cult (Per` z and Rendell, 1996; Vilalta, 1999). It is de? ned on e a-dimensional Boolean concepts as Va = 1 a ? ? diff(C, X,Y ) a2a i=1 neigh(X,Y,i) (39) where the inner summation is taken over all 2a pairs of the neighboring instances X and Y that differ only in their i-th attribute. Division by a2a converts Theoretical and Empirical Analysis of ReliefF and RReliefF 29 ouble sum to average and normalizes the variation to [0, 1] range. The two constant concepts (which have all class values 0 and 1, respectively) have variation 0, and the two parity concepts of order a have variation 1. Random concepts have variation around 0. 5, because the neighbors of any given example are approximately evenly split between the two classes. The variation around 0. 5 can be thus considered as high and the problem as dif? cult. This de? nition of the concept variation is limited to Boolean problems and demands the description of the whole problem space. The modi? d de? nition by (Vilalta, 1999) encompasses the same idea but is de? ned also for numeric and non-binary attributes. It uses only a sample of examples, which makes variation possible to compute in the real world problems as well. Instead of all pairs of the nearest examples it uses a distance metric to weight the contribution of each example to the concept variation. The problem with this de? nition is that it uses the contributions of all instances and thereby looses the information on locality. We propose another de? nition which borrows from both mentioned above. V= 1 m a ? diff(C, neigh(Xi ,Y, j)) m · a i=1 j=1 (40) where Y has to be the nearest instance different from Xi in attribute j. Similarly to the original de? nition we are counting the number of differences in each dimension but we are doing so only on a sample of m examples available. We are using diff function which allows handling of multi valued and numerical attributes. Note that in the case of a dimensional Boolean concept with all 2a instances available Equation (40) is equivalent to Equation (39). We present behaviors of different de? nitions of the concept variation in Figure 9.

We have measured the concept variations on the parity concepts with orders 2 to 9 on dataset with 9 attributes and 512 randomly sampled examples. By the original de? nition (Rendell and Seshu, 1990) the concept variation is linearly increasing with the parity order. For modi? ed de? nition (Vilalta, 1999) the concept variation remains almost constant (0. 5), while our de? nition (V with 1 nearest) exhibits similar behavior as the original. If we were not sampling the examples but rather generated the whole problem space (which is what the original de? ition uses) V behaved exactly as the original de? nition. If in our de? nition (Equation (40)) we averaged the contribution of 10 nearest instances that differed from the instance Xi in the j-th attribute (V with 10 nearest on Figure 9) the behavior becomes more similar to that of (Vilalta, 1999). This indicates that locality is crucial in sensible de? nitions of the concept variation. Note that we are using diff function and if prediction values are used instead of class values, this de? nition can be used for regression problems as well. 30 1. 0 Robnik Sikonja and Kononenko 0. 8 (Rendell ; Seshu, 1990) (Vilalta, 1999) V with 1 nearest V with 10 nearest concept variation 0. 6 0. 4 0. 2 0. 0 2 3 4 5 6 7 8 9 parity order Figure 9. Concept variation by different de? nitions on parity concepts of orders 2 to 9 and 512 examples. 4. 1. 2. Performance measures In our experimental scenario below we run ReliefF and RReliefF on a number of different problems and observe ? if their estimates distinguish between important attributes (conveying some information about the concept) and unimportant attributes and ? f their estimates rank important attributes correctly (attributes which have stronger in? uence on prediction values should be ranked higher). In estimating the success of Relief algorithms we use the following measures: Separability s is the difference between the lowest estimate of the important attributes and the highest estimate of the unimportant attributes. s = WIworst ? WRbest (41) We say that a heuristics is successful in separating between the important and unimportant attributes if s ;gt; 0. Usability u is the difference between the highest estimates of the important and unimportant attributes. = WIbest ? WRbest (42) We say that estimates are useful if u is greater than 0 (we are getting at least some information from the estimates e. g. , the best important attribute could be used as the split in tree based model). It holds that u ? s. Theoretical and Empirical Analysis of ReliefF and RReliefF 31 4. 1. 3. Other attribute estimators for comparison For some problems we want to compare the performance of ReliefF and RReliefF with other attribute estimation heuristics. We have chosen the most widely used. For classi? cation this is the gain ratio (used in e. g. , C4. (Quinlan, 1993)) and for regression it is the mean squared error (MSE) of average prediction value (used in e. g. , CART (Breiman et al. , 1984)). Note that MSE, unlike Relief algorithms and gain ratio, assigns lower weights to better attributes. To make s and u curves comparable to that of RReliefF we are actually reporting separability and usability with the sign reversed. 4. 2. S OME TYPICAL PROBLEMS AND DEPENDENCIES We use arti? cial datasets in the empirical analysis because we want to control the environment: in real-world datasets we do not fully understand the problem and the relation of the attributes to the target variable.

Therefore we do not know what a correct output of the feature estimation should be and we cannot evaluate the quality estimates of the algorithms. We mostly use variants of parity-like problems because these are the most dif? cult problems within the nearest neighbor paradigm. We try to control dif? culty of the concepts (which we measure with the concept variation) and therefore we introduce many variants with various degrees of the concept variation. We use also some non-parity like problems and demonstrate performances of Relief algorithms on them. We did not ? d another conceptually different class of problems on which the Relief algorithms would exhibit signi? cantly different behavior. 4. 2. 1. Sum by modulo concepts We start our presentation of abilities of ReliefF and RReliefF with the concepts based on summation by modulo. Sum by modulo p problems are integer generalizations of parity concept, which is a special case where attributes are Boolean and the class is de? ned by modulo 2. In general, each Modulop-I problem is described by a set of attributes with integer values in the range [0, p). The predicted value ? X) is the sum of I important attributes by modulo p. Modulo? p? I : ? (X) = ( ? Xi ) mod p i=1 I (43) Let us start with the base case i. e. , Boolean problems (p = 2). As an illustrative example we will show problems with parity of 2-8 attributes (I ? [2, 8]) on the data set described with 9 attributes and 512 examples (a complete description of the domain). Figure 10 shows s curve for this problem (u curve 32 ? Robnik Sikonja and Kononenko is identical as we have a complete description of a domain). In this and all ? gures below each point on the graph is an average of 10 runs.

We can see that separability of the attributes is decreasing with increasing dif? culty of the problem for parity orders of 2,3, and 4. At order 5 when more than half of the attributes are important the separability becomes negative i. e. , we are no longer capable of separating the important from unimportant attributes. The reason is that we are using more than one nearest neighbor (one nearest neighbor would always produce positive s curve on this noiseless problem) and as the number of peaks in the problem increases with 2I , and the number of examples remains constant (512) we are having less and less examples per peak.

At I = 5 when we get negative s the number of nearest examples from the neighboring peaks with distance 1 (different parity) surpasses the number of nearest examples from the target peak. An interesting point to note is when I = 8 (there is only 1 random attribute left) and s becomes positive again. The reason for this is that the number of nearest examples from the target peak and neighboring peaks with distance 2 (with the same parity! ) surpasses the number of nearest examples from neighboring peaks with distance 1 (different parity). 0. 50 0. 40 0. 30 separability 0. 20 0. 10 0. 00 2 -0. 10 3 4 5 parity order 6 7 8

Figure 10. Separability on parity concepts of orders 2 to 8 and all 512 examples. A suf? cient number of examples per peak is crucial for reliable estimations with ReliefF as we show in Figure 11. The bottom s and u curves show exactly the same problem as above (in Figure 10) but in this case the problem is not described with all 512 examples but rather with 512 randomly generated examples. The s scores are slightly lower than in Figure 10 as we have in effect decreased the number of different examples (to 63. 2% of the total). The top s and u curves show the same problem but with 8 times more examples

Theoretical and Empirical Analysis of ReliefF and RReliefF 33 0. 50 separability on 4096examples usability on 4096 examples separability on 512 examples usability on 512 examples 0. 40 separability, usability 0. 30 0. 20 0. 10 0. 00 2 -0. 10 3 4 5 parity order 6 7 8 Figure 11. Separability and usability on parity concepts of orders 2 to 8 and randomly sampled 512 or 4096 examples. (4096). We can observe that with that many examples the separability for all problems is positive. In the next problem p increases while the number of important attributes and the number of examples are ? ed (to 2 and 512, respectively). 0. 50 regression problem with nominal attributes regression problem with numerical attributes classification problem with nominal attributes classification problem with numerical attributes 0. 40 0. 30 separability 0. 20 0. 10 0. 00 0 -0. 10 10 20 modulo 30 40 50 Figure 12. Separability for ReliefF and RReliefF on modulo classi? cation and regression problems with changing modulo. Two curves at the bottom of Figure 12 show separability (usability is very similar and is omitted due to clarity) for the classi? cation problem (there are 34 ? Robnik Sikonja and Kononenko . 50 Classification problems Regression problems 0. 40 separability 0. 30 0. 20 0. 10 0. 00 0 1 2 3 4 Number of important attributes 5 6 Figure 13. Separability for ReliefF and RReliefF on modulo 5 problems with changing the number of important attributes. p classes) and thus we can see the performance of ReliefF. The attributes can be treated as nominal or numerical, however, the two curves show similar behavior i. e. , separability is decreasing with increasing modulo. This is expected as the complexity of problems is increasing with the number of classes, attribute values, and peaks.

The number of attributes values and classes is increasing with p, while the number of peaks is increasing with pI (polynomial increase). Again, more examples would shift positive values of s further to the right. A slight but important difference between separability for nominal and numerical attributes shows that numerical attributes convey more information in this problem. Function diff is 1 for any two different nominal attributes while for numerical attributes diff returns the relative numerical difference which is more informative.

The same modulo problems can be viewed as regression problems and the attributes can be again interpreted as nominal or numerical. Two curves at the top of Figure 12 shows separability for the modulo problem formulated as regression problem (RReliefF is used). We get positive s values for larger modulo compared to the classi? cation problem and if the attributes are treated as numerical the separability is not decreasing with modulo at all. The reason is that classi? cation problems were actually more dif? cult. We tried to predict p separate classes (e. g. , results 2 and 3 are completely different in classi? ation) while in regression we model numerical values (2 and 3 are different relatively to the scale). Another interesting problem arises if we ?x modulo to a small number (e. g. , p = 5) and vary the number of important attributes. Figure 13 shows Theoretical and Empirical Analysis of ReliefF and RReliefF 35 s curves for 4096 examples and 10 random attributes. At modulo 5 there are no visible differences in the performance for nominal and numerical attributes therefore we give curves for nominal attributes only. The s curves are decreasing rapidly with increasing I.

Note that the problem complexity (number of peaks) is increasing with pI (exponentially). Modulo problems are examples of dif? cult problems in the sense that their concept variation is high. Note that impurity-based measures such as Gain ratio, are not capable of separating important from random attributes for any of the above described problems. 4. 2. 2. MONK’s problems We present results of attribute estimation on well known and popular MONK’s problems (Thrun et al. , 1991) which consist of three binary classi? cation problems based on common description by six attributes.

A1 , A2 , and A4 can take the values of 1, 2, or 3, A3 and A6 can take the values 1 or 2, and A5 can take one of the values 1, 2, 3, or 4. Altogether there are 432 examples but we randomly generated training subsets of the original size to estimate the attributes in each of the tasks, respectively. ? Problem M1 : 124 examples for the problem: (A1 = A2 ) ? (A5 = 1) ? Problem M2 : 169 examples for the problem: exactly two attributes have value 1 ? Problem M3 : 122 examples with 5 % noise (misclassi? cations) for the problem: (A5 = 3 ? A4 = 1) ? (A5 = 4 ? A2 = 3). We generated 10 random samples of speci? d size for each of the problems and compared estimates of ReliefF and Gain ratio. Table IV reports results. Table IV. Estimations of attributes in three MONK’s databases for ReliefF and Gain ratio. The results are averages over 10 runs. M1 ReliefF Gain r. 0. 054 0. 056 -0. 023 -0. 016 0. 208 -0. 020 0. 003 0. 004 0. 003 0. 007 0. 160 0. 002 M2 ReliefF Gain r. 0. 042 0. 034 0. 053 0. 039 0. 029 0. 043 0. 006 0. 006 0. 001 0. 004 0. 007 0. 001 M3 ReliefF Gain r. -0. 013 0. 324 -0. 016 -0. 005 0. 266 -0. 016 0. 004 0. 201 0. 003 0. 008 0. 183 0. 003 Attribute A1 A2 A3 A4 A5 A6 36 ? Robnik Sikonja and Kononenko

For the ? rst problem we see that ReliefF separates the important attributes (A1 , A2 , and A5 ) from unimportant ones while Gain ratio does not recognize A1 and A2 as important attributes in this task. In the second problem where all the attributes are important ReliefF assigns them all positive weights. It favors attributes with less values as they convey more information. Gain ratio does the opposite: it favors attributes with more values. In the third problem ReliefF and gain ratio behave similarly: they separate important attributes from unimportant ones and rank them equally (A2 , A5 , and A4 ). . 2. 3. Linear and nonlinear problems In typical regression problems linear dependencies are mixed with some nonlinear dependencies. We investigate problems of such type. The problems are described with numerical attributes with values from the [0, 1] interval and 1000 instances. Besides I important attributes there are also 10 random attributes in each problem. We start with linear dependencies and create problems of the form: LinInc? I : ?= j=1 ? j ·Aj I (44) The attributes with larger coef? cient have stronger in? uence on the prediction value and should be estimated as more important. Table V.

Estimations of the best random attribute (Rbest ) and all informative attributes in LinInc-I problems for RReliefF (RRF) and MSE. RReliefF assigns higher scores and MSE assigns lower scores to better attributes. LinInc-2 RRF MSE -0. 040 0. 230 0. 461 0. 424 0. 345 0. 163 LinInc-3 RRF MSE -0. 023 0. 028 0. 154 0. 286 1. 098 1. 021 0. 894 0. 572 LinInc-4 RRF MSE -0. 009 -0. 018 0. 029 0. 110 0. 180 2. 421 2. 342 2. 093 1. 833 1. 508 LinInc-5 RRF MSE -0. 010 -0. 007 0. 014 0. 039 0. 054 0. 139 4. 361 4. 373 4. 181 3. 777 3. 380 2. 837 Attr. Rbest A1 A2 A3 A4 A5 Table V reports quality estimates of attributes for RReliefF and MSE.

We see that for small differences between importance of attributes both RReliefF and MSE are successful in recognizing this and ranking them correctly. When Theoretical and Empirical Analysis of ReliefF and RReliefF 1. 00 37 0. 80 separability for RReliefF usability for RReliefF separability for MSE usability for MSE 0. 60 separability, usability 0. 40 0. 20 0. 00 1 2 3 4 5 6 7 8 9 number of important attributes I Figure 14. Separability and usability on LinInc concepts for 1000 examples. the differences between the importance of attributes become larger (in LinInc4 and LinInc-5) it is possible due to random ? ctuations in the data one of the random attributes is estimated as better than the least important informative attribute (A1 ). This happens in LinInc-4 for RReliefF and in LinInc-5 for MSE. The behavior of s and u are illustrated in Figure 14. Another form of linear dependencies we are going to investigate is LinEq? I : ?= j=1 ? Aj I (45) Here all the attributes are given equal importance and we want to see how many important attributes can we afford. Figure 15 shows separability and usability curves for RReliefF and MSE. We see that separability is decreasing for RReliefF and becomes negative with 10 important attributes.

This is not surprising considering properties of Relief: each attribute gets its weight according to the portion of explained function values (see Section 2. 7) so by increasing the number of important attributes their weights decrease and approaches zero. The same is true for RReliefF’s usability which, however, becomes negative much later. MSE estimates each attribute separately and is therefore not susceptible to this kind of defects, however, by increasing the number of important attributes the probability to assign one of them a low score increases and so s curve becomes negative.

If we increase the number of examples to 4000, RReliefF’s s curve becomes negative at 16 important attributes while the behavior of MSE does not change. 38 0. 20 ? Robnik Sikonja and Kononenko 0. 15 separability for RReliefF usability for RReliefF separability for MSE usability for MSE separability, usability 0. 10 0. 05 0. 00 0 -0. 05 10 20 30 40 50 number of important attributes I Figure 15. Separability and usability on LinEq concepts for 1000 examples. We end our analysis with non-linear dependencies where the prediction is de? ned as Cosinus? Hills : ? = cos 2? (A2 + 3A2 ). 1 2