Công nghệ thông tin
40 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
DATA BALANCING METHODS BY FUZZY ROUGH SETS
Tran Thanh Huyen1,2*, Le Ba Dung3, Mai Dinh4
Abstract: The robustness of rough sets theory in data cleansing have been proved in
many studies. Recently, fuzzy rough set also make a deal with imbalanced data by two
approaches. The first is a combination of fuzzy rough instance selection and balancing
methods. The second tries to use diffe
20 trang |
Chia sẻ: huongnhu95 | Lượt xem: 472 | Lượt tải: 0
Tóm tắt tài liệu Data balancing methods by fuzzy rough sets, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
rent criteria to clean majorities and minorities
classes of imbalanced data. This work is an extension of the second method which was
presented in [16]. The paper depicts complete study about the second method with some
proposed algorithms. It focuses mainly on binary classification with kNN and SVM for
imbalanced data. Experiments and comparisons among related methods will confirm pros
and coin of each method with respect to performance accuracy and time consumption.
Keywords: Rough Set theory; Fuzzy-rough sets; Granular computing; Imbalanced data; Instance selection.
1. INTRODUCTION
Rough set theory was first introduced by Pawlak [21, 22] in early 1980s as a machine
learning methods for knowledge acquisition from data. It provides mathematical
approach to deal with inconsistency between items in datasets. It consequently can be
used for pattern extraction, decision rule generation, feature selection and data reduction.
In case of data reduction/selection, main ideas are extracting fuzzy memberships of
positive regions [23] and choosing the instances which have large membership degrees
(degrees have to larger than thresholds) for training phases [3, 13, 28]. It thus can reduce
noise by defining and eliminating low quality instances in balance datasets.
Other issue that may cause an inconsistency problem is imbalanced data [15]. A
dataset is imbalanced when the numbers of instances in some classes are much larger
than in others. Such classes are called majority classes. The classes with small
cardinality are referred to as minority classes.
There are two possible approaches in using fuzzy rough set [23] to select instance
from imbalanced datasets. The first is combination of balancing methods and a rough set
based noise removed technique [24, 25, 29]. In these approaches, fuzzy rough sets are
first used to remove low quality instances. Then, a well-known balancing technique
called “Synthetic Minority Oversampling Technique (SMOTE)” [4] is employed to form
a candidate set. Finally, fuzzy-rough sets are used again to select quality instances from
the candidate sets.
The second approach is using different criteria in defining different thresholds for
majority and minority classes [16, 27]. By using small threshold for minority class, more
items from minority classes can be selected. Besides, the research in [16] introduces other
method to deal with highly imbalanced data by keeping all instance in minority class while
remove/change label in majority class. The experimental results in those studies show
considerable improvement in classification performance. However, the absence of
generating candidates and optimizing thresholds methods limits application in real.
There are also some study in using fuzzy-rough sets as classification methods for
imbalanced data [26, 30]. However, making a classifier is not in scope of this paper.
This research will show complete study of using fuzzy rough set to select instance in
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 41
imbalanced dataset without any balancing techniques. The paper is structured as follows:
Section 2 reviews the original rough set theory and an extension called fuzzy-rough set;
In section 3, fuzzy rough instance selection approach to deal with inconsistency is
discussed along with their issues; The proposed algorithms of this research with methods
to choose parameters are introduced in section 4; Experiments with results and
comparisons are discussed in section 5; Finally, section 6 concludes the study.
2. THEORETICAL BACKGROUND
2.1. Information Systems and Tolerance Relation
An information system is represented as a data table. Each row of this table represents
an instance of an object such as people, things, etc. Information about every object is
described by object attribute (feature) values.
An information system in the rough sets study is formally defined as a pair ( ),I U A=
where U is a non-empty finite set of objects called the universe and A is a non-empty
finite set of attributes such that :af U A→ for every a A [21, 22]. The non-empty
discrete value set Va is called the domain of a. The original rough set theory deals with
complete information systems in which ( ), , ax U a A f x is a precise value.
If U contains at least an unknown value object, then I is called an incomplete
information system, otherwise complete [14]. In incomplete information systems,
unknown values are denoted by special symbol "∗" and are supposed to be contained in
the set Va.
Any information system taking the form ( ),I U A d= is called a decision table
where d A is called a decision (or label) and elements of A are called conditions. Let
1,..,d kV d d= denote the value set of the decision attribute, decision d then determines a
set of partitions 1 2, ,.., kC C C of universe U, where ( ) ,1 .i d iC x U f x d i k= = Set Ci
is called the i-th decision class or concept on U. We assume that every object in U has a
certain decision value in Vd.
Formally, in incomplete information systems, relation ( ), , ,PTOR x y P A denotes a
binary relation between objects that are possibly equivalent in terms of values of
attributes in P [14]. The relation is reflexive and symmetric, but does not need to be
transitive. Let ( ) ( ) ,P PT x y U TOR y x= be the set of all objects that are equivalent to x
by P , which is then called an equivalence class. The family of all equivalence classes on
U based on an equivalence relation is referred to as a category and is denoted by
.PU TOR
From equivalence classes, Kryszkiewicz [14] defined an approximation space that
contains lower and upper approximations denoted by apprX and apprX respectively, of
set X U as follows:
( ) ( )
( )
,
= .
P PP
P
appr X T x x U T x X
x U T x X
=
(1)
Công nghệ thông tin
42 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
( ) ( )
( )
,
= .
P PP
P
appr X T x x U T x X
x U T x X
=
(2)
In decision table ( ), ,I U A d= the positive region ( )PPOS X of class X in terms of
attribute set P A is defined as:
( ) ( ) .P dPPOS X appr T x x X= (3)
Apart from using tolerance relations to define a rough set in incomplete information
systems, there are also numerous studies [11, 18, 19, 20] that deal with incomplete or
imperfect information systems in which data are not described by precise and crisp values.
2.2. Fuzzy Rough Set
A classical (crisp) set is normally defined as a collection of elements x X that can
be finite, countable or over countable. Each single element can either belong to or not
belong to a set ' .X X For a fuzzy set, a characteristic function allows various degrees
of membership for elements of a given set. Then a fuzzy set X in X is a set of ordered
pairs [34]:
( )( ) , ,x x x X= XX (4)
Where ( ) 0,1x is called the membership function or grade of membership (also
the degree of compatibility or degree of truth) of x inX .
In crisp sets, for two special sets and U the approximations are simply defined as
( ) 1
R
appr U x = and ( ) 0.
R
appr x = Based on the two equivalent definitions, lower and upper
approximations may be interpreted as follows: An element x belongs to the lower
approximation
R
appr X if all elements equivalent to x belong to X. In other words, x
belongs to the lower approximation of X if any element not in X is not equivalent to x,
namely, ( ), 0.R x y = Likewise, x belongs to the upper approximation of X if ( ), 1.R x y =
Now the notion of rough sets in crisp sets is extended to included fuzzy sets. Let
and
R
denote the membership functions of set X and of the set ( ) ( ) , , ,x y U U x y R
respectively. The fuzzy approximation space [23] of fuzzy set X on X in terms of fuzzy
relation ( ),x yR can be defined as follows:
( ) ( ) ( ) inf , ,appr Rx x y y =X X
R
I (5)
( ) ( ) ( ) sup , ,Rappr x x y y = XX T
R
(6)
Where I and T are fuzzy implicators1 and triangular norms (t-norm)2, respectively.
1A fuzzy implicator is a function : 0,1 0,1 0,1 →I which satisfies the following properties:
( ) ( )0,0 1, 1, ;a a= =I I I is decreasing in the first argument and increasing in the second argument.
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 43
From the above equations, it is noticed that membership of lower and upper
approximation for one instance strongly depend on only one other instance since they are
defined by minimum and maximum. To soften this definition, Cornelis et al. [5] suggest
a fuzzy-rough set definition using order weight averaging (OWA) aggregation [32]. An
OWA operator WF is a mapping :
nF R R→ using weight vector 1,.., nW w w= such that.
( )1
1
,..,
n
W n i i
i
F a a w b
=
= (7)
Where ib is
thi largest of the values in 1,.., .na a
An OWA operator is bounded by the minimum and maximum of vector W. It thus can
soften minimum and maximum operators. Suppose that there are two vectors
min min min
1 ,.., nW w w= and
max max max
1 ,.., nW w w= where
min min
1 ... nw w and
max max
1 ... ,nw w
we can define OWA fuzzy rough sets as follows:
( ) ( ) ( ) ( )min , , .appr Wx F x y y =X X
R
R
I (8)
( ) ( ) ( ) ( )max , , .appr Wx F x y y = XX T
R
R
(9)
The effective of using OWA to define fuzzy rough set in dealing with imbalance has
been proved in literature [26, 30]. In fact, the definition of a fuzzy relation on an attribute
set depends on individual systems. One example of defining relation methods to deal
with incomplete information systems is shown in [17]. Further investigation of the
combination between fuzzy and rough sets can be found in [5, 8, 9, 23, 33].
3. ROUGH SELECTION AND PROBLEM STATED
Significant research on instance selection using rough sets model was introduced the
work of Caballero et al. [3]. For this purpose, the authors calculated approximations and
positive regions for each class in training sets. Instances for training phases are then
selected in two ways. In the first method, they try to delete all instances in the boundary.
The second employs a nearest neighborhood algorithm to relabel instances in the boundary
region. The issues associated with these approaches were raised in Jensen and Cornelis’
study [13]. Another possible limitation is that they just deal with crisp sets of attributes
in decision tables.
To deal with the problems identified in the study of Caballero et al., an approach
called Fuzzy-Rough Instance Selection (FRIS) has been proposed [13]. The notion of
this approach is using memberships of positive regions to determine which instances
should be kept and which instances should be discarded for training. First, the method
calculates relations among instances and then measures quality of each instance by its
positive region membership. An instance will be removed if the membership degree is
less than a specified threshold. For the same purpose, another approach of measuring
2A t-norm is a function : 0,1 0,1 0,1 →T which satisfies the following properties:
Commutativity: ( ) ( );, ,a b b a=T T Monotonicity: ( ) ( ), ,a b c dT T if a c and b d ;
Associativity: ( )( ) ( )( ), , , , ;a b c a b c=T T T T The number 1 acts as an identity element: ( ),1 .a a=T
Công nghệ thông tin
44 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
quality of instances was introduced in [28]. Such approaches will undoubtedly refine the
positive regions with quality instances.
However, one limitation to be considered is that there is only one threshold used for
all classes. In some applications such as medical disease prediction, it is not uncommon
to be interested in some specific groups, disease groups for example, rather than the
others (healthy groups). When only one threshold is used, once noise is eliminated,
valuable instances in interesting classes also disappear. Figure 1 illustrates approximations
of two datasets, namely “vehicle3” and “yeast6”3. In “vehicle3”, if we use one threshold
for lower approximations of the both classes, the number of deleted items could be the
same for negative and positive classes. In contrast, if threshold is 0.5 in dataset “yeast6”,
the system will remove almost data in positive class.
Figure 1. Approximations distribution of two datasets.
Some approaches use the advantage of fuzzy-rough to qualify instance sets formed by
SMOTE [24,25,29]. Therefore, the issues above may be avoided. The steps of these
methodologies can be summarized as follows:
1. Using fuzzy-rough sets to calculate the qualities of every instance and choose
instances having high quality.
2. Using SMOTE to create artificial instances from the first step and add to datasets.
3. Using fuzzy-rough sets to eliminate low quality instances from datasets after the
second step.
There are some differences among these methods. In Ramentol et al.’s study named
SMOTE-RSB∗ [24,25], the first step does not appear. In step 3, the algorithm removes only
low quality instances in the artificial instance sets. In Verbiest’s research [29], quality
measurements are different for step 1 and 3. The first measures performance for imbalanced
data while the last deals with balanced information. Thus, they call the algorithm FRIPS-
SMOTE-FRBPS (hereinafter referred to as FRPSS in this paper). It is also claimed in [29]
that FRPSS produces better training sets compared with SMOTE-RSB∗.
On the other hand, measuring quality of each instance and apply different thresholds
for majority and minority classes could be a good deal for imbalanced datasets [16]. The
3 Properties of the two datasets are described in Table 1
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 45
idea is that choosing different thresholds results in removing more data in majority
classes. In [27], each class has their own thresholds for positive regions. Thus this
method can be placed in this group when applying for imbalanced datasets. In [16],
another method to sub-sampling majority classes and oversampling minority class was
proposed based on this idea. Some experiments already showed considerable result in
better classification performance.
There are two issues that need to be solved. Firstly, in some datasets, quality
measurements for all instances are the same in each class. Therefore it cannot decide
which instance should be removed/ reserved. Secondly, in the previous study of using
multiple thresholds fuzzy rough instance selection, there are not methods to optimize
thresholds for such methods.
4. MULTIPLE THRESHOLDS FUZZY ROUGH INSTANCE SELECTION
As stated in the last section, the study in [16] has not been complete. This section first
revises two algorithms proposed in [16] and introduces a new method with some
mathematical definitions. It then presents methods to make higher standard deviation,
thus, approximations are spread out over a wider range of values. Last, we will show the
methods of choosing thresholds for the algorithms.
4.1. The algorithms
In decision table ( ), ,I U A d= let ( ),a x yR and ( ),a x yR denote a fuzzy relation
and membership function, respectively, between objects ,x y U on attribute .a A
Then the membership function of the relation on an attribute set P A is defined as:
( ) ( ), ,
P aa P
x y x y =R RT (10)
Where T is a t-norm.
In fact, there are many ways to calculate the membership function of a fuzzy relation
between two instances. It is also possible to use distance based similarity such as the
normalization of Euclidean distance. It depends on the characteristics of each system.
The membership functions of approximation spaces for an object set X U on
attribute P A can be defined by either equations (5) and (6) or (8) and (9). In this
study, as mentioned in Section 2, we must note that in the decision table ( ),I U A d=
is a single label decision table such that ( )df x is a precise value. In fuzzy sets, for a
single label decision table, ( ) 0
iC
x if ( )d if x d= and ( ) 0,iC x = otherwise.
In selecting learning instances, we first define a function to measure the quality of an
instance for each class X in terms of relation R :
( ) ( ) ( ) ( )1appr apprx x x = + −X X X
R R
R
(11)
Where 0,1 , X is the fuzzy set on X. In the above equation, note that x may or may
not belong to X. This is where this approach differs from other approaches. When
comparing ( )x
X
R for ,x X we can assume that 1 = as in the study of Jensen et al.
[13] and 0.5 = as in the approach of Verbiest et al. [29].
Công nghệ thông tin
46 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
In this paper, we focus on developing methods to balancing data with binary classes.
X and Y present two classes in imbalanced dataset. The sooner is minority class and the
later is majority class. The term positive/negative classes some time are also used as
exchangeable for minority/majority classes in this article.
Now, let XT and YT be the selection thresholds for minority and majority classes, X
and be the fuzzy sets on minority class X and majority classes Y, respectively, R
denotes the fuzzy relation on the universal. The set of instances selected for the training
phase can be defined as:
( ) ( ) X YS x X x T x Y x T = = XR R (12)
Then, the algorithm to select instances can be described as shown in Algorithm 1.
Algorithm 1 - MFRIS1 - Choosing or eliminating instances for both majority and
minority classes
Require:
, ,X Y minority and majority classes;
, ,X YT T the selection thresholds for minority and majority classes.
Ensure: Decision table ( ),S A d
Calculate the quality measurement of all instances for their classes
S
for ix X do
if ( ) Xx t X
R then
S S x
end if
end for
for x Y do
if ( )
j Y
x t R then
S S x
end if
end for
In the first algorithm, depending on labels of instances, the quality measurement of
every instance on their classes will be compared with the threshold for minority or
majority thresholds. This means that an instance will be deleted from a training set if it is
low quality even it is in minority classes. The different thresholds may help us to keep
more instances in minority classes if necessary.
In addition, we also proposed the second algorithm to select instances. The notion of
this method is keeping all instances in minority classes while removing or relabeling
some instances in majority classes. To describe the algorithm, we first introduce some
definitions.
Let ,X denote fuzzy sets on , ,X Y respectively. Thresholds for minority and
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 47
majority classes are Xt and Yt , respectively. The set of instances for which labels could
be changed can be defined as:
( ) ( ) Y X Y XS x Y x t x t → = R RX (13)
From the above definition, there are some instances in majority classes whose labels
could be changed to a minority class label. Now we can re-calculate the class
membership functions of an instance Y Xx S → as ( ) ( ) ( ); 0.x x x = =X Finally, the
selected instances for the training phase can be defined as:
( ) .Y Y Y XS X x Y x t S →= R (14)
The algorithm MFRIS2 is then described as shown in Algorithm 2.
Algorithm 2 - MFRIS2 - Choosing, removing or relabelling instances for majority
classes
Require:
X; Y, the family set of minority and majority classes;
, ,X Yt t the selection thresholds for minority and majority classes.
Ensure: Decision table ( ),S A d
Calculate quality measurement of all instances in the majority classes for all
classes
;YS
Y XS →
for x Y do
if ( ) Yx t
R then
Y YS S x
else if ( ) Xx t
R
X
then
( ) ( )x x R R
X
( ) 0x R
Y X Y XS S x→ →
end if
end for
Y Y X
S X S S →
It is discussed in our earlier papers that the second algorithm working well in highly
imbalanced data. These dues to oversampling in minority classes and sub sampling in
majority side. However, if the quality of the original instance in positive classes is low, it
may cause low performance consequently. Therefore, in this state, another algorithm is
introduced to such kind of datasets.
At first, set of negative that need to change labels is defined by Equation 13. Quality
membership of all instances including original and synthesis items will be compared
with thresholds to choose valuable training sets. The selected set is defined as follows:
( ) ( ) .Y X X YS x X S x t x Y x t →= = R RX (15)
Công nghệ thông tin
48 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
The algorithm MFRIS3 is then depicted as shown in Algorithm 3.
Algorithm 3 - MFRIS3 - Choosing instances for positive, removing or relabeling
instances for majority classes
Require:
X; Y, minority and majority classes.
, ,X Yt t the selection thresholds for minority and majority classes
Ensure: Decision table ( ),S A d
Calculate quality measurement of all instances in the majority classes for all
classes
;YS
Y XS →
Y XS →
for x Y do
if ( ) Yx t
R then
Y YS S x
else if ( ) Xx t
R
X
then
( ) ( )X Yx x
( ) 0Y x
Y X Y XS S x→ →
end if
end for
for x X do
if ( ) Xx t
R
X
then
X XS S x
end if
end for
X Y Y X
S S S S →
4.2. Thresholds Optimization and Granular Tuning
In [16], researchers have just illustrated improvement in classification performance by
changing thresholds for different classes of minority and majority. However, at that state,
they have to manually choose thresholds for each dataset. Furthermore, in some case,
quality of instances to their classes are not distinguishable, thus it makes impossible to
cleanse datasets. Figure 2, the second row shows an example of the histogram of negative
class approximations with low dispersion of a dataset. All instances in negative class have
the same average approximation membership as 0.8. Therefore, while optimizing
thresholds, we will try to spread out the membership values if such issues occur.
Because instances are chosen/removed based on comparison quality degree with
thresholds. It is thus possible to use qualities of instances as threshold candidates [29].
Then using leave-one out strategy to test and calculate accuracy on learning data.
However, it results in thousands of times for train and test when number of instance is big.
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 49
Figure 2. Histogram to show distribution of approximations and
average approximations for a dataset.
For the algorithms in this paper, we divide qualities of instances into some groups by
limited number of cuts and use these cuts as threshold candidates. For dataset which has
low dispersion of approximations/qualities of instance, it first needs to spread out such
degrees by making change in relations among instances. To do this, the membership
function of relation between two objects is defined as follows:
( ) ( ), 1 ,
a a
x y x y = −
R
(16)
Where ( ) , 0,1a x y is normalized distance between x and y, 0.1 is called
granularity.
For discrete values ( ), 1a x y = if ( ) ( ) ( ) ( )* *,a a a af x f y f x f y= = = ( ), 0,a x y =
otherwise. In case of continuous value, ( )
( ) ( )
( )
, ,
a a
a
f x f y
x y
l a
−
= where l(a) is the range
of value domain on a.
Figure 3. Standard derivation for ozon_one_hr dataset.
Figure 3 depicts the change of standard derivation of negative instance qualities for
ozone_one_hr dataset. The higher standard derivation, the more spread out data points
are. However, it is unnecessary to find the granularity that derives maximal standard
derivation. For the three algorithms in this research, it just needs to find the granularity
that enables broaden instance quality data points to some extent. Back to Figure 2, the
Công nghệ thông tin
50 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
first row shows that with granularity equal to 0.125, approximations on the dataset
spread out more than that with granularity equal to 1.0. The Algorithm 4 shows the
method to tune granularity.
Algorithm 4 Tuning granular
Require: X; Y, the family set of minority and majority classes
n groups, the number of thresholds candidate
Ensure: Granular quality measurements
1
max 0groups
for from 1 down to 0 step by - 0.05 do
1
Calculate quality measurement with
( ) ( )_ ,Ycut make cut x x Y ngroups
if maxcut groups then
Store Qualities
end if
if cut groups then
Break
end if
end for
It notices that granularity in this research is different from study in [28]. In such
approach, the authors try to find maximal granularity ( ) )0, .x Every instance must
have its own granularity by which the instance fully belongs to the positive region. In
this research, there is only one granularity for the whole system and for spreading out
instance qualities.
The tuned granularity is then used to find optimal thresholds as depicted in Algorithm 5.
Algorithm 5 Optimizing thresholds
Require:
X; Y , the minority and majority classes
ngroups, number of threshold candidates for each class
( ) ( ), , ,x x x U R R
X
qualities for both X and Y
Ensure: Optimal thresholds
if MFRIS1 then
( ) ( ) ( ),Xc t cut x x X ngroups RX
( ) ( ) ( ),Yc t cut x x Y ngroups R
else if MFRIS2 then
( ) ( ) ( ),Xc t cut x x Y ngroups RX
( ) ( ) ( ),Yc t cut x x Y ngroups R
else if MFRIS3 then
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 51
( ) ( ) ( ) ( ) ( ), ,Xc t cut x x X ngroups cut x x Y ngroups R RX
( ) ( ) ( ),Yc t cut x x Y ngroups R
end if
for ( ) ( ),X X Y Yt c t t c t do
( ), ,X YTrainSet MFRISs U t t by the algorithms
for x U do
TrainSet TrainSet x −
Determine decision of x base on ( ) ,R x y y TrainSet
end for
Calculate ( )AUC ,X Yt t
end
for
( ) ( ) aO rpt gimi , ax ,z me X Y X Yt t AUC t t=
5. EXPERIMENTS
5.1. Setting up experiments
Experiments were conducted on datasets from KEEL dataset repository [1] and UCI
Machine Learning Repository [2]. Datasets and their properties are shown in Table 1. In
this table, “IR” show imbalance rates for the number of instances between majority and
minority classes. Each dataset is divided into three parts for both minority and majority
classes. For each train test, we used two parts to form training data and the rest were
used as testing instances.
Experiments were conducted using R programming4 running on RStudio5.
Table 1. Datasets for experiments.
Datasets Instances Feature IR
iris0 150 4 2.00
Indian Liver Patient 579 10 2.51
vehicle3 846 18 2.99
glass-0-1-2-3 vs 4-5-6 214 9 3.20
new-thyroid1 215 5 5.14
newthyroid2 215 5 5.14
segment0 2308 19 6.02
glass6 214 9 6.38
paw02a-800-7-30-BI 800 2 7.00
glass-0-1-5_vs_2 172 9 9.12
yeast-0-2-5-6 vs 3-7-8-9 1004 8 9.14
yeast-0-2-5-7-9 vs 3-6-8 1004 8 9.14
4
54
Công nghệ thông tin
52 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
ecoli-0-4-6 vs 5 203 6 9.15
ecoli-0-6-7 vs 5 220 6 10.00
ecoli-0-1-4-7 vs 2-3-5-6 336 7 10.59
ecoli-0-1 vs 5 240 6 11.00
glass2 214 9 11.59
cleveland-0 vs 4 177 13 12.62
ecoli-0-1-4-6 vs 5 280 6 13.00
ozone eight hr 1847 72 13.43
shuttle-c0-vs-c4 1829 9 13.87
glass4 214 9 15.46
abalone9-18 731 8 16.40
yeast-1-4-5-8 vs 7 693 8 22.10
glass5 214 9 22.78
kr-vs-k-one vs fifteen 2244 6 27.77
yeast4 1484 8 28.10
winequality-red-4 1599 11 29.17
yeast-1-2-8-9 vs 7 947 8 30.57
ozone one hr 1848 72 31.42
yeast5 1484 8 32.73
winequality-red-8 vs 6 656 11 35.44
ecoli-0-1-3-7 vs 2-6 280 7 39.00
yeast6 1484 8 41.40
winequality-red-8 vs 6-7 855 11 46.50
abalone-19 vs 10-11-12-13 1622 8 49.69
shuttle-2 vs 5 3316 9 66.67
winequality-red-3 vs 5 691 11 68.10
poker-8-9 vs 5 2075 10 82.00
Training data are first cleaned by instance selection algorithms. For rough set based
instance selection methods, we measure fuzzy tolerance relations on an attribute by
membership function ( ),
a
x y
R
as discussion in Section 4.2.
The fuzzy tolerance relation on a set of attributes can be calculated by a combination
of the relation on each attribute by Lukasiewicz’s t-norm ( )2 1max 1,0 .x x= + −T For
those which use OWA fuzzy rough sets, the OWA operators are set as follows:
3
min min min
1 3
2
2 1
i
i n iW w w
−
+ −= = =
−
for 1,2,3i = and 0 for 4,.., ,i n=
3
max max max
3
2
2 1
i
i iW w w
−
= = =
−
for 1,2,3i = and 0 for 4,.., ,i n=
where n is number of instances in datasets.
For our approach, we also use OWA fuzzy rough set with above operators and set
0.5 = to calculate .
X
R That is the same as instances quality measurement in FRIPS-
SMOTE-FRBPS in terms of distinguishing instances qual
Các file đính kèm theo tài liệu này:
- data_balancing_methods_by_fuzzy_rough_sets.pdf