ABSTRACT—Most of the Outlier detection

algorithms in data mining are used to find outliers in static databases. Those algorithms

are generally inappropriate for detecting outliers in dynamic databases where

data continuously arrives in the form of streams such as sensor data.

Association rule based outlier detection method can be applied to streamed data

where frequent item sets are evaluated internally. One of the outlier detection

approaches used for static databases include clustering based method, where

K-means clustering algorithm is used internally for discovering outliers from

various static databases. In this paper, we propose two approaches for outlier

detection. One is to use association rules based method for dynamic databases

and the other is to use pruning based local outlier detection method, which

internally uses K-means clustering method for static databases. Experiments on

various data sets are performed to detect the deviant data effectively in fewer

computations.

Keywords—outlier detection,

static databases, dynamic databases, association rules, k-means clustering.

1. INTRODUCTION

An outlier is

referred to be a data point or an object which differs from other data points

or objects within a data set. One of the important data mining techniques

commonly used in detecting deviant data present in a data set is Outlier

detection. Examples of outlier detection includes credit card fraud detection,

network intrusion detection, stock market analysis, etc., where efficient

algorithms are used in resolving those problems.

The characteristic

feature of streamed data is that data arrives continuously and in a time based

order. Hence they are used in various applications like feeding of sensor data,

measurement of network traffic. Generally, various outlier detection algorithms

are applied on static databases. Because their work is to check whole data sets

in an iterative manner to find top outliers or global outliers. But detecting

outliers in streamed data is difficult for two reasons:

1.

The data sets in streamed data

are unbounded, which prevents scanning the entire data stream.

2.

Streamed data require quick

responses for their request and the process has to be completed in a limited

time. Multiple scans for the data stream are not possible.

In static

databases, distance-based techniques are used to model the data points and

outliers are determined. Various distance functions used are Euclidean

distance, Manhattan distance and Minkowski distance. In clustering based

outlier detection method, K-means algorithm is the most popular one, where the

data set is to be divided into clusters. The objects or data points which lie

near to the centroid of the cluster are assigned to the cluster and those

objects or data points which lie far away to the centroid of the cluster are

the outliers. While evaluating the degree of outlierness of an object, the

objects within a cluster are obtained, and they are pruned from each cluster

during outlier detection. This results in the reduction of computational time.

In this paper, we

have proposed two approaches for outlier detection, one involving streaming

data and the other involving static databases. In the first approach, an

algorithm employing association rules to check streamed data for find out the

outliers is proposed. This algorithm defines a bounded set of transactions for

a data stream, which in turn derives association rules from the transactions.

The transactions containing outliers will be evaluated based on those rules.

For instance, if {P,Q} à {R} is an association rule with a

high confidence value of 85%, then a transaction

is an outlier

because the item R is not appearing in the transaction. A transaction is said

to contain an outlier, if it does not contain an item which it is supposed to contain

based on the high confidence association rule.

In the second

approach, a static database is taken and among various data points, a distance

function is applied on them and various clusters are formed. The objects which

lie near to the centroid of the cluster are pruned to evaluate outliers. Then a

distance-based measure i.e., a Local Distance-based Outlier Factor (LDOF) is

applied to the remaining objects, whose task is to find whether an object is an

outlier or not. Once n outliers in a

data set are found, then the top n

objects among n are reported as

outliers.

The remaining

sections of this paper are organized as follows: Section 2 provides related

works on outlier detection. Section 3 describes an association rule based

outlier detection method. Section 4 describes a pruning based local

distance-based outlier factor outlier detection method. Section 5 provides

experimental results. Section 6 concludes the paper.

2. RELATED WORK

A. Data Models in Streamed Data

Streamed data when

compared with the traditional static databases contain unbounded data sets and

it is difficult to choose a finite number of transactions 1 for mining. In

order to resolve this problem, two data models were proposed:

1.

The accumulative data model

which takes the transactions from the starting to the current time into

consideration. If any new transaction arrives, then it is accumulated into the

database.

2.

The sliding window model which

is used to form a bounded data set of transactions. When mining is to be

performed, then only the transactions that are in the sliding window are taken

into consideration. The results of mining are updated as soon as the window

passes by.

B.

Association Rules and Outliers in Streamed Data

The estDec method

3 is considered which internally uses a sliding window model, and it finds

frequent item sets within the current window. Initially, the estDec method

performs a check on the transactions that are currently in the sliding window

and builds a prefix-tree P by

considering the frequent item sets. Once the window slides to the next bucket,

then the estDec method finds new frequent item sets and it performs either

insertion or deletion of item sets into P

and the process continues iteratively. To obtain association rules, the

prefix-tree is traversed for obtaining subsets of frequent item sets. A

traversal stack is maintained to derive association rules, which stores a

frequent itemset in the prefix-tree 1 and derives all possible association

rules from the itemset.

A transaction 1

is said to contain an outlier if some items were supposed to be present in a

transaction but they are actually not present there. To evaluate 1 whether a

transaction has an outlier, an outlier degree is defined.

C. Distance-Based Outlier Detection Techniques

Knorr and Ng 4

introduced a definition to distance-based outlier detection techniques 2. An object x in a data set D is a DB(y, dist)

– outlier if at least fraction y of the objects in D lie at a greater distance

than dist from x 2.

Angiulli and

Pizzuti 5-7 proposed an outlier detection method based on the concept of

objects’ neighbourhood, where outliers are determined after ranking each object

based on calculating sum of the distances from its k-nearest neighbors.

Breunig, Kriegel

and Ng 8 proposed a new definition named as Local Outlier Factor (LOF) to

each object, which gives its degree of outlierness. To evaluate LOF for an

object, a restricted neighbourhood of it is considered by comparing the density

of the object with its neighbors.

Zhang, Hutter and

Jin 9 proposed a local distance-based outlier detection method for finding

outlier objects. For every object, the local distance-based outlier factor 2

(LDOF) evaluates the degree to which it deviates from its nearest objects.

Calculation of LDOF for all the objects is complex i.e., O(N2), where N

specifies number of objects in a data set.

ough set theory

3 was first proposed by Z. Pawlak, which is used to deal with uncertain,

vague and incomplete data. The main idea is based on the indiscernibility

relation that describes the indistinguishability of objects. If one cannot

represent an object in the form of a crisp set, then such object can be

represented by a rough set with the approximate representation of knowledge. A

rough set gives the lower and upper approximation of the set that has

incompletely specified knowledge.

3.

ASSOCIATION RULE BASED OUTLIER DETECTION METHOD

A. Overview

Based on the 10

work, a sliding window is taken for making a bounded data stream. After

considering transactions within the window, a prefix-tree is built and then

association rules are derived from the tree by traversing it iteratively in

multiple scans. Once a transaction’s items are not found in the association

rules list, then such transaction may be an outlier.

The following

figure shows a sample traversal stack

containing items

and their counts, with the arrow indicating the top of the stack:

item

Count

in

C(i1,i2,….in)

in-1

C(i1,i2,….in-1)

in-2

C(i1,i2,….in-2)

…

…

i2

C(i1,i2)

i1

C(i1)

Figure 1. A sample

traversal stack

The following

definitions are regarding the outlier degree given by Narita and Katigawa in

10:

1 Definition 1. Let t be a transaction and R

be a set of association rules, and then t’s

associative closure t+ is

given as below:

(1)

1 Definition 2. Let t

be a transaction and R be a set of

association rules, and t+ is

the t’s associative closure. Then the

outlier degree of t is given as

below:

OutlierDegree(t) = ||/|| (2)

The range of outlier degree lies between

0 and 1.

Definition 3. A transaction is said to be an outlier transaction if its

outlier degree is greater than or equal to minimum outlier degree, which is

considered to be a threshold.

The following steps

provide a way to detect outliers in transaction data sets:

1.

Build a prefix tree

that keeps track of frequent item sets within a sliding window.

2.

Derive a set of

association rules with high confidence value from the prefix tree.

3.

Evaluate outlier

transactions from the transaction set within the sliding window.

4.

Divide outlier

transactions into two parts, one containing unobserved frequent item sets and

the other containing infrequent items.

B. Example

Let us consider a

transaction data set 1 consisting of 10 transactions in a sliding window,

each consisting of the items 1 Book, Joke, milk, bacon, Chocos, egg. The

following table gives the 10 transactions and the items in each transaction:

1 Table 1. A

Sample Transaction Data set

TID

Items

1

Book, Joke, Milk

2

Book, Chocos,

Joke, Milk

3

Book, Joke, Milk

4

Bacon, Book,

Chocos, Egg, Milk

5

Bacon, Book,

Chocos, Egg, Joke, Milk

6

Book, Chocos,

Joke, Milk

7

Bacon, Book,

Egg, Milk

8

Bacon, Book,

Egg, Joke, Milk

9

Book, Joke, Milk

10

Bacon, Egg, Milk

For deriving

association rules from table 1, the values of minimum confidence and minimum

support are set to 80% and 50% respectively. The association rules derived

after setting the above values are given in table 2, where each association

rule has a high confidence value.

1 Table 2.

Association Rules Derived from Table 1

RID

Rule

1

{Joke} à {Book}

2

{Joke, Milk} à {Book}

3

{Joke} à {Book, Milk}

4

{Bacon} à {Egg}

5

{Bacon, Milk} à {Egg}

6

{Bacon} à {Egg, Milk}

7

{Milk} à {Book}

From (1), the association

closure t+ is for transaction 2 will be

0.33.

According to

definition 3, 1 if the minimum outlier degree is set to 0.3, then transaction

2 will be an outlier. Similarly, transaction 10 will also be an outlier since

it does not follow any of the high confidence association rule given in table

2.

4.

PRUNING BASED LDOF OUTLIER DETECTION METHOD

A. Overview

Based on the 9

work, we are using Local Distance-based Outlier Factor (LDOF) measure which

specifies by how much an object is deviating from its neighbours. If LDOF value

is high, then such an object is said to be more deviating from its neighbours

and it can be considered to be an outlier. For an object p, the LDOF value can be given as:

LDOF (p) = (3)

where specifies the k-nearest neighbour distance of p

and specifies the k-nearest neighbour inner distance of p.

For an object p, the distance is equal to the average distance from

p to all objects in its nearest

neighbours’ set.

= (4)

For an object p, the distance is equal to the average distance

among all the objects in its nearest neighbours’ set.

= (5)

B. Procedure

The major drawback

of LDOF algorithm is that it is more expensive with respect to the

computations, because for each object in the data set we have to calculate

LDOF. To resolve this drawback, we have proposed a K-means algorithm to form

clusters. After clusters are formed, radius for each cluster is calculated and

the objects whose distance from the centroid is less than the radius of the

cluster are pruned. Then LDOF for the remaining (unpruned) objects is calculated,

which reduces the total computations. Among the objects, the top-n objects with higher LDOF values are

considered as outliers.

The following

steps are involved in performing the pruning based LDOF outlier detection

algorithm:

1.

Formation of clusters: Here we consider

the entire data set and apply K-means algorithm to generate clusters and the

radius for each cluster is calculated.

2.

Clusters with minimal objects: If a

cluster is found to contain less number of objects than the number of outliers

required, then pruning is avoided in the cluster.

3.

Pruning objects within a cluster:

Distance for each object in a cluster from its centroid is calculated. If the

distance is less than the cluster radius, then the object is pruned.

4.

Evaluation of outliers: For the unpruned

objects in each cluster, LDOF is calculated and the ton-n objects with high LDOF value are declared as outliers.

The following

algorithm gives the steps involved in pruning based LDOF outlier detection

algorithm:

begin

OutlierDetectio(DS,c,it,n)

set

X ß Kmeans(c,it,DS)

for each cluster Cj in X do

Radiusj

ß radius(Cj)

end for

if |Cj| > n then

for

each object pi in Cj do

if

distance(pi,oj)

< Radiusj then
prune(pi)
else
Add pi to U
end
if
end
for
else
for
each object pi in Cj do
Add pi to U
end
for
end if
for each object pi in Cj do
calculate LDOF (pi)
end for
sort
the objects based on their LDOF (pi)
values
choose
n objects among the highest LDOF (pi) values and display them
as outliers.
end
OutlierDetection()
The computational
complexity of our algorithm is c*it*N
+c*np + (w*N) 2, where c is the number of clusters that are to be generated, it is the number of iterations needed
and N is the number of objects in the
DS data set, np is the
average number of objects in each cluster, w
represents the fraction of objects that are unpruned.
5. EXPERIMENTAL RESULTS
In this section,
we perform two experiments one regarding the outlier detection by deriving
association rules from a streamed data and the other regarding the outlier
detection by calculating LDOF through pruning. The experiments are performed on
a PC with Intel 2.3GHz processor and 2GB of memory.
In this
experiment, we consider a supermarket data set consisting of 4627 transactions
and 217 attributes representing items in the supermarket. For deriving 1
association rules, both the minimum support and minimum confidence are set to
30% and 80% respectively. Among 217 attributes, we have considered only 18
attributes.
The number of
association rules is set to 8. The association rules derived from the data set
are given in table 3, which consists of rule id and the rules which have high
confidence value.
Table 3.
Association Rules Derived From Supermarket Data Set
Rule ID
Rule
1
{biscuits,
vegetables} à {Book, cake}
2
{total} à {Book, cake}
3
{biscuits,
milk-cream} à {Book, cake}
4
{biscuits,
fruit} à {Book, cake}
5
{biscuits,
frozen-foods} à {Book, cake}
6
{frozen-foods,
fruit} à {Book, cake}
7
{frozen-foods,
milk-cream} à {Book, cake}
8
{baking-needs,
milk-cream} à {Book, cake}
The confidence
value for the rules 1-4 is 0.84 and the confidence value for the rules 5-8 is
0.83. As per the association rules, if a transaction contains biscuits and
vegetables, then it should contain both Book and cake. If any one of Book and
cake is found to be missing, then such a transaction is said to be an outlier.
Similarly, transactions which do not follow above high confidence association
rules are said to be outliers.
6.
CONCLUSIONS
In this paper, we
have considered outlier detection problem in two varieties of databases, one
involving data streams, where data arrives continuously and also in a time
based order, and the other involving static databases. We have provided two
algorithms for the problem.
For streamed data,
a sliding window is considered to make the data items in a database bounded.
Then for all the transactions in the bounded data set, a prefix-tree is taken
and items are added to it, and it serves as a stack. Association rules are
derived from the transactions data set and the items which do not obey the
association rules are declared as outliers.
For static
databases, pruning based outlier detection is performed, which internally uses
K-means clustering algorithm followed by 2 local distance-based outlier
factor for each object within a cluster. The objects with high LDOF values are
declared as outliers.
Finally, we hope
this work will attain further interest in various problem areas of data mining
such as text mining, multimedia data mining etc.
ACKNOWLEDGEMENTS
The authors would
like to acknowledge Li-Jen Kao, Yo-Ping Huang, P. Rajendra, D. Jatindra Kumar
and N. Sukumar for their valuable discussions.
REFERENCES
1 Li-Jen Kao, Yo-Ping Huang, "Association rules based
algorithm for identifying outlier transactions in data stream," IEEE Int'l Conference on Systems, Man, and
Cybernetics, Oct. 14-17, 2012.
2 P. Rajendra, D. Jatindra
Kumar, N. Sukumar, "An outlier detection method based on clustering," Int'l Conference on Emerging Applications
of Information Technology, 2011.
3 J.H.
Chang and W.S. Lee, "Finding recent frequent item sets adaptively over online
data streams," in Proc. Of the 9th
ACM SIGKDD, Washington, DC, USA, pp.487-492, August
2003.
4 E.M. Knorr and R.T. Ng, "Algorithms for
mining distance-based outliers in large databases," In Proc. 24th Int. Conf. Very Large Data Bases, VLDB,
pages 392-403, 1998.
5 F. Angiulli, S. Basta, and C. Pizzuti,
"Distance-based detection and prediction of outliers," IEEE Transactions on Knowledge and Data Engineering, 18:145-160,
2006.
6 F. Angiulli and C. Pizzuti, "Fast outlier detection
in high dimensional spaces," In PKDD '02:
Proceedings of the 6th European Conference on Principles of Data
Mining and Knowledge Discovery, pages 15-26, 2002.
7
F. Angiulli and C. Pizzuti,
"Outlier mining in large high-dimensional data sets," IEEE Transactions in Knowledge and Data Engineering, 17:203-215,
2005.
8
M.M. Breunig, H.-P. Kriegel,
R.T. Ng, and J. Sander, "LOF: identifying density-based local outliers," SIGMOD Rec., 29(2):93-104, 2000.
9
K. Zhang, M. Hutter, and H.
Jin, "A new local distance-based outlier detection approach for scattered
real-world data," In PAKDD '09:
Proceedings of the 13th Pacific-Asia Conference on Advances in
Knowledge Discovery and Data Mining, pages 813-822, 2009.
10
K. Narita and H. Katigawa,
"Outlier detection for transactional Databases using association rules," in Proc. of the 9th Int'l Conf. On
Web-Age Information Management, Zhangjiajie, Hunan, pp.373-380, July 2008.