ABSTRACT—Most Keywords—outlier detection, static databases, dynamic databases, association

ABSTRACT—Most of the Outlier detection
algorithms in data mining are used to find outliers in static databases. Those algorithms
are generally inappropriate for detecting outliers in dynamic databases where
data continuously arrives in the form of streams such as sensor data.
Association rule based outlier detection method can be applied to streamed data
where frequent item sets are evaluated internally. One of the outlier detection
approaches used for static databases include clustering based method, where
K-means clustering algorithm is used internally for discovering outliers from
various static databases. In this paper, we propose two approaches for outlier
detection. One is to use association rules based method for dynamic databases
and the other is to use pruning based local outlier detection method, which
internally uses K-means clustering method for static databases. Experiments on
various data sets are performed to detect the deviant data effectively in fewer

Keywords—outlier detection,
static databases, dynamic databases, association rules, k-means clustering.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


An outlier is
referred to be a data point or an object which differs from other data points
or objects within a data set. One of the important data mining techniques
commonly used in detecting deviant data present in a data set is Outlier
detection. Examples of outlier detection includes credit card fraud detection,
network intrusion detection, stock market analysis, etc., where efficient
algorithms are used in resolving those problems.

The characteristic
feature of streamed data is that data arrives continuously and in a time based
order. Hence they are used in various applications like feeding of sensor data,
measurement of network traffic. Generally, various outlier detection algorithms
are applied on static databases. Because their work is to check whole data sets
in an iterative manner to find top outliers or global outliers. But detecting
outliers in streamed data is difficult for two reasons:

The data sets in streamed data
are unbounded, which prevents scanning the entire data stream.

Streamed data require quick
responses for their request and the process has to be completed in a limited
time. Multiple scans for the data stream are not possible.

In static
databases, distance-based techniques are used to model the data points and
outliers are determined. Various distance functions used are Euclidean
distance, Manhattan distance and Minkowski distance. In clustering based
outlier detection method, K-means algorithm is the most popular one, where the
data set is to be divided into clusters. The objects or data points which lie
near to the centroid of the cluster are assigned to the cluster and those
objects or data points which lie far away to the centroid of the cluster are
the outliers. While evaluating the degree of outlierness of an object, the
objects within a cluster are obtained, and they are pruned from each cluster
during outlier detection. This results in the reduction of computational time.

In this paper, we
have proposed two approaches for outlier detection, one involving streaming
data and the other involving static databases. In the first approach, an
algorithm employing association rules to check streamed data for find out the
outliers is proposed. This algorithm defines a bounded set of transactions for
a data stream, which in turn derives association rules from the transactions.
The transactions containing outliers will be evaluated based on those rules.
For instance, if {P,Q} à {R} is an association rule with a
high confidence value of 85%, then a transaction is an outlier
because the item R is not appearing in the transaction. A transaction is said
to contain an outlier, if it does not contain an item which it is supposed to contain
based on the high confidence association rule.

In the second
approach, a static database is taken and among various data points, a distance
function is applied on them and various clusters are formed. The objects which
lie near to the centroid of the cluster are pruned to evaluate outliers. Then a
distance-based measure i.e., a Local Distance-based Outlier Factor (LDOF) is
applied to the remaining objects, whose task is to find whether an object is an
outlier or not. Once n outliers in a
data set are found, then the top n
objects among n are reported as

The remaining
sections of this paper are organized as follows: Section 2 provides related
works on outlier detection. Section 3 describes an association rule based
outlier detection method. Section 4 describes a pruning based local
distance-based outlier factor outlier detection method. Section 5 provides
experimental results. Section 6 concludes the paper.


A. Data Models in Streamed Data

Streamed data when
compared with the traditional static databases contain unbounded data sets and
it is difficult to choose a finite number of transactions 1 for mining. In
order to resolve this problem, two data models were proposed:

The accumulative data model
which takes the transactions from the starting to the current time into
consideration. If any new transaction arrives, then it is accumulated into the

The sliding window model which
is used to form a bounded data set of transactions. When mining is to be
performed, then only the transactions that are in the sliding window are taken
into consideration. The results of mining are updated as soon as the window
passes by.

Association Rules and Outliers in Streamed Data

The estDec method
3 is considered which internally uses a sliding window model, and it finds
frequent item sets within the current window. Initially, the estDec method
performs a check on the transactions that are currently in the sliding window
and builds a prefix-tree P by
considering the frequent item sets. Once the window slides to the next bucket,
then the estDec method finds new frequent item sets and it performs either
insertion or deletion of item sets into P
and the process continues iteratively. To obtain association rules, the
prefix-tree is traversed for obtaining subsets of frequent item sets. A
traversal stack is maintained to derive association rules, which stores a
frequent itemset in the prefix-tree 1 and derives all possible association
rules from the itemset.

A transaction 1
is said to contain an outlier if some items were supposed to be present in a
transaction but they are actually not present there. To evaluate 1 whether a
transaction has an outlier, an outlier degree is defined.

C. Distance-Based Outlier Detection Techniques

Knorr and Ng 4
introduced a definition to distance-based outlier detection techniques 2. An object x in a data set D is a DB(y, dist)
– outlier if at least fraction y of the objects in D lie at a greater distance
than dist from x 2.

Angiulli and
Pizzuti 5-7 proposed an outlier detection method based on the concept of
objects’ neighbourhood, where outliers are determined after ranking each object
based on calculating sum of the distances from its k-nearest neighbors.

Breunig, Kriegel
and Ng 8 proposed a new definition named as Local Outlier Factor (LOF) to
each object, which gives its degree of outlierness. To evaluate LOF for an
object, a restricted neighbourhood of it is considered by comparing the density
of the object with its neighbors.

Zhang, Hutter and
Jin 9 proposed a local distance-based outlier detection method for finding
outlier objects. For every object, the local distance-based outlier factor 2
(LDOF) evaluates the degree to which it deviates from its nearest objects.
Calculation of LDOF for all the objects is complex i.e., O(N2), where N
specifies number of objects in a data set.

ough set theory
3 was first proposed by Z. Pawlak, which is used to deal with uncertain,
vague and incomplete data. The main idea is based on the indiscernibility
relation that describes the indistinguishability of objects. If one cannot
represent an object in the form of a crisp set, then such object can be
represented by a rough set with the approximate representation of knowledge. A
rough set gives the lower and upper approximation of the set that has
incompletely specified knowledge.


A. Overview

Based on the 10
work, a sliding window is taken for making a bounded data stream. After
considering transactions within the window, a prefix-tree is built and then
association rules are derived from the tree by traversing it iteratively in
multiple scans. Once a transaction’s items are not found in the association
rules list, then such transaction may be an outlier.

The following
figure shows a sample traversal stack

containing items
and their counts, with the arrow indicating the top of the stack:













Figure 1. A sample
traversal stack

The following
definitions are regarding the outlier degree given by Narita and Katigawa in

1 Definition 1. Let t be a transaction and R
be a set of association rules, and then t’s
associative closure t+ is
given as below:


1 Definition 2. Let t
be a transaction and R be a set of
association rules, and t+ is
the t’s associative closure. Then the
outlier degree of t is given as

OutlierDegree(t) = ||/||                                        (2)

The range of outlier degree lies between
0 and 1.

Definition 3. A transaction is said to be an outlier transaction if its
outlier degree is greater than or equal to minimum outlier degree, which is
considered to be a threshold.

The following steps
provide a way to detect outliers in transaction data sets:

Build a prefix tree
that keeps track of frequent item sets within a sliding window.

Derive a set of
association rules with high confidence value from the prefix tree.

Evaluate outlier
transactions from the transaction set within the sliding window.

Divide outlier
transactions into two parts, one containing unobserved frequent item sets and
the other containing infrequent items.

B. Example

Let us consider a
transaction data set 1 consisting of 10 transactions in a sliding window,
each consisting of the items 1 Book, Joke, milk, bacon, Chocos, egg. The
following table gives the 10 transactions and the items in each transaction:




1 Table 1. A
Sample Transaction Data set




Book, Joke, Milk


Book, Chocos,
Joke, Milk


Book, Joke, Milk


Bacon, Book,
Chocos, Egg, Milk


Bacon, Book,
Chocos, Egg, Joke, Milk


Book, Chocos,
Joke, Milk


Bacon, Book,
Egg, Milk


Bacon, Book,
Egg, Joke, Milk


Book, Joke, Milk


Bacon, Egg, Milk


For deriving
association rules from table 1, the values of minimum confidence and minimum
support are set to 80% and 50% respectively. The association rules derived
after setting the above values are given in table 2, where each association
rule has a high confidence value.

1 Table 2.
Association Rules Derived from Table 1




{Joke} à {Book}


{Joke, Milk} à {Book}


{Joke} à {Book, Milk}


{Bacon} à {Egg}


{Bacon, Milk} à {Egg}


{Bacon} à {Egg, Milk}


{Milk} à {Book}


From (1), the association
closure t+ is for transaction 2 will be . From (2), the outlier degree for transaction 2 will be

According to
definition 3, 1 if the minimum outlier degree is set to 0.3, then transaction
2 will be an outlier. Similarly, transaction 10 will also be an outlier since
it does not follow any of the high confidence association rule given in table


A. Overview

Based on the 9
work, we are using Local Distance-based Outlier Factor (LDOF) measure which
specifies by how much an object is deviating from its neighbours. If LDOF value
is high, then such an object is said to be more deviating from its neighbours
and it can be considered to be an outlier. For an object p, the LDOF value can be given as:

        LDOF (p) =                                                (3)

 where  specifies the k-nearest neighbour distance of p
and  specifies the k-nearest neighbour inner distance of p.

For an object p, the  distance is equal to the average distance from
p to all objects in its nearest
neighbours’ set.

                                                             =                                           (4)

For an object p, the  distance is equal to the average distance
among all the objects in its nearest neighbours’ set.

 =                         (5)

B. Procedure

The major drawback
of LDOF algorithm is that it is more expensive with respect to the
computations, because for each object in the data set we have to calculate
LDOF. To resolve this drawback, we have proposed a K-means algorithm to form
clusters. After clusters are formed, radius for each cluster is calculated and
the objects whose distance from the centroid is less than the radius of the
cluster are pruned. Then LDOF for the remaining (unpruned) objects is calculated,
which reduces the total computations. Among the objects, the top-n objects with higher LDOF values are
considered as outliers.

The following
steps are involved in performing the pruning based LDOF outlier detection

Formation of clusters: Here we consider
the entire data set and apply K-means algorithm to generate clusters and the
radius for each cluster is calculated.

Clusters with minimal objects: If a
cluster is found to contain less number of objects than the number of outliers
required, then pruning is avoided in the cluster.

Pruning objects within a cluster:
Distance for each object in a cluster from its centroid is calculated. If the
distance is less than the cluster radius, then the object is pruned.

Evaluation of outliers: For the unpruned
objects in each cluster, LDOF is calculated and the ton-n objects with high LDOF value are declared as outliers.

The following
algorithm gives the steps involved in pruning based LDOF outlier detection


X ß Kmeans(c,it,DS)

for each cluster Cj in X do

ß radius(Cj)

end for

if |Cj| > n then

each object pi in Cj do

< Radiusj then                prune(pi)             else                Add pi to U             end if    end for else    for each object pi in Cj do             Add pi to U    end for end if for each object pi in Cj do    calculate LDOF (pi) end for sort the objects based on their LDOF (pi) values choose n objects among the highest LDOF (pi) values and display them as outliers. end OutlierDetection() The computational complexity of our algorithm is c*it*N +c*np + (w*N) 2, where c is the number of clusters that are to be generated, it is the number of iterations needed and N is the number of objects in the DS data set, np is the average number of objects in each cluster, w represents the fraction of objects that are unpruned. 5. EXPERIMENTAL RESULTS In this section, we perform two experiments one regarding the outlier detection by deriving association rules from a streamed data and the other regarding the outlier detection by calculating LDOF through pruning. The experiments are performed on a PC with Intel 2.3GHz processor and 2GB of memory. In this experiment, we consider a supermarket data set consisting of 4627 transactions and 217 attributes representing items in the supermarket. For deriving 1 association rules, both the minimum support and minimum confidence are set to 30% and 80% respectively. Among 217 attributes, we have considered only 18 attributes. The number of association rules is set to 8. The association rules derived from the data set are given in table 3, which consists of rule id and the rules which have high confidence value.         Table 3. Association Rules Derived From Supermarket Data Set Rule ID Rule 1 {biscuits, vegetables} à {Book, cake} 2 {total} à {Book, cake} 3 {biscuits, milk-cream} à {Book, cake} 4 {biscuits, fruit} à {Book, cake} 5 {biscuits, frozen-foods} à {Book, cake} 6 {frozen-foods, fruit} à {Book, cake} 7 {frozen-foods, milk-cream} à {Book, cake} 8 {baking-needs, milk-cream} à {Book, cake}   The confidence value for the rules 1-4 is 0.84 and the confidence value for the rules 5-8 is 0.83. As per the association rules, if a transaction contains biscuits and vegetables, then it should contain both Book and cake. If any one of Book and cake is found to be missing, then such a transaction is said to be an outlier. Similarly, transactions which do not follow above high confidence association rules are said to be outliers. 6. CONCLUSIONS In this paper, we have considered outlier detection problem in two varieties of databases, one involving data streams, where data arrives continuously and also in a time based order, and the other involving static databases. We have provided two algorithms for the problem. For streamed data, a sliding window is considered to make the data items in a database bounded. Then for all the transactions in the bounded data set, a prefix-tree is taken and items are added to it, and it serves as a stack. Association rules are derived from the transactions data set and the items which do not obey the association rules are declared as outliers. For static databases, pruning based outlier detection is performed, which internally uses K-means clustering algorithm followed by 2 local distance-based outlier factor for each object within a cluster. The objects with high LDOF values are declared as outliers. Finally, we hope this work will attain further interest in various problem areas of data mining such as text mining, multimedia data mining etc.  ACKNOWLEDGEMENTS The authors would like to acknowledge Li-Jen Kao, Yo-Ping Huang, P. Rajendra, D. Jatindra Kumar and N. Sukumar for their valuable discussions. REFERENCES 1   Li-Jen Kao, Yo-Ping Huang, "Association rules based algorithm for identifying outlier transactions in data stream," IEEE Int'l Conference on Systems, Man, and Cybernetics, Oct. 14-17, 2012. 2   P. Rajendra, D. Jatindra Kumar, N. Sukumar, "An outlier detection method based on clustering," Int'l Conference on Emerging Applications of Information Technology, 2011. 3   J.H. Chang and W.S. Lee, "Finding recent frequent item sets adaptively over online data streams," in Proc. Of the 9th ACM SIGKDD, Washington, DC, USA, pp.487-492, August 2003. 4   E.M. Knorr and R.T. Ng, "Algorithms for mining distance-based outliers in large databases," In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 392-403, 1998. 5   F. Angiulli, S. Basta, and C. Pizzuti, "Distance-based detection and prediction of outliers," IEEE Transactions on Knowledge and Data Engineering, 18:145-160, 2006. 6   F. Angiulli and C. Pizzuti, "Fast outlier detection in high dimensional spaces," In PKDD '02: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, pages 15-26, 2002. 7   F. Angiulli and C. Pizzuti, "Outlier mining in large high-dimensional data sets," IEEE Transactions in Knowledge and Data Engineering, 17:203-215, 2005. 8   M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, "LOF: identifying density-based local outliers," SIGMOD Rec., 29(2):93-104, 2000. 9   K. Zhang, M. Hutter, and H. Jin, "A new local distance-based outlier detection approach for scattered real-world data," In PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 813-822, 2009. 10           K. Narita and H. Katigawa, "Outlier detection for transactional Databases using association rules," in Proc. of the 9th Int'l Conf. On Web-Age Information Management, Zhangjiajie, Hunan, pp.373-380, July 2008.