Slicing: A New Approach
for Privacy Preserving Data Publishing
ABSTRACT:
Several anonymization techniques, such
as generalization and bucketization, have been designed for privacy preserving microdata
publishing. Recent work has shown that generalization loses considerable amount
of information, especially for high dimensional data. Bucketization, on the
other hand, does not prevent membership disclosure and does not apply for data
that do not have a clear separation between quasi-identifying attributes and
sensitive attributes. In this paper, we present a novel technique called slicing,
which partitions the data both horizontally and vertically. We show that
slicing preserves better data utility than generalization and can be used for
membership disclosure protection. Another important advantage of slicing is
that it can handle high-dimensional data. We show how slicing can be used for
attribute disclosure protection and develop an efficient algorithm for
computing the sliced data that obey the ‘-diversity requirement. Our workload
experiments confirm that slicing preserves better utility than generalization and
is more effective than bucketization in workloads involving the sensitive
attribute. Our experiments also demonstrate that slicing can be used to prevent
membership disclosure.
EXISTING
SYSTEM:
Several micro data anonymization
techniques have been proposed. The most popular ones are generalization for
k-anonymity and bucketization for ‘-diversity. In both approaches, attributes
are partitioned into three categories:
1) Some attributes are identifiers that
can uniquely identify an individual, such as Name or Social Security Number;
2) Some attributes are Quasi Identifiers
(QI), which the adversary may already know (possibly from other publicly
available databases) and which, when taken together, can potentially identify
an individual, e.g., Birthdate, Sex, and Zipcode;
3) Some attributes are Sensitive
Attributes (SAs), which are unknown to the adversary and are considered
sensitive, such as Disease and Salary.
In both generalization and
bucketization, one first removes identifiers from the data and then partitions
tuples into buckets. The two techniques differ in the next step. Generalization
transforms the QI-values in each bucket into “less specific but semantically
consistent” values so that tuples in the same bucket cannot be distinguished by
their QI values. In bucketization, one separates the SAs from the QIs by
randomly permuting the SA values in each bucket.
DISADVANTAGES
OF EXISTING SYSTEM:
Generalization for k-anonymity losses
considerable amount of information, especially for high-dimensional data.
Bucketization does not prevent
membership disclosure. Because bucketization publishes the QI values in their original
forms, an adversary can find out whether an individual has a record in the
published data or not.
Bucketization requires a clear
separation between QIs and SAs. However, in many data sets, it is unclear which
attributes are QIs and which are SAs.
PROPOSED
SYSTEM:
We introduce a novel data anonymization technique
called slicing to improve the current state of the art. Slicing partitions the
data set both vertically and horizontally. Vertical partitioning is done by
grouping attributes into columns based on the correlations among the attributes.
Each column contains a subset of attributes that are highly correlated.
Horizontal partitioning is done by grouping tuples into buckets. Finally,
within each bucket, values in each column are randomly permutated (or sorted) to
break the linking between different columns.
The basic idea of slicing is to break
the association cross columns, but to preserve the association within each column.
This reduces the dimensionality of the data and preserves better utility than
generalization and bucketization. Slicing preserves utility because it groups
highly correlated attributes together, and preserves the correlations between
such attributes. Slicing protects privacy because it breaks the associations
between uncorrelated attributes, which are infrequent and thus identifying.
Note that when the data set contains QIs and one SA, bucketization has to break
their correlation; slicing, on the other hand, can group some QI attributes
with the SA, preserving attribute correlations with the sensitive attribute. The
key intuition that slicing provides privacy protection is that the slicing
process ensures that for any tuple, there are generally multiple matching
buckets.
ADVANTAGES
OF PROPOSED SYSTEM:
1.
We
introduce a novel data anonymization technique called slicing to improve the
current state of the art.
2.
We
show that slicing can be effectively used for preventing attribute disclosure,
based on the privacy requirement of â„“-diversity.
3.
We
develop an efficient algorithm for computing the sliced table that satisfies
â„“-diversity. Our algorithm partitions attributes into columns, applies column
generalization, and partitions tuples into buckets. Attributes that are
highly-correlated are in the same column.
4.
We
conduct extensive workload experiments. Our results confirm that slicing
preserves much better data utility than generalization. In workloads involving
the sensitive attribute, slicing is also more effective than bucketization. In
some classification experiments, slicing shows better performance than using
the original data (which may overfit the model). Our experiments also show the
limitations of bucketization in membership disclosure protection and slicing
remedies these limitations.
MODULES :
In this project consists of the following modules,
·
Dataset Extraction
·
Generalization
·
Bucketization
·
Multi-Set Generalization
·
Slicing
·
Graph Generation
MODULES
DESCRIPTION
DATASET
EXTRACTION:
The dataset extraction module can
be used to extract the dataset and it will be stored in the database for future
use. Initially the dataset was selected, after that it will be split separate
data and it can be stored in the table to the user database.
GENERALIZATION:
Generalization
module performs 2-anonymity process. In generalization approach we use the
identifiers data and Quasi Identifiers. Here the attribute age is Identifiers,
and gender is Quasi Identifiers. The generalization data can be retrieved from
an original data. The dataset data’s are
stored into two buckets.
BUCKETIZATION:
Bucketization module can be
performs 2-diversity process. In generalization approach we use the Quasi
Identifiers. Here the attribute workclass is attribute. The bucketization data
can be retrieved from an original data. The dataset data’s are stored into two
buckets.
MULTI-SET
GENERALIZATION:
Multi-set generalization module performs 2-anonymity process. In
multi-set generalization approach we use the identifiers data and Quasi
Identifiers. Here the attribute age is Identifiers, and gender, workclass are
Quasi Identifiers. The multi-set generalization data can be retrieved from an
original data. The dataset data’s are stored into two buckets.
SLICING:
Slicing partitions the data set both vertically and
horizontally. Slicing preserves better data utility than generalization and can
be used for membership disclosure protection. Here we using the following sub
modules,
·
Attribute partition and
Columns
·
Tuple Partition and Buckets
·
Slicing
·
Column Generalization
·
Matching Buckets
GRAPH GENERATION:
Graph generation module can be used to
find the classification accuracy between Original data, Generalization,
Bucketization and Slicing. Slicing shows better accuracy than generalization. When the target
attribute is the sensitive attribute, slicing even performs better than
bucketization.
SYSTEM CONFIGURATION:-
HARDWARE REQUIREMENTS:-
ü Processor -Pentium –III
ü Speed - 1.1 Ghz
ü RAM - 256 MB(min)
ü Hard
Disk - 20 GB
ü Floppy
Drive - 1.44 MB
ü Key
Board - Standard Windows Keyboard
ü Mouse - Two or Three Button Mouse
ü Monitor - SVGA
SOFTWARE REQUIREMENTS:-
v Operating System : Windows95/98/2000/XP
v Front End : Java / J2EE
v BACKEND : MYSQL
v TOOL :
NETBEANS 7.0
REFERENCE:
Tiancheng Li, Ninghui Li, Jian Zhang,
and Ian Molloy, “Slicing: A New Approach for Privacy Preserving Data
Publishing”, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012.