Data Mining in Identity Crime Prevention
                        Intelligent Techniques to Combat White-Collar Crime
 

  

     
 
Identity Crime

Wikipedia has a very relevant article on identity theft/fraud which explains why certain countries are more susceptible to it than others, what the consequences are, how it has become worse, and what precautions can be taken by an individual to mitigate the chances of becoming a victim.

There are different definitions of identity theft and identity fraud which usually differs from country to country [1.url][2.url][3.url]. For our purposes, we adopt the term  identity crime to encompass both real identity theft and synthetic identity fraud.

About Our Group

We aim to find a formal data mining framework, with efficient and effective methods/techniques, to discover the illegal activities of professional identity fraudsters. These people are highly motivated by the high financial rewards, and the minimal risk and effort associated with exploiting the weaknesses of business processes in many organisations. As a result, one can anticipate that there are already many highly experienced, organised, and sophisticated fraudsters in operation, using commercially available or stolen identity data for their criminal purposes. In reaction to these illegal activities, this project aims to challenge and extend existing data mining-based fraud detection methods/techniques, and propose new and better ones by demonstrating it on consumer credit application fraud.

Hesperus is a new experimental fraud detection system written for credit applications. It is based on the idea that any successful fraudster(s), within certain time frames, will exhibit consistent, communal, temporal, spatial, and persistent characteristics which are distinguishable from the normal credit applications. It goes beyond the conventional industry technique of ID number, address, and phone number verification.

Brief Project Outline
Research Problems
 - Dynamic Nature of Credit Application Data
 - Unexploited Temporal Information
 - Weakness in Anomaly Detection
 - Significant Time Delay

 
Procedures
 - Data Representation
 - Performance Measures
 - Graph-based Data Mining
 - Visualisation
 - Finer-Grained Models
 - Game Theory
 - Relational Learning
 - Other Methods and Techniques

Our Current Work
Section 4

 

Phua C, Lee V, Gayler R and Smith K (2006) "Temporal Representation in Spike Detection of Sparse Personal Identity Streams", Proceedings of PAKDD06 Workshop on Intelligence and Security Informatics (WISI06), [.url].

 

Section 3

 

"Adaptive Communal Detection of Adversarial Identity Crime", in preparation. (Several million real examples - all available data, to verify results of previous paper, to demonstrate random choice of parameter + systematic choice of parameter value - all automatic, and to build and apply naive Bayesian classifiers on multi-attribute links)

 

*** Up to this stage, our work seems to be similar to commercial (trademarked and patented) systems: Jeff Jona's NORA [1.url][2.url] and IBM Entity Analytic Solutions [3.url], and IDAnalytics' GTAD [.url], and Identity Systems' Identity Search Server [.url] ***

*** Identity attribute values can be safely and easily anonymised in secure cryptographic hash functions (convert variable length human-readable strings to fixed length alphanumeric ones) in .NET System.Security.Cryptography namespace [.url] ***

 

The most updated 2007 version of Powerpoint slides on the IWMESD06 paper, which have been presented at the IWMESD06 workshop, Australian Tax Office (ATO), and Seoul National University (SNU) Data Mining Laboratory [.ppt]

 

A short and very fuzzy homemade video presentation of the CASS algorithm [.avi]

 

Phua C, Gayler R, Lee V and Smith K (2006) "Communal Detection of Implicit Personal Identity Streams", Proceedings of ICDM06 Workshop on Mining Evolving and Streaming Data (IWMESD06). (Few million real examples, with almost complete automation of data extraction, processing, and performance evaluation)

 

Phua C, Gayler R, Lee V and Smith K (2006) "On the Communal Analysis Suspicion Scoring for Identity Crime in Streaming Credit Applications", European Journal of Operational Research, submitted. (Few hundred thousand real data examples, extended version of conference paper) [contact via email]

 

Phua C, Gayler R, Lee V and Smith K (2005) "On the Approximate Communal Fraud Scoring of Credit Applications", Proceedings of Credit Scoring and Credit Control IX, [.pdf]. (Synthetic data examples)

 

Free synthetic data generators/code for data mining experiments: FEBRL v0.3 [.url] (Python), ACORA [.url] (Perl), RIDDLE [.url] (C), QUEST [.url] (C++)

 

The Search Systems Free Public Records Directory [.url]

 

Section 2

 

Phua C, Lee V and Smith K (2006) "The Personal Name Problem and a Recommended Data Mining Solution", Encyclopedia of Data Warehousing and Mining (2nd Edition), accepted.

 

Publicly available names database: [.url]

 

Other interesting web-based gender guessing using given / first / fore / Christian names: Multicultural [.url], Mandarin (hanyu pinyin) [.url], Hindu [.url], United States [.url]

 

Visualisation: NameVoyager [.url]

 

Humourous names: [.url]

 

Working in collaboration with Sam Chapman, SimMetrics (incorporated into Hesperus) is a Java and C# .NET open source library of Similarity or Distance Metrics, e.g. Levenshtein Distance, that provide float-based similarity measures between String Data. All metrics can return consistent measures as well as unbounded similarity scores. Phonetics is part of SimMetrics [.url]

 

IBM Global Name Recognition [.url]

 

Section 1

 

Phua C, Lee V, Smith K, and Gayler R (2005) "A Comprehensive Survey of Data Mining-based Fraud Detection Research", Artificial Intelligence Review, submitted. [.pdf [DRAFT] (v1.2)]

 

Our online identity fraud and data mining-related bibliography. It mostly contains a mixture of academic and non-academic pointers to existing/possible techniques, knowledge, and commercial databases on identity fraud prevention/detection. Cite as: Phua C (200x) "Identity Fraud and Data Mining Bibliography", Version Number, Monash University, Accessed From [.bib (v0.1)][.bib (v0.2)]

 

Shared between Tom Fawcett and our group is an online fraud detection bibliography. Cite as: Fawcett T and Phua C (200x) "Fraud Detection Bibliography", Accessed From [.url]. The newest, 2006 version is available here [.bib (v3)]

 

Richard Derrig has compiled an extensive Insurance Fraud Research Register [.url]

 

Previous Work

 

Phua C, Alahakoon D, and Lee V (2004) "Minority Report in Fraud Detection: Classification of Skewed Data", ACM SIGKDD Explorations: Special Issue on Imbalanced Data Sets, 6(1), pp50-59. [.pdf]

 

Phua C (2003) Investigative Data Mining in Fraud Detection: Transforming Minority Report from Science Fiction to Science Fact, Honours Thesis Defence Slides and Poster, Monash University, Australia. [.ppt]

 

Phua C (2003) Investigative Data Mining in Fraud Detection, Unpublished Honours Thesis, Monash University, Australia. [.pdf]

 

About This Website

Please note that this webpage will be updated at least once every 3 to 4 months. For some unknown reason, this webpage is best viewed in Mozilla web browsers.

  • [中文版][한국어 버전]

  • Do drop us a line if you happen to have done any past research, or currently doing any research, in this area.

  • Also, please send possible omissions in the bibliographies to us.

  • If there are any other comments about the contents in the website, they can be directed to: Clifton - first_name(dot)last_name(at)infotech.monash.edu.au

     

[Please do not click here - For web-spiders only]

 

 
 

[[Please do not click here - For web-spiders only]

Last Updated: 09-Feb-2007 03:17:15 PM

Maintained by: Clifton

Copyright © 2004-6 All rights reserved

 

Official Disclaimer
This is a personal page published by the author. The ideas and information expressed on it have not been approved or authorised by Monash University or Baycorp Advantage either explicitly or implicitly. In no event shall Monash University and Baycorp Advantage be liable for any damages whatsoever resulting from any action arising in connection with the use of this information or its publication, including any action for infringement of copyright or defamation.