## Data Mining vs. Statistics

Pavel Brusilovsky

### Objectives

* Intro to Data Mining

* Data Mining vs. Statistics

* Data Mining vs. Text Mining

* Applications of Data Mining

### What is the Taxonomy of Data Mining?

* Data mining taxonomy, based on application

- Data Mining

- Text Mining

- Web Mining

- Image Mining...

* Data mining taxonomy, based on the usage of domain knowledge:

- Verification-driven data mining

* Is associated with traditional quantitative approaches that permit a decision maker to express and verify organizational and personal domain knowledge

- Discovery driven data Mining

* It tied with knowledge discovery technology capable of automatically discovering previously unknown patterns hidden in the data

- Combination of both classes leads to synergy that can produce meaningful and reliable results that may not be obtained within the framework of each class of data mining independently

* Data mining taxonomy, based on estimation paradigm:

- supervised learning

- unsupervised learning

### What is the difference between "Search" and "Discover"

Source:

http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt
### Example: Amazon.com purchase suggestion

### Data Mining and Related Fields

### Is Data Mining extension of Statistics?

* Data Mining and Statistics: mutual fertilization with convergence

* Statistical Data Mining (Graduate course, George Mason University)

* Statistical Data Mining and Knowledge Discovery (Hardcover) by Hamparsum Bozdogan (Editor)

- An overview of Bayesian and frequentist issues that arise in multivariate statistical modeling involving data mining

* Data Mining with Stepwise Regression (Dean Foster, Wharton School)

- use interactions to capture non-linearities

- use Bonferroni adjustment to pick variables to include

- use the sandwich estimator to get robust standard errors

### What are Data Mining Myths?

* Myth 1: Data mining automatically discovers hidden pattern in your data

* Myth 2: Data mining is design for business analysts who are not professional in quantitative fields

* Myth 3: Data mining findings can be easily translated into decision-maker actions

* Myth 4: Data mining encompasses decision analysis/decision support technology

### What are the logical steps of Data Mining?

SEMMA methodology (SAS Enterprise Miner)

* The core process of conducting data mining study includes the following

steps (SEMMA):

- Sample

- Explore

- Modify

- Model

- Assess

* SEMMA is a logical organization of the functional tool set of SAS Enterprise Miner for carrying out the core tasks of data mining

* SEMMA is focused on the model development aspects of data mining

### CRoss-Industry Standard Process for Data Mining (CRISP-DM)

SPSS Clementine

Six phases of CRISP-DM:

1. Business understanding

2. Data understanding

3. Data preparation

4. Modeling

5. Evaluation

6. Model deployment

www.crips-dm.org

### Statistics vs. Data Mining: Concepts

### What is Breiman Uncertainty Principle?

Breiman uncertainty principle:

Accuracy * Interpretability = Breimanâ€™s constant

Breiman uncertainty principle means that:

The higher methodâ€™s accuracy, the lower its interpretability, and vice versa

### What are great Data Mining Ideas?

Injecting randomness into function estimation procedure Bagging (Breiman, 1996):

- Apply the same unstable algorithm to different samples (with replacement) of the original data
- Different samples yield different models
- The average of the predictions of these models might be better than the predictions from any single model

Boosting (Friedman, Hastie, and Tibshirani (1999):

- Each model is based on the same original data
- The first individual model is fit to the original data
- For the second model, subtract the predicted value from the original target value, and use the difference as the target value to train the second model
- For the third model, subtract weighted average of the predictions from the original target value, and use the difference as the target value to train the third model, and so on.

### What are the best Data Mining Conferences?

Annual SAS Data Mining Technology Conference

- The world's largest data mining conference that balances theory and practice

Annual International Conference on Knowledge Discovery and Data Mining (KDD)

- Sponsored by the American Association for Artificial Intelligence (AAII)

Annual International Salford Systems Data Mining Conference

- Focusing on solving real world challenges
- Business Applications of CART, MARS, TreeNet, and Random Forrest
- Keynote speakers: Jerome Friedman (Stanford University) and Leo Breiman (University of California, Berkeley)

### What are the best data mining tools?

- Salford Systems Tools (CART, Random Forest, MARS, TreeNet)
- SAS Enterprise Miner/Text Miner
- SPSS Clementine
- Megaputer Intelligence PolyAnalyst

### References (Data Mining)

Randall Matignon (2007), Neural Network Modeling Using SAS Enterprise Miner , SASÂ® Institute Inc.

David J. Hand, Data Mining: Statistics and More? The American Statistician, May 1998, Vol. 52 No. 2

http://www.amstat.org/publications/tas/hand.pdf

Friedman, J.H. 1997. Data Mining and Statistics. What's the connection? Proceedings of the 29th Symposium on the Interface: Computing Science and Statistics, May 1997, Houston, Texas

Doug Wielenga (2007), Identifying and Overcoming Common Data Mining Mistakes, SAS Global Forum Paper 073-2007

Nathan Treloar (2002), Text Mining: Tools, Techniques, and Applications

http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt