数据挖掘ch1

What is Big Data?
“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” — Gartner

“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” — Mckinsey & Company

Paste_Image.png

Data mining
People have been analysing and investigating data for centuries.

Statistics
Mean, Variance, Correlation, Distribution …

In modern days, data are often far beyond human comprehension.
Diversity, Volume, Dimensionality

Definition
Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.

Not a fully automatic process
Human interventions are often inevitable.
Domain Knowledge
Data Collection and Pre-processing

Synonym: Knowledge Discovery

Paste_Image.png

Data Integration & Analysis

Paste_Image.png

Process of Data Mining

Paste_Image.png

DM Techniques - Classification
“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”

Given a training set: {(x1, y1), …, (xn, yn)}, produce a classifier (function) that maps any unknown object xi to its class label yi.

Algorithms
Decision Trees
K-Nearest Neighbours
Neural Networks
Support Vector Machines

Applications
Churn Prediction
Medical Diagnosis
Classification Boundaries

Paste_Image.png

Overfitting – Classification

Paste_Image.png

Confusion Matrix

Paste_Image.png

TPR=TP/(TP+FN)

TNR=TN/(TN+FP)

Accuracy=(TP+TN)/(P+N)

Receiver Operating Characteristic


Paste_Image.png

DM Techniques - Clustering
“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”

Distance Metrics
Euclidean Distance
Manhattan Distance
Mahalanobis Distance

Algorithms
K-Means
Sequential Leader
Affinity Propagation

Applications
Market Research
Image Segmentation
Social Network Analysis

Paste_Image.png

Hierarchical Clustering

Paste_Image.png

DM Techniques – Association Rule

Paste_Image.png
Paste_Image.png

DM Techniques – Regression

Paste_Image.png
Paste_Image.png
Paste_Image.png

Overfitting – Regression

Paste_Image.png

Data Preprocessing
Real data are often surprisingly dirty.
A Major Challenge for Data Mining

Typical Issues
Missing Attribute Values
Different Coding/Naming Schemes
Infeasible Values
Inconsistent Data
Outliers

Data Quality
Accuracy
Completeness
Consistency
Interpretability
Credibility
Timeliness

Paste_Image.png

Data Cleaning
Fill in missing values.
Correct inconsistent data.
Identify outliers and noisy data.

Data Integration
Combine data from different sources.

Data Transformation
Normalization
Aggregation
Type Conversion

Data Reduction
Feature Selection
Sampling

Privacy Protection
Data: A Double-Edged Sword
People can benefit greatly from data analysis.
The consequence of information leakage can be catastrophic.

People may be reluctant to give sensitive information due to privacy concerns.
Drug, Tax, Sexuality …

How to find out the percentage of people with a certain attribute?
The interviewer should not know the true answer of each respondent.

Randomized Response
Used in structured survey research.
Can maintain the confidentiality of respondents.
Two questions are presented:
Q1: I have the attribute A.
Q2: I do not have the attribute A.

The respondent uses a random device to:
Answer Q1 with probability p.
Answer Q2 with probability 1-p.
The interviewer has no idea about which question is answered.

Paste_Image.png

Cloud Computing

Paste_Image.png
Paste_Image.png

Why bother so many different algorithms?

No algorithm is always superior to others.

No parameter setting is optimal over all problems.

Look for the best match between problem and algorithm.
Experience
Trial and Error

Factors to consider:
Applicability
Computational Complexity
Interpretability

Always start with simple ones.

Grouping

Paste_Image.png
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 这篇只是笔记而已,用于记录python编程中那些比较好的做法. python中经常使用的序列化模块是pickle,...
    Yihulee阅读 669评论 0 0
  • 《头上长出樱桃树》 金子最喜欢吃樱桃,每到春天樱桃上市,妈妈总会给她买很多樱桃。妈妈还告诉金子,樱桃籽不能吞进...
    春迟秋暮阅读 5,118评论 8 6
  • 时间过得真快,转眼间2016年就要结束,回想着走过的路,感触颇多。 记得年初的时候,给自己制定了年度目标,而如今2...
    陈慕读历史阅读 2,944评论 0 0