Data preprocessing

Data preprocessing

1) data cleaning : 누락 데이터 채우기, 노이즈 제거 , 아웃라이어 제거, inconsistency 수정

- Noisy Data : random error or variance in a measured variable => "SMOOTH" by Binning

First, sort data and partition into (equal-frequency) bins then can smooth by bin means, bin median or bin boundaries

: also can smooth by Regression, Clustering, Combined computer and human inspection

ex) 다음 데이터(4,8,15,21,21,24,25,28,34)를 빈의 평균값을 이용하여 smooth 한다면 :

=> 9개의 데이터 포인트가 있으니 3개씩 3개의 bin으로 나누어 준다.

bin1: 4,8,15 * smooth by bin mean: bin1: 9,9,9

bin2: 21,21,24 bin2: 22,22,22

bin3: 25,28,34 bin3: 29,29,29

2) data integration : integration of multiple database

- Redundant attributes may be able to be detected by correlation analysis and covariance analysis

상관 분석을 통해서 데이터를 합친다.

3) data reduction : obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (almost the same) analytical results

(1) dimensionality reduction: Wavelet transforms, Principal Components Analysis (PCA)

(2) numerosity reduction: Regression and Log-Linear Models, Histograms, clustering, sampling

4) data transformation and data discretization: smoothing, attribute/feature construction normalization, discretization

저작자표시 (새창열림)

'산업공학 > Data Analytics' 카테고리의 다른 글

Data transformation (1)	2024.03.05
Data integration (1)	2024.03.05
FP Growth (0)	2024.03.05
Apriori algorithm (0)	2024.03.05
Closed Patterns and Max-Patterns (0)	2024.03.05

실리콘밸리의 비개발자 직장인

Data preprocessing

'산업공학 > Data Analytics' 카테고리의 다른 글

티스토리툴바

Data preprocessing

'산업공학 > Data Analytics' 카테고리의 다른 글

관련글

티스토리툴바