[1]Dong Chee-hwa,Wang Guoyin,Yongxi,et al.Normalized PCA Algorithm Based on Spark[J].Journal of Zhengzhou University (Engineering Science),2017,38(05):7-12.[doi:10.13705/j.issn.1671-6833.2017.05.001]
Copy
Journal of Zhengzhou University (Engineering Science)[ISSN
1671-6833/CN
41-1339/T] Volume:
38
Number of periods:
2017 05
Page number:
7-12
Column:
Public date:
2017-09-26
- Title:
-
Normalized PCA Algorithm Based on Spark
- Author(s):
-
Dong Chee-hwa1; Wang Guoyin2; Yongxi3; Shi Xiaoyu2; Li Qingliang4
-
1. Institute of Electronic Information Technology, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714; 2. University of Chinese Academy of Sciences, Beijing 100049; 3. Institute of Electronic Information Technology , Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, 400714; 4. Ministry of Water Resources Information Center, Beijing, 100053 5. Xichang Satellite Launch Center, Wenchang, Hainan, 571300
-
- Keywords:
-
PCA; Spark; distributed; normalization
- CLC:
-
-
- DOI:
-
10.13705/j.issn.1671-6833.2017.05.001
- Abstract:
-
Principal Component Analysis (PCA) is a well known model for dimensionality reduction in data mining,it transforms the original variables into a few comprehensive indices.In this paper,we study the principle of PCA,the distributed architecture of Spark and PCA algorithm of distributed matrix from spark’s ML-lib,then improved the design and present a new algorithm named SNPCA (Spark’s Normalized Principal Component Analysis),this SNPCA algorithm computes principal components together with data normalization process.We carried out benchmarking on multicore CPUs and the results demonstrate the effectiveness of SNPCA.