Research on Parallel Support Vector Machine Based on Spark Big Data Platform

Yao Huimin

Research output: Contribution to journalArticlepeer-review

7 Citations (SciVal)

Abstract

With the development of cloud computing and distributed cluster technology, the concept of big data has been expanded and extended in terms of capacity and value, and machine learning technology has also received unprecedented attention in recent years. Traditional machine learning algorithms cannot solve the problem of effective parallelization, so a parallelization support vector machine based on Spark big data platform is proposed. Firstly, the big data platform is designed with Lambda architecture, which is divided into three layers: Batch Layer, Serving Layer, and Speed Layer. Secondly, in order to improve the training efficiency of support vector machines on large-scale data, when merging two support vector machines, the "special points"other than support vectors are considered, that is, the points where the nonsupport vectors in one subset violate the training results of the other subset, and a cross-validation merging algorithm is proposed. Then, a parallelized support vector machine based on cross-validation is proposed, and the parallelization process of the support vector machine is realized on the Spark platform. Finally, experiments on different datasets verify the effectiveness and stability of the proposed method. Experimental results show that the proposed parallelized support vector machine has outstanding performance in speed-up ratio, training time, and prediction accuracy.

Original languageEnglish
Article number7998417
JournalScientific Programming
Volume2021
DOIs
Publication statusPublished - 17 Dec 2021
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Research on Parallel Support Vector Machine Based on Spark Big Data Platform'. Together they form a unique fingerprint.

Cite this