Big Data and Machine Learning: A Case Study with Bump Boost
With the increase of computing power and computing possibilities, especially the rise of cloud computing, more and more data accumulates, commonly named Big Data. This development leads to the need of scalable algorithms. Machine learning always had an emphasis on scalability, but few well scaling algorithms are known. Often, this property is reached by approximation. In this thesis, through a well structured parallelization we enhance the Bump Boost and Multi Bump Boost algorithms. We show that with increasing data set sizes, the algorithms are able to reach almost perfect scalability. Furthermore, we investigate empirically how suitable Big-Data-frameworks, i.e. Apache Spark and Apache Flink, are for implementing Bump Boost and Multi Bump Boost.