Skip to main content

Machine Learning and Deep Learning Frameworks and Libraries for Large-Scale Data Mining: A Survey

· 6 min read

Paper Link

Published online: 19 January 2019

This survey examines the current progress, advantages, and limitations of ML/DL frameworks.

1. Introduction

This section establishes key definitions and the hierarchical relationship between AI, ML, and DL technologies.

Core Definitions

Data Mining (DM)

The core stage of the knowledge discovery process that aims to extract interesting and potentially useful information from data (Goodfellow et al. 2016; Mierswa 2017).

Artificial Intelligence (AI)

Any technique that aims to enable computers to mimic human behaviour, including machine learning, natural language processing (NLP), language synthesis, computer vision, robotics, sensor analysis, optimization and simulation.

Machine Learning (ML)

A subset of AI techniques that enables computer systems to learn from previous experience (i.e. data observations) and improve their behaviour for a given task. ML techniques include Support Vector Machines (SVM), decision trees, Bayes learning, k-means clustering, association rule learning, regression, neural networks, and many more.

Neural Networks (NNs)

A subset of ML techniques, loosely inspired by biological neural networks. They are usually described as a collection of connected units, called artificial neurons, organized in layers.

Deep Learning (DL)

A subset of NNs that makes computational multi-layer processing feasible. Typical DL architectures include deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), and many more.

2. Machine Learning and Deep Learning

2.1 Machine Learning process

The classic Data Mining process follows the CRISP-DM methodology:

CRISP-DM methodology flowchart showing the six phases of data mining process

  1. Business Understanding - Define objectives and requirements
  2. Data Understanding - Collect and explore initial data
  3. Data Preparation - Clean and transform data for modeling
  4. Modeling Phase - Select and apply modeling techniques
  5. Evaluation Phase - Assess model quality and effectiveness
  6. Deployment Phase - Deploy the model into production environment

2.2 Neural Networks and Deep Learning

This section introduces classic neural network architectures.

3. Accelerated computing

Hardware acceleration includes GPU, FPGA, and TPU technologies.

New computational schemas help reduce memory usage and accelerate computation:

Optimization Techniques

  1. Sparse Computation

    Utilizes sparse representations throughout the neural network, providing benefits of lower memory requirements and faster computation.

  2. Low Precision Data Types

    Data types smaller than 32-bits (e.g., half-precision or integer) with experimentation extending to 1-bit computation (Courbariaux et al. 2016). This approach speeds up algebraic calculations and greatly decreases memory consumption at the cost of slightly reduced model accuracy (Iandola et al. 2016; Markidis et al. 2018). In recent years, most DL frameworks have begun supporting 16-bit and 8-bit computation (Harris 2016; Andres et al. 2018).

Scalability Challenges and Solutions

Large-scale DL development faces significant constraints:

  • Memory Limitations: GPU memory size remains a bottleneck (e.g., NVIDIA Volta 32GB limit)
  • Scalability Solutions: Multi-GPU and distributed-GPU approaches have led to advances in:
    • Data parallelism
    • Model parallelism

Accelerated Libraries

Manufacturers enhance hardware configurations with many-core accelerators and provide highly optimized libraries with primitives, algorithms, and functions to access GPU parallel processing power.

Key Libraries:

  • NVIDIA CUDA - Parallel computing platform
  • NVIDIA cuDNN - Deep neural network primitives
  • Intel MKL - Math kernel library
  • OpenCL - Open computing language
  • AMD ROCm - Open-source platform
  • OpenMP - Multi-processing API
  • Open MPI - Message passing interface

Hybrid Parallelism:

OpenMP handles parallelism within multi-core nodes while MPI manages parallelism between nodes, enabling efficient distributed computing.

4. Machine Learning frameworks and libraries

The authors present numerous open-source ML/DL packages and frameworks.

4.2 Deep Learning Frameworks and Libraries

4.2.12 Deep Learning Wrapper Libraries

Major technology companies have established clear pathways for transitioning from research prototypes to production systems:

Google's Approach:

  • Keras for rapid prototyping and experimentation
  • TensorFlow for production deployment

Facebook's Strategy:

  • PyTorch for research and prototyping
  • Caffe2 for production systems

4.3 Machine Learning and Deep Learning frameworks and libraries with MapReduce

4.3.1 Deeplearning4j

Advantages:

  • JAVA-based framework

Disadvantages:

  • Java/Scala are not mainstream languages for DL/ML
  • H2O appears to be more popular in the community

4.3.2 Apache Spark MLlib and Spark ML

The authors emphasize that developing algorithms on Spark requires deep understanding of distributed environments, data management, process management, and development techniques.

It is important to notice that the implementation of a seemingly simple algorithm (e.g. distributed multi-label kNN) for large-scale data mining is not trivial (Ramirez-Gallego et al. 2017; Gonzalez-Lopez et al. 2018). It requires deep understanding about underlying scalable and distributed environment (e.g. Apache Spark), its data and processing management as well as programming skills. Therefore, ML algorithms for large-scale data mining are different in complexity and implementation from general purpose ones.

Advantages:

  • Extensive package ecosystem
  • In-memory processing capabilities

Disadvantages:

  • Primarily handles tabular data
  • High memory consumption
  • MLlib/ML is still in development stage

4.3.3 H2O, Sparkling Water and Deep Water

Sparkling Water: Integration with Spark

Deep Water: Supports TensorFlow, MXNet, and Caffe

Advantages:

  • Widely adopted in industry
  • Infrastructure-optimized algorithms
  • Aims to automate ML/DL workflows through UI interface

Disadvantages:

  • UI flow: the web-based UI for H2O does not support direct interaction with Spark
  • H2O is more general purpose and aims at different problems compared to specific DL libraries (e.g., TensorFlow or DL4j)

5. Conclusions

This survey reveals several key insights about the current state and future of ML/DL frameworks for large-scale data mining:

Framework Development and Ecosystem

  1. Industry-Academic Partnership: Most frameworks emerge from enterprises or research institutions, ensuring both theoretical rigor and practical applicability.
  2. Standardization: All major frameworks now provide high-level Deep Learning wrappers, reducing implementation complexity.

Technology Maturity

  1. Ongoing Evolution: Apache Spark, Apache Flink, and Cloudera Oryx 2 remain in active development, indicating continued innovation in distributed computing platforms.
  2. Scalability Trade-offs:
    • Vertical scalability faces memory size constraints
    • Horizontal scalability encounters inter-node latency bottlenecks

Accessibility and Adoption

  1. Universal Capability: Both modern and traditional tools demonstrate capability for large-scale data processing.
  2. Language Convergence: Python has emerged as the dominant language, with universal support across major tools (performance comparisons with native languages remain underexplored).

Future Directions

  1. Interactive Intelligence: The trajectory points toward frameworks incorporating interactive analysis and visualization capabilities to enhance decision-making processes.