Data Science Project Workflow

From the sensor to the cloud, from the dataset to the smart service. The CfADS data science workflow is fully covered by the DA-Cluster's architecture. Regarding model complexity, the hardware infrastructure ensures full scalability when it comes to developing machine learning solutions, training processes and data processing steps.


Encrypted Data Transfer

Different protocols can be used to transfer data points that are generated by smart IoT-devices and/or database systems to the cluster. Within the context of machine-to-machine (m2m) communication the data transmission is always carried out encrypted. The following protocols are usually applied in order to implement data transfer pipelines:

  • OPC-UA
  • MQTT
  • REST


Data Storage & Backup System

Databases as well as the distributed file system are available to store incoming data at the server-side. The Hadoop framework provides all requirements to establish smooth and efficient data processing, as well as parallel and scalable computation. Hence, the DA-Cluster provides data access with high reliability and performance for different uses and fields of application. Depending on the underlying problem and the structure within the data, different types of database systems (NoSQL or SQL) can be used to handle incoming data streams properly.


Data Analysis & Feature Engineering

Preparation, preprocessing and analysis of data are some common steps applied in most data science projects. Some widely used tools and frameworks are available to support implementation of solutions for each step. Within every project, the applied tool stack is chosen with respect to the individual requirements induced by the structure of the data and the underlying problem. Efficient processing is ensured by the hardware resources of the cluster, especially when it comes to application of complex algorithms where large amounts of data are involved (e.g. distributed map-reduce). The results of feature engineering processes can be presented and visualized with Jupyter Notebooks, which provide interactive, web-based visualization.


Model Development & Training

Different state-of-the-art machine learning frameworks (e.g., TensorFlow, Keras, scikit-learn) are available. Training processes for complex models and deep neural networks can be distributed among all available cluster nodes. Therefore, the complete hardware infrastructure and especially the high-performance GPU nodes are available to accelerate learning. The DA-Cluster serves as a suitable platform for complex and data-intensive machine learning projects.


Smart Services

Smart services are a link to the outside of the DA-cluster and serve as an interface between the DA-Cluster and its users. These services are used to deploy machine learning solutions to online software systems which are implemented as virtual machines/environments. On the client side, users can use the services in different ways in order to engage with developed machine learning models. Usually, smart services are accessible through a web interface with restricted access via login and/or secure vpn connections.


Application within the context of Industry 4.0

  • Smart-Services
    • Predictive/Prescriptive Maintenance
    • Predictive Scheduling
    • Forecasting
    • Anomaly Detection
  • Robotics
  • Digital-Twins
  • MES