Paper Draft | Notion

Title : Semi-Structured Data Pipeline Preprocessing for Medical Data

Introduction

What are semi structured data :

Semi-structured data refers to data that does not fit neatly into traditional structured data models like relational databases but also lacks the complete flexibility of unstructured data like text documents (Engelen & Hoos, 2019). In the context of CSV (Comma-Separated Values) files, which are commonly used for storing tabular data, semi-structured data can manifest in various ways. For instance, in a CSV file, different rows may have a different number of columns, leading to inconsistencies in the data structure. This variability in the data format is a hallmark of semi-structured data in CSV files (Engelen & Hoos, 2019).

How semi structured data can effect the machine learning and data analytics process :

The presence of semi-structured data in CSV files can have significant implications for machine learning and data analytics processes. Machine learning algorithms often rely on structured data formats to train models effectively. When dealing with semi-structured data in CSV files, preprocessing steps become crucial to handle the variability in data structure and ensure that the machine learning models can interpret the data correctly (Engelen & Hoos, 2019). This preprocessing may involve tasks such as data cleaning, normalization, and imputation to address missing values or inconsistencies in the data format.

Moreover, the presence of semi-structured data in CSV files can pose challenges for feature extraction and selection in machine learning tasks. Since the data may not adhere to a strict schema, identifying relevant features for predictive modeling becomes more complex. Feature engineering, a critical aspect of machine learning, may require additional effort to extract meaningful information from semi-structured CSV data (Engelen & Hoos, 2019).

In the realm of data analytics, the impact of semi-structured data in CSV files is also noteworthy. Traditional analytics techniques that are designed for structured data may struggle to handle the inherent variability present in semi-structured data formats. Analysts may need to leverage tools and methods specifically tailored for semi-structured data to derive valuable insights from CSV files with irregular structures (Engelen & Hoos, 2019).

Researchers have explored various approaches to address the challenges posed by semi-structured data in machine learning and data analytics. For instance, semi-supervised learning techniques have been proposed to leverage both labeled and unlabeled data for training models, which can be particularly useful when dealing with semi-structured data that may have inconsistencies or missing labels (Engelen & Hoos, 2019). By incorporating unlabeled data alongside labeled data, semi-supervised learning algorithms aim to improve model performance and generalization on semi-structured datasets.

Furthermore, regularization methods like Virtual Adversarial Training (VAT) have been investigated as a means to enhance supervised and semi-supervised learning tasks on diverse datasets, including those with semi-structured characteristics (Miyato et al., 2019). By introducing adversarial perturbations to the input data, VAT aims to improve the robustness and generalization of machine learning models, which can be beneficial when working with semi-structured data in CSV files that exhibit variability in their format.

In summary, semi-structured data in CSV files presents unique challenges and opportunities for machine learning and data analytics. Researchers continue to explore innovative techniques to handle semi-structured data effectively, emphasizing the importance of preprocessing, feature engineering, and specialized algorithms to extract valuable insights from CSV files with irregular data structures.

References: Engelen, J. and Hoos, H. (2019). A survey on semi-supervised learning. Machine Learning, 109(2), 373-440. https://doi.org/10.1007/s10994-019-05855-6 Miyato, T., Maeda, S., Koyama, M., & Ishii, S. (2019). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. Ieee Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1979-1993. https://doi.org/10.1109/tpami.2018.2858821

Auto ML Disadvantages.