DATA PIPELINE IN PYTHON AND WITH SCIKIT-LEARN

5 min readDec 12, 2020

In this new tech era, machine learning has become familiar to a lot more people as it is shaping and simplifying our daily lives in countless aspects such as work, life, travel, communication. Building a machine learning model typically involves these four main steps: 1) Data acquisition, 2) Data pre-processing, 3) Training and testing the models, and 4) Evaluating the models. The machine learning workflow is similar to a pipe-like manner where the output of the first step is the input of the second step. This is when a powerful tool like pipeline comes in handy to keep the machine learning run smoothly but less tedious and time-consuming.

WHAT IS A DATA PIPELINE?

A data pipeline is used to automate machine learning workflows as it consists of multiple steps for the movement and transformation of data. It is a very similar mechanism to what you are used to on a daily basis, the water pipeline. A data pipeline consists of three main elements: a source, a processing-step series, and a destination. The raw data from point A or also called data producers is transferred to point B or also called data consumers through some intermediate steps which are also called a data pipeline. Some of the common steps in a data pipeline are data cleansing, augmentation, enrichment, filtering, grouping, aggregating, and so on.

Reading the definition, many people might think that a machine learning pipeline is a one-way operation. However, the machine learning pipelines can actually be iterative as each step is repeated to continuously improve the accuracy of the model and finally achieve a successful algorithm.

One type of pipeline is to automate the machine learning model by transforming enabled data and correlating it into a model that can be analyzed and obtains outputs later. This type of machine learning pipeline automates that process of inputting data into the machine learning model. Another type of machine learning pipeline is to split up the machine learning workflows into independent, reusable, modular parts that can then be pipelined together to create models. This type of pipeline makes the process of building machine learning models more efficient and simplified by cutting out redundant work.

WHY USING DATA PIPELINES?

In this tech era, there is an enormous amount of data that is created every day. Incorporating data pipelines in the workflow will do more good than harm to the users. While a data pipeline is not necessary for every company, this tool is especially useful for those that:

Generate, rely on, or store large amounts of data or various data sources
Maintain siloed data sources
Need real-time or highly advanced data analysis
Use cloud storage for their data

In a business setting, multiple departments use several apps for different purposes within a company. For instance, the sales team uses Salesforce to manage leads, the marketing department utilizes HubSpot and Marketo for automating marketing, or the product team uses MongoDB to store information about customers. Without using data pipelines to consolidate data from different sources into a common destination, the issue of data silo which is the collection of information in an organization that is isolated from and not accessible by other parts of the organization can arise. Data silo can cause difficulty and inconvenience when it comes to extracting or fetching data for business analysis due to errors such as data redundancy. Unifying the data workflows within pipelines give a lot of advantages:

Employees, regards departments, can use the same data at all times which eliminates the conflicts or discussions about the data source or the data validity,
The business can save a lot by reducing the amount of time and money on data wrangling, manipulation, and analysis
Everyone within the company has the flexibility to explore and analyze the data without concern of breaking anything

DATA PIPELINE WITH SCIKIT-LEARN

According to the documentation by scikit-learn:

Pipeline of transforms with a final estimator.

In other words, the semantic of sklearn.pipeline.Pipeline is usually as: a sequence of transformers (i.e. implementing fit and transform), followed by a final predictor (i.e. implementing fit and predict ). Sometimes, it is optional to have predict_proba,decision_function, and so on.

A pipeline can be created by combining different models, in which data sequentially flows through the aggregate model. It has standard fitting and predicting capabilities, making the training process much more organized.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

X, y = make_classification(n_samples=100)
pipeline = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
pipeline.fit(X_train, y_train)

Some examples of objects that can fit into a pipeline:

Imputers: using Simple Imputer or KNN Imputer when having missing data
Encoders: using LabelEncoder or One-HotEncoder when having non-binary categorical data
NLP Vectorizers: using Count Vectorizers, TD-IDF Vectorizers, or Hash Vectorizers when working with NLP data
Numerical Transformations: using Standardization, Normalization, and Min-Max scalers.

EXAMPLE: USING PIPELINE IN MACHINE LEARNING

Below is a simple example of how to use a data pipeline in machine learning for GridSearchCV. First of all, we need to import Pipeline from imblearn and define a function for GridSearchCV:

from imblearn.pipeline import Pipelinedef gridsearch_cv(clf, params, X_train, y_train, cv):
    pipeline = Pipeline([('clf', clf)]) 
    gs = GridSearchCV(pipeline, params, cv=kf, n_jobs=-1, 
                      scoring='f1', return_train_score=True)
    gs.fit(X_train, y_train)
    return gs

After that, the parameter grid needs to be defined for the classifier that we want to use. In this example, we will use logistic regression.

logistic_regression_params = {'clf__solver': ['liblinear'], 
                              'clf__C': [0.1, 1, 10],
                              'clf__penalty': ['l2', 'l1']}

We use train-test split to verify that our model works well before we deploy them. By performing the train-test split, we train the model with the training set and use the test set to determine the accuracy of the new data.

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                   test_size=0.2, random_state=42)

On the other hand, we also need to set the number of splits for the Stratified K Fold

kf = StratifiedKFold(n_splits=5, random_state=42)

Now, we can use our created pipeline for the GridSearchCV with the logistic regression model

logis_reg = LogisticRegression(C=1.0)
gs_logis = gridsearch_cv(logis_reg, logistic_regression_params, 
                         X_train, y_train, kf)
predictions = gs_logis.predict(X_test)

CONCLUSION

The use of data pipelines will continue to increase to accommodate larger data segments with transformational ability. Data pipelines help enforce the desired order of application steps, creating a convenient and time-saving work-flow, which makes sure of the reproducibility of the work. Learning about and using data pipelines can be a little tricky for the first few times, but once you get the hang of it, data pipelines can be a greatly powerful tool, especially for machine learning.