{ "cells": [ { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-04T13:44:39.449161Z", "start_time": "2019-12-04T13:44:39.443285Z" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning: the Iris dataset\n", "\n", "If you want to try out this notebook with a live Python kernel, use mybinder:\n", "\n", "\"https://mybinder.org/badge_logo.svg\"\n", "\n", "While `vaex.ml` does not yet implement predictive models, we provide wrappers to powerful libraries (e.g. [Scikit-learn](https://scikit-learn.org/), [xgboost](https://xgboost.readthedocs.io/)) and make them work efficiently with `vaex`. `vaex.ml` does implement a variety of standard data transformers (e.g. PCA, numerical scalers, categorical encoders) and a very efficient KMeans algorithm that take full advantage of `vaex`.\n", "\n", "The following is a simple example on use of `vaex.ml`. We will be using the well known Iris dataset, and we will use it to build a model which distinguishes between the three Irish species ([Iris setosa](https://en.wikipedia.org/wiki/Iris_setosa), [Iris virginica](https://en.wikipedia.org/wiki/Iris_virginica) and [Iris versicolor](https://en.wikipedia.org/wiki/Iris_versicolor)).\n", "\n", "Lets start by importing the common libraries, load and inspect the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:16.402715Z", "start_time": "2020-01-14T14:31:14.887730Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# sepal_length sepal_width petal_length petal_width class_
0 5.9 3.0 4.2 1.5 1
1 6.1 3.0 4.6 1.4 1
2 6.6 2.9 4.6 1.3 1
3 6.7 3.3 5.7 2.1 2
4 5.5 4.2 1.4 0.2 0
... ... ... ... ... ...
1455.2 3.4 1.4 0.2 0
1465.1 3.8 1.6 0.2 0
1475.8 2.6 4.0 1.2 1
1485.7 3.8 1.7 0.3 0
1496.2 2.9 4.3 1.3 1
" ], "text/plain": [ "# sepal_length sepal_width petal_length petal_width class_\n", "0 5.9 3.0 4.2 1.5 1\n", "1 6.1 3.0 4.6 1.4 1\n", "2 6.6 2.9 4.6 1.3 1\n", "3 6.7 3.3 5.7 2.1 2\n", "4 5.5 4.2 1.4 0.2 0\n", "... ... ... ... ... ...\n", "145 5.2 3.4 1.4 0.2 0\n", "146 5.1 3.8 1.6 0.2 0\n", "147 5.8 2.6 4.0 1.2 1\n", "148 5.7 3.8 1.7 0.3 0\n", "149 6.2 2.9 4.3 1.3 1" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import vaex\n", "import vaex.ml\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "df = vaex.datasets.iris()\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Splitting the data into _train_ and _test_ steps should be done immediately, before any manipulation is done on the data. `vaex.ml` contains a `train_test_split` method which creates shallow copies of the main DataFrame, meaning that no extra memory is used when defining train and test sets. Note that the `train_test_split` method does an ordered split of the main DataFrame to create the two sets. In some cases, one may need to shuffle the data.\n", "\n", "If shuffling is required, we recommend the following:\n", "```\n", "df.shuffle().export(\"shuffled.hdf5\")\n", "df = vaex.open(\"shuffled.hdf5\")\n", "df_train, df_test = df.ml.train_test_split(test_size=0.2)\n", "```\n", "\n", "In the present scenario, the dataset is already shuffled, so we can simply do the split right away." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:16.430056Z", "start_time": "2020-01-14T14:31:16.404209Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/ml/__init__.py:209: UserWarning: Make sure the DataFrame is shuffled\n", " warnings.warn('Make sure the DataFrame is shuffled')\n" ] } ], "source": [ "# Orderd split in train and test\n", "df_train, df_test = df.ml.train_test_split(test_size=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As this is a very simple tutorial, we will just use the columns already provided as features for training the model." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:17.525079Z", "start_time": "2020-01-14T14:31:17.520285Z" } }, "outputs": [ { "data": { "text/plain": [ "['sepal_length', 'sepal_width', 'petal_length', 'petal_width']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features = df_train.column_names[:4]\n", "features" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-04T14:06:45.512795Z", "start_time": "2019-12-04T14:06:45.510575Z" } }, "source": [ "## PCA\n", "\n", "The `vaex.ml` module contains several classes for dataset transformations that are commonly used to pre-process data prior to building a model. These include numerical feature scalers, category encoders, and [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) transformations. We have adopted the [scikit-learn](https://scikit-learn.org/stable/) API, meaning that all transformers have the `.fit` and `.transform` methods. \n", "\n", "Let's use apply a PCA transformation on the training set. There is no need to scale the data beforehand, since the PCA also normalizes the data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:18.975817Z", "start_time": "2020-01-14T14:31:18.926724Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3
0 5.4 3.0 4.5 1.5 1 -0.5819340944906611 -0.5192084328455534 -0.4079706950207428 -0.22843325658378022
1 4.8 3.4 1.6 0.2 0 2.628040487885542 -0.05578001049524599-0.09961452867004605-0.14960589756342935
2 6.9 3.1 4.9 1.5 1 -1.438496521671396 0.5307778852279289 0.32322065776316616 -0.0066478967991949744
3 4.4 3.2 1.3 0.2 0 3.00633586736142 -0.41909744036887703-0.17571839830952185-0.05420541515837107
4 5.6 2.8 4.9 2.0 2 -1.1948465297428466 -0.6200295372229213 -0.4751905348367903 0.08724845774327505
... ... ... ... ... ... ... ... ... ...
1155.2 3.4 1.4 0.2 0 2.6608856211270933 0.2619681501203415 0.12886483875694454 0.06429707648769989
1165.1 3.8 1.6 0.2 0 2.561545765055359 0.4288927940763031 -0.18633294617759266-0.20573646329612738
1175.8 2.6 4.0 1.2 1 -0.22075578997244774-0.401523366515551370.25417836518749715 0.04952191889168374
1185.7 3.8 1.7 0.3 0 2.23068249078231 0.826166758833374 0.07863720599424912 0.0004035597987264161
1196.2 2.9 4.3 1.3 1 -0.6256358184862005 0.0239304743336751680.21203674475657858 -0.0077954052328795265
" ], "text/plain": [ "# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3\n", "0 5.4 3.0 4.5 1.5 1 -0.5819340944906611 -0.5192084328455534 -0.4079706950207428 -0.22843325658378022\n", "1 4.8 3.4 1.6 0.2 0 2.628040487885542 -0.05578001049524599 -0.09961452867004605 -0.14960589756342935\n", "2 6.9 3.1 4.9 1.5 1 -1.438496521671396 0.5307778852279289 0.32322065776316616 -0.0066478967991949744\n", "3 4.4 3.2 1.3 0.2 0 3.00633586736142 -0.41909744036887703 -0.17571839830952185 -0.05420541515837107\n", "4 5.6 2.8 4.9 2.0 2 -1.1948465297428466 -0.6200295372229213 -0.4751905348367903 0.08724845774327505\n", "... ... ... ... ... ... ... ... ... ...\n", "115 5.2 3.4 1.4 0.2 0 2.6608856211270933 0.2619681501203415 0.12886483875694454 0.06429707648769989\n", "116 5.1 3.8 1.6 0.2 0 2.561545765055359 0.4288927940763031 -0.18633294617759266 -0.20573646329612738\n", "117 5.8 2.6 4.0 1.2 1 -0.22075578997244774 -0.40152336651555137 0.25417836518749715 0.04952191889168374\n", "118 5.7 3.8 1.7 0.3 0 2.23068249078231 0.826166758833374 0.07863720599424912 0.0004035597987264161\n", "119 6.2 2.9 4.3 1.3 1 -0.6256358184862005 0.023930474333675168 0.21203674475657858 -0.0077954052328795265" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca = vaex.ml.PCA(features=features, n_components=4)\n", "df_train = pca.fit_transform(df_train)\n", "df_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of pca `.fit_transform` method is a shallow copy of the DataFrame which contains the resulting columns of the transformation, in this case the PCA components, as virtual columns. This means that the transformed DataFrame takes no memory at all! So while this example is made with only 120 sample, this would work in the same way even for millions or billions of samples.\n", "\n", "## Gradient boosting trees\n", "\n", "Now let's train a gradient boosting model. While `vaex.ml` does not currently include this type of models, we support the popular boosted trees libraries [xgboost](https://xgboost.readthedocs.io/en/latest/), [lightgbm](https://lightgbm.readthedocs.io/en/latest/), and [catboost](https://catboost.ai/). In this tutorial we will use the `lightgbm` classifier." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:40.074344Z", "start_time": "2020-01-14T14:31:39.899968Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3 prediction
0 5.4 3.0 4.5 1.5 1 -0.5819340944906611 -0.5192084328455534 -0.4079706950207428 -0.22843325658378022 1
1 4.8 3.4 1.6 0.2 0 2.628040487885542 -0.05578001049524599-0.09961452867004605-0.14960589756342935 0
2 6.9 3.1 4.9 1.5 1 -1.438496521671396 0.5307778852279289 0.32322065776316616 -0.00664789679919497441
3 4.4 3.2 1.3 0.2 0 3.00633586736142 -0.41909744036887703-0.17571839830952185-0.05420541515837107 0
4 5.6 2.8 4.9 2.0 2 -1.1948465297428466 -0.6200295372229213 -0.4751905348367903 0.08724845774327505 2
... ... ... ... ... ... ... ... ... ... ...
1155.2 3.4 1.4 0.2 0 2.6608856211270933 0.2619681501203415 0.12886483875694454 0.06429707648769989 0
1165.1 3.8 1.6 0.2 0 2.561545765055359 0.4288927940763031 -0.18633294617759266-0.20573646329612738 0
1175.8 2.6 4.0 1.2 1 -0.22075578997244774-0.401523366515551370.25417836518749715 0.04952191889168374 1
1185.7 3.8 1.7 0.3 0 2.23068249078231 0.826166758833374 0.07863720599424912 0.0004035597987264161 0
1196.2 2.9 4.3 1.3 1 -0.6256358184862005 0.0239304743336751680.21203674475657858 -0.00779540523287952651
" ], "text/plain": [ "# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3 prediction\n", "0 5.4 3.0 4.5 1.5 1 -0.5819340944906611 -0.5192084328455534 -0.4079706950207428 -0.22843325658378022 1\n", "1 4.8 3.4 1.6 0.2 0 2.628040487885542 -0.05578001049524599 -0.09961452867004605 -0.14960589756342935 0\n", "2 6.9 3.1 4.9 1.5 1 -1.438496521671396 0.5307778852279289 0.32322065776316616 -0.0066478967991949744 1\n", "3 4.4 3.2 1.3 0.2 0 3.00633586736142 -0.41909744036887703 -0.17571839830952185 -0.05420541515837107 0\n", "4 5.6 2.8 4.9 2.0 2 -1.1948465297428466 -0.6200295372229213 -0.4751905348367903 0.08724845774327505 2\n", "... ... ... ... ... ... ... ... ... ... ...\n", "115 5.2 3.4 1.4 0.2 0 2.6608856211270933 0.2619681501203415 0.12886483875694454 0.06429707648769989 0\n", "116 5.1 3.8 1.6 0.2 0 2.561545765055359 0.4288927940763031 -0.18633294617759266 -0.20573646329612738 0\n", "117 5.8 2.6 4.0 1.2 1 -0.22075578997244774 -0.40152336651555137 0.25417836518749715 0.04952191889168374 1\n", "118 5.7 3.8 1.7 0.3 0 2.23068249078231 0.826166758833374 0.07863720599424912 0.0004035597987264161 0\n", "119 6.2 2.9 4.3 1.3 1 -0.6256358184862005 0.023930474333675168 0.21203674475657858 -0.0077954052328795265 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import lightgbm\n", "import vaex.ml.sklearn\n", "\n", "# Features on which to train the model\n", "train_features = df_train.get_column_names(regex='PCA_.*')\n", "# The target column\n", "target = 'class_'\n", "\n", "# Instantiate the LightGBM Classifier\n", "booster = lightgbm.sklearn.LGBMClassifier(num_leaves=5, \n", " max_depth=5, \n", " n_estimators=100,\n", " random_state=42)\n", "\n", "# Make it a vaex transformer (for the automagic pipeline and lazy predictions)\n", "model = vaex.ml.sklearn.Predictor(features=train_features, \n", " target=target,\n", " model=booster, \n", " prediction_name='prediction')\n", "\n", "# Train and predict\n", "model.fit(df=df_train)\n", "df_train = model.transform(df=df_train)\n", "\n", "df_train" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-04T14:40:52.347957Z", "start_time": "2019-12-04T14:40:52.345452Z" } }, "source": [ "Notice that after training the model, we use the `.transform` method to obtain a shallow copy of the DataFrame which contains the prediction of the model, in a form of a virtual column. This makes it easy to evaluate the model, and easily create various diagnostic plots. If required, one can call the `.predict` method, which will result in an in-memory `numpy.array` housing the predictions.\n", "\n", "## Automatic pipelines\n", "\n", "Assuming we are happy with the performance of the model, we can continue and apply our transformations and model to the test set. Unlike other libraries, we do not need to explicitly create a pipeline here in order to propagate the transformations. In fact, with `vaex` and `vaex.ml`, a pipeline is automatically being created as one is doing the exploration of the data. Each `vaex` DataFrame contains a _state,_ which is a (serializable) object containing information of all transformations applied to the DataFrame (filtering, creation of new virtual columns, transformations).\n", "\n", "Recall that the outputs of both the PCA transformation and the boosted model were in fact virtual columns, and thus are stored in the state of `df_train`. All we need to do, is to apply this state to another similar DataFrame (e.g. the test set), and all the changes will be propagated." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:24.204646Z", "start_time": "2020-01-14T14:31:24.114304Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3 prediction
0 5.9 3.0 4.2 1.5 1 -0.4978687101343986 -0.11289245880584761-0.119626012060696370.0625954090178564 1
1 6.1 3.0 4.6 1.4 1 -0.8754765898560835 -0.039024021195735940.022944044447894815-0.14143773065379384 1
2 6.6 2.9 4.6 1.3 1 -1.0228803632878913 0.2503709022470443 0.4130613754204865 -0.0303919115590032821
3 6.7 3.3 5.7 2.1 2 -2.2544508624315838 0.3431374410700749 -0.28908707579214765-0.07059175451207655 2
4 5.5 4.2 1.4 0.2 0 2.632289228948536 1.020394958612415 -0.20769510079946696-0.13744144140286718 0
... ... ... ... ... ... ... ... ... ... ...
255.5 2.5 4.0 1.3 1 -0.16189655085432594-0.6871827581512436 0.09773053160021669 0.07093166682594204 1
265.8 2.7 3.9 1.2 1 -0.12526327170089271-0.3148233189949767 0.19720893202789733 0.060419826927667064 1
274.4 2.9 1.4 0.2 0 2.8918941837640526 -0.6426744898497139 0.0061717958745104440.007700652884580328 0
284.5 2.3 1.3 0.3 0 2.850207707200544 -0.9710397723109179 0.38501428492268475 0.377723418991853 0
296.9 3.2 5.7 2.3 2 -2.405639277483925 0.4027072938482219 -0.229448178035409730.17443211711742812 2
" ], "text/plain": [ "# sepal_length sepal_width petal_length petal_width class_ PCA_0 PCA_1 PCA_2 PCA_3 prediction\n", "0 5.9 3.0 4.2 1.5 1 -0.4978687101343986 -0.11289245880584761 -0.11962601206069637 0.0625954090178564 1\n", "1 6.1 3.0 4.6 1.4 1 -0.8754765898560835 -0.03902402119573594 0.022944044447894815 -0.14143773065379384 1\n", "2 6.6 2.9 4.6 1.3 1 -1.0228803632878913 0.2503709022470443 0.4130613754204865 -0.030391911559003282 1\n", "3 6.7 3.3 5.7 2.1 2 -2.2544508624315838 0.3431374410700749 -0.28908707579214765 -0.07059175451207655 2\n", "4 5.5 4.2 1.4 0.2 0 2.632289228948536 1.020394958612415 -0.20769510079946696 -0.13744144140286718 0\n", "... ... ... ... ... ... ... ... ... ... ...\n", "25 5.5 2.5 4.0 1.3 1 -0.16189655085432594 -0.6871827581512436 0.09773053160021669 0.07093166682594204 1\n", "26 5.8 2.7 3.9 1.2 1 -0.12526327170089271 -0.3148233189949767 0.19720893202789733 0.060419826927667064 1\n", "27 4.4 2.9 1.4 0.2 0 2.8918941837640526 -0.6426744898497139 0.006171795874510444 0.007700652884580328 0\n", "28 4.5 2.3 1.3 0.3 0 2.850207707200544 -0.9710397723109179 0.38501428492268475 0.377723418991853 0\n", "29 6.9 3.2 5.7 2.3 2 -2.405639277483925 0.4027072938482219 -0.22944817803540973 0.17443211711742812 2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "state = df_train.state_get()\n", "df_test.state_set(state)\n", "\n", "df_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Production\n", "\n", "Now `df_test` contains all the transformations we applied on the training set (`df_train`), including the model prediction. The transfer of state from one DataFrame to another can be extremely valuable for putting models in production.\n", "\n", "## Performance\n", "Finally, let's check the model performance." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:27.603150Z", "start_time": "2020-01-14T14:31:27.590066Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy: 100.0%\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "acc = accuracy_score(y_true=df_test.class_.values, y_pred=df_test.prediction.values)\n", "acc *= 100.\n", "print(f'Test set accuracy: {acc}%')" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-12-04T15:03:26.881910Z", "start_time": "2019-12-04T15:03:26.872187Z" } }, "source": [ "The model get perfect accuracy of 100%. This is not surprising as this problem is rather easy: doing a PCA transformation on the features nicely separates the 3 flower species. Plotting the first two PCA axes, and colouring the samples according to their class already shows an almost perfect separation." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2020-01-14T14:31:28.960052Z", "start_time": "2020-01-14T14:31:28.793818Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(8, 4))\n", "df_test.scatter(df_test.PCA_0, df_test.PCA_1, c_expr=df_test.class_, s=50)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 2 }