{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Handling missing or invalid data\n", "\n", "Data in the real world is seldom clean and never perfect. It often happens that we end up with \"missing\" or \"invalid\" data. There are countless reasons for why data can be missing: an instrument failed to make a recording in the real world, there was a temporary or no connection between the instrument and the computer storing the readings, maybe our scraper failed to gather all of the data, or our tracking tool did not manage to record all events.. this list can go on and on.\n", "\n", "In addition to this, during our analysis we can sometimes make a wrong turn and \"corrupt\" our data by dividing by zero, or taking the logarithm of a negative number. In addition, a sensor or a human may record invalid values that we want to highlight in a special way.\n", "\n", "In Vaex we have 3 ways of representing these special values:\n", "\n", " - \"missing\" or \"masked\" values;\n", " - \"not a number\" or `nan` values;\n", " - \"not available\" or `na` values.\n", "\n", " If you have used Vaex, you may have noticed some DataFrame methods, Expression methods, or method arguments referencing \"missing\", \"nan\", \"na\". Here are some examples:\n", "\n", "| \"missing\" | \"nan\" | \"na\" |\n", "| ------------------- | --------------- | -------------- |\n", "| `df.dropmissing` | `df.dropnan` | `df.dropna` |\n", "| `df.x.countmissing` | `df.x.countnan` | `df.x.countna` |\n", "| `df.x.ismissing` | `df.x.isnan` | `df.x.isna` |\n", "| `df.x.fillmissing` | `df.x.fillnan` | `df.x.fillna` |\n", "\n", "In what follows we will explain the difference between these 3 types of values, when they should be used, and why does Vaex makes the distinction between them. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## \"nan\" vs \"missing\" vs \"na\"\n", "\n", "### Summary (TLDR;)\n", "\n", "The following table summarizes the differences between missing values, `nan` values and `na`:\n", "\n", "| | missing or masked values | Not a number (`nan`) | Not available (`na`) |\n", "| --------- | ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |\n", "| `dtype` | Any `dtype` | Float | Any `dtype` but only truly relevant for float |\n", "| Meaning | Total absence of data | Data is present, but is corrupted or can not be represented in numeric form (e.g. `log(-5)`) | Union of missing and `nan` values |\n", "| Use case | Sensor did not make a measurement | Sensor made a measurement but the data is corrupted, or mathematical transformation leads to an invalid / non-numerical values | It is up to the user to decide | |\n", "\n", "### Not a number or `nan`\n", "\n", "Many data practitioners, perhaps erroneously, interchangeably use the term `nan` and the term missing values. In fact `nan` values are commonly used as sentinel values to generally indicate invalid data. This is inaccurate because `nan` values are in fact special float values. `nan` is a shorthand for \"not a number\", which is meant to indicate a value that is not a number in a sequence of floats, and thus in itself is not missing. It is used to represent values that are undefined mathematically, such as `0/0` or `log(-5)`, or for data that does exist but is corrupted or can not be represented in numerical form. Note that there is no such corresponding value for integers for example, or for non-numeric types such as string.\n", "\n", "In Python one can use `nan` values via the `math` standard library (e.g.: `math.nan`) or via the `numpy` library (e.g.: `numpy.nan`). \n", "\n", "So why are `nan` values synonymous with missing values? It is hard to tell. One guess is that data practitioners found using `numpy.nan` a convenient shortcut to representing an \"missing\" or invalid value in arrays. Numpy does have a proper way of indicating a missing values via masked arrays (more on that in the next section), but for many that API can be less convenient and requires an addition knowledge of how to handle those array types. This effect might have been more enhanced by Pandas, in which for a long time `nan` values were the only way to indicate both invalid/corrupted and truly missing data.\n", "\n", "### Missing or masked values\n", "\n", "Perhaps a better way to mark the absence of data is via missing or masked values. Python itself has a special object to indicate missing or no data, and that is the `None` object, which has its own `NoneType` type. The `None` object in Python is equivalent to the `NULL` value in SQL. \n", "\n", "Modern data analysis libraries also implement their own ways of indicating missing values. For arrays that have missing data, Numpy implements so-called \"masked arrays\". When constructing the arrays, in addition to data one is also required to provide a boolean mask. A `True` value in the mask array, indicates that the corresponding element in the data array is missing. In the example below, the last element of the masked array is missing:\n", "\n", "```python\n", "import numpy as np\n", "\n", "data = [23, 31, 0]\n", "mask = [False, False, True]\n", "\n", "my_masked_array = np.ma.masked_array(data, mask)\n", "```\n", "\n", "Pyarrow also implements a `null` type to indicate missing values in their data structures. Unlike Numpy that uses bytemasks, Pyarrow uses bitmasks to indicate missing data which make it more memory efficient. Note that in Pyarrow, if the mask has a value of 1 it means that the data is present, while 0 indicates missing data. Similarly to Vaex, Pyarrow also makes the distinction between `nan` and `null` values. \n", "\n", "In more recent versions, Pandas also implements a `pd.NA` value to indicate missing values, which can be used in arrays or Series various types and not just float.\n", "\n", "In Vaex, missing data are `null` values if the underlying array is backed by Pyarrow, and masked values if the underlying array is a Numpy masked array. \n", "\n", "When are missing values used in practice? They are used to indicate data that was not collected, i.e. a sensor was scheduled to make a reading but it did not, or a doctor was supposed to make scan of a patient but they did not. \n", "\n", "To contrast with `nan` values: missing or masked values indicate a complete absence of data, while `nan` values indicate the presence of data that can not be interpreted numerically. This can be a subtle but sometimes an important distinction to make\n", "\n", "### Not available or `na`\n", "\n", "Vaex also implements methods referring to `na` which stands for Not available\", and is a union of both `nan` and missing values. This only really matters when dealing with Expressions of float type, since that is the only type that can have both missing and `nan` values. Of course if you do not make the distinction between `nan` and missing values in your code, use can use methods that refer to `na` to encompass both cases and simplify development. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "\n", "Let us consider the following DataFrame:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x y z w
01.010 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}
1-- 20 Michael Jordan--
23.0-- -- {'city': 'Dallas', 'team': 'Mavericks'}
34.040 -- --
4nan50 Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}
" ], "text/plain": [ " # x y z w\n", " 0 1.0 10 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}\n", " 1 -- 20 Michael Jordan --\n", " 2 3.0 -- -- {'city': 'Dallas', 'team': 'Mavericks'}\n", " 3 4.0 40 -- --\n", " 4 nan 50 Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import vaex\n", "import numpy as np\n", "import pyarrow as pa\n", "\n", "x = np.ma.array(data=[1, 0, 3, 4, np.nan], mask=[False, True, False, False, False])\n", "y = pa.array([10, 20, None, 40, 50])\n", "z = pa.array(['Reggie Miller', 'Michael Jordan', None, None, 'Kobe Bryant'])\n", "w = pa.array([\n", " {'city': 'Indianapolis', 'team': 'Pacers'}, \n", " None,\n", " {'city': 'Dallas', 'team': 'Mavericks'},\n", " None, \n", " {'city': 'Los Angeles', 'team': 'Lakers'}\n", " ])\n", "\n", "df = vaex.from_arrays(x=x, y=y, z=z, w=w)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `df` contains a float column `x` which in turn contains both a missing (masked) value and a `nan` value. The columns `y`, `z`, and `w` which are of `dtype` int, string, and struct respectively can only contain masked values in addition to their nominal type. \n", "\n", "For example, if we want to drop all rows with missing values from the entire DataFrame, we can use the `dropmissing` method:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x yz w
0 1 10Reggie Miller{'city': 'Indianapolis', 'team': 'Pacers'}
1nan 50Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}
" ], "text/plain": [ " # x y z w\n", " 0 1 10 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}\n", " 1 nan 50 Kobe Bryant {'city': 'Los Angeles', 'team': 'Lakers'}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropmissing()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that all missing (masked) values are dropped, but the `nan` value in column `x` is still present since it is not technically \"missing\". \n", "\n", "If we want drop all `nan` values from the DataFrame we can do so via the corresponding `dropnan` method:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x y z w
01.010 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}
1-- 20 Michael Jordan--
23.0-- -- {'city': 'Dallas', 'team': 'Mavericks'}
34.040 -- --
" ], "text/plain": [ " # x y z w\n", " 0 1.0 10 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}\n", " 1 -- 20 Michael Jordan --\n", " 2 3.0 -- -- {'city': 'Dallas', 'team': 'Mavericks'}\n", " 3 4.0 40 -- --" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropnan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we see that the `nan` value from the column `x` is no longer in the DataFrame, but all the other missing values are still there. \n", "\n", "If we simply want to get rid of all values that are not available for us to use directly, we can use the `dropna` method:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x yz w
0 1 10Reggie Miller{'city': 'Indianapolis', 'team': 'Pacers'}
" ], "text/plain": [ " # x y z w\n", " 0 1 10 Reggie Miller {'city': 'Indianapolis', 'team': 'Pacers'}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we see that only rows containing valid data entries remain. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.12 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "2b337e1aa502f5cea9a92c761ad75d3ab5045107ee3446fdbe7f873d4f1936e7" } } }, "nbformat": 4, "nbformat_minor": 2 }