Handling missing or invalid data¶
Data in the real world is seldom clean and never perfect. It often happens that we end up with “missing” or “invalid” data. There are countless reasons for why data can be missing: an instrument failed to make a recording in the real world, there was a temporary or no connection between the instrument and the computer storing the readings, maybe our scraper failed to gather all of the data, or our tracking tool did not manage to record all events.. this list can go on and on.
In addition to this, during our analysis we can sometimes make a wrong turn and “corrupt” our data by dividing by zero, or taking the logarithm of a negative number. In addition, a sensor or a human may record invalid values that we want to highlight in a special way.
In Vaex we have 3 ways of representing these special values:
“missing” or “masked” values;
“not a number” or
nan
values;“not available” or
na
values.
If you have used Vaex, you may have noticed some DataFrame methods, Expression methods, or method arguments referencing “missing”, “nan”, “na”. Here are some examples:
“missing” 
“nan” 
“na” 













In what follows we will explain the difference between these 3 types of values, when they should be used, and why does Vaex makes the distinction between them.
“nan” vs “missing” vs “na”¶
Summary (TLDR;)¶
The following table summarizes the differences between missing values, nan
values and na
:
missing or masked values 
Not a number ( 
Not available ( 



Any 
Float 
Any 
Meaning 
Total absence of data 
Data is present, but is corrupted or can not be represented in numeric form (e.g. 
Union of missing and 
Use case 
Sensor did not make a measurement 
Sensor made a measurement but the data is corrupted, or mathematical transformation leads to an invalid / nonnumerical values 
It is up to the user to decide 
Not a number or nan
¶
Many data practitioners, perhaps erroneously, interchangeably use the term nan
and the term missing values. In fact nan
values are commonly used as sentinel values to generally indicate invalid data. This is inaccurate because nan
values are in fact special float values. nan
is a shorthand for “not a number”, which is meant to indicate a value that is not a number in a sequence of floats, and thus in itself is not missing. It is used to represent values that are undefined
mathematically, such as 0/0
or log(5)
, or for data that does exist but is corrupted or can not be represented in numerical form. Note that there is no such corresponding value for integers for example, or for nonnumeric types such as string.
In Python one can use nan
values via the math
standard library (e.g.: math.nan
) or via the numpy
library (e.g.: numpy.nan
).
So why are nan
values synonymous with missing values? It is hard to tell. One guess is that data practitioners found using numpy.nan
a convenient shortcut to representing an “missing” or invalid value in arrays. Numpy does have a proper way of indicating a missing values via masked arrays (more on that in the next section), but for many that API can be less convenient and requires an addition knowledge of how to handle those array types. This effect might have been more enhanced by
Pandas, in which for a long time nan
values were the only way to indicate both invalid/corrupted and truly missing data.
Missing or masked values¶
Perhaps a better way to mark the absence of data is via missing or masked values. Python itself has a special object to indicate missing or no data, and that is the None
object, which has its own NoneType
type. The None
object in Python is equivalent to the NULL
value in SQL.
Modern data analysis libraries also implement their own ways of indicating missing values. For arrays that have missing data, Numpy implements socalled “masked arrays”. When constructing the arrays, in addition to data one is also required to provide a boolean mask. A True
value in the mask array, indicates that the corresponding element in the data array is missing. In the example below, the last element of the masked array is missing:
import numpy as np
data = [23, 31, 0]
mask = [False, False, True]
my_masked_array = np.ma.masked_array(data, mask)
Pyarrow also implements a null
type to indicate missing values in their data structures. Unlike Numpy that uses bytemasks, Pyarrow uses bitmasks to indicate missing data which make it more memory efficient. Note that in Pyarrow, if the mask has a value of 1 it means that the data is present, while 0 indicates missing data. Similarly to Vaex, Pyarrow also makes the distinction between nan
and null
values.
In more recent versions, Pandas also implements a pd.NA
value to indicate missing values, which can be used in arrays or Series various types and not just float.
In Vaex, missing data are null
values if the underlying array is backed by Pyarrow, and masked values if the underlying array is a Numpy masked array.
When are missing values used in practice? They are used to indicate data that was not collected, i.e. a sensor was scheduled to make a reading but it did not, or a doctor was supposed to make scan of a patient but they did not.
To contrast with nan
values: missing or masked values indicate a complete absence of data, while nan
values indicate the presence of data that can not be interpreted numerically. This can be a subtle but sometimes an important distinction to make
Not available or na
¶
Vaex also implements methods referring to na
which stands for Not available”, and is a union of both nan
and missing values. This only really matters when dealing with Expressions of float type, since that is the only type that can have both missing and nan
values. Of course if you do not make the distinction between nan
and missing values in your code, use can use methods that refer to na
to encompass both cases and simplify development.
Examples¶
Let us consider the following DataFrame:
[1]:
import vaex
import numpy as np
import pyarrow as pa
x = np.ma.array(data=[1, 0, 3, 4, np.nan], mask=[False, True, False, False, False])
y = pa.array([10, 20, None, 40, 50])
z = pa.array(['Reggie Miller', 'Michael Jordan', None, None, 'Kobe Bryant'])
w = pa.array([
{'city': 'Indianapolis', 'team': 'Pacers'},
None,
{'city': 'Dallas', 'team': 'Mavericks'},
None,
{'city': 'Los Angeles', 'team': 'Lakers'}
])
df = vaex.from_arrays(x=x, y=y, z=z, w=w)
df
[1]:
#  x  y  z  w 

0  1.0  10  Reggie Miller  {'city': 'Indianapolis', 'team': 'Pacers'} 
1    20  Michael Jordan   
2  3.0      {'city': 'Dallas', 'team': 'Mavericks'} 
3  4.0  40     
4  nan  50  Kobe Bryant  {'city': 'Los Angeles', 'team': 'Lakers'} 
The df
contains a float column x
which in turn contains both a missing (masked) value and a nan
value. The columns y
, z
, and w
which are of dtype
int, string, and struct respectively can only contain masked values in addition to their nominal type.
For example, if we want to drop all rows with missing values from the entire DataFrame, we can use the dropmissing
method:
[2]:
df.dropmissing()
[2]:
#  x  y  z  w 

0  1  10  Reggie Miller  {'city': 'Indianapolis', 'team': 'Pacers'} 
1  nan  50  Kobe Bryant  {'city': 'Los Angeles', 'team': 'Lakers'} 
We see that all missing (masked) values are dropped, but the nan
value in column x
is still present since it is not technically “missing”.
If we want drop all nan
values from the DataFrame we can do so via the corresponding dropnan
method:
[3]:
df.dropnan()
[3]:
#  x  y  z  w 

0  1.0  10  Reggie Miller  {'city': 'Indianapolis', 'team': 'Pacers'} 
1    20  Michael Jordan   
2  3.0      {'city': 'Dallas', 'team': 'Mavericks'} 
3  4.0  40     
Now we see that the nan
value from the column x
is no longer in the DataFrame, but all the other missing values are still there.
If we simply want to get rid of all values that are not available for us to use directly, we can use the dropna
method:
[4]:
df.dropna()
[4]:
#  x  y  z  w 

0  1  10  Reggie Miller  {'city': 'Indianapolis', 'team': 'Pacers'} 
Now we see that only rows containing valid data entries remain.