API documentation for vaex library

Contents

API documentation for vaex library#

Quick lists#

Opening/reading in your data.#

vaex.open(path[, convert, progress, ...])

Open a DataFrame from file given by path.

vaex.open_many(filenames)

Open a list of filenames, and return a DataFrame with all DataFrames concatenated.

vaex.from_arrays(**arrays)

Create an in memory DataFrame from numpy arrays.

vaex.from_arrow_dataset(arrow_dataset)

Create a DataFrame from an Apache Arrow dataset.

vaex.from_arrow_table(table)

Creates a vaex DataFrame from an arrow Table.

vaex.from_ascii(path[, seperator, names, ...])

Create an in memory DataFrame from an ascii file (whitespace seperated by default).

vaex.from_astropy_table(table)

Create a vaex DataFrame from an Astropy Table.

vaex.from_csv(filename_or_buffer[, ...])

Load a CSV file as a DataFrame, and optionally convert to an HDF5 file.

vaex.from_csv_arrow(file[, read_options, ...])

Fast CSV reader using Apache Arrow.

vaex.from_dataset(dataset)

Create a Vaex DataFrame from a Vaex Dataset

vaex.from_dict(data)

Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values

vaex.from_items(*items)

Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).

vaex.from_json(path_or_buffer[, orient, ...])

A method to read a JSON file using pandas, and convert to a DataFrame directly.

vaex.from_pandas(df[, name, copy_index, ...])

Create an in memory DataFrame from a pandas DataFrame.

vaex.from_records(records[, array_type, ...])

Create a dataframe from a list of dict.

Visualizations.#

vaex.viz.DataFrameAccessorViz.heatmap([x, ...])

Viz data in a 2d histogram/heatmap.

vaex.viz.DataFrameAccessorViz.histogram([x, ...])

Plot a histogram.

vaex.viz.DataFrameAccessorViz.scatter(x, y)

Viz (small amounts) of data in 2d using a scatter plot

Statistics.#

vaex.dataframe.DataFrame.correlation(x[, y, ...])

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.count([expression, ...])

Count the number of non-NaN values (or all, if expression is None or "*").

vaex.dataframe.DataFrame.cov(x[, y, binby, ...])

Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.max(expression[, ...])

Calculate the maximum for given expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.mean(expression[, ...])

Calculate the mean for expression, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.median_approx(...)

Calculate the median, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.min(expression[, ...])

Calculate the minimum for given expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.minmax(expression)

Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.mode(expression[, ...])

Calculate/estimate the mode.

vaex.dataframe.DataFrame.mutual_information(x)

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.

vaex.dataframe.DataFrame.std(expression[, ...])

Calculate the standard deviation for the given expression, possible on a grid defined by binby

vaex.dataframe.DataFrame.unique(expression)

Returns all unique values.

vaex.dataframe.DataFrame.var(expression[, ...])

Calculate the sample variance for the given expression, possible on a grid defined by binby

vaex-core#

Vaex is a library for dealing with larger than memory DataFrames (out of core).

The most important class (datastructure) in vaex is the DataFrame. A DataFrame is obtained by either opening the example dataset:

>>> import vaex
>>> df = vaex.example()

Or using open() to open a file.

>>> df1 = vaex.open("somedata.hdf5")
>>> df2 = vaex.open("somedata.fits")
>>> df2 = vaex.open("somedata.arrow")
>>> df4 = vaex.open("somedata.csv")

Or connecting to a remove server:

>>> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")

A few strong features of vaex are:

  • Performance: works with huge tabular data, process over a billion (> 109) rows/second.

  • Expression system / Virtual columns: compute on the fly, without wasting ram.

  • Memory efficient: no memory copies when doing filtering/selections/subsets.

  • Visualization: directly supported, a one-liner is often enough.

  • User friendly API: you will only need to deal with a DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.

  • Very fast statistics on N dimensional grids such as histograms, running mean, heatmaps.

Follow the tutorial at https://docs.vaex.io/en/latest/tutorial.html to learn how to use vaex.

vaex.concat(dfs, resolver='flexible') DataFrame[source]#

Concatenate a list of DataFrames.

Parameters:

resolver – How to resolve schema conflicts, see DataFrame.concat().

vaex.delayed(f)[source]#

Decorator to transparantly accept delayed computation.

Example:

>>> delayed_sum = ds.sum(ds.E, binby=ds.x, limits=limits,
>>>                   shape=4, delay=True)
>>> @vaex.delayed
>>> def total_sum(sums):
>>>     return sums.sum()
>>> sum_of_sums = total_sum(delayed_sum)
>>> ds.execute()
>>> sum_of_sums.get()
See the tutorial for a more complete example https://docs.vaex.io/en/latest/tutorial.html#Parallel-computations
vaex.example()[source]#

Result of an N-body simulation of the accretion of 33 satellite galaxies into a Milky Way dark matter halo.

Data was greated by Helmi & de Zeeuw 2000. The data contains the position (x, y, z), velocitie (vx, vy, vz), the energy (E), the angular momentum (L, Lz) and iron content (FeH) of the particles.

Return type:

DataFrame

vaex.from_arrays(**arrays) DataFrameLocal[source]#

Create an in memory DataFrame from numpy arrays.

Example

>>> import vaex, numpy as np
>>> x = np.arange(5)
>>> y = x ** 2
>>> vaex.from_arrays(x=x, y=y)
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
>>> some_dict = {'x': x, 'y': y}
>>> vaex.from_arrays(**some_dict)  # in case you have your columns in a dict
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
Parameters:

arrays – keyword arguments with arrays

Return type:

DataFrame

vaex.from_arrow_dataset(arrow_dataset) DataFrame[source]#

Create a DataFrame from an Apache Arrow dataset.

vaex.from_arrow_table(table) DataFrame[source]#

Creates a vaex DataFrame from an arrow Table.

Parameters:

as_numpy – Will lazily cast columns to a NumPy ndarray.

Return type:

DataFrame

vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]#

Create an in memory DataFrame from an ascii file (whitespace seperated by default).

>>> ds = vx.from_ascii("table.asc")
>>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])
Parameters:
  • path – file path

  • seperator – value seperator, by default whitespace, use “,” for comma seperated values.

  • names – If True, the first line is used for the column names, otherwise provide a list of strings with names

  • skip_lines – skip lines at the start of the file

  • skip_after – skip lines at the end of the file

  • kwargs

Return type:

DataFrame

vaex.from_astropy_table(table)[source]#

Create a vaex DataFrame from an Astropy Table.

vaex.from_csv(filename_or_buffer, copy_index=False, chunk_size=None, convert=False, fs_options={}, progress=None, fs=None, **kwargs)[source]#

Load a CSV file as a DataFrame, and optionally convert to an HDF5 file.

Parameters:
  • filename_or_buffer (str or file) – CSV file path or file-like

  • copy_index (bool) – copy index when source is read via Pandas

  • chunk_size (int) –

    if the CSV file is too big to fit in the memory this parameter can be used to read CSV file in chunks. For example:

    >>> import vaex
    >>> for i, df in enumerate(vaex.read_csv('taxi.csv', chunk_size=100_000)):
    >>>     df = df[df.passenger_count < 6]
    >>>     df.export_hdf5(f'taxi_{i:02}.hdf5')
    

  • convert (bool or str) – convert files to an hdf5 file for optimization, can also be a path. The CSV file will be read in chunks: either using the provided chunk_size argument, or a default size. Each chunk will be saved as a separate hdf5 file, then all of them will be combined into one hdf5 file. So for a big CSV file you will need at least double of extra space on the disk. Default chunk_size for converting is 5 million rows, which corresponds to around 1Gb memory on an example of NYC Taxi dataset.

  • progress – (Only applies when convert is not False) True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • kwargs – extra keyword arguments, currently passed to Pandas read_csv function, but the implementation might change in future versions.

Returns:

DataFrame

vaex.from_dataset(dataset: Dataset) DataFrame[source]#

Create a Vaex DataFrame from a Vaex Dataset

vaex.from_dict(data)[source]#

Create an in memory dataset from a dict with column names as keys and list/numpy-arrays as values

Example

>>> data = {'A':[1,2,3],'B':['a','b','c']}
>>> vaex.from_dict(data)
  #    A    B
  0    1   'a'
  1    2   'b'
  2    3   'c'
Parameters:

data – A dict of {column:[value, value,…]}

Return type:

DataFrame

vaex.from_items(*items)[source]#

Create an in memory DataFrame from numpy arrays, in contrast to from_arrays this keeps the order of columns intact (for Python < 3.6).

Example

>>> import vaex, numpy as np
>>> x = np.arange(5)
>>> y = x ** 2
>>> vaex.from_items(('x', x), ('y', y))
  #    x    y
  0    0    0
  1    1    1
  2    2    4
  3    3    9
  4    4   16
Parameters:

items – list of [(name, numpy array), …]

Return type:

DataFrame

vaex.from_json(path_or_buffer, orient=None, precise_float=False, lines=False, copy_index=False, **kwargs)[source]#

A method to read a JSON file using pandas, and convert to a DataFrame directly.

Parameters:
  • path_or_buffer (str) – a valid JSON string or file-like, default: None The string could be a URL. Valid URL schemes include http, ftp, s3, gcs, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/table.json

  • orient (str) – Indication of expected JSON string format. Allowed values are split, records, index, columns, and values.

  • precise_float (bool) – Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality

  • lines (bool) – Read the file as a json object per line.

Return type:

DataFrame

vaex.from_pandas(df, name='pandas', copy_index=False, index_name='index')[source]#

Create an in memory DataFrame from a pandas DataFrame.

Param:

pandas.DataFrame df: Pandas DataFrame

Param:

name: unique for the DataFrame

>>> import vaex, pandas as pd
>>> df_pandas = pd.from_csv('test.csv')
>>> df = vaex.from_pandas(df_pandas)
Return type:

DataFrame

vaex.from_records(records: List[Dict], array_type='arrow', defaults={}) DataFrame[source]#

Create a dataframe from a list of dict.

Warning

This is for convenience only, for performance pass arrays to from_arrays() for instance.

Parameters:
  • array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

  • defaults (dict) – default values if a record has a missing entry

vaex.open(path, convert=False, progress=None, shuffle=False, fs_options={}, fs=None, *args, **kwargs)[source]#

Open a DataFrame from file given by path.

Example:

>>> df = vaex.open('sometable.hdf5')
>>> df = vaex.open('somedata*.csv', convert='bigdata.hdf5')
Parameters:
  • path (str or list) – local or absolute path to file, or glob string, or list of paths

  • convert – Uses dataframe.export when convert is a path. If True, convert=path+'.hdf5' The conversion is skipped if the input file or conversion argument did not change.

  • progress – (Only applies when convert is not False) True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • shuffle (bool) – shuffle converted DataFrame or not

  • fs_options (dict) – Extra arguments passed to an optional file system if needed. See below

  • group – (optional) Specify the group to be read from and HDF5 file. By default this is set to “/table”.

  • fs – Apache Arrow FileSystem object, or FSSpec FileSystem object, if specified, fs_options should be empty.

  • args – extra arguments for file readers that need it

  • kwargs – extra keyword arguments

Returns:

return a DataFrame on success, otherwise None

Return type:

DataFrame

Note: From version 4.14.0 vaex.open() will lazily read CSV files. If you prefer to read the entire CSV file into memory, use vaex.from_csv() or vaex.from_csv_arrow() instead.

Cloud storage support:

Vaex supports streaming of HDF5 files from Amazon AWS S3 and Google Cloud Storage. Files are by default cached in $HOME/.vaex/file-cache/(s3|gs) such that successive access is as fast as native disk access.

Amazon AWS S3 options:

The following common fs_options are used for S3 access:

  • anon: Use anonymous access or not (false by default). (Allowed values are: true,True,1,false,False,0)

  • anonymous - Alias for anon

  • cache: Use the disk cache or not, only set to false if the data should be accessed once. (Allowed values are: true,True,1,false,False,0)

  • access_key - AWS access key, if not provided will use the standard env vars, or the ~/.aws/credentials file

  • secret_key - AWS secret key, similar to access_key

  • profile - If multiple profiles are present in ~/.aws/credentials, pick this one instead of ‘default’, see https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html

  • region - AWS Region, e.g. ‘us-east-1`, will be determined automatically if not provided.

  • endpoint_override - URL/ip to connect to, instead of AWS, e.g. ‘localhost:9000’ for minio

All fs_options can also be encoded in the file path as a query string.

Examples:

>>> df = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true', fs_options={'anonymous': True})
>>> df = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true?anon=true')
>>> df = vaex.open('s3://mybucket/path/to/file.hdf5', fs_options={'access_key': my_key, 'secret_key': my_secret_key})
>>> df = vaex.open(f's3://mybucket/path/to/file.hdf5?access_key={my_key}&secret_key={my_secret_key}')
>>> df = vaex.open('s3://mybucket/path/to/file.hdf5?profile=myproject')

Google Cloud Storage options:

The following fs_options are used for GCP access:

Examples:

>>> df = vaex.open('gs://vaex-data/airlines/us_airline_data_1988_2019.hdf5', fs_options={'token': None})
>>> df = vaex.open('gs://vaex-data/airlines/us_airline_data_1988_2019.hdf5?token=anon')
>>> df = vaex.open('gs://vaex-data/testing/xys.hdf5?token=anon&cache=False')
vaex.open_many(filenames)[source]#

Open a list of filenames, and return a DataFrame with all DataFrames concatenated.

The filenames can be of any format that is supported by vaex.open(), namely hdf5, arrow, parquet, csv, etc.

Parameters:

filenames (list[str]) – list of filenames/paths

Return type:

DataFrame

vaex.register_function(scope=None, as_property=False, name=None, on_expression=True, df_accessor=None, multiprocessing=False)[source]#

Decorator to register a new function with vaex.

If on_expression is True, the function will be available as a method on an Expression, where the first argument will be the expression itself.

If df_accessor is given, it is added as a method to that dataframe accessor (see e.g. vaex/geo.py)

Example:

>>> import vaex
>>> df = vaex.example()
>>> @vaex.register_function()
>>> def invert(x):
>>>     return 1/x
>>> df.x.invert()
>>> import numpy as np
>>> df = vaex.from_arrays(departure=np.arange('2015-01-01', '2015-12-05', dtype='datetime64'))
>>> @vaex.register_function(as_property=True, scope='dt')
>>> def dt_relative_day(x):
>>>     return vaex.functions.dt_dayofyear(x)/365.
>>> df.departure.dt.relative_day
vaex.vconstant(value, length, dtype=None, chunk_size=1024)[source]#

Creates a virtual column with constant values, which uses 0 memory.

Parameters:
  • value – The value with which to fill the column

  • length – The length of the column, i.e. the number of rows it should contain.

  • dtype – The preferred dtype for the column.

  • chunk_size – Could be used to optimize the performance (evaluation) of this column.

vaex.vrange(start, stop, step=1, dtype='f8')[source]#

Creates a virtual column which is the equivalent of numpy.arange, but uses 0 memory

Parameters:
  • start (int) – Start of interval. The interval includes this value.

  • stop (int) – End of interval. The interval does not include this value,

  • step (int) – Spacing between values.

Dtype:

The preferred dtype for the column.

Aggregation and statistics#

class vaex.stat.Expression[source]#

Bases: object

Describes an expression for a statistic

calculate(ds, binby=[], shape=256, limits=None, selection=None)[source]#

Calculate the statistic for a Dataset

vaex.stat.correlation(x, y)[source]#

Creates a standard deviation statistic

vaex.stat.count(expression='*')[source]#

Creates a count statistic

vaex.stat.covar(x, y)[source]#

Creates a standard deviation statistic

vaex.stat.mean(expression)[source]#

Creates a mean statistic

vaex.stat.std(expression)[source]#

Creates a standard deviation statistic

vaex.stat.sum(expression)[source]#

Creates a sum statistic

class vaex.agg.AggregatorDescriptorKurtosis(name, expression, short_name='kurtosis', selection=None, edges=False)[source]#

Bases: AggregatorDescriptorMulti

class vaex.agg.AggregatorDescriptorMean(name, expressions, short_name='mean', selection=None, edges=False)[source]#

Bases: AggregatorDescriptorMulti

class vaex.agg.AggregatorDescriptorMulti(name, expressions, short_name, selection=None, edges=False)[source]#

Bases: AggregatorDescriptor

Uses multiple operations/aggregation to calculate the final aggretation

class vaex.agg.AggregatorDescriptorSkew(name, expression, short_name='skew', selection=None, edges=False)[source]#

Bases: AggregatorDescriptorMulti

class vaex.agg.AggregatorDescriptorStd(name, expression, short_name='var', ddof=0, selection=None, edges=False)[source]#

Bases: AggregatorDescriptorVar

class vaex.agg.AggregatorDescriptorVar(name, expression, short_name='var', ddof=0, selection=None, edges=False)[source]#

Bases: AggregatorDescriptorMulti

vaex.agg.all(expression=None, selection=None)[source]#

Aggregator that returns True when all of the values in the group are True, or when all of the data in the group is valid (i.e. not missing values or np.nan). The aggregator returns False if there is no data in the group when the selection argument is used.

Parameters:
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

vaex.agg.any(expression=None, selection=None)[source]#

Aggregator that returns True when any of the values in the group are True, or when there is any data in the group that is valid (i.e. not missing values or np.nan). The aggregator returns False if there is no data in the group when the selection argument is used.

Parameters:
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

vaex.agg.count(expression='*', selection=None, edges=False)[source]#

Creates a count aggregation

vaex.agg.first(expression, order_expression=None, selection=None, edges=False)[source]#

Creates a first aggregation.

Parameters:
  • expression – {expression_one}.

  • order_expression – Order the values in the bins by this expression.

  • selection – {selection1}

  • edges – {edges}

vaex.agg.kurtosis(expression, selection=None, edges=False)[source]#

Create a kurtosis aggregation.

vaex.agg.last(expression, order_expression=None, selection=None, edges=False)[source]#

Creates a first aggregation.

Parameters:
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y .

  • order_expression – Order the values in the bins by this expression.

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

class vaex.agg.list(expression, selection=None, dropna=False, dropnan=False, dropmissing=False, edges=False)[source]#

Bases: AggregatorDescriptorBasic

Aggregator that returns a list of values belonging to the specified expression.

Parameters:
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

  • dropmissing – Drop rows with missing values

  • dropnan – Drop rows with NaN values

  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

vaex.agg.max(expression, selection=None, edges=False)[source]#

Creates a max aggregation

vaex.agg.mean(expression, selection=None, edges=False)[source]#

Creates a mean aggregation

vaex.agg.min(expression, selection=None, edges=False)[source]#

Creates a min aggregation

vaex.agg.nunique(expression, dropna=False, dropnan=False, dropmissing=False, selection=None, edges=False)[source]#

Aggregator that calculates the number of unique items per bin.

Parameters:
  • expression – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • dropmissing – Drop rows with missing values

  • dropnan – Drop rows with NaN values

  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False)

vaex.agg.skew(expression, selection=None, edges=False)[source]#

Create a skew aggregation.

vaex.agg.std(expression, ddof=0, selection=None, edges=False)[source]#

Creates a standard deviation aggregation

vaex.agg.sum(expression, selection=None, edges=False)[source]#

Creates a sum aggregation

vaex.agg.var(expression, ddof=0, selection=None, edges=False)[source]#

Creates a variance aggregation

Caching#

(Currently experimental, use at own risk) Vaex can cache task results, such as aggregations, or the internal hashmaps used for groupby to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.

Internally, Vaex calculates fingerprints (such as hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, such that a restart of a process will most likely result in similar hash keys.

Caches can turned on globally, or used as a context manager:

>>> import vaex
>>> df = vaex.example()
>>> vaex.cache.memory_infinite()  # cache on globally
<cache restore context manager>
>>> vaex.cache.is_on()
True
>>> vaex.cache.off() # cache off globally
<cache restore context manager>
>>> vaex.cache.is_on()
False
>>> with vaex.cache.memory_infinite():
...     df.x.sum()  # calculated without cache
array(-20884.64307324)
>>> vaex.cache.is_on()
False

The functions vaex.cache.set() and vaex.cache.get() simply look up the values in a global dict (vaex.cache.cache), but can be set for more complex behaviour.

A good library to use for in-memory caching is cachetools (https://pypi.org/project/cachetools/)

>>> import vaex
>>> import cachetools
>>> df = vaex.example()
>>> vaex.cache.cache = cachetools.LRUCache(1_000_000_000)  # 1gb cache

Configure using environment variables#

See Configuration for more configuration options.

Especially when using the vaex server it can be useful to turn on caching externally using enviroment variables.

$ VAEX_CACHE=disk VAEX_CACHE_DISK_SIZE_LIMIT=”10GB” python -m vaex.server

Will enable caching using vaex.cache.disk() and configure it to use at max 10 GB of disk space.

When using Vaex in combination with Flask or Plotly Dash, and using gunicorn for scaling, it can be useful to use a multilevel cache, where the first cache is small but low latency (and private for each progress), and a second higher latency disk cache that is shared among all processes.

$ VAEX_CACHE=”memory,disk” VAEX_CACHE_DISK_SIZE_LIMIT=”10GB” VAEX_CACHE_MEMORY_SIZE_LIMIT=”1GB” gunicorn -w 16 app:server

vaex.cache.disk(clear=False, size_limit='10GB', eviction_policy='least-recently-stored')[source]#

Stored cached values using the diskcache library.

See configuration details at configuration of cache. and configuration of paths

Parameters:
vaex.cache.get(key, default=None, type=None)[source]#

Looks up the cache value for the key, or returns the default

Will return None if the cache is turned off.

Parameters:
  • key (str) – Cache key.

  • default – Return when cache is on, but key not in cache

  • type – Currently unused.

vaex.cache.is_on()[source]#

Returns True when caching is enabled

vaex.cache.memory(maxsize='1GB', classname='LRUCache', clear=False)[source]#

Sets a memory cache using cachetools (https://cachetools.readthedocs.io/).

Calling multiple times with clear=False will keep the current cache (useful in notebook usage).

Parameters:
  • maxsize (int or str) – Max size of cache in bytes (or use a string like ‘128MB’)

  • classname (str) – classname in the cachetools library used for the cache (e.g. LRUCache, MRUCache).

  • clear (bool) – If False, will always set a new cache, when true, it will keep the cache when it is of the same type.

vaex.cache.memory_infinite(clear=False)[source]#

Sets a dict a cache, creating an infinite cache.

Calling multiple times with clear=False will keep the current cache (useful in notebook usage)

vaex.cache.off()[source]#

Turns off caching, or temporary when used as context manager

>>> import vaex
>>> df = vaex.example()
>>> vaex.cache.memory_infinite()  # cache on
<cache restore context manager>
>>> with vaex.cache.off():
...     df.x.sum()  # calculated without cache
array(-20884.64307324)
>>> df.x.sum()  # calculated with cache
array(-20884.64307324)
>>> vaex.cache.off()  # cache off
<cache restore context manager>
>>> df.x.sum()  # calculated without cache
array(-20884.64307324)
vaex.cache.redis(client=None)[source]#

Uses Redis for caching.

Parameters:

client – Redis client, if None, will call redis.Redis()

vaex.cache.set(key, value, type=None, duration_wallclock=None)[source]#

Set a cache value

Useful to more advanced strategies, where we want to have different behaviour based on the type and costs. Implementations can set this function override the default behaviour:

>>> import vaex
>>> vaex.cache.memory_infinite()  
>>> def my_smart_cache_setter(key, value, type=None, duration_wallclock=None):
...     if duration_wallclock >= 0.1:  # skip fast calculations
...         vaex.cache.cache[key] = value
...
>>> vaex.cache.set = my_smart_cache_setter
Parameters:
  • key (str) – key for caching

  • type – Currently unused.

  • duration_wallclock (float) – Time spend on calculating the result (in wallclock time).

  • value – Any value, typically needs to be pickleable (unless stored in memory)

DataFrame class#

class vaex.dataframe.DataFrame(name=None, executor=None)[source]#

Bases: object

All local or remote datasets are encapsulated in this class, which provides a pandas like API to your dataset.

Each DataFrame (df) has a number of columns, and a number of rows, the length of the DataFrame.

All DataFrames have multiple ‘selection’, and all calculations are done on the whole DataFrame (default) or for the selection. The following example shows how to use the selection.

>>> df.select("x < 0")
>>> df.sum(df.y, selection=True)
>>> df.sum(df.y, selection=[df.x < 0, df.x > 0])
__dataframe__(nan_as_null: bool = False, allow_copy: bool = True)[source]#
__delitem__(item)[source]#

Alias of df.drop(item, inplace=True)

__getitem__(item)[source]#

Convenient way to get expressions, (shallow) copies of a few columns, or to apply filtering.

Example:

>>> df['Lz']  # the expression 'Lz
>>> df['Lz/2'] # the expression 'Lz/2'
>>> df[["Lz", "E"]] # a shallow copy with just two columns
>>> df[df.Lz < 0]  # a shallow copy with the filter Lz < 0 applied
__init__(name=None, executor=None)[source]#
__iter__()[source]#

Iterator over the column names.

__len__()[source]#

Returns the number of rows in the DataFrame (filtering applied).

__repr__()[source]#

Return repr(self).

__setitem__(name, value)[source]#

Convenient way to add a virtual column / expression to this DataFrame.

Example:

>>> import vaex, numpy as np
>>> df = vaex.example()
>>> df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)
>>> df.r
<vaex.expression.Expression(expressions='r')> instance at 0x121687e80 values=[2.9655450396553587, 5.77829281049018, 6.99079603950256, 9.431842752707537, 0.8825613121347967 ... (total 330000 values) ... 7.453831761514681, 15.398412491068198, 8.864250273925633, 17.601047186042507, 14.540181524970293]
__str__()[source]#

Return str(self).

__weakref__#

list of weak references to the object

add_column(name, f_or_array, dtype=None)[source]#

Add an in memory array as a column.

add_variable(name, expression, overwrite=True, unique=True)[source]#

Add a variable to a DataFrame.

A variable may refer to other variables, and virtual columns and expression may refer to variables.

Example

>>> df.add_variable('center', 0)
>>> df.add_virtual_column('x_prime', 'x-center')
>>> df.select('x_prime < 0')
Param:

str name: name of virtual varible

Param:

expression: expression for the variable

add_virtual_column(name, expression, unique=False)[source]#

Add a virtual column to the DataFrame.

Example:

>>> df.add_virtual_column("r", "sqrt(x**2 + y**2 + z**2)")
>>> df.select("r < 10")
Param:

str name: name of virtual column

Param:

expression: expression for the column

Parameters:

unique (str) – if name is already used, make it unique by adding a postfix, e.g. _1, or _2

apply(f, arguments=None, vectorize=False, multiprocessing=True)[source]#

Apply a function on a per row basis across the entire DataFrame.

Example:

>>> import vaex
>>> df = vaex.example()
>>> def func(x, y):
...     return (x+y)/(x-y)
...
>>> df.apply(func, arguments=[df.x, df.y])
Expression = lambda_function(x, y)
Length: 330,000 dtype: float64 (expression)
-------------------------------------------
     0  -0.460789
     1    3.90038
     2  -0.642851
     3   0.685768
     4  -0.543357
Parameters:
  • f – The function to be applied

  • arguments – List of arguments to be passed on to the function f.

  • vectorize – Call f with arrays instead of a scalars (for better performance).

  • multiprocessing (bool) – Use multiple processes to avoid the GIL (Global interpreter lock).

Returns:

A function that is lazily evaluated.

byte_size(selection=False, virtual=False)[source]#

Return the size in bytes the whole DataFrame requires (or the selection), respecting the active_fraction.

cat(i1, i2, format='html')[source]#

Display the DataFrame from row i1 till i2

For format, see https://pypi.org/project/tabulate/

Parameters:
  • i1 (int) – Start row

  • i2 (int) – End row.

  • format (str) – Format to use, e.g. ‘html’, ‘plain’, ‘latex’

close()[source]#

Close any possible open file handles or other resources, the DataFrame will not be in a usable state afterwards.

property col#

Gives direct access to the columns only (useful for tab completion).

Convenient when working with ipython in combination with small DataFrames, since this gives tab-completion.

Columns can be accessed by their names, which are attributes. The attributes are currently expressions, so you can do computations with them.

Example

>>> ds = vaex.example()
>>> df.plot(df.col.x, df.col.y)
column_count(hidden=False)[source]#

Returns the number of columns (including virtual columns).

Parameters:

hidden (bool) – If True, include hidden columns in the tally

Returns:

Number of columns in the DataFrame

combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]#

Generate a list of combinations for the possible expressions for the given dimension.

Parameters:
  • expressions_list – list of list of expressions, where the inner list defines the subspace

  • dimensions – if given, generates a subspace with all possible combinations for that dimension

  • exclude – list of

correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, delay=False, progress=None, array_type=None)[source]#

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between x and y, possibly on a grid defined by binby.

The x and y arguments can be single expressions of lists of expressions. - If x and y are single expression, it computes the correlation between x and y; - If x is a list of expressions and y is a single expression, it computes the correlation between each expression in x and the expression in y; - If x is a list of expressions and y is None, it computes the correlation matrix amongst all expressions in x; - If x is a list of tuples of length 2, it computes the correlation for the specified dimension pairs; - If x and y are lists of expressions, it computes the correlation matrix defined by the two expression lists.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.correlation("x**2+y**2+z**2", "-log(-E+1)")
array(0.6366637382215669)
>>> df.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 0.40594394,  0.69868851,  0.61394099,  0.65266318])
>>> df.correlation(x=['x', 'y', 'z'])
array([[ 1.        , -0.06668907, -0.02709719],
       [-0.06668907,  1.        ,  0.03450365],
       [-0.02709719,  0.03450365,  1.        ]])
>>> df.correlation(x=['x', 'y', 'z'], y=['E', 'Lz'])
array([[-0.01116315, -0.00369268],
       [-0.0059848 ,  0.02472491],
       [ 0.01428211, -0.05900035]])
Parameters:
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

count(expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]#

Count the number of non-NaN values (or all, if expression is None or “*”).

Example:

>>> df.count()
330000
>>> df.count("*")
330000.0
>>> df.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])
Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

cov(x, y=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Calculate the covariance matrix for x and y or more expressions, possibly on a grid defined by binby.

Either x and y are expressions, e.g.:

>>> df.cov("x", "y")

Or only the x argument is given with a list of expressions, e.g.:

>>> df.cov(["x, "y, "z"])

Example:

>>> df.cov("x", "y")
array([[ 53.54521742,  -3.8123135 ],
[ -3.8123135 ,  60.62257881]])
>>> df.cov(["x", "y", "z"])
array([[ 53.54521742,  -3.8123135 ,  -0.98260511],
[ -3.8123135 ,  60.62257881,   1.21381057],
[ -0.98260511,   1.21381057,  25.55517638]])
>>> df.cov("x", "y", binby="E", shape=2)
array([[[  9.74852878e+00,  -3.02004780e-02],
[ -3.02004780e-02,   9.99288215e+00]],
[[  8.43996546e+01,  -6.51984181e+00],
[ -6.51984181e+00,   9.68938284e+01]]])
Parameters:
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – if previous argument is not a list, this argument should be given

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)

covar(x, y, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Calculate the covariance cov[x,y] between x and y, possibly on a grid defined by binby.

Example:

>>> df.covar("x**2+y**2+z**2", "-log(-E+1)")
array(52.69461456005138)
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)")/(df.std("x**2+y**2+z**2") * df.std("-log(-E+1)"))
0.63666373822156686
>>> df.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 10.17387143,  51.94954078,  51.24902796,  20.2163929 ])
Parameters:
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

data_type(expression, array_type=None, internal=False, axis=0)[source]#

Return the datatype for the given expression, if not a column, the first row will be evaluated to get the data type.

Example:

>>> df = vaex.from_scalars(x=1, s='Hi')
Parameters:
  • array_type (str) – ‘numpy’, ‘arrow’ or None, to indicate if the data type should be converted

  • axis (int) – If a nested type (like list), it will return the value_type of the nested type, axis levels deep.

delete_variable(name)[source]#

Deletes a variable from a DataFrame.

delete_virtual_column(name)[source]#

Deletes a virtual column from a DataFrame.

describe(strings=True, virtual=True, selection=None)[source]#

Give a description of the DataFrame.

>>> import vaex
>>> df = vaex.example()[['x', 'y', 'z']]
>>> df.describe()
                 x          y          z
dtype      float64    float64    float64
count       330000     330000     330000
missing          0          0          0
mean    -0.0671315 -0.0535899  0.0169582
std        7.31746    7.78605    5.05521
min       -128.294   -71.5524   -44.3342
max        271.366    146.466    50.7185
>>> df.describe(selection=df.x > 0)
                   x         y          z
dtype        float64   float64    float64
count         164060    164060     164060
missing       165940    165940     165940
mean         5.13572 -0.486786 -0.0868073
std          5.18701   7.61621    5.02831
min      1.51635e-05  -71.5524   -44.3342
max          271.366   78.0724    40.2191
Parameters:
  • strings (bool) – Describe string columns or not

  • virtual (bool) – Describe virtual columns or not

  • selection – Optional selection to use.

Returns:

Pandas dataframe

diff(periods=1, column=None, fill_value=None, trim=False, inplace=False, reverse=False)[source]#

Calculate the difference between the current row and the row offset by periods

Parameters:
  • periods (int) – Which row to take the difference with

  • column (str or list[str]) – Column or list of columns to use (default is all).

  • fill_value – Value to use instead of missing values.

  • trim (bool) – Do not include rows that would otherwise have missing values

  • reverse (bool) – When true, calculate row[periods] - row[current]

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

drop(columns, inplace=False, check=True)[source]#

Drop columns (or a single column).

Parameters:
  • columns – List of columns or a single column name

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

  • check – When true, it will check if the column is used in virtual columns or the filter, and hide it instead.

drop_filter(inplace=False)[source]#

Removes all filters from the DataFrame

dropinf(column_names=None, how='any')[source]#

Create a shallow copy of a DataFrame, with filtering set using isinf.

Parameters:
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are inf. If “all”, then drop rows where all of the columns are inf.

Return type:

DataFrame

dropmissing(column_names=None, how='any')[source]#

Create a shallow copy of a DataFrame, with filtering set using ismissing.

Parameters:
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are missing. If “all”, then drop rows where all of the columns are missing.

Return type:

DataFrame

dropna(column_names=None, how='any')[source]#

Create a shallow copy of a DataFrame, with filtering set using isna.

Parameters:
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are na. If “all”, then drop rows where all of the columns are na.

Return type:

DataFrame

dropnan(column_names=None, how='any')[source]#

Create a shallow copy of a DataFrame, with filtering set using isnan.

Parameters:
  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • how (str) – One of (“any”, “all”). If “any”, then drop rows where any of the columns are nan. If “all”, then drop rows where all of the columns are nan.

Return type:

DataFrame

property dtypes#

Gives a Pandas series object containing all numpy dtypes of all columns (except hidden).

evaluate(expression, i1=None, i2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, progress=None)[source]#

Evaluate an expression, and return a numpy array with the results for the full column or a part of it.

Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.

To get partial results, use i1 and i2

Parameters:
  • expression (str) – Name/expression to evaluate

  • i1 (int) – Start row index, default is the start (0)

  • i2 (int) – End row index, default is the length of the DataFrame

  • out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to a memory mapped array)

  • progress – {progress}

  • selection – selection to apply

Returns:

evaluate_iterator(expression, s1=None, s2=None, out=None, selection=None, filtered=True, array_type=None, parallel=True, chunk_size=None, prefetch=True, progress=None)[source]#

Generator to efficiently evaluate expressions in chunks (number of rows).

See DataFrame.evaluate() for other arguments.

Example:

>>> import vaex
>>> df = vaex.example()
>>> for i1, i2, chunk in df.evaluate_iterator(df.x, chunk_size=100_000):
...     print(f"Total of {i1} to {i2} = {chunk.sum()}")
...
Total of 0 to 100000 = -7460.610158279056
Total of 100000 to 200000 = -4964.85827154921
Total of 200000 to 300000 = -7303.271340043915
Total of 300000 to 330000 = -2424.65234724951
Parameters:
  • progress – {progress}

  • prefetch – Prefetch/compute the next chunk in parallel while the current value is yielded/returned.

evaluate_variable(name)[source]#

Evaluates the variable given by name.

execute()[source]#

Execute all delayed jobs.

async execute_async()[source]#

Async version of execute

extract()[source]#

Return a DataFrame containing only the filtered rows.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered (contains just a small number of rows).

If no filtering is applied, it returns a trimmed view. For the returned df, len(df) == df.length_original() == df.length_unfiltered()

Return type:

DataFrame

fillna(value, column_names=None, prefix='__original_', inplace=False)[source]#

Return a DataFrame, where missing values/NaN are filled with ‘value’.

The original columns will be renamed, and by default they will be hidden columns. No data is lost.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Note

Note that filtering will be ignored (since they may change), you may want to consider running extract() first.

Example:

>>> import vaex
>>> import numpy as np
>>> x = np.array([3, 1, np.nan, 10, np.nan])
>>> df = vaex.from_arrays(x=x)
>>> df_filled = df.fillna(value=-1, column_names=['x'])
>>> df_filled
  #    x
  0    3
  1    1
  2   -1
  3   10
  4   -1
Parameters:
  • value (float) – The value to use for filling nan or masked values.

  • fill_na (bool) – If True, fill np.nan values with value.

  • fill_masked (bool) – If True, fill masked values with values.

  • column_names (list) – List of column names in which to fill missing values.

  • prefix (str) – The prefix to give the original columns.

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

filter(expression, mode='and')[source]#

General version of df[<boolean expression>] to modify the filter applied to the DataFrame.

See DataFrame.select() for usage of selection.

Note that using df = df[<boolean expression>], one can only narrow the filter (i.e. only less rows can be selected). Using the filter method, and a different boolean mode (e.g. “or”) one can actually cause more rows to be selected. This differs greatly from numpy and pandas for instance, which can only narrow the filter.

Example:

>>> import vaex
>>> import numpy as np
>>> x = np.arange(10)
>>> df = vaex.from_arrays(x=x, y=x**2)
>>> df
#    x    y
0    0    0
1    1    1
2    2    4
3    3    9
4    4   16
5    5   25
6    6   36
7    7   49
8    8   64
9    9   81
>>> dff = df[df.x<=2]
>>> dff
#    x    y
0    0    0
1    1    1
2    2    4
>>> dff = dff.filter(dff.x >=7, mode="or")
>>> dff
#    x    y
0    0    0
1    1    1
2    2    4
3    7   49
4    8   64
5    9   81
fingerprint(dependencies=None, treeshake=False)[source]#

Id that uniquely identifies a dataframe (cross runtime).

Parameters:
  • dependencies (set[str]) – set of column, virtual column, function or selection names to be used.

  • treeshake (bool) – Get rid of unused variables before calculating the fingerprint.

first(expression, order_expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]#

Return the first element of a binned expression, where the values each bin are sorted by order_expression.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.first(df.x, df.y, shape=8)
>>> df.first(df.x, df.y, shape=8, binby=[df.y])
>>> df.first(df.x, df.y, shape=8, binby=[df.y])
array([-4.81883764, 11.65378   ,  9.70084476, -7.3025589 ,  4.84954977,
        8.47446537, -5.73602629, 10.18783   ])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • order_expression – Order the values in the bins by this expression.

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Ndarray containing the first elements.

Return type:

numpy.array

get_active_fraction()[source]#

Value in the range (0, 1], to work only with a subset of rows.

get_column_names(virtual=True, strings=True, hidden=False, regex=None, dtype=None)[source]#

Return a list of column names

Example:

>>> import vaex
>>> df = vaex.from_scalars(x=1, x2=2, y=3, s='string')
>>> df['r'] = (df.x**2 + df.y**2)**2
>>> df.get_column_names()
['x', 'x2', 'y', 's', 'r']
>>> df.get_column_names(virtual=False)
['x', 'x2', 'y', 's']
>>> df.get_column_names(regex='x.*')
['x', 'x2']
>>> df.get_column_names(dtype='string')
['s']
Parameters:
  • virtual – If False, skip virtual columns

  • hidden – If False, skip hidden columns

  • strings – If False, skip string columns

  • regex – Only return column names matching the (optional) regular expression

  • dtype – Only return column names with the given dtype. Can be a single or a list of dtypes.

Return type:

list of str

get_current_row()[source]#

Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked.

get_names(hidden=False)[source]#

Return a list of column names and variable names.

get_private_dir(create=False)[source]#

Each DataFrame has a directory where files are stored for metadata etc.

Example

>>> import vaex
>>> ds = vaex.example()
>>> vaex.get_private_dir()
'/Users/users/breddels/.vaex/dfs/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'
Parameters:

create (bool) – is True, it will create the directory if it does not exist

get_selection(name='default')[source]#

Get the current selection object (mostly for internal use atm).

get_variable(name)[source]#

Returns the variable given by name, it will not evaluate it.

For evaluation, see DataFrame.evaluate_variable(), see also DataFrame.set_variable()

has_current_row()[source]#

Returns True/False if there currently is a picked row.

has_selection(name='default')[source]#

Returns True if there is a selection with the given name.

head(n=10)[source]#

Return a shallow copy a DataFrame with the first n rows.

head_and_tail_print(n=5)[source]#

Display the first and last n elements of a DataFrame.

healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, delay=False, progress=None, selection=None)[source]#

Count non missing value for expression on an array which represents healpix data.

Parameters:
  • expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows

  • healpix_expression – {healpix_max_level}

  • healpix_max_level – {healpix_max_level}

  • healpix_level – {healpix_level}

  • binby – {binby}, these dimension follow the first healpix dimension.

  • limits – {limits}

  • shape – {shape}

  • selection – {selection}

  • delay – {delay}

  • progress – {progress}

Returns:

is_category(column)[source]#

Returns true if column is a category.

is_local()[source]#

Returns True if the DataFrame is local, False when a DataFrame is remote.

is_masked(column)[source]#

Return if a column is a masked (numpy.ma) column.

kurtosis(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]#

Calculate the kurtosis for the given expression, possible on a grid defined by binby.

Example:

>>> df.kurtosis('vz')
0.33414303
>>> df.kurtosis("vz", binby=["E"], shape=4)
array([0.35286113, 0.14455428, 0.52955107, 5.06716345])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

last(expression, order_expression=None, binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None, array_type=None)[source]#

Return the last element of a binned expression, where the values each bin are sorted by order_expression.

Parameters:
  • expression – The value to be placed in the bin.

  • order_expression – Order the values in the bins by this expression.

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Ndarray containing the first elements.

Return type:

numpy.array

length_original()[source]#

the full length of the DataFrame, independent what active_fraction is, or filtering. This is the real length of the underlying ndarrays.

length_unfiltered()[source]#

The length of the arrays that should be considered (respecting active range), but without filtering.

limits(expression, value=None, square=False, selection=None, delay=False, progress=None, shape=None)[source]#

Calculate the [min, max] range for expression, as described by value, which is ‘minmax’ by default.

If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.limits("x")
array([-128.293991,  271.365997])
>>> df.limits("x", "99.7%")
array([-28.86381927,  28.9261226 ])
>>> df.limits(["x", "y"])
(array([-128.293991,  271.365997]), array([ -71.5523682,  146.465836 ]))
>>> df.limits(["x", "y"], "99.7%")
(array([-28.86381927,  28.9261226 ]), array([-28.60476934,  28.96535249]))
>>> df.limits(["x", "y"], ["minmax", "90%"])
(array([-128.293991,  271.365997]), array([-13.37438402,  13.4224423 ]))
>>> df.limits(["x", "y"], ["minmax", [0, 10]])
(array([-128.293991,  271.365997]), [0, 10])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • value – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

limits_percentage(expression, percentage=99.73, square=False, selection=False, progress=None, delay=False)[source]#

Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.

The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:

Example:

>>> df.limits_percentage("x", 90)
array([-12.35081376,  12.14858052]
>>> df.percentile_approx("x", 5), df.percentile_approx("x", 95)
(array([-12.36813152]), array([ 12.13275818]))

NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code

Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • percentage (float) – Value between 0 and 100

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns:

List in the form [[xmin, xmax], [ymin, ymax], …. ,[zmin, zmax]] or [xmin, xmax] when expression is not a list

materialize(column=None, inplace=False, virtual_column=None)[source]#

Turn columns into native CPU format for optimal performance at cost of memory.

Warning

This may use of lot of memory, be mindfull.

Virtual columns will be evaluated immediately, and all real columns will be cached in memory when used for the first time.

Example for virtual column:

>>> x = np.arange(1,4)
>>> y = np.arange(2,5)
>>> df = vaex.from_arrays(x=x, y=y)
>>> df['r'] = (df.x**2 + df.y**2)**0.5 # 'r' is a virtual column (computed on the fly)
>>> df = df.materialize('r')  # now 'r' is a 'real' column (i.e. a numpy array)

Example with parquet file >>> df = vaex.open(‘somewhatslow.parquet’) >>> df.x.sum() # slow >>> df = df.materialize() >>> df.x.sum() # slow, but will fill the cache >>> df.x.sum() # as fast as possible, will use memory

Parameters:
  • column – string or list of strings with column names to materialize, all columns when None

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

  • virtual_column – for backward compatibility

max(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]#

Calculate the maximum for given expressions, possibly on a grid defined by binby.

Example:

>>> df.max("x")
array(271.365997)
>>> df.max(["x", "y"])
array([ 271.365997,  146.465836])
>>> df.max("x", binby="x", shape=5, limits=[-10, 10])
array([-6.00010443, -2.00002384,  1.99998057,  5.99983597,  9.99984646])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mean(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]#

Calculate the mean for expression, possibly on a grid defined by binby.

Example:

>>> df.mean("x")
-0.067131491264005971
>>> df.mean("(x**2+y**2)**0.5", binby="E", shape=4)
array([  2.43483742,   4.41840721,   8.26742458,  15.53846476])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=256, percentile_limits='minmax', selection=False, delay=False, progress=None)[source]#

Calculate the median, possibly on a grid defined by binby.

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’

  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

min(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]#

Calculate the minimum for given expressions, possibly on a grid defined by binby.

Example:

>>> df.min("x")
array(-128.293991)
>>> df.min(["x", "y"])
array([-128.293991 ,  -71.5523682])
>>> df.min("x", binby="x", shape=5, limits=[-10, 10])
array([-9.99919128, -5.99972439, -1.99991322,  2.0000093 ,  6.0004878 ])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

minmax(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Calculate the minimum and maximum for expressions, possibly on a grid defined by binby.

Example:

>>> df.minmax("x")
array([-128.293991,  271.365997])
>>> df.minmax(["x", "y"])
array([[-128.293991 ,  271.365997 ],
           [ -71.5523682,  146.465836 ]])
>>> df.minmax("x", binby="x", shape=5, limits=[-10, 10])
array([[-9.99919128, -6.00010443],
           [-5.99972439, -2.00002384],
           [-1.99991322,  1.99998057],
           [ 2.0000093 ,  5.99983597],
           [ 6.0004878 ,  9.99984646]])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)

mode(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]#

Calculate/estimate the mode.

mutual_information(x, y=None, dimension=2, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, delay=False)[source]#

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possibly on a grid defined by binby.

The x and y arguments can be single expressions of lists of expressions: - If x and y are single expression, it computes the mutual information between x and y; - If x is a list of expressions and y is a single expression, it computes the mutual information between each expression in x and the expression in y; - If x is a list of expressions and y is None, it computes the mutual information matrix amongst all expressions in x; - If x is a list of tuples of length 2, it computes the mutual information for the specified dimension pairs; - If x and y are lists of expressions, it computes the mutual information matrix defined by the two expression lists.

If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order.

Example:

>>> import vaex
>>> df = vaex.example()
>>> df.mutual_information("x", "y")
array(0.1511814526380327)
>>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]])
array([ 0.15118145,  0.18439181,  1.07067379])
>>> df.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True)
(array([ 1.07067379,  0.18439181,  0.15118145]),
[['E', 'Lz'], ['x', 'z'], ['x', 'y']])
>>> df.mutual_information(x=['x', 'y', 'z'])
array([[3.53535106, 0.06893436, 0.11656418],
       [0.06893436, 3.49414866, 0.14089177],
       [0.11656418, 0.14089177, 3.96144906]])
>>> df.mutual_information(x=['x', 'y', 'z'], y=['E', 'Lz'])
array([[0.32316291, 0.16110026],
       [0.36573065, 0.17802792],
       [0.35239151, 0.21677695]])
Parameters:
  • x – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • y – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,

property nbytes#

Alias for df.byte_size(), see DataFrame.byte_size().

nop(expression=None, progress=False, delay=False)[source]#

Evaluates expression or a list of expressions, and drops the result. Usefull for benchmarking, since vaex is usually lazy.

Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns:

None

percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, delay=False, progress=None)[source]#

Calculate the percentile given by percentage, possibly on a grid defined by binby.

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits.

Example:

>>> df.percentile_approx("x", 10), df.percentile_approx("x", 90)
(array([-8.3220355]), array([ 7.92080358]))
>>> df.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10])
array([[-7.56462982],
           [-3.61036641],
           [-0.01296306],
           [ 3.56697863],
           [ 7.45838367]])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’

  • percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

plot2d_contour(x=None, y=None, what='count(*)', limits=None, shape=256, selection=None, f='identity', figsize=None, xlabel=None, ylabel=None, aspect='auto', levels=None, fill=False, colorbar=False, colorbar_label=None, colormap=None, colors=None, linewidths=None, linestyles=None, vmin=None, vmax=None, grid=None, show=None, **kwargs)#

Plot conting contours on 2D grid.

Parameters:
  • x – {expression}

  • y – {expression}

  • what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column)

  • limits – {limits}

  • shape – {shape}

  • selection – {selection}

  • f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value

  • figsize – (x, y) tuple passed to plt.figure for setting the figure size

  • xlabel – label of the x-axis (defaults to param x)

  • ylabel – label of the y-axis (defaults to param y)

  • aspect – the aspect ratio of the figure

  • levels – the contour levels to be passed on plt.contour or plt.contourf

  • colorbar – plot a colorbar or not

  • colorbar_label – the label of the colourbar (defaults to param what)

  • colormap – matplotlib colormap to pass on to plt.contour or plt.contourf

  • colors – the colours of the contours

  • linewidths – the widths of the contours

  • linestyles – the style of the contour lines

  • vmin – instead of automatic normalization, scale the data between vmin and vmax

  • vmax – see vmin

  • grid – {grid}

  • show

plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]#

Use at own risk, requires ipyvolume

plot_bq(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]#

Deprecated: use plot_widget

plot_widget(x, y, limits=None, f='identity', **kwargs)[source]#

Deprecated: use df.widget.heatmap

propagate_uncertainties(columns, depending_variables=None, cov_matrix='auto', covariance_format='{}_{}_covariance', uncertainty_format='{}_uncertainty')[source]#

Propagates uncertainties (full covariance matrix) for a set of virtual columns.

Covariance matrix of the depending variables is guessed by finding columns prefixed by “e” or “e_” or postfixed by “_error”, “_uncertainty”, “e” and “_e”. Off diagonals (covariance or correlation) by postfixes with “_correlation” or “_corr” for correlation or “_covariance” or “_cov” for covariances. (Note that x_y_cov = x_e * y_e * x_y_correlation.)

Example

>>> df = vaex.from_scalars(x=1, y=2, e_x=0.1, e_y=0.2)
>>> df["u"] = df.x + df.y
>>> df["v"] = np.log10(df.x)
>>> df.propagate_uncertainties([df.u, df.v])
>>> df.u_uncertainty, df.v_uncertainty
Parameters:
  • columns – list of columns for which to calculate the covariance matrix.

  • depending_variables – If not given, it is found out automatically, otherwise a list of columns which have uncertainties.

  • cov_matrix – List of list with expressions giving the covariance matrix, in the same order as depending_variables. If ‘full’ or ‘auto’, the covariance matrix for the depending_variables will be guessed, where ‘full’ gives an error if an entry was not found.

remove_virtual_meta()[source]#

Removes the file with the virtual column etc, it does not change the current virtual columns etc.

rename(name, new_name, unique=False)[source]#

Renames a column or variable, and rewrite expressions such that they refer to the new name

rolling(window, trim=False, column=None, fill_value=None, edge='right')[source]#

Create a vaex.rolling.Rolling rolling window object

Parameters:
  • window (int) – Size of the rolling window.

  • trim (bool) – Trim off begin or end of dataframe to avoid missing values

  • column (str or list[str]) – Column name or column names of columns affected (None for all)

  • fill_value (any) – Scalar value to use for data outside of existing rows.

  • edge (str) – Where the edge of the rolling window is for the current row.

sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]#

Returns a DataFrame with a random set of rows

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Provide either n or frac.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df
  #  s      x
  0  a      1
  1  b      2
  2  c      3
  3  d      4
>>> df.sample(n=2, random_state=42) # 2 random rows, fixed seed
  #  s      x
  0  b      2
  1  d      4
>>> df.sample(frac=1, random_state=42) # 'shuffling'
  #  s      x
  0  c      3
  1  a      1
  2  d      4
  3  b      2
>>> df.sample(frac=1, replace=True, random_state=42) # useful for bootstrap (may contain repeated samples)
  #  s      x
  0  d      4
  1  a      1
  2  a      1
  3  d      4
Parameters:
  • n (int) – number of samples to take (default 1 if frac is None)

  • frac (float) – fractional number of takes to take

  • replace (bool) – If true, a row may be drawn multiple times

  • weights (str or expression) – (unnormalized) probability that a row can be drawn

  • RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.

Returns:

Returns a new DataFrame with a shallow copy/view of the underlying data

Return type:

DataFrame

schema()[source]#

Similar to df.dtypes, but returns a dict

schema_arrow(reduce_large=False)[source]#

Similar to schema(), but returns an arrow schema

Parameters:

reduce_large (bool) – change large_string to normal string

select(boolean_expression, mode='replace', name='default', executor=None)[source]#

Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode.

Selections are recorded in a history tree, per name, undo/redo can be done for them separately.

Parameters:
  • boolean_expression (str) – Any valid column expression, with comparison operators

  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract

  • name (str) – history tree or selection ‘slot’ to use

  • executor

Returns:

select_box(spaces, limits, mode='replace', name='default')[source]#

Select a n-dimensional rectangular box bounded by limits.

The following examples are equivalent:

>>> df.select_box(['x', 'y'], [(0, 10), (0, 1)])
>>> df.select_rectangle('x', 'y', [(0, 10), (0, 1)])
Parameters:
  • spaces – list of expressions

  • limits – sequence of shape [(x1, x2), (y1, y2)]

  • mode

  • name

Returns:

select_circle(x, y, xc, yc, r, mode='replace', name='default', inclusive=True)[source]#

Select a circular region centred on xc, yc, with a radius of r.

Example:

>>> df.select_circle('x','y',2,3,1)
Parameters:
  • x – expression for the x space

  • y – expression for the y space

  • xc – location of the centre of the circle in x

  • yc – location of the centre of the circle in y

  • r – the radius of the circle

  • name – name of the selection

  • mode

Returns:

select_ellipse(x, y, xc, yc, width, height, angle=0, mode='replace', name='default', radians=False, inclusive=True)[source]#

Select an elliptical region centred on xc, yc, with a certain width, height and angle.

Example:

>>> df.select_ellipse('x','y', 2, -1, 5,1, 30, name='my_ellipse')
Parameters:
  • x – expression for the x space

  • y – expression for the y space

  • xc – location of the centre of the ellipse in x

  • yc – location of the centre of the ellipse in y

  • width – the width of the ellipse (diameter)

  • height – the width of the ellipse (diameter)

  • angle – (degrees) orientation of the ellipse, counter-clockwise measured from the y axis

  • name – name of the selection

  • mode

Returns:

select_inverse(name='default', executor=None)[source]#

Invert the selection, i.e. what is selected will not be, and vice versa

Parameters:
  • name (str)

  • executor

Returns:

select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]#

For performance reasons, a lasso selection is handled differently.

Parameters:
  • expression_x (str) – Name/expression for the x coordinate

  • expression_y (str) – Name/expression for the y coordinate

  • xsequence – list of x numbers defining the lasso, together with y

  • ysequence

  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract

  • name (str)

  • executor

Returns:

select_non_missing(drop_nan=True, drop_masked=True, column_names=None, mode='replace', name='default')[source]#

Create a selection that selects rows having non missing values for all columns in column_names.

The name reflects Pandas, no rows are really dropped, but a mask is kept to keep track of the selection

Parameters:
  • drop_nan – drop rows when there is a NaN in any of the columns (will only affect float values)

  • drop_masked – drop rows when there is a masked value in any of the columns

  • column_names – The columns to consider, default: all (real, non-virtual) columns

  • mode (str) – Possible boolean operator: replace/and/or/xor/subtract

  • name (str) – history tree or selection ‘slot’ to use

Returns:

select_nothing(name='default')[source]#

Select nothing.

select_rectangle(x, y, limits, mode='replace', name='default')[source]#

Select a 2d rectangular box in the space given by x and y, bounded by limits.

Example:

>>> df.select_box('x', 'y', [(0, 10), (0, 1)])
Parameters:
  • x – expression for the x space

  • y – expression fo the y space

  • limits – sequence of shape [(x1, x2), (y1, y2)]

  • mode

selected_length()[source]#

Returns the number of rows that are selected.

selection_can_redo(name='default')[source]#

Can selection name be redone?

selection_can_undo(name='default')[source]#

Can selection name be undone?

selection_redo(name='default', executor=None)[source]#

Redo selection, for the name.

selection_undo(name='default', executor=None)[source]#

Undo selection, for the name.

set_active_fraction(value)[source]#

Sets the active_fraction, set picked row to None, and remove selection.

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range(i1, i2)[source]#

Sets the active_fraction, set picked row to None, and remove selection.

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_current_row(value)[source]#

Set the current row, and emit the signal signal_pick.

set_selection(selection, name='default', executor=None)[source]#

Sets the selection object

Parameters:
  • selection – Selection object

  • name – selection ‘slot’

  • executor

Returns:

set_variable(name, expression_or_value, write=True)[source]#

Set the variable to an expression or value defined by expression_or_value.

Example

>>> df.set_variable("a", 2.)
>>> df.set_variable("b", "a**2")
>>> df.get_variable("b")
'a**2'
>>> df.evaluate_variable("b")
4.0
Parameters:
  • name – Name of the variable

  • write – write variable to meta file

  • expression – value or expression

shift(periods, column=None, fill_value=None, trim=False, inplace=False)[source]#

Shift a column or multiple columns by periods amounts of rows.

Parameters:
  • periods (int) – Shift column forward (when positive) or backwards (when negative)

  • column (str or list[str]) – Column or list of columns to shift (default is all).

  • fill_value – Value to use instead of missing values.

  • trim (bool) – Do not include rows that would otherwise have missing values

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

shuffle(random_state=None)[source]#

Shuffle order of rows (equivalent to df.sample(frac=1))

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c']), x=np.arange(1,4))
>>> df
  #  s      x
  0  a      1
  1  b      2
  2  c      3
>>> df.shuffle(random_state=42)
  #  s      x
  0  a      1
  1  b      2
  2  c      3
Parameters:

RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.

Returns:

Returns a new DataFrame with a shallow copy/view of the underlying data

Return type:

DataFrame

skew(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]#

Calculate the skew for the given expression, possible on a grid defined by binby.

Example:

>>> df.skew("vz")
0.02116528
>>> df.skew("vz", binby=["E"], shape=4)
array([-0.069976  , -0.01003445,  0.05624177, -2.2444322 ])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

sort(by, ascending=True)[source]#

Return a sorted DataFrame, sorted by the expression ‘by’.

Both ‘by’ and ‘ascending’ arguments can be lists. Note that missing/nan/NA values will always be pushed to the end, no matter the sorting order.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Note

Note that filtering will be ignored (since they may change), you may want to consider running extract() first.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df['y'] = (df.x-1.8)**2
>>> df
  #  s      x     y
  0  a      1  0.64
  1  b      2  0.04
  2  c      3  1.44
  3  d      4  4.84
>>> df.sort('y', ascending=False)  # Note: passing '(x-1.8)**2' gives the same result
  #  s      x     y
  0  d      4  4.84
  1  c      3  1.44
  2  a      1  0.64
  3  b      2  0.04
Parameters:
  • by (str or expression or list of str/expressions) – expression to sort by.

  • ascending (bool or list of bools) – ascending (default, True) or descending (False).

split(into=None)[source]#

Returns a list containing ordered subsets of the DataFrame.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex
>>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> for dfs in df.split(into=0.3):
...     print(dfs.x.values)
...
[0 1 3]
[3 4 5 6 7 8 9]
>>> for split in df.split(into=[0.2, 0.3, 0.5]):
...     print(dfs.x.values)
[0 1]
[2 3 4]
[5 6 7 8 9]
Parameters:

into (int/float/list) – If float will split the DataFrame in two, the first of which will have a relative length as specified by this parameter. When a list, will split into as many portions as elements in the list, where each element defines the relative length of that portion. Note that such a list of fractions will always be re-normalized to 1. When an int, split DataFrame into n dataframes of equal length (last one may deviate), if len(df) < n, it will return len(df) DataFrames.

split_random(into, random_state=None)[source]#

Returns a list containing random portions of the DataFrame.

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex, import numpy as np
>>> np.random.seed(111)
>>> df = vaex.from_arrays(x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> for dfs in df.split_random(into=0.3, random_state=42):
...     print(dfs.x.values)
...
[8 1 5]
[0 7 2 9 4 3 6]
>>> for split in df.split_random(into=[0.2, 0.3, 0.5], random_state=42):
...     print(dfs.x.values)
[8 1]
[5 0 7]
[2 9 4 3 6]
Parameters:
  • into (int/float/list) – If float will split the DataFrame in two, the first of which will have a relative length as specified by this parameter. When a list, will split into as many portions as elements in the list, where each element defines the relative length of that portion. Note that such a list of fractions will always be re-normalized to 1. When an int, split DataFrame into n dataframes of equal length (last one may deviate), if len(df) < n, it will return len(df) DataFrames.

  • RandomState (int or) – Sets a seed or RandomState for reproducability. When None, a random seed it chosen.

Returns:

A list of DataFrames.

Return type:

list

state_load(file, use_active_range=False, keep_columns=None, set_filter=True, trusted=True, fs_options=None, fs=None)[source]#

Load a state previously stored by DataFrame.state_write(), see also DataFrame.state_set().

Parameters:
  • file (str) – filename (ending in .json or .yaml)

  • use_active_range (bool) – Whether to use the active range or not.

  • keep_columns (list) – List of columns that should be kept if the state to be set contains less columns.

  • set_filter (bool) – Set the filter from the state (default), or leave the filter as it is it.

  • fs_options (dict) – arguments to pass the the file system handler (s3fs or gcsfs)

  • fs – ‘Pass a file system object directly, see vaex.open()

state_write(file, fs_options=None, fs=None)[source]#

Write the internal state to a json or yaml file (see DataFrame.state_get())

Example

>>> import vaex
>>> df = vaex.from_scalars(x=1, y=2)
>>> df['r'] = (df.x**2 + df.y**2)**0.5
>>> df.state_write('state.json')
>>> print(open('state.json').read())
{
"virtual_columns": {
    "r": "(((x ** 2) + (y ** 2)) ** 0.5)"
},
"column_names": [
    "x",
    "y",
    "r"
],
"renamed_columns": [],
"variables": {
    "pi": 3.141592653589793,
    "e": 2.718281828459045,
    "km_in_au": 149597870.7,
    "seconds_per_year": 31557600
},
"functions": {},
"selections": {
    "__filter__": null
},
"ucds": {},
"units": {},
"descriptions": {},
"description": null,
"active_range": [
    0,
    1
]
}
>>> df.state_write('state.yaml')
>>> print(open('state.yaml').read())
active_range:
- 0
- 1
column_names:
- x
- y
- r
description: null
descriptions: {}
functions: {}
renamed_columns: []
selections:
__filter__: null
ucds: {}
units: {}
variables:
pi: 3.141592653589793
e: 2.718281828459045
km_in_au: 149597870.7
seconds_per_year: 31557600
virtual_columns:
r: (((x ** 2) + (y ** 2)) ** 0.5)
Parameters:
  • file (str) – filename (ending in .json or .yaml)

  • fs_options (dict) – arguments to pass the the file system handler (s3fs or gcsfs)

  • fs – ‘Pass a file system object directly, see vaex.open()

std(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, array_type=None)[source]#

Calculate the standard deviation for the given expression, possible on a grid defined by binby

>>> df.std("vz")
110.31773397535071
>>> df.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

sum(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, edges=False, array_type=None)[source]#

Calculate the sum for the given expression, possible on a grid defined by binby

Example:

>>> df.sum("L")
304054882.49378014
>>> df.sum("L", binby="E", shape=4)
array([  8.83517994e+06,   5.92217598e+07,   9.55218726e+07,
                 1.40008776e+08])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

tail(n=10)[source]#

Return a shallow copy a DataFrame with the last n rows.

take(indices, filtered=True, dropfilter=True)[source]#

Returns a DataFrame containing only rows indexed by indices

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Example:

>>> import vaex, numpy as np
>>> df = vaex.from_arrays(s=np.array(['a', 'b', 'c', 'd']), x=np.arange(1,5))
>>> df.take([0,2])
 #  s      x
 0  a      1
 1  c      3
Parameters:
  • indices – sequence (list or numpy array) with row numbers

  • filtered – (for internal use) The indices refer to the filtered data.

  • dropfilter – (for internal use) Drop the filter, set to False when indices refer to unfiltered, but may contain rows that still need to be filtered out.

Returns:

DataFrame which is a shallow copy of the original data.

Return type:

DataFrame

to_arrays(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]#

Return a list of ndarrays

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

list of arrays

to_arrow_table(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, reduce_large=False)[source]#

Returns an arrow Table object containing the arrays corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • reduce_large (bool) – If possible, cast large_string to normal string

Returns:

pyarrow.Table object or iterator of

to_astropy_table(column_names=None, selection=None, strings=True, virtual=True, index=None, parallel=True)[source]#

Returns a astropy table object containing the ndarrays corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • index – if this column is given it is used for the index of the DataFrame

Returns:

astropy.table.Table object

to_dask_array(chunks='auto')[source]#

Lazily expose the DataFrame as a dask.array

Example

>>> df = vaex.example()
>>> A = df[['x', 'y', 'z']].to_dask_array()
>>> A
dask.array<vaex-df-1f048b40-10ec-11ea-9553, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray>
>>> A+1
dask.array<add, shape=(330000, 3), dtype=float64, chunksize=(330000, 3), chunktype=numpy.ndarray>
Parameters:

chunks – How to chunk the array, similar to dask.array.from_array().

Returns:

dask.array.Array object.

to_dict(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]#

Return a dict containing the ndarray corresponding to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

dict

to_items(column_names=None, selection=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type=None)[source]#

Return a list of [(column_name, ndarray), …)] pairs where the ndarray corresponds to the evaluated data

Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

list of (name, ndarray) pairs or iterator of

to_pandas_df(column_names=None, selection=None, strings=True, virtual=True, index_name=None, parallel=True, chunk_size=None, array_type=None)[source]#

Return a pandas DataFrame containing the ndarray corresponding to the evaluated data

If index is given, that column is used for the index of the dataframe.

Example

>>> df_pandas = df.to_pandas_df(["x", "y", "z"])
>>> df_copy = vaex.from_pandas(df_pandas)
Parameters:
  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • index_column – if this column is given it is used for the index of the DataFrame

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

pandas.DataFrame object or iterator of

to_records(index=None, selection=None, column_names=None, strings=True, virtual=True, parallel=True, chunk_size=None, array_type='python')[source]#

Return a list of [{column_name: value}, …)] “records” where each dict is an evaluated row.

Parameters:
  • index – an index to use to get the record of a specific row when provided

  • column_names – list of column names, to export, when None DataFrame.get_column_names(strings=strings, virtual=virtual) is used

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • strings – argument passed to DataFrame.get_column_names when column_names is None

  • virtual – argument passed to DataFrame.get_column_names when column_names is None

  • parallel – Evaluate the (virtual) columns in parallel

  • chunk_size – Return an iterator with cuts of the object in lenght of this size

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

list of [{column_name:value}, …] records

trim(inplace=False)[source]#

Return a DataFrame, where all columns are ‘trimmed’ by the active range.

For the returned DataFrame, df.get_active_range() returns (0, df.length_original()).

Note

Note that no copy of the underlying data is made, only a view/reference is made.

Parameters:

inplace – If True, make modifications to self, otherwise return a new DataFrame

Return type:

DataFrame

ucd_find(ucds, exclude=[])[source]#

Find a set of columns (names) which have the ucd, or part of the ucd.

Prefixed with a ^, it will only match the first part of the ucd.

Example

>>> df.ucd_find('pos.eq.ra', 'pos.eq.dec')
['RA', 'DEC']
>>> df.ucd_find('pos.eq.ra', 'doesnotexist')
>>> df.ucds[df.ucd_find('pos.eq.ra')]
'pos.eq.ra;meta.main'
>>> df.ucd_find('meta.main')]
'dec'
>>> df.ucd_find('^meta.main')]
unique(expression, return_inverse=False, dropna=False, dropnan=False, dropmissing=False, progress=False, selection=None, axis=None, delay=False, limit=None, limit_raise=True, array_type='python')[source]#

Returns all unique values.

Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • return_inverse – Return the inverse mapping from unique values to original values.

  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • axis (int) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • limit (int) – Limit the amount of results

  • limit_raise (bool) – Raise vaex.RowLimitException when limit is exceeded, or return at maximum ‘limit’ amount of results.

  • array_type (str) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

unit(expression, default=None)[source]#

Returns the unit (an astropy.unit.Units object) for the expression.

Example

>>> import vaex
>>> ds = vaex.example()
>>> df.unit("x")
Unit("kpc")
>>> df.unit("x*L")
Unit("km kpc2 / s")
Parameters:
  • expression – Expression, which can be a column name

  • default – if no unit is known, it will return this

Returns:

The resulting unit of the expression

Return type:

astropy.units.Unit

validate_expression(expression)[source]#

Validate an expression (may throw Exceptions)

var(expression, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None, array_type=None)[source]#

Calculate the sample variance for the given expression, possible on a grid defined by binby

Example:

>>> df.var("vz")
12170.002429456246
>>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 15271.90481083,   7284.94713504,   3738.52239232,   1449.63418988])
>>> df.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
Parameters:
  • expression – expression or list of expressions, e.g. df.x, ‘x’, or [‘x, ‘y’]

  • binby – List of expressions for constructing a binned grid

  • limits – description for the min and max values for the expressions, e.g. ‘minmax’ (default), ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’]

  • shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256]

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • delay – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • array_type – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

Returns:

Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic

DataFrameLocal class#

class vaex.dataframe.DataFrameLocal(dataset=None, name=None)[source]#

Bases: DataFrame

Base class for DataFrames that work with local file/data

__array__(dtype=None, parallel=True)[source]#

Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.

Note this returns the same result as:

>>> np.array(ds)

If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).

__call__(*expressions, **kwargs)[source]#

The local implementation of DataFrame.__call__()

__getstate__()[source]#

Helper for pickle.

__init__(dataset=None, name=None)[source]#
as_arrow()[source]#

Lazily cast all columns to arrow, except object types.

as_numpy(strict=False)[source]#

Lazily cast all numerical columns to numpy.

If strict is True, it will also cast non-numerical types.

binby(by=None, agg=None, sort=False, copy=True, delay=False, progress=None)[source]#

Return a BinBy or DataArray object when agg is not None

The binby operation does not return a ‘flat’ DataFrame, instead it returns an N-d grid in the form of an xarray.

Parameters:
  • agg (dict, list or agg) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the binby object.

  • copy (bool) – Copy the dataframe (shallow, does not cost memory) so that the fingerprint of the original dataframe is not modified.

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

DataArray or BinBy object.

categorize(column, min_value=0, max_value=None, labels=None, inplace=False)[source]#

Mark column as categorical.

This may help speed up calculations using integer columns between a range of [min_value, max_value].

If max_value is not given, the [min_value and max_value] are calcuated from the data.

Example:

>>> import vaex
>>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
>>> df = df.categorize('year', min_value=2020, max_value=2019)
>>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
>>> df
  #    year    weekday
  0    2012          0
  1    2015          4
  2    2019          6
>>> df.is_category('year')
True
Parameters:
  • column – column to assume is categorical.

  • labels – labels to associate to the values between min_value and max_value

  • min_value – minimum integer value (if max_value is not given, this is calculated)

  • max_value – maximum integer value (if max_value is not given, this is calculated)

  • labels – Labels to associate to each value, list(range(min_value, max_value+1)) by default

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

compare(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]#

Compare two DataFrames and report their difference, use with care for large DataFrames

concat(*others, resolver='flexible') DataFrame[source]#

Concatenates multiple DataFrames, adding the rows of the other DataFrame to the current, returned in a new DataFrame.

In the case of resolver=’flexible’, when not all columns has the same names, the missing data is filled with missing values.

In the case of resolver=’strict’ all datasets need to have matching column names.

Parameters:
  • others – The other DataFrames that are concatenated with this DataFrame

  • resolver (str) – How to resolve schema conflicts, ‘flexible’ or ‘strict’.

Returns:

New DataFrame with the rows concatenated

copy(column_names=None, treeshake=False)[source]#

Make a shallow copy of a DataFrame. One can also specify a subset of columns.

This is a fairly cheap operation, since no memory copies of the underlying data are made.

{note_copy}

Parameters:
  • column_names (list) – A subset of columns to use for the DataFrame copy. If None, all the columns are copied.

  • treeshake (bool) – Get rid of variables not used.

Return type:

DataFrame

property data#

Gives direct access to the data as numpy arrays.

Convenient when working with IPython in combination with small DataFrames, since this gives tab-completion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use DataFrame.evaluate(…).

Columns can be accessed by their names, which are attributes. The attributes are of type numpy.ndarray.

Example:

>>> df = vaex.example()
>>> r = np.sqrt(df.data.x**2 + df.data.y**2)
export(path, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None, **kwargs)[source]#

Exports the DataFrame to a file depending on the file extension.

E.g if the filename ends on .hdf5, df.export_hdf5 is called.

Parameters:
  • path (str) – path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration, if supported.

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

Returns:

export_arrow(to, progress=None, chunk_size=1048576, parallel=True, reduce_large=True, fs_options=None, fs=None, as_stream=True)[source]#

Exports the DataFrame to a file of stream written with arrow

Parameters:
  • to – filename, file object, or pyarrow.RecordBatchStreamWriter, py:data:pyarrow.RecordBatchFileWriter or pyarrow.parquet.ParquetWriter

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • reduce_large (bool) – If True, convert arrow large_string type to string type

  • as_stream (bool) – Write as an Arrow stream if true, else a file. see also https://arrow.apache.org/docs/format/Columnar.html?highlight=arrow1#ipc-file-format

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

Returns:

export_csv(path, progress=None, chunk_size=1048576, parallel=True, backend='pandas', **kwargs)[source]#

Exports the DataFrame to a CSV file.

Parameters:
  • path (str) – path to the file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • backend (str) – Which backend to use, either ‘pandas’ or ‘arrow’. Arrow is considerably faster, but pandas is more flexible.

  • kwargs – additional keyword arguments are passed to the the backends. See DataFrameLocal.export_csv_pandas() and DataFrameLocal.export_csv_arrow() for more details.

export_csv_arrow(to, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None)[source]#

Exports the DataFrame to a CSV file via PyArrow.

Parameters:
  • to (str) – path to the file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

export_csv_pandas(path, progress=None, chunk_size=1048576, parallel=True, **kwargs)[source]#

Exports the DataFrame to a CSV file via the Pandas.

Parameters:
  • path (str) – Path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel – Evaluate the (virtual) columns in parallel

  • kwargs – Extra keyword arguments to be passed on pandas.DataFrame.to_csv()

export_feather(to, parallel=True, reduce_large=True, compression='lz4', fs_options=None, fs=None)[source]#

Exports the DataFrame to an arrow file using the feather file format version 2

Feather is exactly represented as the Arrow IPC file format on disk, but also support compression.

see also https://arrow.apache.org/docs/python/feather.html

Parameters:
  • to – filename or file object

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • reduce_large (bool) – If True, convert arrow large_string type to string type

  • compression – Can be one of ‘zstd’, ‘lz4’ or ‘uncompressed’

  • fs_options – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

Returns:

export_fits(path, progress=None)[source]#

Exports the DataFrame to a fits file that is compatible with TOPCAT colfits format

Parameters:
  • path (str) – path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

export_hdf5(path, byteorder='=', progress=None, chunk_size=1048576, parallel=True, column_count=1, writer_threads=0, group='/table', mode='w')[source]#

Exports the DataFrame to a vaex hdf5 file

Parameters:
  • path (str) – path for file

  • byteorder (str) – = for native, < for little endian and > for big endian

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • column_count (int) – How many columns to evaluate and export in parallel (>1 requires fast random access, like and SSD drive).

  • writer_threads (int) – Use threads for writing or not, only useful when column_count > 1.

  • group (str) – Write the data into a custom group in the hdf5 file.

  • mode (str) – If set to “w” (write), an existing file will be overwritten. If set to “a”, one can append additional data to the hdf5 file, but it needs to be in a different group.

Returns:

export_json(to, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None)[source]#

Exports the DataFrame to a CSV file.

Parameters:
  • to – filename or file object

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel – Evaluate the (virtual) columns in parallel

  • fs_options – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

Returns:

export_many(path, progress=None, chunk_size=1048576, parallel=True, max_workers=None, fs_options=None, fs=None, **export_kwargs)[source]#

Export the DataFrame to multiple files of the same type in parallel.

The path will be formatted using the i parameter (which is the chunk index).

Example:

>>> import vaex
>>> df = vaex.open('my_big_dataset.hdf5')
>>> print(f'number of rows: {len(df):,}')
number of rows: 193,938,982
>>> df.export_many(path='my/destination/folder/chunk-{i:03}.arrow')
>>> df_single_chunk = vaex.open('my/destination/folder/chunk-00001.arrow')
>>> print(f'number of rows: {len(df_single_chunk):,}')
number of rows: 1,048,576
>>> df_all_chunks = vaex.open('my/destination/folder/chunk-*.arrow')
>>> print(f'number of rows: {len(df_all_chunks):,}')
number of rows: 193,938,982
Parameters:
  • path (str) – Path for file, formatted by chunk index i (e.g. ‘chunk-{i:05}.parquet’)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • max_workers (int) – Number of workers/threads to use for writing in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

export_parquet(path, progress=None, chunk_size=1048576, parallel=True, fs_options=None, fs=None, **kwargs)[source]#

Exports the DataFrame to a parquet file.

Note: This may require that all of the data fits into memory (memory mapped data is an exception).

Parameters:
  • path (str) – path for file

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

  • fs – Pass a file system object directly, see vaex.open()

  • kwargs – Extra keyword arguments to be passed on to py:data:pyarrow.parquet.ParquetWriter.

Returns:

export_partitioned(path, by, directory_format='{key}={value}', progress=None, chunk_size=1048576, parallel=True, fs_options={}, fs=None)[source]#

Expertimental: export files using hive partitioning.

If no extension is found in the path, we assume parquet files. Otherwise you can specify the format like an format-string. Where {i} is a zero based index, {uuid} a unique id, and {subdir} the Hive key=value directory.

Example paths:
  • ‘/some/dir/{subdir}/{i}.parquet’

  • ‘/some/dir/{subdir}/fixed_name.parquet’

  • ‘/some/dir/{subdir}/{uuid}.parquet’

  • ‘/some/dir/{subdir}/{uuid}.parquet’

Parameters:
  • path – directory where to write the files to.

  • str (str or list of) – Which column to partition by.

  • directory_format (str) – format string for directories, default ‘{key}={value}’ for Hive layout.

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • chunk_size (int) – Number of rows to be written to disk in a single iteration

  • parallel (bool) – Evaluate the (virtual) columns in parallel

  • fs_options (dict) – see vaex.open() e.g. for S3 {“profile”: “myproject”}

groupby(by=None, agg=None, sort=False, ascending=True, assume_sparse='auto', row_limit=None, copy=True, progress=None, delay=False)[source]#

Return a GroupBy or DataFrame object when agg is not None

Examples:

>>> import vaex
>>> import numpy as np
>>> np.random.seed(42)
>>> x = np.random.randint(1, 5, 10)
>>> y = x**2
>>> df = vaex.from_arrays(x=x, y=y)
>>> df.groupby(df.x, agg='count')
#    x    y_count
0    3          4
1    4          2
2    1          3
3    2          1
>>> df.groupby(df.x, agg=[vaex.agg.count('y'), vaex.agg.mean('y')])
#    x    y_count    y_mean
0    3          4         9
1    4          2        16
2    1          3         1
3    2          1         4
>>> df.groupby(df.x, agg={'z': [vaex.agg.count('y'), vaex.agg.mean('y')]})
#    x    z_count    z_mean
0    3          4         9
1    4          2        16
2    1          3         1
3    2          1         4

Example using datetime:

>>> import vaex
>>> import numpy as np
>>> t = np.arange('2015-01-01', '2015-02-01', dtype=np.datetime64)
>>> y = np.arange(len(t))
>>> df = vaex.from_arrays(t=t, y=y)
>>> df.groupby(vaex.BinnerTime.per_week(df.t)).agg({'y' : 'sum'})
#  t                      y
0  2015-01-01 00:00:00   21
1  2015-01-08 00:00:00   70
2  2015-01-15 00:00:00  119
3  2015-01-22 00:00:00  168
4  2015-01-29 00:00:00   87
Parameters:
  • agg (dict, list or agg) – Aggregate operation in the form of a string, vaex.agg object, a dictionary where the keys indicate the target column names, and the values the operations, or the a list of aggregates. When not given, it will return the groupby object.

  • sort (bool) – Sort columns for which we group by.

  • ascending (bool or list of bools) – ascending (default, True) or descending (False).

  • assume_sparse (bool or str) – Assume that when grouping by multiple keys, that the existing pairs are sparse compared to the cartesian product. If ‘auto’, let vaex decide (e.g. a groupby with 10_000 rows but only 4*3=12 combinations does not matter much to compress into say 8 existing combinations, and will save another pass over the data)

  • row_limit (int) – Limits the resulting dataframe to the number of rows (default is not to check, only works when assume_sparse is True). Throws a vaex.RowLimitException when the condition is not met.

  • copy (bool) – Copy the dataframe (shallow, does not cost memory) so that the fingerprint of the original dataframe is not modified.

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

Returns:

DataFrame or GroupBy object.

hashed(inplace=False) DataFrame[source]#

Return a DataFrame with a hashed dataset

is_local()[source]#

The local implementation of DataFrame.evaluate(), always returns True.

join(other, on=None, left_on=None, right_on=None, lprefix='', rprefix='', lsuffix='', rsuffix='', how='left', allow_duplication=False, prime_growth=False, cardinality_other=None, inplace=False)[source]#

Return a DataFrame joined with other DataFrames, matched by columns/expression on/left_on/right_on

If neither on/left_on/right_on is given, the join is done by simply adding the columns (i.e. on the implicit row index).

Note: The filters will be ignored when joining, the full DataFrame will be joined (since filters may change). If either DataFrame is heavily filtered (contains just a small number of rows) consider running DataFrame.extract() first.

Example:

>>> a = np.array(['a', 'b', 'c'])
>>> x = np.arange(1,4)
>>> ds1 = vaex.from_arrays(a=a, x=x)
>>> b = np.array(['a', 'b', 'd'])
>>> y = x**2
>>> ds2 = vaex.from_arrays(b=b, y=y)
>>> ds1.join(ds2, left_on='a', right_on='b')
Parameters:
  • other – Other DataFrame to join with (the right side)

  • on – default key for the left table (self)

  • left_on – key for the left table (self), overrides on

  • right_on – default key for the right table (other), overrides on

  • lprefix – prefix to add to the left column names in case of a name collision

  • rprefix – similar for the right

  • lsuffix – suffix to add to the left column names in case of a name collision

  • rsuffix – similar for the right

  • how – how to join, ‘left’ keeps all rows on the left, and adds columns (with possible missing values) ‘right’ is similar with self and other swapped. ‘inner’ will only return rows which overlap.

  • allow_duplication (bool) – Allow duplication of rows when the joined column contains non-unique values.

  • cardinality_other (int) – Number of unique elements (or estimate of) for the other table.

  • prime_growth (bool) – Growth strategy for the hashmaps used internally, can improve performance in some case (e.g. integers with low bits unused).

  • inplace – If True, make modifications to self, otherwise return a new DataFrame

Returns:

label_encode(column, values=None, inplace=False, lazy=False)#

Deprecated: use ordinal_encode

Encode column as ordinal values and mark it as categorical.

The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].

param lazy:

When False, it will materialize the ordinal codes.

length(selection=False)[source]#

Get the length of the DataFrames, for the selection of the whole DataFrame.

If selection is False, it returns len(df).

TODO: Implement this in DataFrameRemote, and move the method up in DataFrame.length()

Parameters:

selection – When True, will return the number of selected rows

Returns:

ordinal_encode(column, values=None, inplace=False, lazy=False)[source]#

Deprecated: use ordinal_encode

Encode column as ordinal values and mark it as categorical.

The existing column is renamed to a hidden column and replaced by a numerical columns with values between [0, len(values)-1].

param lazy:

When False, it will materialize the ordinal codes.

selected_length(selection='default')[source]#

The local implementation of DataFrame.selected_length()

shallow_copy(virtual=True, variables=True)[source]#

Creates a (shallow) copy of the DataFrame.

It will link to the same data, but will have its own state, e.g. virtual columns, variables, selection etc.

property values#

Gives a full memory copy of the DataFrame into a 2d numpy array of shape (n_rows, n_columns). Note that the memory order is fortran, so all values of 1 column are contiguous in memory for performance reasons.

Note this returns the same result as:

>>> np.array(ds)

If any of the columns contain masked arrays, the masks are ignored (i.e. the masked elements are returned as well).

Date/time operations#

class vaex.expression.DateTime(expression)[source]#

Bases: object

DateTime operations

Usually accessed using e.g. df.birthday.dt.dayofweek

__init__(expression)[source]#
__weakref__#

list of weak references to the object

property date#

Return the date part of the datetime value

Returns:

an expression containing the date portion of a datetime value

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.date
Expression = dt_date(date)
Length: 3 dtype: datetime64[D] (expression)
-------------------------------------------
0  2009-10-12
1  2016-02-11
2  2015-11-12
property day#

Extracts the day from a datetime sample.

Returns:

an expression containing the day extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.day
Expression = dt_day(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  12
1  11
2  12
property day_name#

Returns the day names of a datetime sample in English.

Returns:

an expression containing the day names extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.day_name
Expression = dt_day_name(date)
Length: 3 dtype: str (expression)
---------------------------------
0    Monday
1  Thursday
2  Thursday
property dayofweek#

Obtain the day of the week with Monday=0 and Sunday=6

Returns:

an expression containing the day of week.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.dayofweek
Expression = dt_dayofweek(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  0
1  3
2  3
property dayofyear#

The ordinal day of the year.

Returns:

an expression containing the ordinal day of the year.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.dayofyear
Expression = dt_dayofyear(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  285
1   42
2  316
floor(freq, *args)#

Perform floor operation on an expression for a given frequency.

Parameters:

freq – The frequency level to floor the index to. Must be a fixed frequency like ‘S’ (second), or ‘H’ (hour), but not ‘ME’ (month end).

Returns:

an expression containing the floored datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.floor("H")
Expression = dt_floor(date, 'H')
Length: 3 dtype: datetime64[ns] (expression)
--------------------------------------------
0  2009-10-12 03:00:00.000000000
1  2016-02-11 10:00:00.000000000
2  2015-11-12 11:00:00.000000000
property halfyear#

Return the half-year of the date. Values can be 1 and 2, for the first and second half of the year respectively.

Returns:

an expression containing the half-year extracted from the datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.halfyear
Expression = dt_halfyear(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  2
1  1
2  2
property hour#

Extracts the hour out of a datetime samples.

Returns:

an expression containing the hour extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.hour
Expression = dt_hour(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0   3
1  10
2  11
property is_leap_year#

Check whether a year is a leap year.

Returns:

an expression which evaluates to True if a year is a leap year, and to False otherwise.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.is_leap_year
Expression = dt_is_leap_year(date)
Length: 3 dtype: bool (expression)
----------------------------------
0  False
1   True
2  False
property minute#

Extracts the minute out of a datetime samples.

Returns:

an expression containing the minute extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.minute
Expression = dt_minute(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  31
1  17
2  34
property month#

Extracts the month out of a datetime sample.

Returns:

an expression containing the month extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.month
Expression = dt_month(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  10
1   2
2  11
property month_name#

Returns the month names of a datetime sample in English.

Returns:

an expression containing the month names extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.month_name
Expression = dt_month_name(date)
Length: 3 dtype: str (expression)
---------------------------------
0   October
1  February
2  November
property quarter#

Return the quarter of the date. Values range from 1-4.

Returns:

an expression containing the quarter extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.quarter
Expression = dt_quarter(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  4
1  1
2  4
property second#

Extracts the second out of a datetime samples.

Returns:

an expression containing the second extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.second
Expression = dt_second(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0   0
1  34
2  22
strftime(date_format)#

Returns a formatted string from a datetime sample.

Returns:

an expression containing a formatted string extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.strftime("%Y-%m")
Expression = dt_strftime(date, '%Y-%m')
Length: 3 dtype: object (expression)
------------------------------------
0  2009-10
1  2016-02
2  2015-11
property weekofyear#

Returns the week ordinal of the year.

Returns:

an expression containing the week ordinal of the year, extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.weekofyear
Expression = dt_weekofyear(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  42
1   6
2  46
property year#

Extracts the year out of a datetime sample.

Returns:

an expression containing the year extracted from a datetime column.

Example:

>>> import vaex
>>> import numpy as np
>>> date = np.array(['2009-10-12T03:31:00', '2016-02-11T10:17:34', '2015-11-12T11:34:22'], dtype=np.datetime64)
>>> df = vaex.from_arrays(date=date)
>>> df
  #  date
  0  2009-10-12 03:31:00
  1  2016-02-11 10:17:34
  2  2015-11-12 11:34:22
>>> df.date.dt.year
Expression = dt_year(date)
Length: 3 dtype: int64 (expression)
-----------------------------------
0  2009
1  2016
2  2015

Expression class#

class vaex.expression.Expression(ds, expression, ast=None, _selection=False)[source]#

Bases: object

Expression class

__abs__()[source]#

Returns the absolute value of the expression

__bool__()[source]#

Cast expression to boolean. Only supports (<expr1> == <expr2> and <expr1> != <expr2>)

The main use case for this is to support assigning to traitlets. e.g.:

>>> bool(expr1 == expr2)

This will return True when expr1 and expr2 are exactly the same (in string representation). And similarly for:

>>> bool(expr != expr2)

All other cases will return True.

__eq__(b)#

Return self==value.

__ge__(b)#

Return self>=value.

__getitem__(slicer)[source]#

Provides row and optional field access (struct arrays) via bracket notation.

Examples:

>>> import vaex
>>> import pyarrow as pa
>>> array = pa.StructArray.from_arrays(arrays=[[1, 2, 3], ["a", "b", "c"]], names=["col1", "col2"])
>>> df = vaex.from_arrays(array=array, integer=[5, 6, 7])
>>> df
#       array                       integer
0       {'col1': 1, 'col2': 'a'}        5
1       {'col1': 2, 'col2': 'b'}        6
2       {'col1': 3, 'col2': 'c'}        7
>>> df.integer[1:]
Expression = integer
Length: 2 dtype: int64 (column)
-------------------------------
0  6
1  7
>>> df.array[1:]
Expression = array
Length: 2 dtype: struct<col1: int64, col2: string> (column)
-----------------------------------------------------------
0  {'col1': 2, 'col2': 'b'}
1  {'col1': 3, 'col2': 'c'}
>>> df.array[:, "col1"]
Expression = struct_get(array, 'col1')
Length: 3 dtype: int64 (expression)
-----------------------------------
0  1
1  2
2  3
>>> df.array[1:, ["col1"]]
Expression = struct_project(array, ['col1'])
Length: 2 dtype: struct<col1: int64> (expression)
-------------------------------------------------
0  {'col1': 2}
1  {'col1': 3}
>>> df.array[1:, ["col2", "col1"]]
Expression = struct_project(array, ['col2', 'col1'])
Length: 2 dtype: struct<col2: string, col1: int64> (expression)
---------------------------------------------------------------
0  {'col2': 'b', 'col1': 2}
1  {'col2': 'c', 'col1': 3}
__gt__(b)#

Return self>value.

__hash__ = None#
__init__(ds, expression, ast=None, _selection=False)[source]#
__le__(b)#

Return self<=value.

__lt__(b)#

Return self<value.

__ne__(b)#

Return self!=value.

__or__(b)#

Return self|value.

__repr__()[source]#

Return repr(self).

__ror__(b)#

Return value|self.

__str__()[source]#

Return str(self).

__weakref__#

list of weak references to the object

abs(**kwargs)#

Lazy wrapper around numpy.abs

apply(f, vectorize=False, multiprocessing=True)[source]#

Apply a function along all values of an Expression.

Shorthand for df.apply(f, arguments=[expression]), see DataFrame.apply()

Example:

>>> df = vaex.example()
>>> df.x
Expression = x
Length: 330,000 dtype: float64 (column)
---------------------------------------
     0  -0.777471
     1    3.77427
     2    1.37576
     3   -7.06738
     4   0.243441
>>> def func(x):
...     return x**2
>>> df.x.apply(func)
Expression = lambda_function(x)
Length: 330,000 dtype: float64 (expression)
-------------------------------------------
     0   0.604461
     1    14.2451
     2    1.89272
     3    49.9478
     4  0.0592637
Parameters:
  • f – A function to be applied on the Expression values

  • vectorize – Call f with arrays instead of a scalars (for better performance).

  • multiprocessing (bool) – Use multiple processes to avoid the GIL (Global interpreter lock).

Returns:

A function that is lazily evaluated when called.

arccos(**kwargs)#

Lazy wrapper around numpy.arccos

arccosh(**kwargs)#

Lazy wrapper around numpy.arccosh

arcsin(**kwargs)#

Lazy wrapper around numpy.arcsin

arcsinh(**kwargs)#

Lazy wrapper around numpy.arcsinh

arctan(**kwargs)#

Lazy wrapper around numpy.arctan

arctan2(**kwargs)#

Lazy wrapper around numpy.arctan2

arctanh(**kwargs)#

Lazy wrapper around numpy.arctanh

as_arrow()#

Lazily convert to Apache Arrow array type

as_numpy(strict=False)#

Lazily convert to NumPy ndarray type

property ast#

Returns the abstract syntax tree (AST) of the expression

clip(**kwargs)#

Lazy wrapper around numpy.clip

copy(df=None)[source]#

Efficiently copies an expression.

Expression objects have both a string and AST representation. Creating the AST representation involves parsing the expression, which is expensive.

Using copy will deepcopy the AST when the expression was already parsed.

Parameters:

df – DataFrame for which the expression will be evaluated (self.df if None)

cos(**kwargs)#

Lazy wrapper around numpy.cos

cosh(**kwargs)#

Lazy wrapper around numpy.cosh

count(binby=[], limits=None, shape=128, selection=False, delay=False, edges=False, progress=None)[source]#

Shortcut for ds.count(expression, …), see Dataset.count

countmissing()[source]#

Returns the number of missing values in the expression.

countna()[source]#

Returns the number of Not Availiable (N/A) values in the expression. This includes missing values and np.nan values.

countnan()[source]#

Returns the number of NaN values in the expression.

deg2rad(**kwargs)#

Lazy wrapper around numpy.deg2rad

dependencies()[source]#

Get all dependencies of this expression, including ourselves

digitize(**kwargs)#

Lazy wrapper around numpy.digitize

dot_product(b)#

Compute the dot product between a and b.

Parameters:
  • a – A list of Expressions or a list of values (e.g. a vector)

  • b – A list of Expressions or a list of values (e.g. a vector)

Returns:

Vaex expression

property dt#

Gives access to datetime operations via DateTime

exp(**kwargs)#

Lazy wrapper around numpy.exp

expand(stop=[])[source]#

Expand the expression such that no virtual columns occurs, only normal columns.

Example:

>>> df = vaex.example()
>>> r = np.sqrt(df.data.x**2 + df.data.y**2)
>>> r.expand().expression
'sqrt(((x ** 2) + (y ** 2)))'
expm1(**kwargs)#

Lazy wrapper around numpy.expm1

fillmissing(value)[source]#

Returns an array where missing values are replaced by value.

See :ismissing for the definition of missing values.

fillna(value)#

Returns an array where NA values are replaced by value. See :isna for the definition of missing values.

fillnan(value)#

Returns an array where nan values are replaced by value. See :isnan for the definition of missing values.

format(format)#

Uses http://www.cplusplus.com/reference/string/to_string/ for formatting

hashmap_apply(hashmap, check_missing=False)#

Apply values to hashmap, if check_missing is True, missing values in the hashmap will translated to null/missing values

isfinite(**kwargs)#

Lazy wrapper around numpy.isfinite

isin(values, use_hashmap=True)[source]#

Lazily tests if each value in the expression is present in values.

Parameters:
  • values – List/array of values to check

  • use_hashmap – use a hashmap or not (especially faster when values contains many elements)

Returns:

Expression with the lazy expression.

isinf(**kwargs)#

Lazy wrapper around numpy.isinf

ismissing()#

Returns True where there are missing values (masked arrays), missing strings or None

isna()#

Returns a boolean expression indicating if the values are Not Availiable (missing or NaN).

isnan()#

Returns an array where there are NaN values

kurtosis(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for df.kurtosis(expression, …), see DataFrame.kurtosis

log(**kwargs)#

Lazy wrapper around numpy.log

log10(**kwargs)#

Lazy wrapper around numpy.log10

log1p(**kwargs)#

Lazy wrapper around numpy.log1p

map(mapper, nan_value=None, missing_value=None, default_value=None, allow_missing=False, axis=None)[source]#

Map values of an expression or in memory column according to an input dictionary or a custom callable function.

Example:

>>> import vaex
>>> df = vaex.from_arrays(color=['red', 'red', 'blue', 'red', 'green'])
>>> mapper = {'red': 1, 'blue': 2, 'green': 3}
>>> df['color_mapped'] = df.color.map(mapper)
>>> df
#  color      color_mapped
0  red                   1
1  red                   1
2  blue                  2
3  red                   1
4  green                 3
>>> import numpy as np
>>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, np.nan])
>>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user', np.nan: 'unknown'})
>>> df
#    type  role
0       0  admin
1       1  maintainer
2       2  user
3       2  user
4       2  user
5     nan  unknown
>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(type=[0, 1, 2, 2, 2, 4])
>>> df['role'] = df['type'].map({0: 'admin', 1: 'maintainer', 2: 'user'}, default_value='unknown')
>>> df
#    type  role
0       0  admin
1       1  maintainer
2       2  user
3       2  user
4       2  user
5       4  unknown
:param mapper: dict like object used to map the values from keys to values
:param nan_value: value to be used when a nan is present (and not in the mapper)
:param missing_value: value to use used when there is a missing value
:param default_value: value to be used when a value is not in the mapper (like dict.get(key, default))
:param allow_missing: used to signal that values in the mapper should map to a masked array with missing values,
    assumed True when default_value is not None.
:param bool axis: Axis over which to determine the unique elements (None will flatten arrays or lists)
:return: A vaex expression
:rtype: vaex.expression.Expression
property masked#

Alias to df.is_masked(expression)

max(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for ds.max(expression, …), see Dataset.max

maximum(**kwargs)#

Lazy wrapper around numpy.maximum

mean(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for ds.mean(expression, …), see Dataset.mean

min(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for ds.min(expression, …), see Dataset.min

minimum(**kwargs)#

Lazy wrapper around numpy.minimum

minmax(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for ds.minmax(expression, …), see Dataset.minmax

nop()[source]#

Evaluates expression, and drop the result, usefull for benchmarking, since vaex is usually lazy

notna()#

Opposite of isna

nunique(dropna=False, dropnan=False, dropmissing=False, selection=None, axis=None, limit=None, limit_raise=True, progress=None, delay=False)[source]#

Counts number of unique values, i.e. len(df.x.unique()) == df.x.nunique().

Parameters:
  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • limit (int) – Limit the amount of results

  • limit_raise (bool) – Raise vaex.RowLimitException when limit is exceeded, or return at maximum ‘limit’ amount of results.

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

rad2deg(**kwargs)#

Lazy wrapper around numpy.rad2deg

round(**kwargs)#

Lazy wrapper around numpy.round

searchsorted(**kwargs)#

Lazy wrapper around numpy.searchsorted

sin(**kwargs)#

Lazy wrapper around numpy.sin

sinc(**kwargs)#

Lazy wrapper around numpy.sinc

sinh(**kwargs)#

Lazy wrapper around numpy.sinh

skew(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for df.skew(expression, …), see DataFrame.skew

sqrt(**kwargs)#

Lazy wrapper around numpy.sqrt

std(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for ds.std(expression, …), see Dataset.std

property str#

Gives access to string operations via StringOperations

property str_pandas#

Gives access to string operations via StringOperationsPandas (using Pandas Series)

property struct#

Gives access to struct operations via StructOperations

sum(axis=None, binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Sum elements over given axis.

If no axis is given, it will sum over all axes.

For non list elements, this is a shortcut for ds.sum(expression, …), see Dataset.sum.

>>> list_data = [1, 2, None], None, [], [1, 3, 4, 5]
>>> df = vaex.from_arrays(some_list=pa.array(list_data))
>>> df.some_list.sum().item()  # will sum over all axis
16
>>> df.some_list.sum(axis=1).tolist()  # sums the list elements
[3, None, 0, 13]
Parameters:

axis (int) – Axis over which to determine the unique elements (None will flatten arrays or lists)

tan(**kwargs)#

Lazy wrapper around numpy.tan

tanh(**kwargs)#

Lazy wrapper around numpy.tanh

property td#

Gives access to timedelta operations via TimeDelta

to_arrow(convert_to_native=False)[source]#

Convert to Apache Arrow array (will byteswap/copy if convert_to_native=True).

to_numpy(strict=True)[source]#

Return a numpy representation of the data

to_pandas_series()[source]#

Return a pandas.Series representation of the expression.

Note: Pandas is likely to make a memory copy of the data.

to_string()#

Cast/convert to string, same as expression.astype(‘str’)

tolist(i1=None, i2=None)[source]#

Short for expr.evaluate().tolist()

property transient#

If this expression is not transient (e.g. on disk) optimizations can be made

unique(dropna=False, dropnan=False, dropmissing=False, selection=None, axis=None, limit=None, limit_raise=True, array_type='list', progress=None, delay=False)[source]#

Returns all unique values.

Parameters:
  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False), or a list of selections

  • axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • limit (int) – Limit the amount of results

  • limit_raise (bool) – Raise vaex.RowLimitException when limit is exceeded, or return at maximum ‘limit’ amount of results.

  • array_type (bool) – Type of output array, possible values are None (keep as is), “numpy” (ndarray), “xarray” for a xarray.DataArray, or “list”/”python” for a Python list

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

value_counts(dropna=False, dropnan=False, dropmissing=False, ascending=False, progress=False, axis=None, delay=False)[source]#

Computes counts of unique values.

WARNING:
  • If the expression/column is not categorical, it will be converted on the fly

  • dropna is False by default, it is True by default in pandas

Parameters:
  • dropna – Drop rows with Not Available (NA) values (NaN or missing values).

  • dropnan – Drop rows with NaN values

  • dropmissing – Drop rows with missing values

  • ascending – when False (default) it will report the most frequent occuring item first

  • progress – True to display a progress bar, or a callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False

  • axis (bool) – Axis over which to determine the unique elements (None will flatten arrays or lists)

  • delay (bool) – Do not return the result, but a proxy for asychronous calculations (currently only for internal use)

Returns:

Pandas series containing the counts

var(binby=[], limits=None, shape=128, selection=False, delay=False, progress=None)[source]#

Shortcut for ds.std(expression, …), see Dataset.var

variables(ourself=False, expand_virtual=True, include_virtual=True)[source]#

Return a set of variables this expression depends on.

Example:

>>> df = vaex.example()
>>> r = np.sqrt(df.data.x**2 + df.data.y**2)
>>> r.variables()
{'x', 'y'}
where(x, y, dtype=None)#

Return the values row-wise chosen from x or y depending on the condition.

This a useful function when you want to create based on some condition. If the condition is True, the value from x is taken, and othewise the value from y is taken. An easy way to think about the syntax is df.func.where(“if”, “then”, “else”). Please see the example below.

Note: if the function is used as a method of an expression, that expression is assumed to be the condition.

Parameters:
  • condition – An boolean expression

  • x – A single value or an expression, the value passed on if the condition is satisfied.

  • y – A single value or an expression, the value passed on if the condition is not satisfied.

  • dtype – Optionally specify the dtype of the resulting expression

Return type:

Expression

Example:

>>> import vaex
>>> df = vaex.from_arrays(x=[0, 1, 2, 3])
>>> df['y'] = df.func.where(df.x >=2, df.x, -1)
>>> df
#    x    y
0    0   -1
1    1   -1
2    2    2
3    3    3

Geo operations#

class vaex.geo.DataFrameAccessorGeo(df)[source]#

Bases: object

Geometry/geographic helper methods

Example:

>>> df_xyz = df.geo.spherical2cartesian(df.longitude, df.latitude, df.distance)
>>> df_xyz.x.mean()
__init__(df)[source]#
__weakref__#

list of weak references to the object

bearing(lon1, lat1, lon2, lat2, bearing='bearing', inplace=False)[source]#

Calculates a bearing, based on http://www.movable-type.co.uk/scripts/latlong.html

cartesian2spherical(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position', inplace=False)[source]#

Convert cartesian to spherical coordinates.

Parameters:
  • x

  • y

  • z

  • alpha

  • delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True).

  • distance

  • radians

  • center

  • center_name

Returns:

cartesian_to_polar(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', propagate_uncertainties=False, radians=False, inplace=False)[source]#

Convert cartesian to polar coordinates

Parameters:
  • x – expression for x

  • y – expression for y

  • radius_out – name for the virtual column for the radius

  • azimuth_out – name for the virtual column for the azimuth angle

  • propagate_uncertainties – {propagate_uncertainties}

  • radians – if True, azimuth is in radians, defaults to degrees

Returns:

inside_polygon(y, px, py)#

Test if points defined by x and y are inside the polygon px, py

Example:

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> df['inside'] = df.geo.inside_polygon(df.x, df.y, px, py)
>>> df
#    x    y  inside
0    1    2  False
1    2    3  True
2    3    4  False
Parameters:
  • x – {expression_one}

  • y – {expression_one}

  • px – list of x coordinates for the polygon

  • px – list of y coordinates for the polygon

Returns:

Expression, which is true if point is inside, else false.

inside_polygons(y, pxs, pys, any=True)#

Test if points defined by x and y are inside all or any of the the polygons px, py

Example:

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> df['inside'] = df.geo.inside_polygons(df.x, df.y, [px, px + 1], [py, py + 1], any=True)
>>> df
#    x    y  inside
0    1    2  False
1    2    3  True
2    3    4  True
Parameters:
  • x – {expression_one}

  • y – {expression_one}

  • pxs – list of N ndarrays with x coordinates for the polygon, N is the number of polygons

  • pxs – list of N ndarrays with y coordinates for the polygon

  • any – return true if in any polygon, or all polygons

Returns:

Expression , which is true if point is inside, else false.

inside_which_polygon(y, pxs, pys)#

Find in which polygon (0 based index) a point resides

Example:

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> df['polygon_index'] = df.geo.inside_which_polygon(df.x, df.y, [px, px + 1], [py, py + 1])
>>> df
#    x    y  polygon_index
0    1    2  --
1    2    3  0
2    3    4  1
Parameters:
  • x – {expression_one}

  • y – {expression_one}

  • px – list of N ndarrays with x coordinates for the polygon, N is the number of polygons

  • px – list of N ndarrays with y coordinates for the polygon

Returns:

Expression, 0 based index to which polygon the point belongs (or missing/masked value)

inside_which_polygons(x, y, pxss, pyss=None, any=True)[source]#

Find in which set of polygons (0 based index) a point resides.

If any=True, it will be the first matching polygon set index, if any=False, it will be the first index that matches all polygons in the set.

>>> import vaex
>>> import numpy as np
>>> df = vaex.from_arrays(x=[1, 2, 3], y=[2, 3, 4])
>>> px = np.array([1.5, 2.5, 2.5, 1.5])
>>> py = np.array([2.5, 2.5, 3.5, 3.5])
>>> polygonA = [px, py]
>>> polygonB = [px + 1, py + 1]
>>> pxs = [[polygonA, polygonB], [polygonA]]
>>> df['polygon_index'] = df.geo.inside_which_polygons(df.x, df.y, pxs, any=True)
>>> df
#    x    y  polygon_index
0    1    2  --
1    2    3  0
2    3    4  0
>>> df['polygon_index'] = df.geo.inside_which_polygons(df.x, df.y, pxs, any=False)
>>> df
#    x    y  polygon_index
0    1    2  --
1    2    3  1
2    3    4  --
Parameters:
  • x – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • y – expression in the form of a string, e.g. ‘x’ or ‘x+y’ or vaex expression object, e.g. df.x or df.x+df.y

  • px – list of N ndarrays with x coordinates for the polygon, N is the number of polygons

  • px – list of N ndarrays with y coordinates for the polygon, if None, the shape of the ndarrays of the last dimention of the x arrays should be 2 (i.e. have the x and y coordinates)

  • any – test if point it in any polygon (logically or), or all polygons (logically and)

Returns:

Expression, 0 based index to which polygon the point belongs (or missing/masked value)

project_aitoff(alpha, delta, x, y, radians=True, inplace=False)[source]#

Add aitoff (https://en.wikipedia.org/wiki/Aitoff_projection) projection

Parameters:
  • alpha – azimuth angle

  • delta – polar angle

  • x – output name for x coordinate

  • y – output name for y coordinate

  • radians – input and output in radians (True), or degrees (False)

Returns:

project_gnomic(alpha, delta, alpha0=0, delta0=0, x='x', y='y', radians=False, postfix='', inplace=False)[source]#

Adds a gnomic projection to the DataFrame

rotation_2d(x, y, xnew, ynew, angle_degrees, propagate_uncertainties=False, inplace=False)[source]#

Rotation in 2d.

Parameters:
  • x (str) – Name/expression of x column

  • y (str) – idem for y

  • xnew (str) – name of transformed x column

  • ynew (str)

  • angle_degrees (float) – rotation in degrees, anti clockwise

Returns:

spherical2cartesian(alpha, delta, distance, xname='x', yname='y', zname='z', propagate_uncertainties=False, center=[0, 0, 0], radians=False, inplace=False)[source]#

Convert spherical to cartesian coordinates.

Parameters:
  • alpha

  • delta – polar angle, ranging from the -90 (south pole) to 90 (north pole)

  • distance – radial distance, determines the units of x, y and z

  • xname

  • yname

  • zname

  • propagate_uncertainties – If true, will propagate errors for the new virtual columns, see propagate_uncertainties() for details

  • center

  • radians

Returns:

New dataframe (in inplace is False) with new x,y,z columns

velocity_cartesian2polar(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', propagate_uncertainties=False, inplace=False)[source]#

Convert cartesian to polar velocities.

Parameters:
  • x

  • y

  • vx

  • radius_polar – Optional expression for the radius, may lead to a better performance when given.

  • vy

  • vr_out

  • vazimuth_out

  • propagate_uncertainties – If true, will propagate errors for the new virtual columns, see propagate_uncertainties() for details

Returns:

velocity_cartesian2spherical(x='x', y