I have a massive CSV file which I can not fit all into memory at one time. How do I convert it to HDF5?¶

New in 4.14:

Backed by Apache Arrow, Vaex supports lazy reading of CSV files simply with:

df = vaex.open('./my_data/my_big_file.csv')


In this way you can work with artibrarily large CSV files, with the same API just as if you were working with HDF5, Apache Arrow or Apache Parquet files.

For performance reasons, we do recommend converting large CSV files either HDF5 or Apache Arrow format. This is simply done via:

df = vaex.open('./my_data/my_big_file.csv', convert='./my_data/my_big_file.hdf5')


One can also choose to convert a large CSV file to Apache Parquet in order to save disk space in the very same way.

Prior to 4.14:

df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)


When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk. All temporary will files are then concatenated into a single HDF5, and the temporary files deleted. The size of the individual chunks to be read can be specified via the chunk_size argument.

Note: this is still possible in the newer version of Vaex, but it is not the most performant approach.

Why can’t I open a HDF5 file that was exported from a pandas DataFrame using .to_hdf?¶

When one uses the pandas .to_hdf method, the output HDF5 file has a row based format. Vaex on the other hand expects column based HDF5 files. This allows for more efficient reading of data columns, which is much more commonly required for data science applications.

One can easily export a pandas DataFrame to a vaex friendly HDF5 file:

vaex_df = vaex.from_pandas(pandas_df, copy_index=False)
vaex_df.export_hdf5('my_data.hdf5')


What is the optimal file format to use with vaex?¶

What is “optimal” may dependent on what one is trying to achieve. A quick summary would be:

vaex shines when the data is in a memory-mappable file format, namely HDF5, Apache Arrow, or FITS. We say a file can be memory mapped if it has the same structure in memory, as it has on disk. Although any file can be memory mapped, if it requires deserialisation there is no advantage to memory mapping.

In principle, HDF5 and Arrow should give the same performance. For files that would fit into memory the performance between the two is the same. For single files that are larger than available RAM, our tests show that HDF5 gives faster performance. What “faster” means will likely depend on your system, quantity and type of data. This performance difference may be caused by converting bit masks to byte masks, or by flattening chunked Arrow arrays. We expect that this performance difference will disappear in the future.

If your data is spread amongst multiple files that are concatenated on the fly, the performance between HDF5 and Arrow is expected to be the same. Our test show better performance when all the data is contained in a single file, compared to multiple file.

The Arrow file format allows seamless interoperability with other ecosystems. If your use-case requires sharing data with other ecosystems, e.g. Java, the Arrow file format is the way to.

vaex also supports Parquet. Parquet is compressed, therefore memory mapping brings no advantage. There is always a performance penalty when using Parquet, since the data needs to be decompressed before it is used. Parquet however allows lazy reading of the data, which can be decompressed on the fly. Thus vaex can easily work with Parquet files that are larger than RAM. We recommend using Parquet when one wants to save disk space. It can be also convenient when reading from slow i/o sources, like spinning hard-drives or Cloud storage for example. Note that by using df.materialize one can get the same performance as HDF5 or Arrow files at the cost of memory or disk space.

Technically vaex can use data from CSV and JSON sources, but then the data is put in memory and the usage is not optimal. We warmly recommend that these and any other data source be converted to either HDF5, Arrow or Parquet file format, depending on your use-case or preference.

Why can’t I add a new column after filtering a vaex DataFrame?¶

Unlike other libraries, vaex does not copy or modify the data. After a filtering operations for example:

df2 = df[df.x > 5]


df2 still contains all of the data present in df however. The difference is that the columns of df2 are lazily indexed, and only the rows for which the filtering condition is satisfied are displayed or used. This means that in principle one can turn filters on/off as needed.

To be able to manually add a new column to the filtered df2 DataFrame, one needs to use the df2.extract() method first. This will drop the lazy indexing, making the length of df2 equal to its filtered length.

Here is a short example:

[1]:

import vaex
import numpy as np

df = vaex.from_dict({'id': np.array([1, 2, 3, 4]),
'name': np.array(['Sally', 'Tom', 'Maria', 'John'])
})

df2 = df[df.id > 2]
df2 = df2.extract()

df2['age'] = np.array([27, 29])
df2

[1]:

# idname age
0 3Maria 27
1 4John 29