Vaex can cache task results, such as aggregations, or the internal hashmaps used for
groupby operations to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.
Internally, Vaex calculates fingerprints (e.g. hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, so that a restart of a process will most likely result in similar hash keys.
Caches can be turned on globally like this:
import vaex df = vaex.datasets.titanic() vaex.cache.memory(); # cache on globally
One can verify that the cache is turned on via:
The cache can be globally turned off again:
The cache can also be turned on with a context manager, after which it will be turned off again. Here we use a disk cache. Disk cache is shared among processes, and is ideal for processes that restart, or when using Vaex in a web service with multiple workers. Consider the following example:
with vaex.cache.disk(clear=True): print(df.age.mean()) # The very first time the mean is computed
# outside of the context manager, the cache is still off vaex.cache.is_on()
with vaex.cache.disk(): print(df.age.mean()) # The second time the result is read from the cache