{ "cells": [ { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Vaex introduction in 11 minutes \n", "\n", "[Because vaex goes up to 11](https://en.wikipedia.org/wiki/Up_to_eleven)\n", "\n", "If you want to try out this notebook with a live Python kernel, use mybinder:\n", "\n", "\n", "\n", "## DataFrame\n", "Central to Vaex is the DataFrame (similar, but more efficient than a Pandas DataFrame), and we often use the variable `df` to represent it. A DataFrame is an efficient representation for a large tabular dataset, and has:\n", "\n", " * A number of columns, say `x`, `y` and `z`, which are:\n", " * Backed by a Numpy array;\n", " * Wrapped by an expression system e.g. `df.x`, `df['x']` or `df.col.x` is an Expression;\n", " * Columns/expression can perform lazy computations, e.g. `df.x * np.sin(df.y)` does nothing, until the result is needed.\n", " * A set of virtual columns, columns that are backed by a (lazy) computation, e.g. `df['r'] = df.x/df.y` \n", " * A set of selections, that can be used to explore the dataset, e.g. `df.select(df.x < 0)`\n", " * Filtered DataFrames, that does not copy the data, `df_negative = df[df.x < 0]`\n", " \n", "Lets start with an example dataset, which is included in Vaex." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:49.933975Z", "start_time": "2020-04-15T18:56:48.859098Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# id x y z vx vy vz E L Lz FeH
0 0 1.2318683862686157 -0.39692866802215576-0.598057746887207 301.1552734375 174.05947875976562 27.42754554748535 -149431.40625 407.38897705078125333.9555358886719 -1.0053852796554565
1 23 -0.163700610399246223.654221296310425 -0.25490644574165344-195.00022888183594170.47216796875 142.5302276611328 -124247.953125890.2411499023438 684.6676025390625 -1.7086670398712158
2 32 -2.120255947113037 3.326052665710449 1.7078403234481812 -48.63423156738281 171.6472930908203 -2.079437255859375 -138500.546875372.2410888671875 -202.17617797851562-1.8336141109466553
3 8 4.7155890464782715 4.5852508544921875 2.2515437602996826 -232.42083740234375-294.850830078125 62.85865020751953 -60037.03906251297.63037109375 -324.6875 -1.4786882400512695
4 16 7.21718692779541 11.99471664428711 -1.064562201499939 -1.6891745328903198181.329345703125 -11.333610534667969-83206.84375 1332.79895019531251328.948974609375 -1.8570483922958374
... ... ... ... ... ... ... ... ... ... ... ...
329,99521 1.9938701391220093 0.789276123046875 0.22205990552902222 -216.9299011230468816.124420166015625 -211.244384765625 -146457.4375 457.72247314453125203.36758422851562 -1.7451677322387695
329,99625 3.7180912494659424 0.721337616443634 1.6415337324142456 -185.92160034179688-117.25082397460938-105.4986572265625 -126627.109375335.0025634765625 -301.8370056152344 -0.9822322130203247
329,99714 0.3688507676124573 13.029608726501465 -3.633934736251831 -53.677146911621094-145.15771484375 76.70909881591797 -84912.2578125817.1375732421875 645.8507080078125 -1.7645612955093384
329,99818 -0.112592644989490511.4529125690460205 2.168952703475952 179.30865478515625 205.79710388183594 -68.75872802734375 -133498.46875 724.000244140625 -283.6910400390625 -1.8808952569961548
329,9994 20.796220779418945 -3.331387758255005 12.18841552734375 42.69000244140625 69.20479583740234 29.54275131225586 -65519.328125 1843.07470703125 1581.4151611328125 -1.1231083869934082
# x y z r
0 1.2318683862686157 -0.39692866802215576-0.598057746887207 1.425736665725708
1 -0.163700610399246223.654221296310425 -0.254906445741653443.666757345199585
2 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512
3 4.7155890464782715 4.5852508544921875 2.2515437602996826 6.952032566070557
4 7.21718692779541 11.99471664428711 -1.064562201499939 14.03902816772461
... ... ... ... ...
329,9951.9938701391220093 0.789276123046875 0.22205990552902222 2.155872344970703
329,9963.7180912494659424 0.721337616443634 1.6415337324142456 4.127851963043213
329,9970.3688507676124573 13.029608726501465 -3.633934736251831 13.531896591186523
329,998-0.112592644989490511.4529125690460205 2.168952703475952 2.613041877746582
329,99920.796220779418945 -3.331387758255005 12.18841552734375 24.333894729614258
" ], "text/plain": [ "# x y z r\n", "0 1.2318683862686157 -0.39692866802215576 -0.598057746887207 1.425736665725708\n", "1 -0.16370061039924622 3.654221296310425 -0.25490644574165344 3.666757345199585\n", "2 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512\n", "3 4.7155890464782715 4.5852508544921875 2.2515437602996826 6.952032566070557\n", "4 7.21718692779541 11.99471664428711 -1.064562201499939 14.03902816772461\n", "... ... ... ... ...\n", "329,995 1.9938701391220093 0.789276123046875 0.22205990552902222 2.155872344970703\n", "329,996 3.7180912494659424 0.721337616443634 1.6415337324142456 4.127851963043213\n", "329,997 0.3688507676124573 13.029608726501465 -3.633934736251831 13.531896591186523\n", "329,998 -0.11259264498949051 1.4529125690460205 2.168952703475952 2.613041877746582\n", "329,999 20.796220779418945 -3.331387758255005 12.18841552734375 24.333894729614258" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)\n", "df[['x', 'y', 'z', 'r']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selections and filtering\n", "Vaex can be efficient when exploring subsets of the data, for instance to remove outliers or to inspect only a part of the data. Instead of making copies, Vaex internally keeps track which rows are selected." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:49.992283Z", "start_time": "2020-04-15T18:56:49.983254Z" } }, "outputs": [ { "data": { "text/plain": [ "array([-0.16370061, -2.120256 , -7.7843747 , ..., -8.126636 ,\n", " -3.9477386 , -0.11259264], dtype=float32)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.select(df.x < 0)\n", "df.evaluate(df.x, selection=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selections are useful when you frequently modify the portion of the data you want to visualize, or when you want to efficiently compute statistics on several portions of the data effectively.\n", "\n", "Alternatively, you can also create filtered datasets. This is similar to using Pandas, except that Vaex does not copy the data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:50.056954Z", "start_time": "2020-04-15T18:56:49.994132Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
# x y z r
0 -0.163700610399246223.654221296310425 -0.254906445741653443.666757345199585
1 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512
2 -7.784374713897705 5.989774703979492 -0.682695209980011 9.845809936523438
3 -3.5571861267089844 5.413629055023193 0.09171556681394577 6.478376865386963
4 -20.813940048217773 -3.294677495956421 13.486607551574707 25.019264221191406
... ... ... ... ...
166,274-2.5926425457000732 -2.871671676635742 -0.180483341217041023.8730955123901367
166,275-0.7566012144088745 2.9830434322357178 -6.940553188323975 7.592250823974609
166,276-8.126635551452637 1.1619765758514404 -1.6459038257598877 8.372657775878906
166,277-3.9477386474609375 -3.0684902667999268-1.5822702646255493 5.244411468505859
166,278-0.112592644989490511.4529125690460205 2.168952703475952 2.613041877746582