" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Columns\n", "The above preview shows that this dataset contains \$> 300,000\$ rows, and columns named x ,y, z (positions), vx, vy, vz (velocities), E (energy), L (angular momentum), and an id (subgroup of samples). When we print out a column, we can see that it is not a Numpy array, but an Expression." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:49.942305Z", "start_time": "2020-04-15T18:56:49.935689Z" } }, "outputs": [ { "data": { "text/plain": [ "Expression = x\n", "Length: 330,000 dtype: float32 (column)\n", "---------------------------------------\n", " 0 1.23187\n", " 1 -0.163701\n", " 2 -2.12026\n", " 3 4.71559\n", " 4 7.21719\n", " ... \n", "329995 1.99387\n", "329996 3.71809\n", "329997 0.368851\n", "329998 -0.112593\n", "329999 20.7962" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.x # df.col.x or df['x'] are equivalent, but df.x may be preferred because it is more tab completion friendly or programming friendly respectively" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can use the `.values` method to get an in-memory representation of an expression. The same method can be applied to a DataFrame as well." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:49.949040Z", "start_time": "2020-04-15T18:56:49.944388Z" } }, "outputs": [ { "data": { "text/plain": [ "array([ 1.2318684 , -0.16370061, -2.120256 , ..., 0.36885077,\n", " -0.11259264, 20.79622 ], dtype=float32)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.x.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most Numpy functions (ufuncs) can be performed on expressions, and will not result in a direct result, but in a new expression." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:49.958000Z", "start_time": "2020-04-15T18:56:49.950537Z" } }, "outputs": [ { "data": { "text/plain": [ "Expression = sqrt((((x ** 2) + (y ** 2)) + (z ** 2)))\n", "Length: 330,000 dtype: float32 (expression)\n", "-------------------------------------------\n", " 0 1.42574\n", " 1 3.66676\n", " 2 4.29824\n", " 3 6.95203\n", " 4 14.039\n", " ... \n", "329995 2.15587\n", "329996 4.12785\n", "329997 13.5319\n", "329998 2.61304\n", "329999 24.3339" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.sqrt(df.x**2 + df.y**2 + df.z**2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Virtual columns\n", "Sometimes it is convenient to store an expression as a column. " ], "text/plain": [ "# x y z r\n", "0 -0.16370061039924622 3.654221296310425 -0.25490644574165344 3.666757345199585\n", "1 -2.120255947113037 3.326052665710449 1.7078403234481812 4.298235893249512\n", "2 -7.784374713897705 5.989774703979492 -0.682695209980011 9.845809936523438\n", "3 -3.5571861267089844 5.413629055023193 0.09171556681394577 6.478376865386963\n", "4 -20.813940048217773 -3.294677495956421 13.486607551574707 25.019264221191406\n", "... ... ... ... ...\n", "166,274 -2.5926425457000732 -2.871671676635742 -0.18048334121704102 3.8730955123901367\n", "166,275 -0.7566012144088745 2.9830434322357178 -6.940553188323975 7.592250823974609\n", "166,276 -8.126635551452637 1.1619765758514404 -1.6459038257598877 8.372657775878906\n", "166,277 -3.9477386474609375 -3.0684902667999268 -1.5822702646255493 5.244411468505859\n", "166,278 -0.11259264498949051 1.4529125690460205 2.168952703475952 2.613041877746582" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_negative = df[df.x < 0]\n", "df_negative[['x', 'y', 'z', 'r']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Statistics on N-d grids\n", "\n", "A core feature of Vaex is the extremely efficient calculation of statistics on N-dimensional grids. This is rather useful for making visualisations of large datasets." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:50.088048Z", "start_time": "2020-04-15T18:56:50.059487Z" } }, "outputs": [ { "data": { "text/plain": [ "(array(330000), array(-0.0632868), array(-5.18457762))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.count(), df.mean(df.x), df.mean(df.x, selection=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to SQL's groupby, Vaex uses the binby concept, which tells Vaex that a statistic should be calculated on a regular grid (for performance reasons)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2020-04-15T18:56:50.106014Z", "start_time": "2020-04-15T18:56:50.089858Z" } }, "outputs": [ { "data": { "text/plain": [ "array([1374, 1350, 1459, 1618, 1706, 1762, 1852, 2007, 2240, 2340, 2610,\n", " 2840, 3126, 3337, 3570, 3812, 4216, 4434, 4730, 4975, 5332, 5800,\n", " 6162, 6540, 6805, 7261, 7478, 7642, 7839, 8336, 8736, 8279, 8269,\n", " 8824, 8217, 7978, 7541, 7383, 7116, 6836, 6447, 6220, 5864, 5408,\n", " 4881, 4681, 4337, 4015, 3799, 3531, 3320, 3040, 2866, 2629, 2488,\n", " 2244, 1981, 1905, 1734, 1540, 1437, 1378, 1233, 1186])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts_x = df.count(binby=df.x, limits=[-10, 10], shape=64)\n", "counts_x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This results in a Numpy array with the number counts in 64 bins distributed between x = -10, and x = 10. "text/plain": [ "