{ "cells": [ { "cell_type": "markdown", "id": "76effe6b-09bd-424f-8296-f07f8d08936f", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "(utilities)=\n", "# Utilities\n", "\n", "[`xyzpy`](xyzpy) provides a number of utilities that might be generally\n", "useful when generating data. These are:\n", "\n", "* {class}`~xyzpy.Timer`\n", "* {func}`~xyzpy.benchmark`\n", "* {class}`~xyzpy.Benchmarker`\n", "\n", "For timing and comparing functions. And then:\n", "\n", "* {class}`~xyzpy.RunningStatistics`\n", "* {func}`~xyzpy.estimate_from_repeats`\n", "\n", "for collecting running statistics and estimating quantities from repeats." ] }, { "cell_type": "code", "execution_count": 1, "id": "12d83242-844f-45aa-90d0-75fa1aaa604c", "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_formats = ['svg']\n", "\n", "import numpy as np\n", "\n", "import xyzpy as xyz" ] }, { "cell_type": "markdown", "id": "f4b4ac0b-ea68-4e60-a390-95275f1a15ec", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "## Timing\n", "\n", "### Simple timing with ``Timer``\n", "\n", "This is a super simple context manager for very roughly timing a statement that runs once:" ] }, { "cell_type": "code", "execution_count": 2, "id": "e1d77d3a-3557-4440-a27f-1836d99eed41", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.27247190475463867" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with xyz.Timer() as timer:\n", " A = np.random.randn(512, 512)\n", " el, ev = np.linalg.eig(A)\n", "\n", "timer.interval" ] }, { "cell_type": "markdown", "id": "e43460c2-1a7c-4b33-aee6-9e215b68b1f9", "metadata": {}, "source": [ "If you run this a few times you might notice some big fluctuations.\n", "\n", "\n", "### Advanced timing with ``benchmark``\n", "\n", "This is a more advanced and accurate function that wraps ``timeit`` under the hood.\n", "If offers however a convenient interface that accepts callables and sensibly manages\n", "how many repeats to do etc.:" ] }, { "cell_type": "code", "execution_count": 3, "id": "83811004-6141-4054-a0a1-ba6777a90605", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.16060145798837766" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def setup(n=512):\n", " return np.random.randn(n, n)\n", "\n", "\n", "def foo(A):\n", " return np.linalg.eig(A)\n", "\n", "\n", "xyz.benchmark(foo, setup=setup)" ] }, { "cell_type": "markdown", "id": "cac3f6a8-9b6b-4bfe-b754-9686e4b7cd52", "metadata": {}, "source": [ "Or we can specfic the size ``n`` to benchmark with as well:" ] }, { "cell_type": "code", "execution_count": 4, "id": "c9767fd8-64f5-4364-b841-c685b4863bf3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.786840959044639" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xyz.benchmark(foo, setup=setup, n=1024)" ] }, { "cell_type": "markdown", "id": "232205ab-6232-4203-823c-14eca6baccdb", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Which is calling ``foo(setup(n))`` under the hood.\n", "Generally the ``setup`` and ``n`` arguments are optional -\n", "including them or not allows switching between the following\n", "underlying patterns:\n", "\n", "```python\n", "foo()\n", "foo(n)\n", "foo(setup())\n", "foo(setup(n))\n", "```\n", "\n", "Supply ``starmap=True`` if you want ``foo(*setup(n))``, and\n", "see {func}`~xyzpy.benchmark` for other options, e.g. the\n", "minimum time and number of repeats to aim for.\n", "\n", "\n", "### Comparing performance with ``Benchmarker``\n", "\n", "Building on top of {func}`~xyzpy.benchmark` and combining it with\n", "the functionality of a {func}`~xyzpy.Harvester` gives us a very nice\n", "way to compare the performance of various functions, or 'kernels'.\n", "\n", "As an example here we'll compare ``python``, ``numpy`` and ``numba``\n", "for computing ``sum(x**2)**0.5``." ] }, { "cell_type": "code", "execution_count": 5, "id": "bde8f297-3306-4bda-991f-8171d6743e47", "metadata": {}, "outputs": [], "source": [ "import numba as nb\n", "\n", "\n", "def setup(n):\n", " return np.random.randn(n)\n", "\n", "\n", "def python_square_sum(xs):\n", " y = 0.0\n", " for x in xs:\n", " y += x**2\n", " return y**0.5\n", "\n", "\n", "def numpy_square_sum(xs):\n", " return (xs**2).sum() ** 0.5\n", "\n", "\n", "@nb.njit\n", "def numba_square_sum(xs):\n", " y = 0.0\n", " for x in xs:\n", " y += x**2\n", " return y**0.5" ] }, { "cell_type": "markdown", "id": "8dc7acd0-6480-425a-a277-9fa0727eb377", "metadata": {}, "source": [ "The ``setup`` function will be supplied to each, we can check they\n", "first give the same answer:" ] }, { "cell_type": "code", "execution_count": 6, "id": "d5d93fa6-3502-4f54-871e-707d7476ece4", "metadata": {}, "outputs": [], "source": [ "xs = setup(100)" ] }, { "cell_type": "code", "execution_count": 7, "id": "9dc6e848-8b20-4a37-8876-1b9ec049c0c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(9.97523320851365)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "python_square_sum(xs)" ] }, { "cell_type": "code", "execution_count": 8, "id": "008ddfb2-b4b7-4d82-b459-8109bfa5edca", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(9.97523320851365)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numpy_square_sum(xs)" ] }, { "cell_type": "code", "execution_count": 9, "id": "28f8de98-269c-46bd-9a9c-bfc0c2e3d6b9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9.97523320851365" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numba_square_sum(xs)" ] }, { "cell_type": "markdown", "id": "9d26ec79-4336-46cc-a083-26d04725a09b", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Then we can set up a {class}`~xyzpy.utils.Benchmarker` object to compare these with:" ] }, { "cell_type": "code", "execution_count": 10, "id": "4917fbfb-38ed-406c-adb0-e64c3f4512fd", "metadata": {}, "outputs": [], "source": [ "kernels = [\n", " python_square_sum,\n", " numpy_square_sum,\n", " numba_square_sum,\n", "]\n", "\n", "benchmarker = xyz.Benchmarker(\n", " kernels, setup=setup, benchmark_opts={\"min_t\": 0.01}\n", ")" ] }, { "cell_type": "markdown", "id": "f43e0776-54c4-4749-a8bd-a6bc9c2afb13", "metadata": {}, "source": [ "Next we run a set of problem sizes:" ] }, { "cell_type": "code", "execution_count": 11, "id": "ab53d456-e0f5-4dc4-9e01-38392398ad16", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|##########| 30/30 [00:01<00:00, 21.85it/s, {'n': 1024, 'kernel': 'numba_square_sum'}] \n" ] } ], "source": [ "sizes = [2**i for i in range(1, 11)]\n", "\n", "benchmarker.run(sizes, verbosity=2)" ] }, { "cell_type": "markdown", "id": "91789e72-f4e2-41fe-9b3f-76d5a022e4c2", "metadata": {}, "source": [ "Which we can then automatically plot:" ] }, { "cell_type": "code", "execution_count": 12, "id": "f2f93434-0e7e-4da1-ae8f-3313af9a69ae", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "" ], "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/plain": [ "(
,\n", " array([[]], dtype=object))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "benchmarker.plot()" ] }, { "cell_type": "markdown", "id": "0f1471cf-a1ab-48a4-9b31-d70c371ba04e", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Under the hood {class}`~xyzpy.Benchmarker` collects and aggregates results\n", "using a {class}`~xyzpy.Harvester`. This means that subsequent runs\n", "of different sizes will be automatically merged. Additionally, if you\n", "initialize the benchmarker with a ``dataname``, the results will be\n", "stored in a on-disk dataset.\n", "\n", "\n", "## Estimation\n", "\n", "### Efficiently collect running statistics\n", "\n", "Sometimes it is convenient to collect statistics on-the-fly, rather than storing\n", "all the values and computing statistics afterwards. The\n", "{class}`~xyzpy.RunningStatistics` object can be used for this purpose:" ] }, { "cell_type": "code", "execution_count": 13, "id": "ce238f2a-5a95-4b15-a29a-9c1a980e49bb", "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "stats = xyz.RunningStatistics()\n", "total = 0.0\n", "\n", "# don't know how many `x` we'll generate, and won't keep them\n", "while total < 100:\n", " x = random.random()\n", " total += x\n", "\n", " stats.update(x)" ] }, { "cell_type": "markdown", "id": "0177f739-7ec6-45dc-b2f9-72096c1053e3", "metadata": {}, "source": [ "We can now check a variety of information about the values generated:" ] }, { "cell_type": "code", "execution_count": 14, "id": "ef53a8bd-0106-4830-a00d-25ee63f0ef72", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Count: 207\n", " Mean: 0.48428927341941447\n", " Variance: 0.07836627923673178\n", "Standard Deviation: 0.27993977787504903\n", " Error on the mean: 0.019457159584961168\n", " Relative Error: 0.04017673042308015\n" ] } ], "source": [ "print(\" Count: {}\".format(stats.count))\n", "print(\" Mean: {}\".format(stats.mean))\n", "print(\" Variance: {}\".format(stats.var))\n", "print(\"Standard Deviation: {}\".format(stats.std))\n", "print(\" Error on the mean: {}\".format(stats.err))\n", "print(\" Relative Error: {}\".format(stats.rel_err))" ] }, { "cell_type": "markdown", "id": "545e6fa7-2499-4555-b941-1ddd66919647", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "For performance, {class}`~xyzpy.RunningStatistics` is a ``numba`` compiled class,\n", "and can also be updated using an iterable very efficiently:" ] }, { "cell_type": "code", "execution_count": 15, "id": "d0109867-1c62-4e24-bcae-44f296854c3d", "metadata": {}, "outputs": [], "source": [ "xs = (random.random() for _ in range(10000))" ] }, { "cell_type": "code", "execution_count": 16, "id": "270be390-7a1c-4341-86d5-2d60177580fd", "metadata": {}, "outputs": [], "source": [ "stats.update_from_it(xs)" ] }, { "cell_type": "code", "execution_count": 17, "id": "f9203949-8b8e-42ae-9481-32aa733329a0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10207" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.count" ] }, { "cell_type": "markdown", "id": "48180d84-89a2-4333-8db0-84c037e7d399", "metadata": {}, "source": [ "The relative error should now be much smaller:" ] }, { "cell_type": "code", "execution_count": 18, "id": "3e7af838-dae0-4923-95d2-fd1469886b76", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.005704083543763922" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.rel_err" ] }, { "cell_type": "markdown", "id": "33532a38-f2c7-4e32-ba52-dfde994e1dd3", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "### Estimating Repeat Quantities\n", "\n", "Another common scenario is when you have a function that returns\n", "a noisy estimate, which you would like to estimate to some\n", "relative error. The function {func}`~xyzpy.estimate_from_repeats`\n", "provides this functionality, building on {class}`~xyzpy.RunningStatistics`.\n", "\n", "As an example, imagine we want to estimate the sum of ``n`` uniformly distributed\n", "numbers to a relative error of 0.1%:" ] }, { "cell_type": "code", "execution_count": 19, "id": "4d6f5da0-9f91-4e79-8074-3bf4c7368fcc", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "32432it [00:00, 285928.59it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "RunningStatistics(mean=5.00007(50)e+02, count=32433)\n" ] } ], "source": [ "def rand_n_sum(n):\n", " return np.random.rand(n).sum()\n", "\n", "\n", "stats = xyz.estimate_from_repeats(rand_n_sum, n=1000, rtol=0.0001, verbosity=1)" ] }, { "cell_type": "markdown", "id": "6aaa8021-01cc-4547-a552-630e8190c5de", "metadata": {}, "source": [ "We can then query the returned ``RunningStatistics`` object:" ] }, { "cell_type": "code", "execution_count": 20, "id": "d9184618-1191-425a-a219-f8893653971c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(500.00662606906917)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.mean" ] }, { "cell_type": "code", "execution_count": 21, "id": "e1d11c41-740f-4f94-9d7b-bdc3010d3725", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(0.00010019919813211347)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.rel_err" ] }, { "cell_type": "markdown", "id": "2a834feb-f48a-4cad-81db-3bbf73da0669", "metadata": {}, "source": [ "Which looks as expected." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3" } }, "nbformat": 4, "nbformat_minor": 4 }