Utilities¶

xyzpy provides a number of utilities that might be generally useful when generating data. These are:

For timing and comparing functions. And then:

for collecting running statistics and estimating quantities from repeats.

%config InlineBackend.figure_formats = ['svg']

import numpy as np

import xyzpy as xyz

Timing¶

Simple timing with `Timer`¶

This is a super simple context manager for very roughly timing a statement that runs once:

with xyz.Timer() as timer:
    A = np.random.randn(512, 512)
    el, ev = np.linalg.eig(A)

timer.interval

0.27247190475463867

If you run this a few times you might notice some big fluctuations.

Advanced timing with `benchmark`¶

This is a more advanced and accurate function that wraps timeit under the hood. If offers however a convenient interface that accepts callables and sensibly manages how many repeats to do etc.:

def setup(n=512):
    return np.random.randn(n, n)

def foo(A):
    return np.linalg.eig(A)

xyz.benchmark(foo, setup=setup)

0.16060145798837766

Or we can specfic the size n to benchmark with as well:

xyz.benchmark(foo, setup=setup, n=1024)

0.786840959044639

Which is calling foo(setup(n)) under the hood. Generally the setup and n arguments are optional - including them or not allows switching between the following underlying patterns:

foo()
foo(n)
foo(setup())
foo(setup(n))

Supply starmap=True if you want foo(*setup(n)), and see benchmark() for other options, e.g. the minimum time and number of repeats to aim for.

Comparing performance with `Benchmarker`¶

Building on top of benchmark() and combining it with the functionality of a Harvester() gives us a very nice way to compare the performance of various functions, or ‘kernels’.

As an example here we’ll compare python, numpy and numba for computing sum(x**2)**0.5.

import numba as nb


def setup(n):
    return np.random.randn(n)


def python_square_sum(xs):
    y = 0.0
    for x in xs:
        y += x**2
    return y**0.5


def numpy_square_sum(xs):
    return (xs**2).sum() ** 0.5


@nb.njit
def numba_square_sum(xs):
    y = 0.0
    for x in xs:
        y += x**2
    return y**0.5

The setup function will be supplied to each, we can check they first give the same answer:

xs = setup(100)

python_square_sum(xs)

np.float64(9.97523320851365)

numpy_square_sum(xs)

np.float64(9.97523320851365)

numba_square_sum(xs)

9.97523320851365

Then we can set up a Benchmarker object to compare these with:

kernels = [
    python_square_sum,
    numpy_square_sum,
    numba_square_sum,
]

benchmarker = xyz.Benchmarker(
    kernels, setup=setup, benchmark_opts={"min_t": 0.01}
)

Next we run a set of problem sizes:

sizes = [2**i for i in range(1, 11)]

benchmarker.run(sizes, verbosity=2)

100%|##########| 30/30 [00:01<00:00, 21.85it/s, {'n': 1024, 'kernel': 'numba_square_sum'}] 

Which we can then automatically plot:

benchmarker.plot()

_images/f6294d89c70313ab416799c7b741f40d3b1fb31cdc1b6972b42860588d30e6a5.svg

(<Figure size 300x300 with 1 Axes>,
 array([[<Axes: xlabel='n', ylabel='time'>]], dtype=object))

Under the hood Benchmarker collects and aggregates results using a Harvester. This means that subsequent runs of different sizes will be automatically merged. Additionally, if you initialize the benchmarker with a dataname, the results will be stored in a on-disk dataset.

Estimation¶

Efficiently collect running statistics¶

Sometimes it is convenient to collect statistics on-the-fly, rather than storing all the values and computing statistics afterwards. The RunningStatistics object can be used for this purpose:

import random

stats = xyz.RunningStatistics()
total = 0.0

# don't know how many `x` we'll generate, and won't keep them
while total < 100:
    x = random.random()
    total += x

    stats.update(x)

We can now check a variety of information about the values generated:

print("             Count: {}".format(stats.count))
print("              Mean: {}".format(stats.mean))
print("          Variance: {}".format(stats.var))
print("Standard Deviation: {}".format(stats.std))
print(" Error on the mean: {}".format(stats.err))
print("    Relative Error: {}".format(stats.rel_err))

             Count: 207
              Mean: 0.48428927341941447
          Variance: 0.07836627923673178
Standard Deviation: 0.27993977787504903
 Error on the mean: 0.019457159584961168
    Relative Error: 0.04017673042308015

For performance, RunningStatistics is a numba compiled class, and can also be updated using an iterable very efficiently:

xs = (random.random() for _ in range(10000))

stats.update_from_it(xs)

stats.count

The relative error should now be much smaller:

stats.rel_err

0.005704083543763922

Estimating Repeat Quantities¶

Another common scenario is when you have a function that returns a noisy estimate, which you would like to estimate to some relative error. The function estimate_from_repeats() provides this functionality, building on RunningStatistics.

As an example, imagine we want to estimate the sum of n uniformly distributed numbers to a relative error of 0.1%:

def rand_n_sum(n):
    return np.random.rand(n).sum()


stats = xyz.estimate_from_repeats(rand_n_sum, n=1000, rtol=0.0001, verbosity=1)

32432it [00:00, 285928.59it/s]

RunningStatistics(mean=5.00007(50)e+02, count=32433)

We can then query the returned RunningStatistics object:

stats.mean

np.float64(500.00662606906917)

stats.rel_err

np.float64(0.00010019919813211347)

Which looks as expected.