Utilities

xyzpy provides a number of utilities that might be generally useful when generating data. These are:

For timing and comparing functions. And then:

for collecting running statistics and estimating quantities from repeats.

%config InlineBackend.figure_formats = ['svg']

import numpy as np

import xyzpy as xyz

Timing

Simple timing with Timer

This is a super simple context manager for very roughly timing a statement that runs once:

with xyz.Timer() as timer:
    A = np.random.randn(512, 512)
    el, ev = np.linalg.eig(A)

timer.interval
0.27247190475463867

If you run this a few times you might notice some big fluctuations.

Advanced timing with benchmark

This is a more advanced and accurate function that wraps timeit under the hood. If offers however a convenient interface that accepts callables and sensibly manages how many repeats to do etc.:

def setup(n=512):
    return np.random.randn(n, n)


def foo(A):
    return np.linalg.eig(A)


xyz.benchmark(foo, setup=setup)
0.16060145798837766

Or we can specfic the size n to benchmark with as well:

xyz.benchmark(foo, setup=setup, n=1024)
0.786840959044639

Which is calling foo(setup(n)) under the hood. Generally the setup and n arguments are optional - including them or not allows switching between the following underlying patterns:

foo()
foo(n)
foo(setup())
foo(setup(n))

Supply starmap=True if you want foo(*setup(n)), and see benchmark() for other options, e.g. the minimum time and number of repeats to aim for.

Comparing performance with Benchmarker

Building on top of benchmark() and combining it with the functionality of a Harvester() gives us a very nice way to compare the performance of various functions, or ‘kernels’.

As an example here we’ll compare python, numpy and numba for computing sum(x**2)**0.5.

import numba as nb


def setup(n):
    return np.random.randn(n)


def python_square_sum(xs):
    y = 0.0
    for x in xs:
        y += x**2
    return y**0.5


def numpy_square_sum(xs):
    return (xs**2).sum() ** 0.5


@nb.njit
def numba_square_sum(xs):
    y = 0.0
    for x in xs:
        y += x**2
    return y**0.5

The setup function will be supplied to each, we can check they first give the same answer:

xs = setup(100)
python_square_sum(xs)
np.float64(9.97523320851365)
numpy_square_sum(xs)
np.float64(9.97523320851365)
numba_square_sum(xs)
9.97523320851365

Then we can set up a Benchmarker object to compare these with:

kernels = [
    python_square_sum,
    numpy_square_sum,
    numba_square_sum,
]

benchmarker = xyz.Benchmarker(
    kernels, setup=setup, benchmark_opts={"min_t": 0.01}
)

Next we run a set of problem sizes:

sizes = [2**i for i in range(1, 11)]

benchmarker.run(sizes, verbosity=2)
100%|##########| 30/30 [00:01<00:00, 21.85it/s, {'n': 1024, 'kernel': 'numba_square_sum'}] 

Which we can then automatically plot:

benchmarker.plot()
_images/f6294d89c70313ab416799c7b741f40d3b1fb31cdc1b6972b42860588d30e6a5.svg
(<Figure size 300x300 with 1 Axes>,
 array([[<Axes: xlabel='n', ylabel='time'>]], dtype=object))

Under the hood Benchmarker collects and aggregates results using a Harvester. This means that subsequent runs of different sizes will be automatically merged. Additionally, if you initialize the benchmarker with a dataname, the results will be stored in a on-disk dataset.

Estimation

Efficiently collect running statistics

Sometimes it is convenient to collect statistics on-the-fly, rather than storing all the values and computing statistics afterwards. The RunningStatistics object can be used for this purpose:

import random

stats = xyz.RunningStatistics()
total = 0.0

# don't know how many `x` we'll generate, and won't keep them
while total < 100:
    x = random.random()
    total += x

    stats.update(x)

We can now check a variety of information about the values generated:

print("             Count: {}".format(stats.count))
print("              Mean: {}".format(stats.mean))
print("          Variance: {}".format(stats.var))
print("Standard Deviation: {}".format(stats.std))
print(" Error on the mean: {}".format(stats.err))
print("    Relative Error: {}".format(stats.rel_err))
             Count: 207
              Mean: 0.48428927341941447
          Variance: 0.07836627923673178
Standard Deviation: 0.27993977787504903
 Error on the mean: 0.019457159584961168
    Relative Error: 0.04017673042308015

For performance, RunningStatistics is a numba compiled class, and can also be updated using an iterable very efficiently:

xs = (random.random() for _ in range(10000))
stats.update_from_it(xs)
stats.count
10207

The relative error should now be much smaller:

stats.rel_err
0.005704083543763922

Estimating Repeat Quantities

Another common scenario is when you have a function that returns a noisy estimate, which you would like to estimate to some relative error. The function estimate_from_repeats() provides this functionality, building on RunningStatistics.

As an example, imagine we want to estimate the sum of n uniformly distributed numbers to a relative error of 0.1%:

def rand_n_sum(n):
    return np.random.rand(n).sum()


stats = xyz.estimate_from_repeats(rand_n_sum, n=1000, rtol=0.0001, verbosity=1)
32432it [00:00, 285928.59it/s]
RunningStatistics(mean=5.00007(50)e+02, count=32433)

We can then query the returned RunningStatistics object:

stats.mean
np.float64(500.00662606906917)
stats.rel_err
np.float64(0.00010019919813211347)

Which looks as expected.