Utilities#
xyzpy
provides a number of utilities that might be generally
useful when generating data. These are:
For timing and comparing functions. And then:
for collecting running statistics and estimating quantities from repeats.
%config InlineBackend.figure_formats = ['svg']
import xyzpy as xyz
import numpy as np
Timing#
Simple timing with Timer
#
This is a super simple context manager for very roughly timing a statement that runs once:
with xyz.Timer() as timer:
A = np.random.randn(512, 512)
el, ev = np.linalg.eig(A)
timer.interval
0.42705488204956055
If you run this a few times you might notice some big fluctuations.
Advanced timing with benchmark
#
This is a more advanced and accurate function that wraps timeit
under the hood.
If offers however a convenient interface that accepts callables and sensibly manages
how many repeats to do etc.:
def setup(n=512):
return np.random.randn(n, n)
def foo(A):
return np.linalg.eig(A)
xyz.benchmark(foo, setup=setup)
0.21258469700114802
Or we can specfic the size n
to benchmark with as well:
xyz.benchmark(foo, setup=setup, n=1024)
0.7750730779953301
Which is calling foo(setup(n))
under the hood.
Generally the setup
and n
arguments are optional -
including them or not allows switching between the following
underlying patterns:
foo()
foo(n)
foo(setup())
foo(setup(n))
Supply starmap=True
if you want foo(*setup(n))
, and
see benchmark()
for other options, e.g. the
minimum time and number of repeats to aim for.
Comparing performance with Benchmarker
#
Building on top of benchmark()
and combining it with
the functionality of a Harvester()
gives us a very nice
way to compare the performance of various functions, or ‘kernels’.
As an example here we’ll compare python
, numpy
and numba
for computing sum(x**2)**0.5
.
import numba as nb
def setup(n):
return np.random.randn(n)
def python_square_sum(xs):
y = 0.0
for x in xs:
y += x**2
return y**0.5
def numpy_square_sum(xs):
return (xs**2).sum()**0.5
@nb.njit
def numba_square_sum(xs):
y = 0.0
for x in xs:
y += x**2
return y**0.5
The setup
function will be supplied to each, we can check they
first give the same answer:
xs = setup(100)
python_square_sum(xs)
9.749946780493362
numpy_square_sum(xs)
9.749946780493364
numba_square_sum(xs)
9.749946780493362
Then we can set up a Benchmarker
object to compare these with:
kernels = [
python_square_sum,
numpy_square_sum,
numba_square_sum,
]
benchmarker = xyz.Benchmarker(
kernels,
setup=setup,
benchmark_opts={'min_t': 0.01}
)
Next we run a set of problem sizes:
sizes = [2**i for i in range(1, 11)]
benchmarker.run(sizes, verbosity=2)
{'n': 1024, 'kernel': 'numba_square_sum'}: 100%|##########| 30/30 [00:01<00:00, 17.43it/s]
Which we can then automatically plot:
benchmarker.plot()
(<Figure size 300x300 with 1 Axes>,
array([[<Axes: xlabel='n', ylabel='time'>]], dtype=object))
Under the hood Benchmarker
collects and aggregates results
using a Harvester
. This means that subsequent runs
of different sizes will be automatically merged. Additionally, if you
initialize the benchmarker with a dataname
, the results will be
stored in a on-disk dataset.
Estimation#
Efficiently collect running statistics#
Sometimes it is convenient to collect statistics on-the-fly, rather than storing
all the values and computing statistics afterwards. The
RunningStatistics
object can be used for this purpose:
import random
stats = xyz.RunningStatistics()
total = 0.0
# don't know how many `x` we'll generate, and won't keep them
while total < 100:
x = random.random()
total += x
stats.update(x)
We can now check a variety of information about the values generated:
print(" Count: {}".format(stats.count))
print(" Mean: {}".format(stats.mean))
print(" Variance: {}".format(stats.var))
print("Standard Deviation: {}".format(stats.std))
print(" Error on the mean: {}".format(stats.err))
print(" Relative Error: {}".format(stats.rel_err))
Count: 199
Mean: 0.5031722149433184
Variance: 0.08545735406675721
Standard Deviation: 0.2923308982416282
Error on the mean: 0.020722787940669462
Relative Error: 0.041184285072266666
For performance, RunningStatistics
is a numba
compiled class,
and can also be updated using an iterable very efficiently:
xs = (random.random() for _ in range(10000))
stats.update_from_it(xs)
stats.count
10199
The relative error should now be much smaller:
stats.rel_err
0.0056581110276361585
Estimating Repeat Quantities#
Another common scenario is when you have a function that returns
a noisy estimate, which you would like to estimate to some
relative error. The function estimate_from_repeats()
provides this functionality, building on RunningStatistics
.
As an example, imagine we want to estimate the sum of n
uniformly distributed
numbers to a relative error of 0.1%:
def rand_n_sum(n):
return np.random.rand(n).sum()
stats = xyz.estimate_from_repeats(rand_n_sum, n=1000, rtol=0.0001, verbosity=1)
32637it [00:00, 92507.22it/s]
RunningStatistics(mean=4.99971(50)e+02, count=32638)
We can then query the returned RunningStatistics
object:
stats.mean
499.9706134376662
stats.rel_err
0.00010019753062400043
Which looks as expected.