Tutorial¶
This is the decu
tutorial. Follow a long for a tour of decu
's main
features.
Example Project¶
After you have installed decu
in your (virtual) environment, you can
follow along with this example project. This example will showcase the
three main sets of functionality that decu
provides: project
organization, bookkeeping, and inspection. It is best if you follow along
by copying and pasting the source code we show below in a temporary
project.
Project organization: decu init
¶
The first step is to get our file system set up to run an experimental
computation project. Navigate to an empty directory where your project is
going to live. Call this directory root_dir
. Now execute
$ decu init
Initialized empty decu project directory in <root_dir>
decu
follows a somewhat strict template of what a project filesystem
should look like, and with $ decu init
, all we're doing is setting up
root_dir
to reflect decu
's organization. If you now ls
your
root_dir
you will see something like the following.
root_dir
src
data
logs
pics
results
The purpose of each directory created by decu
is straightforward. The
data
directory is assumed to contain raw or processed data files. The
logs
directory will store the logs of running experiments. The pics
and
results
directories will hold plots, figures, and result files generated
by your decu
scripts. The src
directory will contain said scripts.
Project bookkeeping: decu exec
¶
With this directory structure, we can now start coding our computational
experiments. Create a file script.py
inside root_dir/src
that contains
the following.
import decu
class MyScript(decu.Script):
@decu.experiment(data_param='data')
def exp(self, data, param, param2):
return [x**param + param2 for x in data]
def main(self):
data = range(100)
result = self.exp(data, 2.72, 3.14)
In script.py
we subclass decu.Script
, define an experiment method
exp
, and call it on some data from the main
method. Note that the
method is decorated with @decu.experiment
which requires us to specify
which of exp
's parameters is treated as the data input. All other
parameters are treated as 'experimental parameters'. More on experimental
parameters later. That's basically all that this code does.
What could we expect decu
to do in this simple example? Bookkeeping! Note
that we didn't save the results of our experiment to disk, or used log
files or print
calls to document what the script is doing. This was not
an oversight, it was on purpose. In fact, decu
will do all of this for
us.
Before running the above script, the directory structure should look as follows.
root_dir
src
script.py
data
logs
pics
results
NOTE: After having called$ decu init
inroot_dir
, all successive calls todecu
should be made fromroot_dir
itself. That is, do notcd
intosrc/
and then calldecu
. All console commands from here on will be done fromroot_dir
.
Simple run¶
Now cd
to root_dir
again and run our script through decu
.
$ decu exec src/script.py
After executing script.py
, we can now take a look at what happened to
root_dir
.
root_dir
src
script.py
data
logs
log_file1--0.txt
pics
results
result_file1--0.txt
Here, both log_file1.txt
and result_file1.txt
will have a name
including the date and time of execution and the name of the script that
generated these files, among other information.
This is (one of) the main features of decu
. We needn't specify what
information we want to log, or the file name where we want to save our
experimental results. In fact, we didn't even need to manually save the
results to disk ourselves. decu
will take care of the bookkeeping.
To see more specifically what decu
saves to the log file, do
$ cat logs/log_file1.txt
[<time>]INFO: Starting exp--0 with {'param': 2.72, 'param2': 3.14}.
[<time>]INFO: Finished exp--0. Took 4e-05.
[<time>]INFO: Wrote results of exp--0 to results/result_file1.txt.
Since our script is very simple, decu
just needed to log three
lines. However, you fill find there's a trove of information here. Without
writing a single line of logging or I/O code, we now have:
- a unique file with a time-stamped recount of what our script did,
- a record of the experimental parameters with which
exp
was called, - the time it took to run
exp
, - a file containing the results of running
exp
with the recorded parameters. This file contains in its name the name of the script and the function that generated its contents.
To understand why the log_file1.txt
logs the call to exp
as exp--0
,
we need to modify script.py
a little.
Multiple runs¶
import decu
class MyScript(decu.Script):
@decu.experiment(data_param='data')
def exp(self, data, param, param2):
return [x**param + param2 for x in data]
def main(self):
data = range(100)
result = self.exp(data, 2.72, 3.14)
params = [(data, x, y) for x, y in zip([1, 2, 3], [-1, 0, 1])]
result2 = decu.run_parallel(self.exp, params)
We have included further experiments now. We are calling the same method
exp
but with a different choice of parameters each time.
decu.run_parallel(method, params)
will call method(*l)
for each element
l
in params
, and it will do so by using Python's multiprocessing
library, which means that these experiments will be run in parallel.
To execute this new version, we do
$ decu exec src/script.py
First of all, take a look at the current state of root_dir
, after a
second run our script.
root_dir
src
script.py
data
logs
log_file1.txt
log_file2.txt
pics
results
result_file1--0.txt
result_file2--0.txt
result_file2--1.txt
result_file2--2.txt
result_file2--3.txt
Here's what happened: decu
created a new log file for this second
experimental run, log_file2.txt
. It also generated one result file for
each of the experiments we ran. To understand the contents of the new
result files, we need only read the new log file. (Your output maybe
slightly different in the order of the lines.)
$ cat logs/log_file2.txt
[<time>]INFO: Starting exp--0 with {'param': 2.72, 'param2': 3.14}.
[<time>]INFO: Finished exp--0. Took 0.0004s.
[<time>]INFO: Wrote results of exp--0 to results/2017-10-20 18:21:28.374962--script--exp--0.txt.
[<time>]INFO: Starting exp--2 with {'param': 2, 'param2': 0}.
[<time>]INFO: Starting exp--1 with {'param': 1, 'param2': -1}.
[<time>]INFO: Starting exp--3 with {'param': 3, 'param2': 1}.
[<time>]INFO: Finished exp--2. Took 4e-05s.
[<time>]INFO: Finished exp--1. Took 5e-05s.
[<time>]INFO: Finished exp--3. Took 6e-05s.
[<time>]INFO: Wrote results of exp--2 to results/2017-10-20 18:21:28.374962--script--exp--2.txt.
[<time>]INFO: Wrote results of exp--1 to results/2017-10-20 18:21:28.374962--script--exp--1.txt.
[<time>]INFO: Wrote results of exp--3 to results/2017-10-20 18:21:28.374962--script--exp--3.txt.
The first three lines are familiar. They correspond to the call to exp
that we had before, and they provide similar information. We now know that
result_file2.txt
contains the result of running exp
with parameters
{'param': 2.72, 'param2': 3.14}
. Then we used run_parallel
to run exp
over a list of parameters, and we get the last nine lines, which contain
similar information as before, but for the three additional times we called
exp
. Observe that each time we call exp
, the log file includes two
dashes followed by a number, --x
. This is a way for identifying different
calls to the same experiment, and serves to tell which result file belongs
to which experiment call. Note that the result files also contain the same
identifier at the end. Without this identifier, our result files would all
overwrite each other, or else there would be no way to tell which contains
the result of which experiment call. In the example above, we need only
match the identifier of a result file name with the experiment identifiers
in log file, to know
- the method that generated the result file,
- the time it took to generate the file, and
- the experimental parameters that were used to generate it.
In other words, with these identifiers we can cross-reference results and
experiment runs by using the log file. The run identifiers are guaranteed
to be unique for each call to exp
, even when using parallelism with
run_parallel
.
Oh, never mind the fact that we just used multiprocessing to run our
experiments in parallel, with no additional imports and in a single call of
run_parallel
.
Figures¶
So now decu
is handling logging, directory bookkeeping, cross-referencing
experimental parameters and results, and parallelism, in 12 lines of
python (two of each are empty BTW). But there's more!
Say now that you want to plot a pretty picture from your results. Enter the
@figure
decorator, used in the following version of script.py
.
import decu
import matplotlib.pyplot as plt
class MyScript(decu.Script):
@decu.experiment(data_param='data')
def exp(self, data, param, param2):
return [x**param + param2 for x in data]
@decu.figure()
def fig(self, data, results):
for res in results:
plt.semilogy(data, res)
def main(self):
data = range(100)
result = self.exp(data, 2.72, 3.14)
params = [(data, x, y) for x, y in zip([1, 2, 3], [-1, 0, 1])]
result2 = decu.run_parallel(self.exp, params)
self.fig(data, result2)
After importing matplotlib
, we have added the fig
method, which we have
decorated with @decu.figure()
, and we call at the end of main
.
You know the drill now:
$ decu exec src/script.py
The root_dir
should now look as follows.
root_dir
src
script.py
data
logs
log_file1.txt
log_file2.txt
log_file3.txt
pics
fig_file1.png
results
result_file1--0.txt
result_file2--0.txt
result_file2--1.txt
result_file2--2.txt
result_file2--3.txt
result_file3--0.txt
result_file3--1.txt
result_file3--2.txt
result_file3--3.txt
As before, we get our log file log_file3.txt
and four result files. We
also get out plot inside pics/
. "But wait!"-I hear you say-"we never
saved the plot to disk!" Exactly. You can open this file to convince
yourself of how wonderful decu
is.
You can also read the log file to see that it mentions which method
(fig
) generated the new figure file.
Project debugging: decu inspect
¶
Oh, darn. We forgot to add a title to our plot. After we have modified our
fig
function to include a nice title, we can generate the new plot in a
number of ways. First, we can run the whole thing again. This becomes
increasingly cumbersome (and sometimes outright impossible) if exp
takes
too long to run, as it often does in real life. Second, since we have the
result file, we can pop into a python interpreter, read the result from
disk, and call fig
again. This would require us to not only load the
result, but the file script.py
and instantiate the class MyScript
. How
tedious.
OR, we can use decu
to do exactly that.
$ decu inspect results/result_file3*
import decu
import numpy as np
import src.script as script
script = script.MyScript('root_dir', 'script')
# loaded result
In [1]: data = range(100)
In [2]: script.fig(data, result)
In [2]: exit
If you have the necessary data in a file, then another possibility is:
$ decu inspect result_dict=results/result_file3* --data=data/data_file1.txt
import decu
import numpy as np
import src.script as script
script = script.MyScript('/tmp/root_dir', 'script')
# loaded result
# loaded data
In [1]: script.fig(data, result)
In [2]: exit
And in that case, yet another possibility, and the most efficient one, is
to use the -c
flag:
$ decu inspect -c "script.fig(data, result)" \
result_dict=results/result_file3* \
--data=data/data_file1.txt
import decu
import numpy as np
import src.script as script
script = script.MyScript('/tmp/root_dir', 'script')
# loaded result
# loaded data
script.fig(data, result)
exit
$
The -c
flag takes an arbitrary string, executes it after the initial file
loading is done, and then quits ipython. This makes it possible to fix our
figure in a single decu
call, without having to drop to an interpreter or
manually load anything. Nifty, yes?
The final result is: