branchTrace.tex

\chapter{Branch Trace} \label{Branch Trace}


\section{Instruction Delta Tracing} \label{Delta Tracing}

Instruction delta tracing, also known as branch tracing, works by
tracking execution from a known start address by sending information
about the deltas taken by the program. Deltas are typically introduced
by jump, call, return and branch type instructions, although
interrupts and exceptions are also types of delta.

Instruction trace delta modes provide an efficient encoding of an
instruction sequence by exploiting the deterministic way the processor
behaves based on the program it is executing. The approach relies on
an offline copy of the program being available to the decoder, so it
is generally unsuitable for either dynamic (self-modifying) programs
or those where access to the program binary is prohibited. There is no
need for either assembly or high-level source code to be available,
although such source code will aid the debugger in presenting the
decoded trace.

This approach can be extended to cope with small sections of
deterministically dynamic code by arranging for the decoder to request
instruction memory from the target. Memory lookups generally lead to a
prohibitive reduction in performance, although they are suitable for
examining modest jump tables, such as the exception/interrupt vector
pointers of an operating system which may be adjusted at boot up and
when services are registered.  Both static and dynamically linked
programs can be traced using this approach. Statically linked programs
are straightforward as they generally operate in a known address
space, often mapping directly to physical memory. Dynamically linked
programs require the debugger to keep track of memory allocation
operations using either trace or stop-mode debugging.

\subsection{Sequential Instructions} \label{Sequential Instructions}

For instruction set architectures where all instructions are executed
unconditionally or at least their execution can be determined based on
the program, the instructions between the deltas are assumed to be
executed sequentially. This characteristic means that there is no need
to report them via the trace, only whether the branches were taken or not
and the address of taken indirect jump.

\subsection{Uninferable PC Discontinuity} \label{uninfpc}

If the program counter is changed by an amount that cannot be
inferred from the execution binary, the trace decoder needs to be
given the destination address (i.e. the address of the next valid
instruction).  Examples of this are indirect jumps, where
the next instruction address is determined by the contents of a
register rather than a constant embedded in the source code.

\subsection{Branches} \label{branches}

When a branch occurs, the decoder must be informed of whether it was
taken or not.  For a direct branch, this is sufficient.  There are no
indirect branches in RISC-V; an indirect jump is an uninferable PC
discontinuity.

\subsection{Interrupts and Exceptions} \label{interruptsexceptions}

Interrupts are a different type of delta, they generally occur
asynchronously to the program's execution rather than intentionally as
a result of a specific instruction or event. Exceptions can be thought
of in the same way, even though they can be typically linked back to a
specific instruction address.  The decoder generally does not know
where an interrupt occurs in the instruction sequence, so the trace
must report the address where normal program flow ceased, as well as
give an indication of the asynchronous destination which may be as
simple as reporting the exception type.  When an interrupt or
exception occurs, or the processor is halted, the final instruction
executed beforehand must be traced.  Following this, for an interrupt
or exception, the next valid instruction address (the first of the
interrupt or exception handler) must be traced in order to instruct the
trace decoder to classify the instruction as an indirect jump even
if it is not.

\subsection{Synchronization} \label{synchronization}

In order to make the trace robust there needs to be regular
synchronization points within the trace. Synchronization is made by
sending a full valued instruction address (and potentially a context
identifier). The decoder and debugger may also benefit from sending
the reason for synchronising. The frequency of synchronization is a
trade-off between robustness and trace bandwidth.

The instruction trace encoder needs to synchronise fully:

\begin{itemize}

\item After a reset.
  \item When tracing starts.
\item If the instruction is the first of an interrupt service routine or
exception handler (hardware context change).
\item After a prolonged period of time.
\end{itemize}

\subsection{Optional and run-time configurable modes} \label{optional}

The following modes are optional, and if present must be run-time selectable.  The 
active run-time options must be reported in the \textit{te\_support} packet, which is issued by the
encoder whenever the encoder configuration is changed.

\subsubsection{Full address} \label{sec:full-address}

All packet formats apart from format 3 output addresses in differential form by default.
An option to output full addresses for all packet formats is a useful debugging aid for 
software decoder developers.  It will always result in less efficient trace encoding.

\subsubsection{Implicit exception} \label{sec:implicit-exception}

The exception handler base address is specified by \textbf{\textit{utvec/stvec/mtvec}}, and 
in some RISC-V implementations the lower address bits can be specified by 
\textbf{\textit{ucause/scause/mcause}}.  
By default, both these values are reported when an exception or interrrupt occurs, 
via the the format 3, subformat 1 packet.  The 'implicit exception' option omits the 
trap handler address, and will improve efficiency in cases where the decoder can infer 
the address of the trap handler from just the exception cause.

\subsubsection{Implicit return} \label{sec:implicit-return}

Although a function return is usually an indirect jump, well behaved programs following a 
calling convention return to the point in the program from which the function was called, 
and as such it is possible to determine the execution path without being explicitly notified 
of the destination address of the return.  The 'implicit return' option can result in very 
significant improvements in trace encoder efficiency.  It utilizes a counter to 
keeps track of the number of nested calls being traced.  The counter increments on calls (but not tail calls),
and decrements on returns (see Section~\ref{Jump Classes} for definitions).  The counter will not 
over or underflow, and is reset to 0 whenever a format 3 \textit{te\_inst} packet is sent.  Returns will be 
treated as inferable and will not generate a trace packet if the count is non-zero (i.e. the associated call 
was already reported in a \textit{te\_inst} message).

\subsubsection{Branch prediction} \label{sec:branch-prediction}

Whilst recording the taken/not-taken status of each branch in a branch map is efficient, there are 
some cases where this can result in a relatively large volume of trace.  For example:

\begin{itemize}
  \item Executing tight loops of straight-line code.  Each iteration of the loop will add a bit to the branch map;
  \item Sitting in an idle loop waiting for an interrupt.  This produces large amounts of trace when nothing of 
  any interest is actually happening!  
  \item Breakpoints, which in some implementations also spin in an idle loop.
\end{itemize}

The prediction scheme implemented in the encoder will need to be modelled in the decoder software.  
The predictor shall comprise a a lookup table of 2\textsuperscript{N} entries, where N is specified by a parameter.  
Each entry is indexed by bits N:1 of the instruction address (or N+1:2 if compressed instructions aren't supported), 
and each contains a 2-bit prediction state:
\begin{itemize}
  \item 00: predict 0, transition to 01 if prediction fails;
  \item 01: predict 0, transition to 00 if prediction succeeds, else 11;
  \item 11: predict 1, transition to 10 if prediction fails;
  \item 10: predict 1, transition to 11 if prediction succeeds, else 00.
\end{itemize}

We could also consider the gShare predictor (see Hennessy \& Patterson).  Some further experimentation is needed
to determine the benefits of different lookup table sizes and predictor algorithms.