Analysis with Instruction-Based Sampling

This section is a brief introduction to using Instruction-Based Sampling (IBS). A CodeAnalyst project must already be opened by following the directions under Creating a CodeAnalyst Project, or by opening an existing CodeAnalyst project. It also assumes that session settings have been established and CodeAnalyst is ready to profile an application.

Collecting IBS Data

A drop-down list of the available profile configurations is included in the CodeAnalyst toolbar.

  1. Select the Instruction-based sampling profile configuration. This profile configuration enables collection of both IBS fetch and IBS op data.

  1. Click the Start button in the toolbar or select Profile > Start to begin profiling. CodeAnalyst starts data collection and launches the application program previously specified in the session settings. The session status displays in the status bar in the lower left corner of the CodeAnalyst window. Session progress displays in the lower right corner. The blank window is the console window in which the application program“classic” is running.

When data collection completes, CodeAnalyst processes the IBS performance data and creates a new session under "IBS Sessions" in the session management area at the left-hand side of the CodeAnalyst window. Results are displayed in the System Data, System Graph, and System Tasks tabs. These tabs behave like their TBP and EBP counterparts. The tables and graph display the number of IBS-derived events that were sampled by the performance monitoring hardware.

 

Reviewing IBS Results

CodeAnalyst reports IBS performance data as IBS-derived events. See Instruction-Based Sampling-Derived Events for descriptions of the IBS-derived events.

Although IBS derived events look similar to performance monitoring counter (PMC) events, the sampling method is quite different. PMC event samples measure the actual number of particular hardware events that occurred during the measurement period. IBS derived events report the number of IBS fetch or op samples for which a particular hardware event was observed. Consider the three IBS derived events:

The IBS all op samples derived event is a count of all IBS op samples that were taken. The IBS branch op derived event is the number of IBS op samples where the monitored macro-op was a branch. These samples are a subset of all the IBS op samples. The IBS mispredicted branch op derived event is the number of IBS op sample branches that mispredicted. These samples are a subset of the IBS branch op samples. Unlike PMC events that count the actual number of branches (event select 0x0C2), it would be incorrect to say that the IBS branch op derived event reports the number of branches. The sampling basis is different.

The "All Data" view shows the number of occurrences of all IBS derived events. Instruction-Based Sampling collects a wide range of performance data in a single run. When both IBS fetch and op data are collected, the "All Data" view displays information for over 60 IBS derived events. CodeAnalyst provides several predefined views that display IBS derived events in logically-related groups. The available views are:

Most software developers will be interested in the overall breakdown of IBS ops, branch operations, load/store operations, data cache behavior, and data translation lookaside buffer behavior. The breakdown of local/remote accesses through the Northbridge can provide information about the efficiency of memory access on non-uniform memory access (NUMA) platforms.

  1. Select IBS fetch instruction cache from the drop-down list of views.

IBS information about instruction cache behavior displays. IC-related IBS derived events are shown, including the number of IBS fetch samples for attempted and completed fetch operations, the number of fetch samples which indicated an IC miss, and the total IBS fetch latency. An attempted fetch is a request to obtain instruction data from cache or system memory. A fetch attempt may be speculative. A completed fetch actually delivers instruction data to the decoder. The delivered data may go unused if the branch operation redirects the pipeline at a later time. Finally, the view also includes two computed performance measurements—the IC miss ratio (the number of IBS IC misses divided by the number of IBS attempted fetches) and the average fetch latency. Fetch latency is the number of cycles from when a fetch is initiated to when the fetch is either completed or aborted. (An aborted fetch is a fetch operation that does not complete and deliver instruction data to the decoder.)

  1. Select IBS All ops from the drop-down list of views.

The "IBS All ops" view displays. This view is an overall summary of the collected IBS op samples. It shows the total number of IBS op samples, the number of op samples taken for branch operations, and the number of samples for ops that performed a memory load and/or store operation. Tag-to-retire time is the number of cycles from when an op was selected (tagged) for sampling to when the op retired. Completion-to-retire time is the number of cycles from when an op completed (finished execution) to when the op retired. Total and average tag-to-retire and completion-to-retire times are shown in the next view.

  1. Select the "IBS MEM data cache" view from the drop-down list of views.

The "IBS MEM data cache" view is displayed. This view shows information related to data cache (DC) behavior. The number of sampled IBS load/store operations is shown along with a breakdown of the number of loads and the number of stores. The number of IBS samples where the load/store operation missed in the data cache is shown. The DC miss rate (DC misses divided by the total number of op samples) and DC miss ratio (DC misses divided by the number of load/store operations) are also displayed.

  1. Select the "IBS MEM data TLB" view from the drop-down list of views.

The "IBS MEM data TLB" view is displayed. This view shows information related to data translation lookaside buffer (DTLB) behavior. The number of sampled IBS load/store operations is shown along with a breakdown of the number of load operations and the number of store operations. AMD processors use a two-level DTLB. Address translation may hit in the L1 DTLB, miss in the L1 DTLB and hit in the L2 DTLB ("L1M L2H"), or miss in both levels of the DTLB ("L1M L2M".) The performance penalty for a miss in both levels is relatively high. Nearly half of the sampled load/store operations incurred a missed at both levels of the DTLB. This is the performance culprit in the sample program, classic, which performs a "textbook" implementation of matrix multiplication.

Drilling Down Into IBS Data

In order to find the source of the performance issue in the example program, we need to drill down into the classic module.

  1. Double-click on the module name classic in the System Data table.

A list of functions within classic is displayed with the IBS information for each function. CodeAnalyst retains the "IBS MEM data TLB" view. The function "multiply_matrices" has the most load/store activity and incurs the bulk of the DTLB misses.

  1. Double-click on the function multiply_matrices in the Module Data table, i.e., the table of functions within classic.

The source code for the function "multiply_matrices" is displayed with the IBS information for each source line in the function. Most load/store activity occurs at line 65, which is the statement within the nested loops. This is the statement that reads an element from each of the two operand matrices and computes the running sum of the product of the elements. The DTLB misses are caused by the large strides taken through matrix_b. With nearly every iteration the program touches a different page, thereby causing a miss in the DTLB.

  1. Click the expand box "+" to the left of line 65

The disassembled instructions for source line 65 are displayed along with the IBS data for each instruction. IBS load/store information is attributed to each instruction that performs either a memory read or write operation. Sources of performance-robbing DTLB misses are precisely identified.

  1. Select IBS BR branch from the drop-down list of views.

The "IBS BR branch" view displays. This view shows the number of IBS branch op samples and indicates if the branch operation mispredicted and/or was taken. Note that only the conditional jump instruction at the end of the innermost loop is marked as a branch instruction. This example further illustrates the precision offered by Instruction-Based Sampling.

 

Next: Profiling a Java Application