### DoD-Focused Benchmarks/Metrics, Toolchain and Debug Recommendations

for Workshop on Suite of Embedded Applications and Kernels Jeffrey Smith, PhD 1 June 2014



### Agenda

### Involvement with PNNL with applications discussion



| g. SIVIS |      |      |      |                          |      |                  |         |                                 |                              |         |
|----------|------|------|------|--------------------------|------|------------------|---------|---------------------------------|------------------------------|---------|
|          |      |      | E    | W                        |      |                  |         |                                 |                              |         |
|          |      |      |      |                          |      |                  |         |                                 |                              |         |
|          |      |      |      | Related CR&D e.g. RASSP, |      |                  |         |                                 |                              |         |
|          |      |      |      |                          |      | SHARE, ANTS, PCA | A, SDR, |                                 |                              |         |
|          |      |      |      |                          |      |                  |         | Comm related<br>and military pr |                              |         |
|          |      |      |      |                          |      |                  |         |                                 | cientist/BAE o<br>to Perfect | on jobs |
|          |      |      |      |                          |      |                  |         |                                 |                              |         |
|          |      |      |      |                          |      |                  |         |                                 |                              | Tod     |
| Jan      | Apr  | Jul  | Oct  | Jan                      | Apr  | Jul              | Oct     | Jan                             | Apr                          |         |
|          | 1976 | 1980 | 1984 | 1989                     | 1993 | 1997             | 2001    | 2006                            | 2010                         |         |

- Points of intersection with SEAK
  - Evaluating, benchmarking and classifying embedded systems
  - Compiler, synthesizers and optimization tools
  - Modeling of application and system behavior
- Discuss lessons learned and recommendation w.r.t. above intersections

BAE SYSTEMS

- Wide Area Persistent Surveillance Poses Extreme Computation Challenges
  - Processing 1.8 billion pixels, at 12 fps, generates on the order of 600 Gb/s (around 6 petabytes of video data per day)
- Benchmark creation through progressively increasing processing pipeline depth and mission capacities
  - Low SWaP processing that scales with Breadth (N objects), Complexity per object (M parameters), Depth (H hierarchy levels), and Data Rates (R) of higher level exploitation missions
  - Characterize load (N, M, H, R) for processing pipeline stages, e.g. image pre-processing, segmentation, classification, tracking
  - SWaP constraints of tactical exploitation systems (e.g. UAVs, DCGS-A Intelligence Fusion Servers) for a range of mission capabilities, e.g., generating dots from pixels, tracks from dots, activity patterns from tracks, and threat analysis from patterns
- Other Challenge Domains Are Similarly Characterized
  - Software Defined Radios
  - Tactical SIGINT Payload
  - Intelligence, Surveillance & Recon
  - Distributed Networked, Adaptive Electronic Surveillance

- Monitor Vessel Behavior
- RT Defense Against Networked Mobile and Spectrum Dynamic Emitters
- Continuous, Predictive Course of Action Analysis and Execution Monitoring

#### ARGUS-IS sensor



Persistics data processing pipeline



### Need New Paradigm to Harness Effective Code Parallelism

# BAE SYSTEMS

- Amdahl's Law prediction of limit of parallel processing speedup may yield low efficiency in HPC concurrent execution of independent parallel applications may need more energy
- Poor parallel application software architectures degrade performance well below Amdahl limits
  - Unbalanced processing load distributions
  - Productivity limits redesign in conventional, hand coded, developments
  - Balance interprocessor communication loading and data dependent wait times
  - Kernels depend on dataset scaling
- Reverse progress from ability to automatically and verifiably generate parallel and concurrent execution of military applications from graphical specifications



$$S = 1/(1-P)+(P/N)$$

Upper bound on parallel compute performance depends on exploiting ratio of serial to parallel code

Metric – common comparison between best hand-coded and tool generated benchmarks

#### Methods to Conceptualize/Apply High Performance Data Flow Applications



\* Conceptualization outliers/combos e.g. ZPL (accelerated arrays), BlueSpec (accelerated functions and objects), Matlab (variant of data flow) "yellow grouping", ...

## **Toolchain based Recommendations**

#### BAE SYSTEMS

#### Current

- Model generated specifications and partial code
- Peta-op computers with thousands of threads with manual code generation, verification, configuration and load balancing
- Tools proven for parallel computing but no longer supported to extend to multi-core or cloud environment
- Highly parallel computers, e.g. Tilera, with manually constructed parallelization methods e.g. MPI
- Emergence of automatic parallelization frameworks distributing common operations

#### Recommendations

- Functionality captured in graphical spec to support practical design optimization - Help with how to evaluate new applications/ algorithms differently
  - Rapid partitioning, configuration and evaluation of model driven code generation
  - Design iteration to near Amdahl performance limitations
  - Support modern RAASP-like program (metrics, support, benchmarks)
- HP inter and intra kernel interface/ comm mechanism spanning vendor approaches
- General use, improved {pre}compiler technology (e.g. directed profile-guided optimization, adaptive compilers, etc.)

### Processing Flow Hasn't Changed for EW, SIGINT & Comms

- High-rate sensors outputting multi-gigabyte data streams
  - Improvements in sensors is constantly increasing the volume of data
- Flexibility provided by digital processing is pushing A/D closer to the sensor
- Very high performance, extremely low-latency front-end pre-processing is performed to process the raw data and extract the signal/information of interest
  - Processing typically requires each data sample to be processed
- Once signal or information of interest is extracted microprocessor performs further lower-rate processing



### Current and Projected Digital Processing Architecture Challenges



## What are SEAK challenges

- Non-reliable computations
  - Map to near-threshold operating regions
  - Fabrication issues
- Small reliable LWCs
- Need faster solution than end-end simulations at transistor/gate/module level
- Need to correlate highlevel state with gate-level state

- Net Count = 100,000
- Depth = 1,024
- Comprehensive coverage:
  - Single faults  $\rightarrow$  100,000 \* 1,024 = ~100M simulations
- Each simulation runs for ~3 \* 1,024 = 3K cycles
- Each cycle is ~100,000 LUT ops
- Each LUT op is ~10 processor instructions
- → Need 100M \* 3K \* 100K \* 10
  = 3\*10^(17) instructions
- One computer has ~100,000 MIPS
- $\rightarrow$  ~35 days of running time

#### **BAE SYSTEMS**

- Inject faults anywhere at any level
  - Need high-level simulator capable of injecting and simulating effects of (multiple) probabilistic faults at low level
    - Simulate low level effects of faults given models of low-level gates/transistors where a fault could be injected
  - Only need to run a gate level simulation cycle when we want to introduce a fault.
    - All the other cycles in a simulation can be run at RTL (or BSV)
- Correlate high-level state with gate-level state to realistically debug
  - Run bsim instead of RTL by interrupting the BlueSpec simulator to compute and insert a fault.

- Reduce cost and time from concept to development and maintenance
- Main technological currency



| Improved<br>Cost                       | High-level languages & automatic parallelization<br>aid <b>targeting</b> parallel platforms, but <b>do not</b><br><b>improve debugging</b> of parallel programs. | Need to speed parallel system debugging & maintenance, not just development      |  |
|----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|--|
| Scaling<br>Performance<br>with # cores | Programmers today avoid parallelism<br>opportunities to simplify debugging.<br>Unsurprising with debugging for M instruction<br>streams ≥ M times difficulty.*   | Need to unleash parallel opportunities without increased cost of bugs.           |  |
| Socurity                               | Most vulnerabilities due to bugs. Isolating & repairing bugs is a central element of security. Parallel bugs most difficult & rapidly expanding.                 | Need improvement just to keep up; leap has potential for large impact.           |  |
| Security                               | Online defense techniques limited by need to be low impact to applications & systems                                                                             | Need full instrumentation & control, invisible to applications for security leap |  |

As number of cores explode & programming tools mature, debugging tractability becomes the bottleneck to realizing gains from parallel platforms; & low-impact instrumentation & control is a potential key enabler for security.

\* Openshaw and Turton, *High Performance and the Art of Parallel Programming.* 

### Current Debugging Approaches Provide Little Support - Especially for Parallel and Optimized Codes

| Debugging Pa                                               | rallel Code                                                  | Debugging Optimized Code                                           |                                     |  |  |
|------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------------|-------------------------------------|--|--|
| Require serialization prior to debugging                   | gdb                                                          | Turn off optimizations when debugging                              | gdb, Microsoft,<br>Borland, others  |  |  |
| Execute single thread until it blocks, then switch threads | MS Research's CHESS                                          | Debugger "hides"<br>transforms and provides<br><i>transparency</i> | Zellweger PhD<br>dissertation, 1983 |  |  |
| Replace thread model with deterministic concurrency model  | George Mason U's MM<br>concurrency test and<br>debug library | Visualize compiler<br>transforms performed                         | Convex Computer,<br>1992            |  |  |
| Focus on data sharing faults                               | Intel Go-Parallel                                            | Debug optimized pseudo-<br>code not original source                | IBM mods to gdb                     |  |  |

#### Programmers have little to work with

Very old results Still in the lab Piecemeal solutions

Poor debugging support will continue to be a major drain on programmer productivity

### Why DARPA? Business Unlikely to Solve Due to Business Economics; Large DoD Payoff

#### Debugging economics: business less sensitive to high debug costs

Debug costs per revenue dollar low for mass market commercial software

VS.

DoD measures acquisition cost/product so debug costs amortize at high rate

#### <u>Also</u>

DoD systems larger, more complex

DoD mission critical systems have to be more fully debugged

 Most commercial software can afford to have users be beta testers

# Open source tools: little to no investment by business

Former development tools companies either subsumed or gone: Borland, Corel, Symantec, Rational

No venture investors will touch a proposed new tools venture: little or no expected return on investment.

Developers expect to get tools for free (e.g. Eclipse)