# Embedded Systems Design: A Unified Hardware/Software Introduction

#### Chapter 7 Digital Camera Example

#### Introduction

- · Putting it all together
  - General-purpose processor
  - Single-purpose processor
    - Custom
  - Standard
  - Memory
  - Interfacing
- Knowledge applied to designing a simple digital camera
  - General-purpose vs. single-purpose processors
  - Partitioning of functionality among different processor types

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

2

#### Outline

- Introduction to a simple digital camera
- Designer's perspective
- Requirements specification
- Design
  - Four implementations

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis 2

#### Introduction to a simple digital camera

- Captures images
- Stores images in digital format
  - No film
  - Multiple images stored in camera
    - Number depends on amount of memory and bits used per image
- Downloads images to PC
- Only recently possible
  - Systems-on-a-chip
    - Multiple processors and memories on one IC
- High-capacity flash memory
- Very simple description used for example
  - Many more features with real digital camera
- Variable size images, image deletion, digital stretching, zooming in and out, etc.

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

#### Designer's perspective

- · Two key tasks
  - Processing images and storing in memory
    - When shutter pressed:
      - Image captured
      - Converted to digital form by charge-coupled device (CCD)
      - Compressed and archived in internal memory
  - Uploading images to PC
    - Digital camera attached to PC
    - Special software commands camera to transmit archived images serially

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi:

#### Zero-bias error

- Manufacturing errors cause cells to measure slightly above or below actual light intensity
- Error typically same across columns, but different across rows
- Some of left most columns blocked by black paint to detect zero-bias error
  - Reading of other than 0 in blocked cells is zero-bias error
  - Each row is corrected by subtracting the average error found in blocked cells for



Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi:

## Charge-coupled device (CCD)

- Special sensor that captures an image
- Light-sensitive silicon solid-state device composed of many cells

When exposed to light, each cell becomes electrically charged. This charge can then be converted to a 8-bit value where 0 represents no exposure while 255 represents very intense exposure of that cell to light.

Some of the columns are covered with a black strip of paint. The light-intensity of these pixels is used for zerobias adjustments of all the



The electronic circuitry, when commanded, discharges the cells, activates the electromechanical shutter,

electromechanical shutter, and then reads the 8-bit charge value of each cell. These values can be clocked out of the CCD by external logic through a standard parallel bus interface.

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi:

# Compression

- Store more images
- · Transmit image to PC in less time
- JPEG (Joint Photographic Experts Group)
  - Popular standard format for representing digital images in a compressed
  - Provides for a number of different modes of operation
  - Mode used in this chapter provides high compression ratios using DCT (discrete cosine transform)
  - Image data divided into blocks of 8 x 8 pixels
  - 3 steps performed on each block
    - DCT
    - Ouantization
    - · Huffman encoding

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

#### DCT step

- Transforms original 8 x 8 block into a cosine-frequency
  - Upper-left corner values represent more of the essence of the image
  - Lower-right corner values represent finer details
    - Can reduce precision of these values and retain reasonable image quality
- FDCT (Forward DCT) formula
  - C(h) = if (h == 0) then 1/sqrt(2) else 1.0
  - · Auxiliary function used in main function F(u,v)
  - $F(u,v) = \frac{1}{4} \times C(u) \times C(v) \sum_{x=0...7} \sum_{y=0...7} D_{xy} \times \cos(\pi(2u+1)u/16) \times \cos(\pi(2y+1)v/16)$ 
    - · Gives encoded pixel at row u, column v
    - · Dxy is original pixel value at row x, column y
- IDCT (Inverse DCT)
  - Reverses process to obtain original block (not needed for this design)

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givarg

### Huffman encoding step

- Serialize 8 x 8 block of pixels
  - Values are converted into single list using zigzag pattern



- Perform Huffman encoding
  - More frequently occurring pixels assigned short binary code
  - Longer binary codes left for less frequently occurring pixels
- Each pixel in serial list converted to Huffman encoded values
  - Much shorter list, thus compression

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

11

### Quantization step

- · Achieve high compression ratio by reducing image quality
  - Reduce bit precision of encoded data
    - · Fewer bits needed for encoding
    - · One way is to divide all values by a factor of 2
    - Simple right shifts can do this
  - Dequantization would reverse process for decompression



Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

10

#### Huffman encoding example

- Pixel frequencies on left
- Pixel value –1 occurs 15 times
   Pixel value 14 occurs 1 time
  Build Huffman tree from bottom up Create one leaf node for each pixel value and assign frequency as node's value
  - Create an internal node by joining an two nodes whose sum is a minimal value

    This sum is internal nodes value
  - Repeat until complete binary tree
- Traverse tree from root to leaf to obtain binary code for leaf's pixel
- Append 0 for left traversal, 1 for right traversal
- Huffman encoding is reversible

  No code is a prefix of another code

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

Huffman tree

12

Huffman

#### Archive step

- · Record starting address and image size
  - Can use linked list
- One possible way to archive images
  - If max number of images archived is N:
    - · Set aside memory for N addresses and N image-size variables
    - · Keep a counter for location of next available address
    - Initialize addresses and image-size variables to 0
    - Set global memory address to N x 4
      - Assuming addresses, image-size variables occupy N x 4 bytes
    - First image archived starting at address N x 4
    - Global memory address updated to N x 4 + (compressed image size)
- Memory requirement based on N, image size, and average compression ratio

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis 13

#### Requirements Specification

- · System's requirements what system should do
  - Nonfunctional requirements
  - · Constraints on design metrics (e.g., "should use 0.001 watt or less")
  - Functional requirements
    - System's behavior (e.g., "output X should be input Y times 2")
  - Initial specification may be very general and come from marketing dept.
    - · E.g., short document detailing market need for a low-end digital camera that:
      - captures and stores at least 50 low-res images and uploads to PC,
      - costs around \$100 with single medium-size IC costing less that \$25,
      - has long as possible battery life,
      - has expected sales volume of 200,000 if market entry < 6 months,
      - 100,000 if between 6 and 12 months,
      - insignificant sales beyond 12 months

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis 15

#### Uploading to PC

- When connected to PC and upload command received
  - Read images from memory
  - Transmit serially using UART
  - While transmitting
    - Reset pointers, image-size variables and global memory pointer accordingly

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis 14

#### Nonfunctional requirements

- · Design metrics of importance based on initial specification
  - Performance: time required to process image

  - Power: measure of avg. electrical energy consumed while processing
  - Energy: battery lifetime (power x time)
- · Constrained metrics
  - Values <u>must</u> be below (sometimes above) certain threshold
- · Optimization metrics
  - Improved as much as possible to improve product
- Metric can be both constrained and optimization

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

















# CNTRL (controller) module Heart of the system CntrlInitialize for consistency with other modules only CntrlCoputre/Inage uses CCDPP module to input image and place in buffer CntrlCompressImage breaks the 64 x 64 buffer into 8 x 8 blocks and performs FDCT on each block using the CDDEC module Also performs quantization on each block CntrlSendImage transmits encoded image serially using UART module for(i-0; i<NUM\_ROW\_BLOCKS; i++) for(j-0; j<NUM\_COL\_BLOCKS; j++) { for(k-0; k<8; k++) for(l-0; l<8; l++) CodecPushFixel( for(j=0; j<SZ\_COL; j++) buffer[1][j] = CcdppPopPixel();</pre> ofine SZ\_COL 64 stine NUM\_ROW\_BLOCKS (SZ\_ROW / 8) stine NUM\_COL\_BLOCKS (SZ\_COL / 8) atic short buffer[SZ\_ROW][SZ\_COL], i, j, k, l, temp. Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi: 25

### Design Determine system's architecture Any combination of single-purpose (custom or standard) or general-purpose processors Map functionality to that architecture Multiple functions on one processor One function on one or more processors - A particular architecture and mapping Solution space is set of all implementations Low-end general-purpose processor connected to flash memory All functionality mapped to software running on processor Usually satisfies power, size, and time-to-market constraints

#### Putting it all together

- Main initializes all modules, then uses CNTRL module to capture, compress, and transmit one image
- This system-level model can be used for extensive experimentation
  - Bugs much easier to correct here rather than in later models

```
main(int argo, char "argv[]) {
    char "martOrporfileName - argo > 1 7 argv[2] : "mart_out.txt";
    /* initialize the modules "/
    /* initialize the modules "/
    UserInitialize (martOrporfileName);
    Compositialize();
    Compositialize();
```

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

26

#### Implementation 1: Microcontroller alone

• Low-end processor could be Intel 8051 microcontroller

. If timing constraint not satisfied then later implementations could: use single-purpose processors for time-critical functions rewrite functional specification

- Total IC cost including NRE about \$5
- · Well below 200 mW power

Processors

Implementation

Starting point

- Memories, buses

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

- Time-to-market about 3 months
- · However, one image per second not possible
  - 12 MHz, 12 cycles per instruction
  - · Executes one million instructions per second
  - CcdppCapture has nested loops resulting in 4096 (64 x 64) iterations
    - ~100 assembly instructions each iteration
    - 409,000 (4096 x 100) instructions per image
    - · Half of budget for reading image alone
  - Would be over budget after adding compute-intensive DCT and Huffman encoding

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

28

### Implementation 2: Microcontroller and CCDPP



- CCDPP function implemented on custom single-purpose processor
  - Improves performance less microcontroller cycles
  - Increases NRE cost and time-to-market
  - Easy to implement
    - · Simple datapath
    - · Few states in controller
- Simple UART easy to implement as single-purpose processor also
- EEPROM for program memory and RAM for data memory added as well

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi:

29

#### **UART**

- UART in idle mode until invoked
  - UART invoked when 8051 executes store instruction with UART's enable register as target address
    - · Memory-mapped communication between 8051 and all single-purpose processors
    - Lower 8-bits of memory address for RAM
    - Upper 8-bits of memory address for memory-mapped
- Start state transmits 0 indicating start of byte transmission then transitions to Data state
- Data state sends 8 bits serially then transitions to Stop state
- Stop state transmits 1 indicating transmission done then transitions back to idle mode

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

31

FSMD description of UART

#### Microcontroller

- Synthesizable version of Intel 8051 available
  - Written in VHDL
  - Captured at register transfer level (RTL)
- Fetches instruction from ROM
- Decodes using Instruction Decoder
  - ALU executes arithmetic operations
  - Source and destination registers reside in RAM Special data movement instructions used to
- of ROM from output of C compiler/linker
- Special program generates VHDL description



Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

load and store externally

30

#### **CCDPP**

- Hardware implementation of zero-bias operations
- Interacts with external CCD chip

  CCD chip resides external to our SOC mainly because combining CCD with ordinary logic not feasible
- Internal buffer, B, memory-mapped to 8051
- Variables R, C are buffer's row, column indices
- GetRow state reads in one row from CCD to B 66 bytes: 64 pixels + 2 blacked-out pixels
- ComputeBias state computes bias for that row and
- stores in variable Bias FixBias state iterates over same row subtracting
- Bias from each element
- NextRow transitions to GetRow for repeat of process on next row or to Idle state when all 64 rows completed

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

32

ESMD description of CCDPP









# Implementation 3: Microcontroller and CCDPP/Fixed-Point DCT

- 9.1 seconds still doesn't meet performance constraint of 1 second
- DCT operation prime candidate for improvement
  - Execution of implementation 2 shows microprocessor spends most cycles here
  - Could design custom hardware like we did for CCDPP
     More complex so more design effort
  - Instead, will speed up DCT functionality by modifying behavior

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi

37

#### Fixed-point arithmetic

- · Integer used to represent a real number
  - Constant number of integer's bits represents fractional portion of real number
    - · More bits, more accurate the representation
  - Remaining bits represent portion of real number before decimal point
- Translating a real constant to a fixed-point representation
- Multiply real value by 2 ^ (# of bits used for fractional part)
- Round to nearest integer
- E.g., represent 3.14 as 8-bit integer with 4 bits for fraction
  - 2<sup>4</sup> = 16
- 3.14 x 16 = 50.24 ≈ 50 = 00110010
- 16 (2^4) possible values for fraction, each represents 0.0625 (1/16)
- Last 4 bits (0010) = 2
- 2 x 0.0625 = 0.125
- $3(0011) + 0.125 = 3.125 \approx 3.14$  (more bits for fraction would increase accuracy)

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi:

39

#### DCT floating-point cost

- · Floating-point cost
  - DCT uses ~260 floating-point operations per pixel transformation
  - 4096 (64 x 64) pixels per image
  - 1 million floating-point operations per image
  - No floating-point support with Intel 8051
    - · Compiler must emulate
      - Generates procedures for each floating-point operation
        - mult, add
      - Each procedure uses tens of integer operations
  - Thus, > 10 million integer operations per image
  - Procedures increase code size
- · Fixed-point arithmetic can improve on this

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi: 38

#### Fixed-point arithmetic operations

#### • Addition

- Simply add integer representations
- E.g., 3.14 + 2.71 = 5.85
  - 3.14 → 50 = 00110010
  - 2.71 → 43 = 00101011
    50 + 43 = 93 = 01011101
  - $5(0101) + 13(1101) \times 0.0625 = 5.8125 \approx 5.85$
- Multiply
  - Multiply integer representations
  - Shift result right by # of bits in fractional part
  - E.g., 3.14 \* 2.71 = 8.5094
    - 50 \* 43 = 2150 = 100001100110
    - >> 4 = 10000110
  - $8(1000) + 6(0110) \times 0.0625 = 8.375 \approx 8.5094$
- Range of real values used limited by bit widths of possible resulting values

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi:





# Implementation 3: Microcontroller and CCDPP/Fixed-Point DCT

- Analysis of implementation 3
  - Use same analysis techniques as implementation 2
  - Total execution time for processing one image:
    - 1.5 seconds
  - Power consumption:
    - 0.033 watt (same as 2)
  - Energy consumption:
    - 0.050 joule (1.5 s x 0.033 watt)
    - Battery life 6x longer!!
  - Total chip area:
    - 90,000 gates
    - 8,000 less gates (less memory needed for code)

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi: 42

# CODEC design

- 4 memory mapped registers
  - C\_DATAL\_REG/C\_DATAO\_REG used to push/pop 8 x 8 block into and out of CODEC
  - C\_CMND\_REG used to command
  - Writing 1 to this register invokes CODEC
  - C\_STAT\_REG indicates CODEC done and ready for next block
    - Polled in software
- Direct translation of C code to VHDL for actual hardware implementation
  - Fixed-point version used
- CODEC module in software changed similar to UART/CCDPP in implementation 2

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

# Implementation 4: Microcontroller and CCDPP/DCT

- Analysis of implementation 4
  - Total execution time for processing one image:
    - 0.099 seconds (well under 1 sec)
  - Power consumption:
    - 0.040 watt
    - Increase over 2 and 3 because SOC has another processor
  - Energy consumption:
    - 0.00040 joule (0.099 s x 0.040 watt)
    - Battery life 12x longer than previous implementation!!
  - Total chip area:
    - 128,000 gates
    - Significant increase over previous implementations

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargi:

45

#### Summary

- Digital camera example
  - Specifications in English and executable language
  - Design metrics: performance, power and area
- Several implementations
  - Microcontroller: too slow
  - Microcontroller and coprocessor: better, but still too slow
  - Fixed-point arithmetic: almost fast enough
  - Additional coprocessor for compression: fast enough, but expensive and hard to design
  - Tradeoffs between hw/sw the main lesson of this book!

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

47

## Summary of implementations

|                      | Implementation 2 | Implementation 3 | Implementation 4 |
|----------------------|------------------|------------------|------------------|
| Performance (second) | 9.1              | 1.5              | 0.099            |
| Power (watt)         | 0.033            | 0.033            | 0.040            |
| Size (gate)          | 98,000           | 90,000           | 128,000          |
| Engrava (iguda)      | 0.30             | 0.050            | 0.0040           |

- Implementation 3
  - Close in performance
  - Cheaper
  - Less time to build
- Implementation 4
  - Great performance and energy consumption
  - More expensive and may miss time-to-market window
    - If DCT designed ourselves then increased NRE cost and time-to-market
    - If existing DCT purchased then increased IC cost
- Which is better?

Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis