# Implementation of VLSI Architecture for Lifting Scheme Based DWT

## Rahul Umesh Kale, MD. Manan Mujahid, N. Venkataramana

Abstract: To The discrete wavelet transform (DWT) plays a central role in a number of signal and image processing applications. Owing to its importance in real-time signal processing systems, its first hardware implementation has been proposed. Subsequently, significant research effort has been made to optimize DWT/inverse DWT (IDWT) implementation, like architectures based on the folded digit-serial approach and low-complexity architectures with a reduced number of multipliers. However, these hardware architectures do not adequately address the power and area consumption issues, which often are the two most important metrics in today's high-performance signal processing systems. The main power consuming operation in DWT/IDWT computation is filtering, which requires a significant number of multiplications.

The lifting scheme is a new algorithm proposed for the implementation of the wavelet transform. It can reduce the computational complexity of DWT involved with the convolution implementation. Furthermore, the extra memory required to store the results of the convolution can also be reduced by in place computation of the wavelet coefficient with the lifting scheme.

The lifting scheme consists of the following three steps associated with the lifting scheme based DWT for the one-dimensional signal:(1) Split step: The input samples are split into even samples and odd samples,(2) Predict step (P): The even samples are multiplied by the predict factor and then the results are added to the odd samples to generate the detailed coefficients;(3) Update step (U): The detailed coefficients computed by the predict step are multiplied by the update factors and then the results are added to the even samples to get the coarse coefficients. One of the elegant features of the lifting scheme is that the inverse transform is a mirror of the forward transform.

Index Terms: Discrete Wavelet Transform, Lifting schemes, VLSI architectures, Inverse lifting scheme.

#### I. INTRODUCTION

The Discrete Wavelet Transform (DWT) plays a major role in the fields of signal analysis, computer vision, object recognition, image compression and video compression standard. The advantage of DWT over other traditional transformations is that it performs multi resolution analysis of signals with localization both in time and frequency as described by Mallat [1]. At present, many VLSI architectures for the 2-D DWT have been proposed to meet the requirements of real-time processing. The implementation of DWT in practical system has issues DWT needs extra memory for storing the intermediate computational results.

Manuscript Received November 2013.

Mr. Rahul Umesh Kale. MTech (ECE), SCET, Hyderabad, India .

**Prof. MD. Manan Mujahid, Asst**. Prof. & H.O.D, Dept. of ECE, SCET Hyderabad, India.

Moreover, for real time image compression, DWT has to process massive amounts of data at high speeds. The use of software implementation of DWT image compression provides flexibility for manipulation but it may not meet timing constraints in certain applications. Hardware implementation of DWT has practical obstacles. First, is that the high cost of hardware implementation of multipliers? Filter bank implementation of DWT contains two FIR filters [2].

The 3D Discrete Wavelet Transform (DWT) is widely used method for these medical imaging systems because of perfect reconstruction property. DWT can decompose the signals into different sub bands with both time and frequency information and facilitate to arrive at high compression ratio. DWT architecture, in general, reduces the memory requirements and increases the speed of communication by breaking up the image into the blocks [6].

Recently, a methodology for implementing lifting based DWT has been proposed because of lifting based DWT has many advantages over convolution based one. The lifting structure largely reduces the number of multiplication and accumulation where filter bank architectures can take advantage of many low power constant multiplication algorithms.

In this paper, we present a brief description of lifting scheme. In Section 2 discuss about what DWT advantages & disadvantages of DWT over DCT. Discuss about the lifting scheme in section 3.Section 4 discuss the some architecture & their comparison. Finally, brief summaries are given in Section 4 to conclude the paper

#### **II. IDEA ABOUT WAVELET TRANSFORMS**

An easy way to comply with the conference paper formatting requirements is to use this document as a template and simply type your text into it.

#### A. Wavelet Transforms

Pa Wavelets are functions generated from one single function (basis function) called the prototype or mother wavelet by dilations (scaling) and translations (shifts) in time (frequency) domain. If the mother wavelet is denoted by w(t) = w(t)

$$\Psi(t)$$
, the other wavelets  $\Psi_{a,b}(t)$  can be represented as,  
 $\Psi_{a,b}(t) = \frac{1}{\sqrt{|a|}} \Psi\left(\frac{t-b}{a}\right)$ 
(2.1)

$$\psi(t) = \psi_{1,0}(t)$$

... (2.2)



Retrieval Number: L05181111213/2013©BEIESP

Prof. Mr. N. Venkataramana ,Asst. Prof., Dept. of ECE, SCET, Hyderabad, India.

D

For any arbitrary  $a \neq 1$  and b = 0, it is possible to derive that,

$$\psi_{a,b}(t) = \frac{1}{\sqrt{|a|}} \psi\left(\frac{t-b}{a}\right). \qquad \dots (2.3)$$

As shown in Eq.3,  $\Psi_{a,0}(t)$  is nothing but a time-scaled (by a) and amplitude-scaled (by  $\sqrt{|a|}$  ) version of the mother wavelet function  $\psi$  t in Eq. 2.



Fig 2.1 (a) A mother wavelet  
(b) 
$$\psi\left(\frac{t}{\alpha}\right): 0 < \alpha < 1$$
, and (c)  $\psi\left(\frac{t}{\alpha}\right): \alpha > 1$ .

Figure 2.1 shows an illustration of a mother wavelet and its dilations in the time domain with the dilation parameter  $a = \alpha$ . For the mother wavelet  $\psi(t)$  shown in figure 1.1(a), a contraction of the signal in the time axis when  $\alpha$  < 1 is shown in figure 2.1(b) and expansion of the signal in the time axis when  $\alpha > 1$  is shown in figure 2.1(c). Based on this definition of wavelets, the wavelet transform (WT) of a function (signal) f(t) is mathematically represented by [1]

$$W(a,b) = \int_{-\infty}^{\infty} \psi_{a,b}(t) f(t) dt \qquad \dots (2.4)$$

#### **B.** Continuous Wavelet Transforms

A continuous wavelet transform is used to divide a continuous-time function into wavelets. Unlike Fourier transform, the continuous wavelet transform possesses the ability to construct a time-frequency representation of a signal that offers very good time and frequency localization. The continuous wavelet transform is defined as [2].

The transformed signal  $f_{WT}(a,b)$  is a function of the dilation parameter 'a' and the translation parameter 'b'. The mother wavelet is denoted by  ${}^{{\ensuremath{\mathcal V}}}$  , the \* indicates that the complex conjugate is used in case of a complex wavelet. The signal energy is normalized at every scale by dividing the

wavelet coefficients by  $\overline{\sqrt{|a|}}$ . This ensures that the wavelets have the same energy at every scale.

# C. Discrete Wavelet Transforms

One drawback of the CWT is that the representation of the signal is often redundant, since a and b are continuous over R (the real number). The original signal can be completely reconstructed by a sample version of W(a, b). Typically, we sample W(a, b) in dyadic grid, i.e.[3]

$$a = 2^{-m} \text{ And } b = n2^{-m}$$
  
m,  $n \in \mathbb{Z}$ , and Z is the set of positive integer  
$$WT_{\psi} f(m, n) = \int_{0}^{\infty} f(t) \psi_{m, n}^{*}(t) dt \quad \dots (2.6)$$

Where 
$$\psi_{m,n}(t) = 2^{-m} \psi(2^m t - n)$$
 Is the dilated and

translated version of the mother wavelet  $\Psi(t)$ .

The transform shown in Eq. 7 is called the wavelet series, which is analogous to the Fourier series because the input function f(t) is still a continuous function whereas the transform coefficients are discrete. This is often called the discrete time wavelet transform (DTWT). For digital signal or image processing applications executed by a digital computer, the input signal f(t) needs to be discrete in nature because of the digital sampling of the original data, which is represented by a finite number bits. When the input function f(t) as well as the wavelet parameters a and b are represented in discrete form, the transformation is commonly referred to as the discrete wavelet transform (DWT) of the signal f(t). The discrete wavelet transform (DWT) became a very versatile signal processing tool after Mallat [3] proposed the multiresolution representation of signals based on wavelet decomposition.



Fig. 2.2 Three level decomposition of an image

Fig 2.2 shows the [3] shows the comprises two stages: stage 1 performs horizontal filtering and stage 2 performs vertical filtering. In the first-level decomposition, the size of the input image is N\* N, and the outputs are the three sub bands LH, HL, and HH, of size N/2\*N/2. In the second-level decomposition, the input is the LL band and the outputs are the three sub bands LLLH, LLHL, and LLHH, of size N/4\*N/4.The multi-level 2-D DWT can be extended in an analogous manner. The advantage of DWT is that it performs multi resolution analysis of signals with localization both in

time and frequency. The disadvantage of DWT to require extra memory to store

& Sciences Publication

Published By:



convolution results & complex hardware.

#### III. DWT WITH LIFTING SCHEME

The lifting scheme is a well known method for constructing bi-orthogonal wavelets. The main difference with the classical construction is that it does not rely on the Fourier transform. The lifting scheme can be used to construct second generation wavelets, wavelets which are not necessarily translates and dilates of one function. The lifting scheme can be used in situations where no Fourier transform is available. The lifting scheme provides an alternative. The lifting scheme is a technique for both, designing wavelets and performing the discrete wavelet transform.

Constructed entirely in spatial domain and based on the theory of biorthogonal wavelet filter banks with perfect reconstruction, lifting scheme can easily build up a gradually improved multi-resolution analysis through iterative primal lifting and dual lifting. It turns out that lifting scheme outperforms the classical especially in effective implementation, such as convenient construction, in-place calculation, lower computational complexity and simple inverse transform, etc. With lifting, we can also build wavelets with more vanishing moments and/or more smoothness, contributing to its flexible adaptively and non-linearity. The lifting scheme consists of the following three steps to decompose the samples, namely, splitting, predicting, and updating.

(1) **Split step**: The input samples split into even samples and odd samples.

(2) **Predict step** (**P**): The even samples are multiplied by the predict factor and then the results are added to the odd samples to generate the detailed coefficients.

(3) Update step (U): The detailed coefficients computed by the predict step are multiplied by the update factors and then the results are added to the even samples to get the coarse Coeffici EVEN SAMPLES



ODD SAMPLES

# Fig.3.1 Forward Lifting Wavelet Transform

**SPLIT:** In this step, the data is divided into ODD and EVEN elements.

$$s_k^{(0)} = x_{2i}^{(0)}, d_k^{(0)} = x_{2i+1}^{(0)}$$
 ... (3.1)

Where  $x_i$  is the input sequence.

 $S_k$  Represents even samples

 $d_k$  Represents odd samples



ODD SAMPLES

r Represents the level of decomposition.

## Fig.3.2: Inverse lifting wavelet transforms

## IV. ARCHITECTURES OF LIFTING SCHEME

#### A. Basic functional units for lifting scheme

Different kinds of lifting-based DWT architectures can be constructed by combining the three basic lifting elements. Most of the applicable DWTs like (9, 7) and (5,3) wavelets consist of processing units, his unit is called the processing element (PE). The processing nodes A, B and C are input samples which arrive successively. To implement the predict unit, A and C receive even samples while B receives odd samples. On the other hand, for the update unit, A and C are odd samples and B receives even samples. Now, the structure can be used to implement (5, 3) and (9, 7) wavelets is shown in Fig.4.2 & Fig.4.3. In this architecture each white circle represents a PE







Fig. 4.3: Lifting structure for (9, 7) wavelet



#### **B.** Folded Architecture

The folded structure is an alternative for the direct mapped architectures as in Fig.4.4 .by which the lifting-based structures can be designed systematically. In folded structure, the output of the PE unit is fed back through the delay registers to the PE's input. By adding different numbers of delay registers and coefficients with PE, the structure for different wavelets can be designed



Fig. 4.4: Folded architecture for (5, 3) wavelet

For example the folded structure for (5, 3) and (9, 7) wavelets have two and four delay registers, respectively. Also the coefficients for (5, 3) wavelet are -1/2-7 and 1/2-7 while for (9, 7) they area,b,c,d. The architecture can be reconfigured so that computation of the two phases can be interleaved by selection of appropriate data by the multiplexers. As a result, two delay registers (D) are needed in each lifting step in order to properly schedule the data in each phase. Based on the phase of interleaved computation, the coefficient for multiplier M1 is either a or c, and similarly the coefficient for multiplier M2 is b or d. The hardware utilization of this architecture is always 100% and the critical path in the multiplier can be reduced. The architecture for (9,7) filter is shown in Fig.4.5



**Fig. 4.5:** Folded architecture for (9, 7) wavelet

#### C. Flipping Structure

While conventional lifting-based architectures require fewer arithmetic operations, they sometimes have long critical paths.The critical path of the lifting-based architecture for the (9,7) filter is 4Tm + 8Ta while that of the convolution implementation is Tm + 4Ta. One way of improving this is by pipelining which results in a significant increase in the number of registers. For instance, to pipeline the lifting-based (9,7) filter such that the critical path is Tm + 2Ta, 6 additional registers are required. Huang et al. proposed a very efficient way of solving the timing accumulation problem. The basic idea is to remove the multiplications along the critical path by scaling the remaining paths by the inverse of the multiplier coefficients.[3]



Fig. .4.6: Flipping architecture .

- (a) Two connected computing units.
- (b) Flipping computing units.
- (c) After splitting the computing units and merging the multipliers.

# **D.** Recursive Architecture

The conventional DWT architectures compute the i<sup>th</sup> level of decomposition upon completion of the  $(i - 1)^{th}$  level of decomposition. For a finite-length input signal, the number of input samples is always greater than the total number of intermediate low-frequency coefficients to be processed at the second and higher stages. The same data path can be used to interleave the calculation of the higher stage DWT coefficients while the first-stage coefficients are being calculated.. This is the basic principle of recursive architecture . Here computations in higher levels of decomposition are initiated as soon as enough intermediate data in low-frequency sub band is available for computation. The basic circuit elements used in this architecture are delay elements, multipliers and MAC units which are in turn designed using a multiplier, an adder and two shifters. The multiplexers M1 and M2 select the even and odd samples of the input data as needed by the lifting scheme. S1, S2 and S3 are the control signals for data flow of the architecture.

The select signal (S1) of the multiplexers is set to 0 for the first level of computation and is set to 1 during the second or third level computation. The switches S2 and S3 select the input data for the second and third level of computation. The multiplexer M3 selects the delayed samples for each level of decomposition based on the clocked signals shown in Fig.4.7. A recursive architecture for multilevel 1D implementation of the (5, 3) filter has been proposed in . The architecture has hardware complexity and the control signals are very complex and it is regular.





#### E. Comparison of performance of 1D Architectures

A summary of the hardware and timing requirements of the different (9, 7) filter

implementations for data size N is presented in table.1. The



hardware complexity has been compared with respect to the data path. An estimate of the controller complexity has also been included. The timing performance has been compared with respect to two parameters: the number of clock cycles to compute L levels of decomposition and the delay in the critical path.

The term Tm stands for the delay of a multiplier, Ta the delay of an added. In terms of hardware complexity, the folded architecture in is the simplest and the DSP-based architecture in is the most complex. All other architectures have comparable hardware complexity and primarily differ in the number of registers and multiplexer circuitry. The control complexity of the architecture in is very simple. In contrast, the number of switches, multiplexers and control signals used in the architectures of is quite large. The control complexity of the remaining architectures is moderate. In terms of timing performance, the architectures in are all pipelined architectures having the highest throughput (1/Tm).

The architecture in has fewer cycles since it is RPA based, but its clock period is higher. Finally, all the architectures with the exception of compute all the outputs of one level before starting computations of the next level. The architecture in is the only one that adopts an RPA based approach and intersperses the computations of the higher levels with those of the first level. So it is likely that the memory requirements of would be lower than the others.

# V. CONCLUSION

The lifting scheme has a few advantages over the classical implementation of the wavelet transforms: it offers faster implementation, and it easily implements reversible integer-to-integer wavelet transforms. Integer wavelet transforms when implemented via lifting scheme have better computational efficiency and lower memory requirements.

In this paper, we have discussed some architectures of Wavelet. Transform based on the hardware complexities and computational time for the different architectures using Lifting schemes. The architectures represented vary from folded, flipping to recursive. This is useful for exploring a new method of pipelined architectures capable of handling multiple data streams suitable for application in image and video processing multimedia real time applications.

# REFERENCES

- S.Mallat, (1989), "A theory for multiresolution signal decomposition: the wavelet representation", IEEE Trans. Pattern Anal. Mach. Intel. 11, 674–693.
- Chao Cheng and Keshab K. Parhi,(2008), ".High-Speed VLSI Implementation of 2-D Discrete Wavelet Transform", IEEE Transactions on Signal Processing, Vol. 56, No. 1.
- Usha Bhanu.N and Dr.A. Chilambuchelvan," A Detailed Survey on VLSI Architectures for Lifting based DWT for efficient hardware implementation" International Journal of LSI design & Communication Systems (VLSICS) Vol.3, No.2, April 2012
- C.J Lian, K.F. Chen, H.H. Chen, and L.G. Chen, (2001), "Lifting Based Discrete Wavelet Transform Architecture for JPEG2000", IEEE International Symposium on Circuits and Systems, Sydney, Australia.
- Naseer M. Basheer, Mustafa Mushtak Mohammed," Design and FPGA Implementation of a Lifting scheme 2D DWT Architecture" International Journal of Recent Architecture for Image Processing" Progress In Electromagnetic Research Symposium Proceedings, Moscow, Russia, August 18-21 Technology and Engineering (IJRTE) Volume-2, Issue- 1, March 2013
- Karthikeyan A , Saranya P, Jayashree N ," An Efficient VLSI Architecture for 3D DWT Using Lifting Scheme", International

Retrieval Number: L05181111213/2013©BEIESP

Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 1, January 2013

- Cheng-yi Xionga,b,?, Jian-hua Houa,b, Jin-wen Tianb,Jian Liub," Efficient array architectures for multidiscrete wavelet transforms", signal Processing 87 (2007) 1089–1099
- I. Daubechies, W. Sweldens,(1998), "Factoring wavelet transform into lifting steps", J. Fourier Anal. Appl. 4, 247–269

| Arc<br>hitec<br>ture | Data path                                            | Timin<br>Requirer<br>No. of<br>cycles | -                           | Control<br>complexity |
|----------------------|------------------------------------------------------|---------------------------------------|-----------------------------|-----------------------|
| Fold<br>ed           | 4 MUL,2<br>Scaling<br>MUL ,8<br>A,8 R, 6<br>D        | 4+2N(<br>1 -<br>1/2 <sup>L</sup> )    | 2T <sub>m</sub><br>+2T<br>a | Moderate              |
| Flip<br>ping         | 4 MUL,2<br>Scaling<br>MUL,8<br>A,10 R, 6<br>S        | 5+2N(<br>1 -<br>1/2 <sup>L</sup> )    | T <sub>m</sub>              | Moderate              |
| Rec<br>ursi<br>ve    | 4 MUL,2<br>Scaling<br>MUL ,4<br>A,7 R, 3<br>D, 6 Mux | N+L+2<br>L                            | 4T<br>m+<br>8Ta             | Complex               |

I. Comparison of various 1D DWT architectures

