Fork of mbed-dsp. CMSIS-DSP library of supporting NEON

Dependents:   mbed-os-example-cmsis_dsp_neon

Fork of mbed-dsp by mbed official

Information

Japanese version is available in lower part of this page.
このページの後半に日本語版が用意されています.

CMSIS-DSP of supporting NEON

What is this ?

A library for CMSIS-DSP of supporting NEON.
We supported the NEON to CMSIS-DSP Ver1.4.3(CMSIS V4.1) that ARM supplied, has achieved the processing speed improvement.
If you use the mbed-dsp library, you can use to replace this library.
CMSIS-DSP of supporting NEON is provied as a library.

Library Creation environment

CMSIS-DSP library of supporting NEON was created by the following environment.

  • Compiler
    ARMCC Version 5.03
  • Compile option switch[C Compiler]
   -DARM_MATH_MATRIX_CHECK -DARM_MATH_ROUNDING -O3 -Otime --cpu=Cortex-A9 --littleend --arm 
   --apcs=/interwork --no_unaligned_access --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp 
   --vectorize --asm
  • Compile option switch[Assembler]
   --cpreproc --cpu=Cortex-A9 --littleend --arm --apcs=/interwork --no_unaligned_access 
   --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp


Effects of NEON support

In the data which passes to each function, large size will be expected more effective than small size.
Also if the data is a multiple of 16, effect will be expected in every function in the CMSIS-DSP.


NEON対応CMSIS-DSP

概要

NEON対応したCMSIS-DSPのライブラリです。
ARM社提供のCMSIS-DSP Ver1.4.3(CMSIS V4.1)をターゲットにNEON対応を行ない、処理速度向上を実現しております。
mbed-dspライブラリを使用している場合は、本ライブラリに置き換えて使用することができます。
NEON対応したCMSIS-DSPはライブラリで提供します。

ライブラリ作成環境

NEON対応CMSIS-DSPライブラリは、以下の環境で作成しています。

  • コンパイラ
    ARMCC Version 5.03
  • コンパイルオプションスイッチ[C Compiler]
   -DARM_MATH_MATRIX_CHECK -DARM_MATH_ROUNDING -O3 -Otime --cpu=Cortex-A9 --littleend --arm 
   --apcs=/interwork --no_unaligned_access --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp 
   --vectorize --asm
  • コンパイルオプションスイッチ[Assembler]
   --cpreproc --cpu=Cortex-A9 --littleend --arm --apcs=/interwork --no_unaligned_access 
   --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp


NEON対応による効果について

CMSIS-DSP内の各関数へ渡すデータは、小さいサイズよりも大きいサイズの方が効果が見込めます。
また、16の倍数のデータであれば、CMSIS-DSP内のどの関数でも効果が見込めます。


Revision:
3:7a284390b0ce
Parent:
2:da51fb522205
Child:
5:a912b042151f
--- a/cmsis_dsp/arm_math.h	Thu May 30 17:10:11 2013 +0100
+++ b/cmsis_dsp/arm_math.h	Fri Nov 08 13:45:10 2013 +0000
@@ -1,33 +1,41 @@
-/* ----------------------------------------------------------------------   
- * Copyright (C) 2010-2011 ARM Limited. All rights reserved.   
- *   
- * $Date:        15. February 2012  
- * $Revision: 	V1.1.0  
- *   
- * Project: 	    CMSIS DSP Library   
- * Title:	     	arm_math.h
- *   
- * Description:	 Public header file for CMSIS DSP Library
- *   
- * Target Processor: Cortex-M4/Cortex-M3/Cortex-M0
- *  
- * Version 1.1.0 2012/02/15 
- *    Updated with more optimizations, bug fixes and minor API changes.  
- *  
- * Version 1.0.10 2011/7/15 
- *    Big Endian support added and Merged M0 and M3/M4 Source code.  
- *   
- * Version 1.0.3 2010/11/29  
- *    Re-organized the CMSIS folders and updated documentation.   
- *    
- * Version 1.0.2 2010/11/11   
- *    Documentation updated.    
- *   
- * Version 1.0.1 2010/10/05    
- *    Production release and review comments incorporated.   
- *   
- * Version 1.0.0 2010/09/20    
- *    Production release and review comments incorporated.   
+/* ----------------------------------------------------------------------
+* Copyright (C) 2010-2013 ARM Limited. All rights reserved.
+*
+* $Date:        17. January 2013
+* $Revision:    V1.4.1
+*
+* Project:      CMSIS DSP Library
+* Title:        arm_math.h
+*
+* Description:  Public header file for CMSIS DSP Library
+*
+* Target Processor: Cortex-M4/Cortex-M3/Cortex-M0
+*
+* Redistribution and use in source and binary forms, with or without
+* modification, are permitted provided that the following conditions
+* are met:
+*   - Redistributions of source code must retain the above copyright
+*     notice, this list of conditions and the following disclaimer.
+*   - Redistributions in binary form must reproduce the above copyright
+*     notice, this list of conditions and the following disclaimer in
+*     the documentation and/or other materials provided with the
+*     distribution.
+*   - Neither the name of ARM LIMITED nor the names of its contributors
+*     may be used to endorse or promote products derived from this
+*     software without specific prior written permission.
+*
+* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+* FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+* COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+* ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+* POSSIBILITY OF SUCH DAMAGE.
  * -------------------------------------------------------------------- */
 
 /**
@@ -35,10 +43,10 @@
    *
    * <b>Introduction</b>
    *
-   * This user manual describes the CMSIS DSP software library, 
+   * This user manual describes the CMSIS DSP software library,
    * a suite of common signal processing functions for use on Cortex-M processor based devices.
    *
-   * The library is divided into a number of functions each covering a specific category:  
+   * The library is divided into a number of functions each covering a specific category:
    * - Basic math functions
    * - Fast math functions
    * - Complex math functions
@@ -51,41 +59,7 @@
    * - Interpolation functions
    *
    * The library has separate functions for operating on 8-bit integers, 16-bit integers,
-   * 32-bit integer and 32-bit floating-point values. 
-   *
-   * <b>Pre-processor Macros</b> 
-   * 
-   * Each library project have differant pre-processor macros. 
-   * 
-   * - UNALIGNED_SUPPORT_DISABLE: 
-   * 
-   * Define macro UNALIGNED_SUPPORT_DISABLE, If the silicon does not support unaligned memory access    
-   * 
-   * - ARM_MATH_BIG_ENDIAN: 
-   * 
-   * Define macro ARM_MATH_BIG_ENDIAN to build the library for big endian targets. By default library builds for little endian targets. 
-   * 
-   * - ARM_MATH_MATRIX_CHECK: 
-   * 
-   * Define macro ARM_MATH_MATRIX_CHECK for checking on the input and output sizes of matrices 
-   * 
-   * - ARM_MATH_ROUNDING: 
-   * 
-   * Define macro ARM_MATH_ROUNDING for rounding on support functions
-   *
-   * - ARM_MATH_CMx:
-   *
-   * Define macro ARM_MATH_CM4 for building the library on Cortex-M4 target, ARM_MATH_CM3 for building library on Cortex-M3 target
-   * and ARM_MATH_CM0 for building library on cortex-M0 target.
-   * 
-   * - __FPU_PRESENT:
-   *
-   * Initialize macro __FPU_PRESENT = 1 when building on FPU supported Targets. Enable this macro for M4bf and M4lf libraries 
-   *
-   * <b>Toolchain Support</b>
-   *
-   * The library has been developed and tested with MDK-ARM version 4.23. 
-   * The library is being tested in GCC and IAR toolchains and updates on this activity will be made available shortly.
+   * 32-bit integer and 32-bit floating-point values.
    *
    * <b>Using the Library</b>
    *
@@ -100,33 +74,67 @@
    * - arm_cortexM0b_math.lib (Big endian on Cortex-M3)
    *
    * The library functions are declared in the public file <code>arm_math.h</code> which is placed in the <code>Include</code> folder.
-   * Simply include this file and link the appropriate library in the application and begin calling the library functions. The Library supports single 
-   * public header file <code> arm_math.h</code> for Cortex-M4/M3/M0 with little endian and big endian. Same header file will be used for floating point unit(FPU) variants. 
-   * Define the appropriate pre processor MACRO ARM_MATH_CM4 or  ARM_MATH_CM3 or 
-   * ARM_MATH_CM0 depending on the target processor in the application.
+   * Simply include this file and link the appropriate library in the application and begin calling the library functions. The Library supports single
+   * public header file <code> arm_math.h</code> for Cortex-M4/M3/M0 with little endian and big endian. Same header file will be used for floating point unit(FPU) variants.
+   * Define the appropriate pre processor MACRO ARM_MATH_CM4 or  ARM_MATH_CM3 or
+   * ARM_MATH_CM0 or ARM_MATH_CM0PLUS depending on the target processor in the application.
    *
    * <b>Examples</b>
    *
    * The library ships with a number of examples which demonstrate how to use the library functions.
    *
+   * <b>Toolchain Support</b>
+   *
+   * The library has been developed and tested with MDK-ARM version 4.60.
+   * The library is being tested in GCC and IAR toolchains and updates on this activity will be made available shortly.
+   *
    * <b>Building the Library</b>
    *
    * The library installer contains project files to re build libraries on MDK Tool chain in the <code>CMSIS\\DSP_Lib\\Source\\ARM</code> folder.
    * - arm_cortexM0b_math.uvproj
    * - arm_cortexM0l_math.uvproj
    * - arm_cortexM3b_math.uvproj
-   * - arm_cortexM3l_math.uvproj  
+   * - arm_cortexM3l_math.uvproj
    * - arm_cortexM4b_math.uvproj
    * - arm_cortexM4l_math.uvproj
    * - arm_cortexM4bf_math.uvproj
    * - arm_cortexM4lf_math.uvproj
    *
    *
-   * The project can be built by opening the appropriate project in MDK-ARM 4.23 chain and defining the optional pre processor MACROs detailed above.
+   * The project can be built by opening the appropriate project in MDK-ARM 4.60 chain and defining the optional pre processor MACROs detailed above.
+   *
+   * <b>Pre-processor Macros</b>
+   *
+   * Each library project have differant pre-processor macros.
+   *
+   * - UNALIGNED_SUPPORT_DISABLE:
+   *
+   * Define macro UNALIGNED_SUPPORT_DISABLE, If the silicon does not support unaligned memory access
+   *
+   * - ARM_MATH_BIG_ENDIAN:
+   *
+   * Define macro ARM_MATH_BIG_ENDIAN to build the library for big endian targets. By default library builds for little endian targets.
+   *
+   * - ARM_MATH_MATRIX_CHECK:
+   *
+   * Define macro ARM_MATH_MATRIX_CHECK for checking on the input and output sizes of matrices
+   *
+   * - ARM_MATH_ROUNDING:
+   *
+   * Define macro ARM_MATH_ROUNDING for rounding on support functions
+   *
+   * - ARM_MATH_CMx:
+   *
+   * Define macro ARM_MATH_CM4 for building the library on Cortex-M4 target, ARM_MATH_CM3 for building library on Cortex-M3 target
+   * and ARM_MATH_CM0 for building library on cortex-M0 target, ARM_MATH_CM0PLUS for building library on cortex-M0+ target.
+   *
+   * - __FPU_PRESENT:
+   *
+   * Initialize macro __FPU_PRESENT = 1 when building on FPU supported Targets. Enable this macro for M4bf and M4lf libraries
    *
    * <b>Copyright Notice</b>
    *
-   * Copyright (C) 2010 ARM Limited. All rights reserved.
+   * Copyright (C) 2010-2013 ARM Limited. All rights reserved.
    */
 
 
@@ -258,20 +266,16 @@
 
 #define __CMSIS_GENERIC         /* disable NVIC and Systick functions */
 
-#if defined (TARGET_LPC1768)
-#   define ARM_MATH_CM3    1
-
-#elif defined (TARGET_LPC11U24)
-#   define ARM_MATH_CM0    1
-#endif
-
-
 #if defined (ARM_MATH_CM4)
 #include "core_cm4.h"
 #elif defined (ARM_MATH_CM3)
 #include "core_cm3.h"
 #elif defined (ARM_MATH_CM0)
 #include "core_cm0.h"
+#define ARM_MATH_CM0_FAMILY
+#elif defined (ARM_MATH_CM0PLUS)
+#include "core_cm0plus.h"
+#define ARM_MATH_CM0_FAMILY
 #else
 #include "ARMCM4.h"
 #warning "Define either ARM_MATH_CM4 OR ARM_MATH_CM3...By Default building on ARM_MATH_CM4....."
@@ -373,17 +377,27 @@
   /**
    * @brief definition to read/write two 16 bit values.
    */
-#if defined  (__GNUC__)
-  #define __SIMD32(addr)         (*( int32_t **) & (addr))
-  #define  _SIMD32_OFFSET(addr)  (*( int32_t * )   (addr))
+#if defined __CC_ARM
+#define __SIMD32_TYPE int32_t __packed
+#define CMSIS_UNUSED __attribute__((unused))
+#elif defined __ICCARM__
+#define CMSIS_UNUSED
+#define __SIMD32_TYPE int32_t __packed
+#elif defined __GNUC__
+#define __SIMD32_TYPE int32_t
+#define CMSIS_UNUSED __attribute__((unused))
 #else
-  #define __SIMD32(addr)         (*(__packed                    int32_t **) & (addr))
-  #define  _SIMD32_OFFSET(addr)  (*(__packed                    int32_t * )   (addr))
-#endif 
-
-  #define __SIMD64(addr)  (*(int64_t **) & (addr))
-
-#if defined (ARM_MATH_CM3) || defined (ARM_MATH_CM0)
+#error Unknown compiler
+#endif
+
+#define __SIMD32(addr)  (*(__SIMD32_TYPE **) & (addr))
+#define __SIMD32_CONST(addr)  ((__SIMD32_TYPE *)(addr))
+
+#define _SIMD32_OFFSET(addr)  (*(__SIMD32_TYPE *)  (addr))
+
+#define __SIMD64(addr)  (*(int64_t **) & (addr))
+
+#if defined (ARM_MATH_CM3) || defined (ARM_MATH_CM0_FAMILY)
   /**
    * @brief definition to pack two 16 bit values.
    */
@@ -417,7 +431,7 @@
   /**
    * @brief Clips Q63 to Q31 values.
    */
-  __STATIC_INLINE q31_t clip_q63_to_q31(
+  static __INLINE q31_t clip_q63_to_q31(
   q63_t x)
   {
     return ((q31_t) (x >> 32) != ((q31_t) x >> 31)) ?
@@ -427,7 +441,7 @@
   /**
    * @brief Clips Q63 to Q15 values.
    */
-  __STATIC_INLINE q15_t clip_q63_to_q15(
+  static __INLINE q15_t clip_q63_to_q15(
   q63_t x)
   {
     return ((q31_t) (x >> 32) != ((q31_t) x >> 31)) ?
@@ -437,7 +451,7 @@
   /**
    * @brief Clips Q31 to Q7 values.
    */
-  __STATIC_INLINE q7_t clip_q31_to_q7(
+  static __INLINE q7_t clip_q31_to_q7(
   q31_t x)
   {
     return ((q31_t) (x >> 24) != ((q31_t) x >> 23)) ?
@@ -447,7 +461,7 @@
   /**
    * @brief Clips Q31 to Q15 values.
    */
-  __STATIC_INLINE q15_t clip_q31_to_q15(
+  static __INLINE q15_t clip_q31_to_q15(
   q31_t x)
   {
     return ((q31_t) (x >> 16) != ((q31_t) x >> 15)) ?
@@ -458,7 +472,7 @@
    * @brief Multiplies 32 X 64 and returns 32 bit result in 2.30 format.
    */
 
-  __STATIC_INLINE q63_t mult32x64(
+  static __INLINE q63_t mult32x64(
   q63_t x,
   q31_t y)
   {
@@ -467,20 +481,16 @@
   }
 
 
-#if defined (ARM_MATH_CM0) && defined ( __CC_ARM   )
+#if defined (ARM_MATH_CM0_FAMILY) && defined ( __CC_ARM   )
 #define __CLZ __clz
-#endif
-
-#if defined (ARM_MATH_CM0) && defined ( __TASKING__ )
-/* No need to redefine __CLZ */
-#endif
-
-#if defined (ARM_MATH_CM0) && ((defined (__ICCARM__)) ||(defined (__GNUC__)) )
-
-  __STATIC_INLINE  uint32_t __CLZ(q31_t data);
-
-
-  __STATIC_INLINE uint32_t __CLZ(q31_t data)
+#elif defined (ARM_MATH_CM0_FAMILY) && ((defined (__ICCARM__)) ||(defined (__GNUC__)) || defined (__TASKING__) )
+
+  static __INLINE uint32_t __CLZ(
+  q31_t data);
+
+
+  static __INLINE uint32_t __CLZ(
+  q31_t data)
   {
     uint32_t count = 0;
     uint32_t mask = 0x80000000;
@@ -498,10 +508,10 @@
 #endif
 
   /**
-   * @brief Function to Calculates 1/in(reciprocal) value of Q31 Data type.
-   */
-
-  __STATIC_INLINE uint32_t arm_recip_q31(
+   * @brief Function to Calculates 1/in (reciprocal) value of Q31 Data type.
+   */
+
+  static __INLINE uint32_t arm_recip_q31(
   q31_t in,
   q31_t * dst,
   q31_t * pRecipTable)
@@ -550,9 +560,9 @@
   }
 
   /**
-   * @brief Function to Calculates 1/in(reciprocal) value of Q15 Data type.
-   */
-  __STATIC_INLINE uint32_t arm_recip_q15(
+   * @brief Function to Calculates 1/in (reciprocal) value of Q15 Data type.
+   */
+  static __INLINE uint32_t arm_recip_q15(
   q15_t in,
   q15_t * dst,
   q15_t * pRecipTable)
@@ -603,9 +613,9 @@
   /*
    * @brief C custom defined intrinisic function for only M0 processors
    */
-#if defined(ARM_MATH_CM0)
-
-  __STATIC_INLINE q31_t __SSAT(
+#if defined(ARM_MATH_CM0_FAMILY)
+
+  static __INLINE q31_t __SSAT(
   q31_t x,
   uint32_t y)
   {
@@ -641,19 +651,19 @@
 
   }
 
-#endif /* end of ARM_MATH_CM0 */
+#endif /* end of ARM_MATH_CM0_FAMILY */
 
 
 
   /*
    * @brief C custom defined intrinsic function for M3 and M0 processors
    */
-#if defined (ARM_MATH_CM3) || defined (ARM_MATH_CM0)
+#if defined (ARM_MATH_CM3) || defined (ARM_MATH_CM0_FAMILY)
 
   /*
    * @brief C custom defined QADD8 for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QADD8(
+  static __INLINE q31_t __QADD8(
   q31_t x,
   q31_t y)
   {
@@ -680,7 +690,7 @@
   /*
    * @brief C custom defined QSUB8 for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QSUB8(
+  static __INLINE q31_t __QSUB8(
   q31_t x,
   q31_t y)
   {
@@ -710,7 +720,7 @@
   /*
    * @brief C custom defined QADD16 for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QADD16(
+  static __INLINE q31_t __QADD16(
   q31_t x,
   q31_t y)
   {
@@ -733,7 +743,7 @@
   /*
    * @brief C custom defined SHADD16 for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SHADD16(
+  static __INLINE q31_t __SHADD16(
   q31_t x,
   q31_t y)
   {
@@ -756,7 +766,7 @@
   /*
    * @brief C custom defined QSUB16 for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QSUB16(
+  static __INLINE q31_t __QSUB16(
   q31_t x,
   q31_t y)
   {
@@ -778,7 +788,7 @@
   /*
    * @brief C custom defined SHSUB16 for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SHSUB16(
+  static __INLINE q31_t __SHSUB16(
   q31_t x,
   q31_t y)
   {
@@ -800,7 +810,7 @@
   /*
    * @brief C custom defined QASX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QASX(
+  static __INLINE q31_t __QASX(
   q31_t x,
   q31_t y)
   {
@@ -818,7 +828,7 @@
   /*
    * @brief C custom defined SHASX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SHASX(
+  static __INLINE q31_t __SHASX(
   q31_t x,
   q31_t y)
   {
@@ -841,7 +851,7 @@
   /*
    * @brief C custom defined QSAX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QSAX(
+  static __INLINE q31_t __QSAX(
   q31_t x,
   q31_t y)
   {
@@ -859,7 +869,7 @@
   /*
    * @brief C custom defined SHSAX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SHSAX(
+  static __INLINE q31_t __SHSAX(
   q31_t x,
   q31_t y)
   {
@@ -881,7 +891,7 @@
   /*
    * @brief C custom defined SMUSDX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SMUSDX(
+  static __INLINE q31_t __SMUSDX(
   q31_t x,
   q31_t y)
   {
@@ -893,7 +903,7 @@
   /*
    * @brief C custom defined SMUADX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SMUADX(
+  static __INLINE q31_t __SMUADX(
   q31_t x,
   q31_t y)
   {
@@ -905,7 +915,7 @@
   /*
    * @brief C custom defined QADD for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QADD(
+  static __INLINE q31_t __QADD(
   q31_t x,
   q31_t y)
   {
@@ -915,7 +925,7 @@
   /*
    * @brief C custom defined QSUB for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __QSUB(
+  static __INLINE q31_t __QSUB(
   q31_t x,
   q31_t y)
   {
@@ -925,7 +935,7 @@
   /*
    * @brief C custom defined SMLAD for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SMLAD(
+  static __INLINE q31_t __SMLAD(
   q31_t x,
   q31_t y,
   q31_t sum)
@@ -938,7 +948,7 @@
   /*
    * @brief C custom defined SMLADX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SMLADX(
+  static __INLINE q31_t __SMLADX(
   q31_t x,
   q31_t y,
   q31_t sum)
@@ -951,7 +961,7 @@
   /*
    * @brief C custom defined SMLSDX for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SMLSDX(
+  static __INLINE q31_t __SMLSDX(
   q31_t x,
   q31_t y,
   q31_t sum)
@@ -964,7 +974,7 @@
   /*
    * @brief C custom defined SMLALD for M3 and M0 processors
    */
-  __STATIC_INLINE q63_t __SMLALD(
+  static __INLINE q63_t __SMLALD(
   q31_t x,
   q31_t y,
   q63_t sum)
@@ -977,7 +987,7 @@
   /*
    * @brief C custom defined SMLALDX for M3 and M0 processors
    */
-  __STATIC_INLINE q63_t __SMLALDX(
+  static __INLINE q63_t __SMLALDX(
   q31_t x,
   q31_t y,
   q63_t sum)
@@ -990,7 +1000,7 @@
   /*
    * @brief C custom defined SMUAD for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SMUAD(
+  static __INLINE q31_t __SMUAD(
   q31_t x,
   q31_t y)
   {
@@ -1002,7 +1012,7 @@
   /*
    * @brief C custom defined SMUSD for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SMUSD(
+  static __INLINE q31_t __SMUSD(
   q31_t x,
   q31_t y)
   {
@@ -1015,7 +1025,7 @@
   /*
    * @brief C custom defined SXTB16 for M3 and M0 processors
    */
-  __STATIC_INLINE q31_t __SXTB16(
+  static __INLINE q31_t __SXTB16(
   q31_t x)
   {
 
@@ -1024,7 +1034,7 @@
   }
 
 
-#endif /* defined (ARM_MATH_CM3) || defined (ARM_MATH_CM0) */
+#endif /* defined (ARM_MATH_CM3) || defined (ARM_MATH_CM0_FAMILY) */
 
 
   /**
@@ -1524,6 +1534,7 @@
    * @param[in]       *pSrcA points to the first input matrix structure
    * @param[in]       *pSrcB points to the second input matrix structure
    * @param[out]      *pDst points to output matrix structure
+   * @param[in]		  *pState points to the array for storing intermediate results
    * @return     The function returns either
    * <code>ARM_MATH_SIZE_MISMATCH</code> or <code>ARM_MATH_SUCCESS</code> based on the outcome of size checking.
    */
@@ -1539,7 +1550,7 @@
    * @param[in]       *pSrcA  points to the first input matrix structure
    * @param[in]       *pSrcB  points to the second input matrix structure
    * @param[out]      *pDst   points to output matrix structure
-   * @param[in]		  *pState points to the array for storing intermediate results  
+   * @param[in]		  *pState points to the array for storing intermediate results
    * @return     The function returns either
    * <code>ARM_MATH_SIZE_MISMATCH</code> or <code>ARM_MATH_SUCCESS</code> based on the outcome of size checking.
    */
@@ -1721,7 +1732,7 @@
   typedef struct
   {
     q15_t A0;    /**< The derived gain, A0 = Kp + Ki + Kd . */
-#ifdef ARM_MATH_CM0
+#ifdef ARM_MATH_CM0_FAMILY
     q15_t A1;
     q15_t A2;
 #else
@@ -1939,52 +1950,8 @@
   uint32_t blockSize);
 
 
-  /**
-   * @brief Instance structure for the Q15 CFFT/CIFFT function.
-   */
-
-  typedef struct
-  {
-    uint16_t fftLen;                 /**< length of the FFT. */
-    uint8_t ifftFlag;                /**< flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform. */
-    uint8_t bitReverseFlag;          /**< flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output. */
-    q15_t *pTwiddle;                 /**< points to the twiddle factor table. */
-    uint16_t *pBitRevTable;          /**< points to the bit reversal table. */
-    uint16_t twidCoefModifier;       /**< twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. */
-    uint16_t bitRevFactor;           /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
-  } arm_cfft_radix4_instance_q15;
-
-  /**
-   * @brief Instance structure for the Q31 CFFT/CIFFT function.
-   */
-
-  typedef struct
-  {
-    uint16_t fftLen;                 /**< length of the FFT. */
-    uint8_t ifftFlag;                /**< flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform. */
-    uint8_t bitReverseFlag;          /**< flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output. */
-    q31_t *pTwiddle;                 /**< points to the twiddle factor table. */
-    uint16_t *pBitRevTable;          /**< points to the bit reversal table. */
-    uint16_t twidCoefModifier;       /**< twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. */
-    uint16_t bitRevFactor;           /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
-  } arm_cfft_radix4_instance_q31;
-
-
-  /**
-   * @brief Instance structure for the floating-point CFFT/CIFFT function.
-   */
-
-  typedef struct
-  {
-    uint16_t fftLen;                   /**< length of the FFT. */
-    uint8_t ifftFlag;                  /**< flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform. */
-    uint8_t bitReverseFlag;            /**< flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output. */
-    float32_t *pTwiddle;               /**< points to the twiddle factor table. */
-    uint16_t *pBitRevTable;            /**< points to the bit reversal table. */
-    uint16_t twidCoefModifier;         /**< twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. */
-    uint16_t bitRevFactor;             /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
-    float32_t onebyfftLen;                 /**< value of 1/fftLen. */
-  } arm_cfft_radix4_instance_f32;
+
+
 
 
   /**
@@ -2002,6 +1969,43 @@
     uint16_t bitRevFactor;           /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
   } arm_cfft_radix2_instance_q15;
 
+  arm_status arm_cfft_radix2_init_q15(
+  arm_cfft_radix2_instance_q15 * S,
+  uint16_t fftLen,
+  uint8_t ifftFlag,
+  uint8_t bitReverseFlag);
+
+  void arm_cfft_radix2_q15(
+  const arm_cfft_radix2_instance_q15 * S,
+  q15_t * pSrc);
+
+
+
+  /**
+   * @brief Instance structure for the Q15 CFFT/CIFFT function.
+   */
+
+  typedef struct
+  {
+    uint16_t fftLen;                 /**< length of the FFT. */
+    uint8_t ifftFlag;                /**< flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform. */
+    uint8_t bitReverseFlag;          /**< flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output. */
+    q15_t *pTwiddle;                 /**< points to the twiddle factor table. */
+    uint16_t *pBitRevTable;          /**< points to the bit reversal table. */
+    uint16_t twidCoefModifier;       /**< twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. */
+    uint16_t bitRevFactor;           /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
+  } arm_cfft_radix4_instance_q15;
+
+  arm_status arm_cfft_radix4_init_q15(
+  arm_cfft_radix4_instance_q15 * S,
+  uint16_t fftLen,
+  uint8_t ifftFlag,
+  uint8_t bitReverseFlag);
+
+  void arm_cfft_radix4_q15(
+  const arm_cfft_radix4_instance_q15 * S,
+  q15_t * pSrc);
+
   /**
    * @brief Instance structure for the Radix-2 Q31 CFFT/CIFFT function.
    */
@@ -2017,6 +2021,42 @@
     uint16_t bitRevFactor;           /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
   } arm_cfft_radix2_instance_q31;
 
+  arm_status arm_cfft_radix2_init_q31(
+  arm_cfft_radix2_instance_q31 * S,
+  uint16_t fftLen,
+  uint8_t ifftFlag,
+  uint8_t bitReverseFlag);
+
+  void arm_cfft_radix2_q31(
+  const arm_cfft_radix2_instance_q31 * S,
+  q31_t * pSrc);
+
+  /**
+   * @brief Instance structure for the Q31 CFFT/CIFFT function.
+   */
+
+  typedef struct
+  {
+    uint16_t fftLen;                 /**< length of the FFT. */
+    uint8_t ifftFlag;                /**< flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform. */
+    uint8_t bitReverseFlag;          /**< flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output. */
+    q31_t *pTwiddle;                 /**< points to the twiddle factor table. */
+    uint16_t *pBitRevTable;          /**< points to the bit reversal table. */
+    uint16_t twidCoefModifier;       /**< twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. */
+    uint16_t bitRevFactor;           /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
+  } arm_cfft_radix4_instance_q31;
+
+
+  void arm_cfft_radix4_q31(
+  const arm_cfft_radix4_instance_q31 * S,
+  q31_t * pSrc);
+
+  arm_status arm_cfft_radix4_init_q31(
+  arm_cfft_radix4_instance_q31 * S,
+  uint16_t fftLen,
+  uint8_t ifftFlag,
+  uint8_t bitReverseFlag);
+
   /**
    * @brief Instance structure for the floating-point CFFT/CIFFT function.
    */
@@ -2033,401 +2073,63 @@
     float32_t onebyfftLen;                 /**< value of 1/fftLen. */
   } arm_cfft_radix2_instance_f32;
 
-
-  /**
-   * @brief Processing function for the Q15 CFFT/CIFFT.
-   * @param[in]      *S    points to an instance of the Q15 CFFT/CIFFT structure.
-   * @param[in, out] *pSrc points to the complex data buffer. Processing occurs in-place.
-   * @return none.
-   */
-
-  void arm_cfft_radix4_q15(
-  const arm_cfft_radix4_instance_q15 * S,
-  q15_t * pSrc);
-
-  /**
-   * @brief Processing function for the Q15 CFFT/CIFFT.
-   * @param[in]      *S    points to an instance of the Q15 CFFT/CIFFT structure.
-   * @param[in, out] *pSrc points to the complex data buffer. Processing occurs in-place.
-   * @return none.
-   */
-
-  void arm_cfft_radix2_q15(
-  const arm_cfft_radix2_instance_q15 * S,
-  q15_t * pSrc);
-
-  /**
-   * @brief Initialization function for the Q15 CFFT/CIFFT.
-   * @param[in,out] *S             points to an instance of the Q15 CFFT/CIFFT structure.
-   * @param[in]     fftLen         length of the FFT.
-   * @param[in]     ifftFlag       flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform.
-   * @param[in]     bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return        arm_status     function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLen</code> is not a supported value.
-   */
-
-  arm_status arm_cfft_radix4_init_q15(
-  arm_cfft_radix4_instance_q15 * S,
-  uint16_t fftLen,
-  uint8_t ifftFlag,
-  uint8_t bitReverseFlag);
-
-  /**
-   * @brief Initialization function for the Q15 CFFT/CIFFT.
-   * @param[in,out] *S             points to an instance of the Q15 CFFT/CIFFT structure.
-   * @param[in]     fftLen         length of the FFT.
-   * @param[in]     ifftFlag       flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform.
-   * @param[in]     bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return        arm_status     function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLen</code> is not a supported value.
-   */
-
-  arm_status arm_cfft_radix2_init_q15(
-  arm_cfft_radix2_instance_q15 * S,
-  uint16_t fftLen,
-  uint8_t ifftFlag,
-  uint8_t bitReverseFlag);
-
-  /**
-   * @brief Processing function for the Q31 CFFT/CIFFT.
-   * @param[in]      *S    points to an instance of the Q31 CFFT/CIFFT structure.
-   * @param[in, out] *pSrc points to the complex data buffer. Processing occurs in-place.
-   * @return none.
-   */
-
-  void arm_cfft_radix4_q31(
-  const arm_cfft_radix4_instance_q31 * S,
-  q31_t * pSrc);
-
-  /**
-   * @brief  Initialization function for the Q31 CFFT/CIFFT.
-   * @param[in,out] *S             points to an instance of the Q31 CFFT/CIFFT structure.
-   * @param[in]     fftLen         length of the FFT.
-   * @param[in]     ifftFlag       flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform.
-   * @param[in]     bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return        arm_status     function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLen</code> is not a supported value.
-   */
-
-  arm_status arm_cfft_radix4_init_q31(
-  arm_cfft_radix4_instance_q31 * S,
-  uint16_t fftLen,
-  uint8_t ifftFlag,
-  uint8_t bitReverseFlag);
-
-  /**
-   * @brief Processing function for the Radix-2 Q31 CFFT/CIFFT.
-   * @param[in]      *S    points to an instance of the Radix-2 Q31 CFFT/CIFFT structure.
-   * @param[in, out] *pSrc points to the complex data buffer. Processing occurs in-place.
-   * @return none.
-   */
-
-  void arm_cfft_radix2_q31(
-  const arm_cfft_radix2_instance_q31 * S,
-  q31_t * pSrc);
-
-  /**
-   * @brief  Initialization function for the Radix-2 Q31 CFFT/CIFFT.
-   * @param[in,out] *S             points to an instance of the Radix-2 Q31 CFFT/CIFFT structure.
-   * @param[in]     fftLen         length of the FFT.
-   * @param[in]     ifftFlag       flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform.
-   * @param[in]     bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return        arm_status     function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLen</code> is not a supported value.
-   */
-
-  arm_status arm_cfft_radix2_init_q31(
-  arm_cfft_radix2_instance_q31 * S,
-  uint16_t fftLen,
-  uint8_t ifftFlag,
-  uint8_t bitReverseFlag);
-
-
-
-  /**
-   * @brief Processing function for the floating-point CFFT/CIFFT.
-   * @param[in]      *S    points to an instance of the floating-point CFFT/CIFFT structure.
-   * @param[in, out] *pSrc points to the complex data buffer. Processing occurs in-place.
-   * @return none.
-   */
-
-  void arm_cfft_radix2_f32(
-  const arm_cfft_radix2_instance_f32 * S,
-  float32_t * pSrc);
-
-  /**
-   * @brief  Initialization function for the floating-point CFFT/CIFFT.
-   * @param[in,out] *S             points to an instance of the floating-point CFFT/CIFFT structure.
-   * @param[in]     fftLen         length of the FFT.
-   * @param[in]     ifftFlag       flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform.
-   * @param[in]     bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return        The function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLen</code> is not a supported value.
-   */
-
+/* Deprecated */
   arm_status arm_cfft_radix2_init_f32(
   arm_cfft_radix2_instance_f32 * S,
   uint16_t fftLen,
   uint8_t ifftFlag,
   uint8_t bitReverseFlag);
 
-  /**
-   * @brief Processing function for the floating-point CFFT/CIFFT.
-   * @param[in]      *S    points to an instance of the floating-point CFFT/CIFFT structure.
-   * @param[in, out] *pSrc points to the complex data buffer. Processing occurs in-place.
-   * @return none.
-   */
-
-  void arm_cfft_radix4_f32(
-  const arm_cfft_radix4_instance_f32 * S,
+/* Deprecated */
+  void arm_cfft_radix2_f32(
+  const arm_cfft_radix2_instance_f32 * S,
   float32_t * pSrc);
 
   /**
-   * @brief  Initialization function for the floating-point CFFT/CIFFT.
-   * @param[in,out] *S             points to an instance of the floating-point CFFT/CIFFT structure.
-   * @param[in]     fftLen         length of the FFT.
-   * @param[in]     ifftFlag       flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform.
-   * @param[in]     bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return        The function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLen</code> is not a supported value.
-   */
-
+   * @brief Instance structure for the floating-point CFFT/CIFFT function.
+   */
+
+  typedef struct
+  {
+    uint16_t fftLen;                   /**< length of the FFT. */
+    uint8_t ifftFlag;                  /**< flag that selects forward (ifftFlag=0) or inverse (ifftFlag=1) transform. */
+    uint8_t bitReverseFlag;            /**< flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output. */
+    float32_t *pTwiddle;               /**< points to the Twiddle factor table. */
+    uint16_t *pBitRevTable;            /**< points to the bit reversal table. */
+    uint16_t twidCoefModifier;         /**< twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. */
+    uint16_t bitRevFactor;             /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
+    float32_t onebyfftLen;                 /**< value of 1/fftLen. */
+  } arm_cfft_radix4_instance_f32;
+
+/* Deprecated */
   arm_status arm_cfft_radix4_init_f32(
   arm_cfft_radix4_instance_f32 * S,
   uint16_t fftLen,
   uint8_t ifftFlag,
   uint8_t bitReverseFlag);
 
-
-
-  /*----------------------------------------------------------------------
-   *		Internal functions prototypes FFT function
-   ----------------------------------------------------------------------*/
-
-  /**
-   * @brief  Core function for the floating-point CFFT butterfly process.
-   * @param[in, out] *pSrc            points to the in-place buffer of floating-point data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef           points to the twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @return none.
-   */
-
-  void arm_radix4_butterfly_f32(
-  float32_t * pSrc,
-  uint16_t fftLen,
-  float32_t * pCoef,
-  uint16_t twidCoefModifier);
-
-  /**
-   * @brief  Core function for the floating-point CIFFT butterfly process.
-   * @param[in, out] *pSrc            points to the in-place buffer of floating-point data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef           points to twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @param[in]      onebyfftLen      value of 1/fftLen.
-   * @return none.
-   */
-
-  void arm_radix4_butterfly_inverse_f32(
-  float32_t * pSrc,
-  uint16_t fftLen,
-  float32_t * pCoef,
-  uint16_t twidCoefModifier,
-  float32_t onebyfftLen);
-
-  /**
-   * @brief  In-place bit reversal function.
-   * @param[in, out] *pSrc        points to the in-place buffer of floating-point data type.
-   * @param[in]      fftSize      length of the FFT.
-   * @param[in]      bitRevFactor bit reversal modifier that supports different size FFTs with the same bit reversal table.
-   * @param[in]      *pBitRevTab  points to the bit reversal table.
-   * @return none.
-   */
-
-  void arm_bitreversal_f32(
-  float32_t * pSrc,
-  uint16_t fftSize,
-  uint16_t bitRevFactor,
-  uint16_t * pBitRevTab);
-
-  /**
-   * @brief  Core function for the Q31 CFFT butterfly process.
-   * @param[in, out] *pSrc            points to the in-place buffer of Q31 data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef           points to Twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @return none.
-   */
-
-  void arm_radix4_butterfly_q31(
-  q31_t * pSrc,
-  uint32_t fftLen,
-  q31_t * pCoef,
-  uint32_t twidCoefModifier);
-
-  /**
-   * @brief  Core function for the f32 FFT butterfly process.
-   * @param[in, out] *pSrc            points to the in-place buffer of f32 data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef        	  points to Twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @return none.
-   */
-
-  void arm_radix2_butterfly_f32(
-  float32_t * pSrc,
-  uint32_t fftLen,
-  float32_t * pCoef,
-  uint16_t twidCoefModifier);
-
-        /**  
-	 * @brief  Core function for the Radix-2 Q31 CFFT butterfly process. 
-	 * @param[in, out] *pSrc            points to the in-place buffer of Q31 data type. 
-	 * @param[in]      fftLen           length of the FFT. 
-	 * @param[in]      *pCoef         	points to Twiddle coefficient buffer.
-	 * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. 
-	 * @return none. 
-	 */
-
-  void arm_radix2_butterfly_q31(
-  q31_t * pSrc,
-  uint32_t fftLen,
-  q31_t * pCoef,
-  uint16_t twidCoefModifier);
-
-        /**  
-	 * @brief  Core function for the Radix-2 Q15 CFFT butterfly process. 
-	 * @param[in, out] *pSrc            points to the in-place buffer of Q15 data type. 
-	 * @param[in]      fftLen           length of the FFT. 
-	 * @param[in]      *pCoef           points to Twiddle coefficient buffer.
-	 * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. 
-	 * @return none. 
-	 */
-
-  void arm_radix2_butterfly_q15(
-  q15_t * pSrc,
-  uint32_t fftLen,
-  q15_t * pCoef,
-  uint16_t twidCoefModifier);
-
-        /**  
-	 * @brief  Core function for the Radix-2 Q15 CFFT Inverse butterfly process. 
-	 * @param[in, out] *pSrc            points to the in-place buffer of Q15 data type. 
-	 * @param[in]      fftLen           length of the FFT. 
-	 * @param[in]      *pCoef       	points to Twiddle coefficient buffer.
-	 * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. 
-	 * @return none. 
-	 */
-
-  void arm_radix2_butterfly_inverse_q15(
-  q15_t * pSrc,
-  uint32_t fftLen,
-  q15_t * pCoef,
-  uint16_t twidCoefModifier);
-
-        /**  
-	 * @brief  Core function for the Radix-2 Q31 CFFT Inverse butterfly process. 
-	 * @param[in, out] *pSrc            points to the in-place buffer of Q31 data type. 
-	 * @param[in]      fftLen           length of the FFT. 
-	 * @param[in]      *pCoef         	points to Twiddle coefficient buffer.
-	 * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table. 
-	 * @return none. 
-	 */
-
-  void arm_radix2_butterfly_inverse_q31(
-  q31_t * pSrc,
-  uint32_t fftLen,
-  q31_t * pCoef,
-  uint16_t twidCoefModifier);
-
-  /**
-   * @brief  Core function for the f32 IFFT butterfly process.
-   * @param[in, out] *pSrc            points to the in-place buffer of f32 data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef        	  points to Twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @param[in]		 onebyfftLen	  1/fftLenfth
-   * @return none.
-   */
-
-  void arm_radix2_butterfly_inverse_f32(
-  float32_t * pSrc,
-  uint32_t fftLen,
-  float32_t * pCoef,
-  uint16_t twidCoefModifier,
-  float32_t onebyfftLen);
-
-                                                                /**
-   * @brief  Core function for the Q31 CIFFT butterfly process.
-   * @param[in, out] *pSrc            points to the in-place buffer of Q31 data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef           points to twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @return none.
-   */
-
-  void arm_radix4_butterfly_inverse_q31(
-  q31_t * pSrc,
-  uint32_t fftLen,
-  q31_t * pCoef,
-  uint32_t twidCoefModifier);
-
-  /**
-   * @brief  In-place bit reversal function.
-   * @param[in, out] *pSrc        points to the in-place buffer of Q31 data type.
-   * @param[in]      fftLen       length of the FFT.
-   * @param[in]      bitRevFactor bit reversal modifier that supports different size FFTs with the same bit reversal table
-   * @param[in]      *pBitRevTab  points to bit reversal table.
-   * @return none.
-   */
-
-  void arm_bitreversal_q31(
-  q31_t * pSrc,
-  uint32_t fftLen,
-  uint16_t bitRevFactor,
-  uint16_t * pBitRevTab);
-
-  /**
-   * @brief  Core function for the Q15 CFFT butterfly process.
-   * @param[in, out] *pSrc16          points to the in-place buffer of Q15 data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef16         points to twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @return none.
-   */
-
-  void arm_radix4_butterfly_q15(
-  q15_t * pSrc16,
-  uint32_t fftLen,
-  q15_t * pCoef16,
-  uint32_t twidCoefModifier);
-
-
-  /**
-   * @brief  Core function for the Q15 CIFFT butterfly process.
-   * @param[in, out] *pSrc16          points to the in-place buffer of Q15 data type.
-   * @param[in]      fftLen           length of the FFT.
-   * @param[in]      *pCoef16         points to twiddle coefficient buffer.
-   * @param[in]      twidCoefModifier twiddle coefficient modifier that supports different size FFTs with the same twiddle factor table.
-   * @return none.
-   */
-
-  void arm_radix4_butterfly_inverse_q15(
-  q15_t * pSrc16,
-  uint32_t fftLen,
-  q15_t * pCoef16,
-  uint32_t twidCoefModifier);
-
-  /**
-   * @brief  In-place bit reversal function.
-   * @param[in, out] *pSrc        points to the in-place buffer of Q15 data type.
-   * @param[in]      fftLen       length of the FFT.
-   * @param[in]      bitRevFactor bit reversal modifier that supports different size FFTs with the same bit reversal table
-   * @param[in]      *pBitRevTab  points to bit reversal table.
-   * @return none.
-   */
-
-  void arm_bitreversal_q15(
-  q15_t * pSrc,
-  uint32_t fftLen,
-  uint16_t bitRevFactor,
-  uint16_t * pBitRevTab);
-
+/* Deprecated */
+  void arm_cfft_radix4_f32(
+  const arm_cfft_radix4_instance_f32 * S,
+  float32_t * pSrc);
+
+  /**
+   * @brief Instance structure for the floating-point CFFT/CIFFT function.
+   */
+
+  typedef struct
+  {
+    uint16_t fftLen;                   /**< length of the FFT. */
+    const float32_t *pTwiddle;         /**< points to the Twiddle factor table. */
+    const uint16_t *pBitRevTable;      /**< points to the bit reversal table. */
+    uint16_t bitRevLength;             /**< bit reversal table length. */
+  } arm_cfft_instance_f32;
+
+  void arm_cfft_f32(
+  const arm_cfft_instance_f32 * S,
+  float32_t * p1,
+  uint8_t ifftFlag,
+  uint8_t bitReverseFlag);
 
   /**
    * @brief Instance structure for the Q15 RFFT/RIFFT function.
@@ -2445,6 +2147,18 @@
     arm_cfft_radix4_instance_q15 *pCfft;          /**< points to the complex FFT instance. */
   } arm_rfft_instance_q15;
 
+  arm_status arm_rfft_init_q15(
+  arm_rfft_instance_q15 * S,
+  arm_cfft_radix4_instance_q15 * S_CFFT,
+  uint32_t fftLenReal,
+  uint32_t ifftFlagR,
+  uint32_t bitReverseFlag);
+
+  void arm_rfft_q15(
+  const arm_rfft_instance_q15 * S,
+  q15_t * pSrc,
+  q15_t * pDst);
+
   /**
    * @brief Instance structure for the Q31 RFFT/RIFFT function.
    */
@@ -2461,6 +2175,18 @@
     arm_cfft_radix4_instance_q31 *pCfft;        /**< points to the complex FFT instance. */
   } arm_rfft_instance_q31;
 
+  arm_status arm_rfft_init_q31(
+  arm_rfft_instance_q31 * S,
+  arm_cfft_radix4_instance_q31 * S_CFFT,
+  uint32_t fftLenReal,
+  uint32_t ifftFlagR,
+  uint32_t bitReverseFlag);
+
+  void arm_rfft_q31(
+  const arm_rfft_instance_q31 * S,
+  q31_t * pSrc,
+  q31_t * pDst);
+
   /**
    * @brief Instance structure for the floating-point RFFT/RIFFT function.
    */
@@ -2477,76 +2203,6 @@
     arm_cfft_radix4_instance_f32 *pCfft;        /**< points to the complex FFT instance. */
   } arm_rfft_instance_f32;
 
-  /**
-   * @brief Processing function for the Q15 RFFT/RIFFT.
-   * @param[in]  *S    points to an instance of the Q15 RFFT/RIFFT structure.
-   * @param[in]  *pSrc points to the input buffer.
-   * @param[out] *pDst points to the output buffer.
-   * @return none.
-   */
-
-  void arm_rfft_q15(
-  const arm_rfft_instance_q15 * S,
-  q15_t * pSrc,
-  q15_t * pDst);
-
-  /**
-   * @brief  Initialization function for the Q15 RFFT/RIFFT.
-   * @param[in, out] *S             points to an instance of the Q15 RFFT/RIFFT structure.
-   * @param[in]      *S_CFFT        points to an instance of the Q15 CFFT/CIFFT structure.
-   * @param[in]      fftLenReal     length of the FFT.
-   * @param[in]      ifftFlagR      flag that selects forward (ifftFlagR=0) or inverse (ifftFlagR=1) transform.
-   * @param[in]      bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return		The function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLenReal</code> is not a supported value.
-   */
-
-  arm_status arm_rfft_init_q15(
-  arm_rfft_instance_q15 * S,
-  arm_cfft_radix4_instance_q15 * S_CFFT,
-  uint32_t fftLenReal,
-  uint32_t ifftFlagR,
-  uint32_t bitReverseFlag);
-
-  /**
-   * @brief Processing function for the Q31 RFFT/RIFFT.
-   * @param[in]  *S    points to an instance of the Q31 RFFT/RIFFT structure.
-   * @param[in]  *pSrc points to the input buffer.
-   * @param[out] *pDst points to the output buffer.
-   * @return none.
-   */
-
-  void arm_rfft_q31(
-  const arm_rfft_instance_q31 * S,
-  q31_t * pSrc,
-  q31_t * pDst);
-
-  /**
-   * @brief  Initialization function for the Q31 RFFT/RIFFT.
-   * @param[in, out] *S             points to an instance of the Q31 RFFT/RIFFT structure.
-   * @param[in, out] *S_CFFT        points to an instance of the Q31 CFFT/CIFFT structure.
-   * @param[in]      fftLenReal     length of the FFT.
-   * @param[in]      ifftFlagR      flag that selects forward (ifftFlagR=0) or inverse (ifftFlagR=1) transform.
-   * @param[in]      bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return		The function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLenReal</code> is not a supported value.
-   */
-
-  arm_status arm_rfft_init_q31(
-  arm_rfft_instance_q31 * S,
-  arm_cfft_radix4_instance_q31 * S_CFFT,
-  uint32_t fftLenReal,
-  uint32_t ifftFlagR,
-  uint32_t bitReverseFlag);
-
-  /**
-   * @brief  Initialization function for the floating-point RFFT/RIFFT.
-   * @param[in,out] *S             points to an instance of the floating-point RFFT/RIFFT structure.
-   * @param[in,out] *S_CFFT        points to an instance of the floating-point CFFT/CIFFT structure.
-   * @param[in]     fftLenReal     length of the FFT.
-   * @param[in]     ifftFlagR      flag that selects forward (ifftFlagR=0) or inverse (ifftFlagR=1) transform.
-   * @param[in]     bitReverseFlag flag that enables (bitReverseFlag=1) or disables (bitReverseFlag=0) bit reversal of output.
-   * @return		The function returns ARM_MATH_SUCCESS if initialization is successful or ARM_MATH_ARGUMENT_ERROR if <code>fftLenReal</code> is not a supported value.
-   */
-
   arm_status arm_rfft_init_f32(
   arm_rfft_instance_f32 * S,
   arm_cfft_radix4_instance_f32 * S_CFFT,
@@ -2554,20 +2210,32 @@
   uint32_t ifftFlagR,
   uint32_t bitReverseFlag);
 
-  /**
-   * @brief Processing function for the floating-point RFFT/RIFFT.
-   * @param[in]  *S    points to an instance of the floating-point RFFT/RIFFT structure.
-   * @param[in]  *pSrc points to the input buffer.
-   * @param[out] *pDst points to the output buffer.
-   * @return none.
-   */
-
   void arm_rfft_f32(
   const arm_rfft_instance_f32 * S,
   float32_t * pSrc,
   float32_t * pDst);
 
   /**
+   * @brief Instance structure for the floating-point RFFT/RIFFT function.
+   */
+
+typedef struct
+  {
+    arm_cfft_instance_f32 Sint;      /**< Internal CFFT structure. */
+    uint16_t fftLenRFFT;                        /**< length of the real sequence */
+	float32_t * pTwiddleRFFT;					/**< Twiddle factors real stage  */
+  } arm_rfft_fast_instance_f32 ;
+
+arm_status arm_rfft_fast_init_f32 (
+	arm_rfft_fast_instance_f32 * S,
+	uint16_t fftLen);
+
+void arm_rfft_fast_f32(
+  arm_rfft_fast_instance_f32 * S,
+  float32_t * p, float32_t * pOut,
+  uint8_t ifftFlag);
+
+  /**
    * @brief Instance structure for the floating-point DCT4/IDCT4 function.
    */
 
@@ -3163,7 +2831,7 @@
   q31_t * pDst,
   uint32_t blockSize);
   /**
-   * @brief  Copies the elements of a floating-point vector. 
+   * @brief  Copies the elements of a floating-point vector.
    * @param[in]  *pSrc input pointer
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3175,7 +2843,7 @@
   uint32_t blockSize);
 
   /**
-   * @brief  Copies the elements of a Q7 vector. 
+   * @brief  Copies the elements of a Q7 vector.
    * @param[in]  *pSrc input pointer
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3187,7 +2855,7 @@
   uint32_t blockSize);
 
   /**
-   * @brief  Copies the elements of a Q15 vector. 
+   * @brief  Copies the elements of a Q15 vector.
    * @param[in]  *pSrc input pointer
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3199,7 +2867,7 @@
   uint32_t blockSize);
 
   /**
-   * @brief  Copies the elements of a Q31 vector. 
+   * @brief  Copies the elements of a Q31 vector.
    * @param[in]  *pSrc input pointer
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3210,7 +2878,7 @@
   q31_t * pDst,
   uint32_t blockSize);
   /**
-   * @brief  Fills a constant value into a floating-point vector. 
+   * @brief  Fills a constant value into a floating-point vector.
    * @param[in]  value input value to be filled
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3222,7 +2890,7 @@
   uint32_t blockSize);
 
   /**
-   * @brief  Fills a constant value into a Q7 vector. 
+   * @brief  Fills a constant value into a Q7 vector.
    * @param[in]  value input value to be filled
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3234,7 +2902,7 @@
   uint32_t blockSize);
 
   /**
-   * @brief  Fills a constant value into a Q15 vector. 
+   * @brief  Fills a constant value into a Q15 vector.
    * @param[in]  value input value to be filled
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3246,7 +2914,7 @@
   uint32_t blockSize);
 
   /**
-   * @brief  Fills a constant value into a Q31 vector. 
+   * @brief  Fills a constant value into a Q31 vector.
    * @param[in]  value input value to be filled
    * @param[out]  *pDst output pointer
    * @param[in]  blockSize number of samples to process
@@ -3257,14 +2925,14 @@
   q31_t * pDst,
   uint32_t blockSize);
 
-/**  
- * @brief Convolution of floating-point sequences.  
- * @param[in] *pSrcA points to the first input sequence.  
- * @param[in] srcALen length of the first input sequence.  
- * @param[in] *pSrcB points to the second input sequence.  
- * @param[in] srcBLen length of the second input sequence.  
- * @param[out] *pDst points to the location where the output result is written.  Length srcALen+srcBLen-1.  
- * @return none.  
+/**
+ * @brief Convolution of floating-point sequences.
+ * @param[in] *pSrcA points to the first input sequence.
+ * @param[in] srcALen length of the first input sequence.
+ * @param[in] *pSrcB points to the second input sequence.
+ * @param[in] srcBLen length of the second input sequence.
+ * @param[out] *pDst points to the location where the output result is written.  Length srcALen+srcBLen-1.
+ * @return none.
  */
 
   void arm_conv_f32(
@@ -3274,17 +2942,17 @@
   uint32_t srcBLen,
   float32_t * pDst);
 
-  
-  /**   
-   * @brief Convolution of Q15 sequences.   
-   * @param[in] *pSrcA points to the first input sequence.   
-   * @param[in] srcALen length of the first input sequence.   
-   * @param[in] *pSrcB points to the second input sequence.   
-   * @param[in] srcBLen length of the second input sequence.   
-   * @param[out] *pDst points to the block of output data  Length srcALen+srcBLen-1.   
-   * @param[in]  *pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @param[in]  *pScratch2 points to scratch buffer of size min(srcALen, srcBLen).   
-   * @return none.   
+
+  /**
+   * @brief Convolution of Q15 sequences.
+   * @param[in] *pSrcA points to the first input sequence.
+   * @param[in] srcALen length of the first input sequence.
+   * @param[in] *pSrcB points to the second input sequence.
+   * @param[in] srcBLen length of the second input sequence.
+   * @param[out] *pDst points to the block of output data  Length srcALen+srcBLen-1.
+   * @param[in]  *pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @param[in]  *pScratch2 points to scratch buffer of size min(srcALen, srcBLen).
+   * @return none.
    */
 
 
@@ -3298,14 +2966,14 @@
   q15_t * pScratch2);
 
 
-/**  
- * @brief Convolution of Q15 sequences.  
- * @param[in] *pSrcA points to the first input sequence.  
- * @param[in] srcALen length of the first input sequence.  
- * @param[in] *pSrcB points to the second input sequence.  
- * @param[in] srcBLen length of the second input sequence.  
- * @param[out] *pDst points to the location where the output result is written.  Length srcALen+srcBLen-1.  
- * @return none.  
+/**
+ * @brief Convolution of Q15 sequences.
+ * @param[in] *pSrcA points to the first input sequence.
+ * @param[in] srcALen length of the first input sequence.
+ * @param[in] *pSrcB points to the second input sequence.
+ * @param[in] srcBLen length of the second input sequence.
+ * @param[out] *pDst points to the location where the output result is written.  Length srcALen+srcBLen-1.
+ * @return none.
  */
 
   void arm_conv_q15(
@@ -3339,9 +3007,9 @@
    * @param[in] *pSrcB points to the second input sequence.
    * @param[in] srcBLen length of the second input sequence.
    * @param[out] *pDst points to the block of output data  Length srcALen+srcBLen-1.
-   * @param[in]  *pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @param[in]  *pScratch2 points to scratch buffer of size min(srcALen, srcBLen).   
-   * @return none.   
+   * @param[in]  *pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @param[in]  *pScratch2 points to scratch buffer of size min(srcALen, srcBLen).
+   * @return none.
    */
 
   void arm_conv_fast_opt_q15(
@@ -3390,16 +3058,16 @@
   q31_t * pDst);
 
 
-    /**   
-   * @brief Convolution of Q7 sequences.   
-   * @param[in] *pSrcA points to the first input sequence.   
-   * @param[in] srcALen length of the first input sequence.   
-   * @param[in] *pSrcB points to the second input sequence.   
-   * @param[in] srcBLen length of the second input sequence.   
-   * @param[out] *pDst points to the block of output data  Length srcALen+srcBLen-1.   
-   * @param[in]  *pScratch1 points to scratch buffer(of type q15_t) of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @param[in]  *pScratch2 points to scratch buffer (of type q15_t) of size min(srcALen, srcBLen).   
-   * @return none.   
+    /**
+   * @brief Convolution of Q7 sequences.
+   * @param[in] *pSrcA points to the first input sequence.
+   * @param[in] srcALen length of the first input sequence.
+   * @param[in] *pSrcB points to the second input sequence.
+   * @param[in] srcBLen length of the second input sequence.
+   * @param[out] *pDst points to the block of output data  Length srcALen+srcBLen-1.
+   * @param[in]  *pScratch1 points to scratch buffer(of type q15_t) of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @param[in]  *pScratch2 points to scratch buffer (of type q15_t) of size min(srcALen, srcBLen).
+   * @return none.
    */
 
   void arm_conv_opt_q7(
@@ -3452,18 +3120,18 @@
   uint32_t firstIndex,
   uint32_t numPoints);
 
-    /**   
-   * @brief Partial convolution of Q15 sequences.   
-   * @param[in]       *pSrcA points to the first input sequence.   
-   * @param[in]       srcALen length of the first input sequence.   
-   * @param[in]       *pSrcB points to the second input sequence.   
-   * @param[in]       srcBLen length of the second input sequence.   
-   * @param[out]      *pDst points to the block of output data   
-   * @param[in]       firstIndex is the first output sample to start with.   
-   * @param[in]       numPoints is the number of output points to be computed.   
-   * @param[in]       * pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @param[in]       * pScratch2 points to scratch buffer of size min(srcALen, srcBLen).   
-   * @return  Returns either ARM_MATH_SUCCESS if the function completed correctly or ARM_MATH_ARGUMENT_ERROR if the requested subset is not in the range [0 srcALen+srcBLen-2].   
+    /**
+   * @brief Partial convolution of Q15 sequences.
+   * @param[in]       *pSrcA points to the first input sequence.
+   * @param[in]       srcALen length of the first input sequence.
+   * @param[in]       *pSrcB points to the second input sequence.
+   * @param[in]       srcBLen length of the second input sequence.
+   * @param[out]      *pDst points to the block of output data
+   * @param[in]       firstIndex is the first output sample to start with.
+   * @param[in]       numPoints is the number of output points to be computed.
+   * @param[in]       * pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @param[in]       * pScratch2 points to scratch buffer of size min(srcALen, srcBLen).
+   * @return  Returns either ARM_MATH_SUCCESS if the function completed correctly or ARM_MATH_ARGUMENT_ERROR if the requested subset is not in the range [0 srcALen+srcBLen-2].
    */
 
   arm_status arm_conv_partial_opt_q15(
@@ -3530,9 +3198,9 @@
    * @param[out]      *pDst points to the block of output data
    * @param[in]       firstIndex is the first output sample to start with.
    * @param[in]       numPoints is the number of output points to be computed.
-   * @param[in]       * pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @param[in]       * pScratch2 points to scratch buffer of size min(srcALen, srcBLen).   
-   * @return  Returns either ARM_MATH_SUCCESS if the function completed correctly or ARM_MATH_ARGUMENT_ERROR if the requested subset is not in the range [0 srcALen+srcBLen-2].   
+   * @param[in]       * pScratch1 points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @param[in]       * pScratch2 points to scratch buffer of size min(srcALen, srcBLen).
+   * @return  Returns either ARM_MATH_SUCCESS if the function completed correctly or ARM_MATH_ARGUMENT_ERROR if the requested subset is not in the range [0 srcALen+srcBLen-2].
    */
 
   arm_status arm_conv_partial_fast_opt_q15(
@@ -3591,18 +3259,18 @@
   uint32_t numPoints);
 
 
-  /**   
-   * @brief Partial convolution of Q7 sequences   
-   * @param[in]       *pSrcA points to the first input sequence.   
-   * @param[in]       srcALen length of the first input sequence.   
-   * @param[in]       *pSrcB points to the second input sequence.   
-   * @param[in]       srcBLen length of the second input sequence.   
-   * @param[out]      *pDst points to the block of output data   
-   * @param[in]       firstIndex is the first output sample to start with.   
-   * @param[in]       numPoints is the number of output points to be computed.   
-   * @param[in]  *pScratch1 points to scratch buffer(of type q15_t) of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @param[in]  *pScratch2 points to scratch buffer (of type q15_t) of size min(srcALen, srcBLen).   
-   * @return  Returns either ARM_MATH_SUCCESS if the function completed correctly or ARM_MATH_ARGUMENT_ERROR if the requested subset is not in the range [0 srcALen+srcBLen-2].   
+  /**
+   * @brief Partial convolution of Q7 sequences
+   * @param[in]       *pSrcA points to the first input sequence.
+   * @param[in]       srcALen length of the first input sequence.
+   * @param[in]       *pSrcB points to the second input sequence.
+   * @param[in]       srcBLen length of the second input sequence.
+   * @param[out]      *pDst points to the block of output data
+   * @param[in]       firstIndex is the first output sample to start with.
+   * @param[in]       numPoints is the number of output points to be computed.
+   * @param[in]  *pScratch1 points to scratch buffer(of type q15_t) of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @param[in]  *pScratch2 points to scratch buffer (of type q15_t) of size min(srcALen, srcBLen).
+   * @return  Returns either ARM_MATH_SUCCESS if the function completed correctly or ARM_MATH_ARGUMENT_ERROR if the requested subset is not in the range [0 srcALen+srcBLen-2].
    */
 
   arm_status arm_conv_partial_opt_q7(
@@ -4094,8 +3762,8 @@
    * @brief Initialization function for the Q15 FIR lattice filter.
    * @param[in] *S points to an instance of the Q15 FIR lattice structure.
    * @param[in] numStages  number of filter stages.
-   * @param[in] *pCoeffs points to the coefficient buffer.  The array is of length numStages. 
-   * @param[in] *pState points to the state buffer.  The array is of length numStages. 
+   * @param[in] *pCoeffs points to the coefficient buffer.  The array is of length numStages.
+   * @param[in] *pState points to the state buffer.  The array is of length numStages.
    * @return none.
    */
 
@@ -4662,15 +4330,15 @@
   float32_t * pDst);
 
 
-   /**   
-   * @brief Correlation of Q15 sequences   
-   * @param[in] *pSrcA points to the first input sequence.   
-   * @param[in] srcALen length of the first input sequence.   
-   * @param[in] *pSrcB points to the second input sequence.   
-   * @param[in] srcBLen length of the second input sequence.   
-   * @param[out] *pDst points to the block of output data  Length 2 * max(srcALen, srcBLen) - 1.   
-   * @param[in]  *pScratch points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @return none.   
+   /**
+   * @brief Correlation of Q15 sequences
+   * @param[in] *pSrcA points to the first input sequence.
+   * @param[in] srcALen length of the first input sequence.
+   * @param[in] *pSrcB points to the second input sequence.
+   * @param[in] srcBLen length of the second input sequence.
+   * @param[out] *pDst points to the block of output data  Length 2 * max(srcALen, srcBLen) - 1.
+   * @param[in]  *pScratch points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @return none.
    */
   void arm_correlate_opt_q15(
   q15_t * pSrcA,
@@ -4724,7 +4392,7 @@
    * @param[in] *pSrcB points to the second input sequence.
    * @param[in] srcBLen length of the second input sequence.
    * @param[out] *pDst points to the block of output data  Length 2 * max(srcALen, srcBLen) - 1.
-   * @param[in]  *pScratch points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
+   * @param[in]  *pScratch points to scratch buffer of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
    * @return none.
    */
 
@@ -4772,16 +4440,16 @@
 
 
 
- /**   
-   * @brief Correlation of Q7 sequences.   
-   * @param[in] *pSrcA points to the first input sequence.   
-   * @param[in] srcALen length of the first input sequence.   
-   * @param[in] *pSrcB points to the second input sequence.   
-   * @param[in] srcBLen length of the second input sequence.   
-   * @param[out] *pDst points to the block of output data  Length 2 * max(srcALen, srcBLen) - 1.   
-   * @param[in]  *pScratch1 points to scratch buffer(of type q15_t) of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.   
-   * @param[in]  *pScratch2 points to scratch buffer (of type q15_t) of size min(srcALen, srcBLen).   
-   * @return none.   
+ /**
+   * @brief Correlation of Q7 sequences.
+   * @param[in] *pSrcA points to the first input sequence.
+   * @param[in] srcALen length of the first input sequence.
+   * @param[in] *pSrcB points to the second input sequence.
+   * @param[in] srcBLen length of the second input sequence.
+   * @param[out] *pDst points to the block of output data  Length 2 * max(srcALen, srcBLen) - 1.
+   * @param[in]  *pScratch1 points to scratch buffer(of type q15_t) of size max(srcALen, srcBLen) + 2*min(srcALen, srcBLen) - 2.
+   * @param[in]  *pScratch2 points to scratch buffer (of type q15_t) of size min(srcALen, srcBLen).
+   * @return none.
    */
 
   void arm_correlate_opt_q7(
@@ -5027,9 +4695,9 @@
 
   /*
    * @brief  Floating-point sin_cos function.
-   * @param[in]  theta    input value in degrees 
-   * @param[out] *pSinVal points to the processed sine output. 
-   * @param[out] *pCosVal points to the processed cos output. 
+   * @param[in]  theta    input value in degrees
+   * @param[out] *pSinVal points to the processed sine output.
+   * @param[out] *pCosVal points to the processed cos output.
    * @return none.
    */
 
@@ -5040,9 +4708,9 @@
 
   /*
    * @brief  Q31 sin_cos function.
-   * @param[in]  theta    scaled input value in degrees 
-   * @param[out] *pSinVal points to the processed sine output. 
-   * @param[out] *pCosVal points to the processed cosine output. 
+   * @param[in]  theta    scaled input value in degrees
+   * @param[out] *pSinVal points to the processed sine output.
+   * @param[out] *pCosVal points to the processed cosine output.
    * @return none.
    */
 
@@ -5140,7 +4808,7 @@
   /**
    * @defgroup PID PID Motor Control
    *
-   * A Proportional Integral Derivative (PID) controller is a generic feedback control 
+   * A Proportional Integral Derivative (PID) controller is a generic feedback control
    * loop mechanism widely used in industrial control systems.
    * A PID controller is the most commonly used type of feedback controller.
    *
@@ -5159,39 +4827,39 @@
    *
    * \par
    * where \c Kp is proportional constant, \c Ki is Integral constant and \c Kd is Derivative constant
-   * 
-   * \par 
-   * \image html PID.gif "Proportional Integral Derivative Controller" 
+   *
+   * \par
+   * \image html PID.gif "Proportional Integral Derivative Controller"
    *
    * \par
    * The PID controller calculates an "error" value as the difference between
    * the measured output and the reference input.
-   * The controller attempts to minimize the error by adjusting the process control inputs.  
-   * The proportional value determines the reaction to the current error, 
-   * the integral value determines the reaction based on the sum of recent errors, 
+   * The controller attempts to minimize the error by adjusting the process control inputs.
+   * The proportional value determines the reaction to the current error,
+   * the integral value determines the reaction based on the sum of recent errors,
    * and the derivative value determines the reaction based on the rate at which the error has been changing.
    *
-   * \par Instance Structure 
-   * The Gains A0, A1, A2 and state variables for a PID controller are stored together in an instance data structure. 
-   * A separate instance structure must be defined for each PID Controller. 
-   * There are separate instance structure declarations for each of the 3 supported data types. 
-   * 
-   * \par Reset Functions 
-   * There is also an associated reset function for each data type which clears the state array. 
+   * \par Instance Structure
+   * The Gains A0, A1, A2 and state variables for a PID controller are stored together in an instance data structure.
+   * A separate instance structure must be defined for each PID Controller.
+   * There are separate instance structure declarations for each of the 3 supported data types.
+   *
+   * \par Reset Functions
+   * There is also an associated reset function for each data type which clears the state array.
    *
-   * \par Initialization Functions 
-   * There is also an associated initialization function for each data type. 
-   * The initialization function performs the following operations: 
+   * \par Initialization Functions
+   * There is also an associated initialization function for each data type.
+   * The initialization function performs the following operations:
    * - Initializes the Gains A0, A1, A2 from Kp,Ki, Kd gains.
-   * - Zeros out the values in the state buffer.   
-   * 
-   * \par 
-   * Instance structure cannot be placed into a const data section and it is recommended to use the initialization function. 
+   * - Zeros out the values in the state buffer.
+   *
+   * \par
+   * Instance structure cannot be placed into a const data section and it is recommended to use the initialization function.
    *
-   * \par Fixed-Point Behavior 
-   * Care must be taken when using the fixed-point versions of the PID Controller functions. 
-   * In particular, the overflow and saturation behavior of the accumulator used in each function must be considered. 
-   * Refer to the function specific documentation below for usage guidelines. 
+   * \par Fixed-Point Behavior
+   * Care must be taken when using the fixed-point versions of the PID Controller functions.
+   * In particular, the overflow and saturation behavior of the accumulator used in each function must be considered.
+   * Refer to the function specific documentation below for usage guidelines.
    */
 
   /**
@@ -5207,7 +4875,7 @@
    */
 
 
-  __STATIC_INLINE float32_t arm_pid_f32(
+  static __INLINE float32_t arm_pid_f32(
   arm_pid_instance_f32 * S,
   float32_t in)
   {
@@ -5233,16 +4901,16 @@
    * @param[in] in input sample to process
    * @return out processed output sample.
    *
-   * <b>Scaling and Overflow Behavior:</b> 
-   * \par 
-   * The function is implemented using an internal 64-bit accumulator. 
-   * The accumulator has a 2.62 format and maintains full precision of the intermediate multiplication results but provides only a single guard bit. 
-   * Thus, if the accumulator result overflows it wraps around rather than clip. 
-   * In order to avoid overflows completely the input signal must be scaled down by 2 bits as there are four additions. 
-   * After all multiply-accumulates are performed, the 2.62 accumulator is truncated to 1.32 format and then saturated to 1.31 format. 
-   */
-
-  __STATIC_INLINE q31_t arm_pid_q31(
+   * <b>Scaling and Overflow Behavior:</b>
+   * \par
+   * The function is implemented using an internal 64-bit accumulator.
+   * The accumulator has a 2.62 format and maintains full precision of the intermediate multiplication results but provides only a single guard bit.
+   * Thus, if the accumulator result overflows it wraps around rather than clip.
+   * In order to avoid overflows completely the input signal must be scaled down by 2 bits as there are four additions.
+   * After all multiply-accumulates are performed, the 2.62 accumulator is truncated to 1.32 format and then saturated to 1.31 format.
+   */
+
+  static __INLINE q31_t arm_pid_q31(
   arm_pid_instance_q31 * S,
   q31_t in)
   {
@@ -5280,48 +4948,43 @@
    * @param[in] in input sample to process
    * @return out processed output sample.
    *
-   * <b>Scaling and Overflow Behavior:</b> 
-   * \par 
-   * The function is implemented using a 64-bit internal accumulator. 
-   * Both Gains and state variables are represented in 1.15 format and multiplications yield a 2.30 result. 
-   * The 2.30 intermediate results are accumulated in a 64-bit accumulator in 34.30 format. 
-   * There is no risk of internal overflow with this approach and the full precision of intermediate multiplications is preserved. 
-   * After all additions have been performed, the accumulator is truncated to 34.15 format by discarding low 15 bits. 
+   * <b>Scaling and Overflow Behavior:</b>
+   * \par
+   * The function is implemented using a 64-bit internal accumulator.
+   * Both Gains and state variables are represented in 1.15 format and multiplications yield a 2.30 result.
+   * The 2.30 intermediate results are accumulated in a 64-bit accumulator in 34.30 format.
+   * There is no risk of internal overflow with this approach and the full precision of intermediate multiplications is preserved.
+   * After all additions have been performed, the accumulator is truncated to 34.15 format by discarding low 15 bits.
    * Lastly, the accumulator is saturated to yield a result in 1.15 format.
    */
 
-  __STATIC_INLINE q15_t arm_pid_q15(
+  static __INLINE q15_t arm_pid_q15(
   arm_pid_instance_q15 * S,
   q15_t in)
   {
     q63_t acc;
     q15_t out;
 
+#ifndef ARM_MATH_CM0_FAMILY
+    __SIMD32_TYPE *vstate;
+
     /* Implementation of PID controller */
 
-#ifdef ARM_MATH_CM0
-
-    /* acc = A0 * x[n]  */
-    acc = ((q31_t) S->A0) * in;
-
-#else
-
     /* acc = A0 * x[n]  */
     acc = (q31_t) __SMUAD(S->A0, in);
 
-#endif
-
-#ifdef ARM_MATH_CM0
+    /* acc += A1 * x[n-1] + A2 * x[n-2]  */
+    vstate = __SIMD32_CONST(S->state);
+    acc = __SMLALD(S->A1, (q31_t) *vstate, acc);
+
+#else
+    /* acc = A0 * x[n]  */
+    acc = ((q31_t) S->A0) * in;
 
     /* acc += A1 * x[n-1] + A2 * x[n-2]  */
     acc += (q31_t) S->A1 * S->state[0];
     acc += (q31_t) S->A2 * S->state[1];
 
-#else
-
-    /* acc += A1 * x[n-1] + A2 * x[n-2]  */
-    acc = __SMLALD(S->A1, (q31_t) __SIMD32(S->state), acc);
-
 #endif
 
     /* acc += y[n-1] */
@@ -5374,7 +5037,7 @@
    * and <code>Ia + Ib + Ic = 0</code>, in this condition <code>Ialpha</code> and <code>Ibeta</code>
    * can be calculated using only <code>Ia</code> and <code>Ib</code>.
    *
-   * The function operates on a single sample of data and each call to the function returns the processed output. 
+   * The function operates on a single sample of data and each call to the function returns the processed output.
    * The library provides separate functions for Q31 and floating-point data types.
    * \par Algorithm
    * \image html clarkeFormula.gif
@@ -5401,7 +5064,7 @@
    * @return none.
    */
 
-  __STATIC_INLINE void arm_clarke_f32(
+  static __INLINE void arm_clarke_f32(
   float32_t Ia,
   float32_t Ib,
   float32_t * pIalpha,
@@ -5431,7 +5094,7 @@
    * There is saturation on the addition, hence there is no risk of overflow.
    */
 
-  __STATIC_INLINE void arm_clarke_q31(
+  static __INLINE void arm_clarke_q31(
   q31_t Ia,
   q31_t Ib,
   q31_t * pIalpha,
@@ -5478,8 +5141,8 @@
   /**
    * @defgroup inv_clarke Vector Inverse Clarke Transform
    * Inverse Clarke transform converts the two-coordinate time invariant vector into instantaneous stator phases.
-   * 
-   * The function operates on a single sample of data and each call to the function returns the processed output. 
+   *
+   * The function operates on a single sample of data and each call to the function returns the processed output.
    * The library provides separate functions for Q31 and floating-point data types.
    * \par Algorithm
    * \image html clarkeInvFormula.gif
@@ -5506,7 +5169,7 @@
    */
 
 
-  __STATIC_INLINE void arm_inv_clarke_f32(
+  static __INLINE void arm_inv_clarke_f32(
   float32_t Ialpha,
   float32_t Ibeta,
   float32_t * pIa,
@@ -5521,7 +5184,7 @@
   }
 
   /**
-   * @brief  Inverse Clarke transform for Q31 version 
+   * @brief  Inverse Clarke transform for Q31 version
    * @param[in]       Ialpha  input two-phase orthogonal vector axis alpha
    * @param[in]       Ibeta   input two-phase orthogonal vector axis beta
    * @param[out]      *pIa    points to output three-phase coordinate <code>a</code>
@@ -5535,7 +5198,7 @@
    * There is saturation on the subtraction, hence there is no risk of overflow.
    */
 
-  __STATIC_INLINE void arm_inv_clarke_q31(
+  static __INLINE void arm_inv_clarke_q31(
   q31_t Ialpha,
   q31_t Ibeta,
   q31_t * pIa,
@@ -5583,19 +5246,19 @@
    * @defgroup park Vector Park Transform
    *
    * Forward Park transform converts the input two-coordinate vector to flux and torque components.
-   * The Park transform can be used to realize the transformation of the <code>Ialpha</code> and the <code>Ibeta</code> currents 
-   * from the stationary to the moving reference frame and control the spatial relationship between 
+   * The Park transform can be used to realize the transformation of the <code>Ialpha</code> and the <code>Ibeta</code> currents
+   * from the stationary to the moving reference frame and control the spatial relationship between
    * the stator vector current and rotor flux vector.
-   * If we consider the d axis aligned with the rotor flux, the diagram below shows the 
+   * If we consider the d axis aligned with the rotor flux, the diagram below shows the
    * current vector and the relationship from the two reference frames:
    * \image html park.gif "Stator current space vector and its component in (a,b) and in the d,q rotating reference frame"
    *
-   * The function operates on a single sample of data and each call to the function returns the processed output. 
+   * The function operates on a single sample of data and each call to the function returns the processed output.
    * The library provides separate functions for Q31 and floating-point data types.
    * \par Algorithm
    * \image html parkFormula.gif
-   * where <code>Ialpha</code> and <code>Ibeta</code> are the stator vector components,  
-   * <code>pId</code> and <code>pIq</code> are rotor vector components and <code>cosVal</code> and <code>sinVal</code> are the 
+   * where <code>Ialpha</code> and <code>Ibeta</code> are the stator vector components,
+   * <code>pId</code> and <code>pIq</code> are rotor vector components and <code>cosVal</code> and <code>sinVal</code> are the
    * cosine and sine values of theta (rotor flux position).
    * \par Fixed-Point Behavior
    * Care must be taken when using the Q31 version of the Park transform.
@@ -5622,7 +5285,7 @@
    *
    */
 
-  __STATIC_INLINE void arm_park_f32(
+  static __INLINE void arm_park_f32(
   float32_t Ialpha,
   float32_t Ibeta,
   float32_t * pId,
@@ -5639,7 +5302,7 @@
   }
 
   /**
-   * @brief  Park transform for Q31 version 
+   * @brief  Park transform for Q31 version
    * @param[in]       Ialpha input two-phase vector coordinate alpha
    * @param[in]       Ibeta  input two-phase vector coordinate beta
    * @param[out]      *pId   points to output rotor reference frame d
@@ -5656,7 +5319,7 @@
    */
 
 
-  __STATIC_INLINE void arm_park_q31(
+  static __INLINE void arm_park_q31(
   q31_t Ialpha,
   q31_t Ibeta,
   q31_t * pId,
@@ -5712,12 +5375,12 @@
    * @defgroup inv_park Vector Inverse Park transform
    * Inverse Park transform converts the input flux and torque components to two-coordinate vector.
    *
-   * The function operates on a single sample of data and each call to the function returns the processed output. 
+   * The function operates on a single sample of data and each call to the function returns the processed output.
    * The library provides separate functions for Q31 and floating-point data types.
    * \par Algorithm
    * \image html parkInvFormula.gif
-   * where <code>pIalpha</code> and <code>pIbeta</code> are the stator vector components,  
-   * <code>Id</code> and <code>Iq</code> are rotor vector components and <code>cosVal</code> and <code>sinVal</code> are the 
+   * where <code>pIalpha</code> and <code>pIbeta</code> are the stator vector components,
+   * <code>Id</code> and <code>Iq</code> are rotor vector components and <code>cosVal</code> and <code>sinVal</code> are the
    * cosine and sine values of theta (rotor flux position).
    * \par Fixed-Point Behavior
    * Care must be taken when using the Q31 version of the Park transform.
@@ -5741,7 +5404,7 @@
    * @return none.
    */
 
-  __STATIC_INLINE void arm_inv_park_f32(
+  static __INLINE void arm_inv_park_f32(
   float32_t Id,
   float32_t Iq,
   float32_t * pIalpha,
@@ -5759,7 +5422,7 @@
 
 
   /**
-   * @brief  Inverse Park transform for	Q31 version 
+   * @brief  Inverse Park transform for	Q31 version
    * @param[in]       Id        input coordinate of rotor reference frame d
    * @param[in]       Iq        input coordinate of rotor reference frame q
    * @param[out]      *pIalpha  points to output two-phase orthogonal vector axis alpha
@@ -5776,7 +5439,7 @@
    */
 
 
-  __STATIC_INLINE void arm_inv_park_q31(
+  static __INLINE void arm_inv_park_q31(
   q31_t Id,
   q31_t Iq,
   q31_t * pIalpha,
@@ -5835,7 +5498,7 @@
    * Linear interpolation is a method of curve fitting using linear polynomials.
    * Linear interpolation works by effectively drawing a straight line between two neighboring samples and returning the appropriate point along that line
    *
-   * \par 
+   * \par
    * \image html LinearInterp.gif "Linear interpolation"
    *
    * \par
@@ -5855,10 +5518,10 @@
    * sample of data and each call to the function returns a single processed value.
    * <code>S</code> points to an instance of the Linear Interpolate function data structure.
    * <code>x</code> is the input sample value. The functions returns the output value.
-   * 
+   *
    * \par
-   * if x is outside of the table boundary, Linear interpolation returns first value of the table 
-   * if x is below input range and returns last value of table if x is above range.  
+   * if x is outside of the table boundary, Linear interpolation returns first value of the table
+   * if x is below input range and returns last value of table if x is above range.
    */
 
   /**
@@ -5874,7 +5537,7 @@
    *
    */
 
-  __STATIC_INLINE float32_t arm_linear_interp_f32(
+  static __INLINE float32_t arm_linear_interp_f32(
   arm_linear_interp_instance_f32 * S,
   float32_t x)
   {
@@ -5887,14 +5550,14 @@
     float32_t *pYData = S->pYData;               /* pointer to output table */
 
     /* Calculation of index */
-    i = (x - S->x1) / xSpacing;
+    i = (int32_t) ((x - S->x1) / xSpacing);
 
     if(i < 0)
     {
       /* Iniatilize output for below specified range as least output value of table */
       y = pYData[0];
     }
-    else if(i >= S->nValues)
+    else if((uint32_t)i >= S->nValues)
     {
       /* Iniatilize output for above specified range as last output value of table */
       y = pYData[S->nValues - 1];
@@ -5933,7 +5596,7 @@
    */
 
 
-  __STATIC_INLINE q31_t arm_linear_interp_q31(
+  static __INLINE q31_t arm_linear_interp_q31(
   q31_t * pYData,
   q31_t x,
   uint32_t nValues)
@@ -5948,7 +5611,7 @@
     /* Index value calculation */
     index = ((x & 0xFFF00000) >> 20);
 
-    if(index >= (nValues - 1))
+    if(index >= (int32_t)(nValues - 1))
     {
       return (pYData[nValues - 1]);
     }
@@ -5990,12 +5653,12 @@
    *
    * \par
    * Input sample <code>x</code> is in 12.20 format which contains 12 bits for table index and 20 bits for fractional part.
-   * This function can support maximum of table size 2^12. 
+   * This function can support maximum of table size 2^12.
    *
    */
 
 
-  __STATIC_INLINE q15_t arm_linear_interp_q15(
+  static __INLINE q15_t arm_linear_interp_q15(
   q15_t * pYData,
   q31_t x,
   uint32_t nValues)
@@ -6010,7 +5673,7 @@
     /* Index value calculation */
     index = ((x & 0xFFF00000) >> 20u);
 
-    if(index >= (nValues - 1))
+    if(index >= (int32_t)(nValues - 1))
     {
       return (pYData[nValues - 1]);
     }
@@ -6055,7 +5718,7 @@
    */
 
 
-  __STATIC_INLINE q7_t arm_linear_interp_q7(
+  static __INLINE q7_t arm_linear_interp_q7(
   q7_t * pYData,
   q31_t x,
   uint32_t nValues)
@@ -6063,22 +5726,22 @@
     q31_t y;                                     /* output */
     q7_t y0, y1;                                 /* Nearest output values */
     q31_t fract;                                 /* fractional part */
-    int32_t index;                               /* Index to read nearest output values */
+    uint32_t index;                              /* Index to read nearest output values */
 
     /* Input is in 12.20 format */
     /* 12 bits for the table index */
     /* Index value calculation */
-    index = ((x & 0xFFF00000) >> 20u);
+    if (x < 0)
+    {
+      return (pYData[0]);
+    }
+    index = (x >> 20) & 0xfff;
 
 
     if(index >= (nValues - 1))
     {
       return (pYData[nValues - 1]);
     }
-    else if(index < 0)
-    {
-      return (pYData[0]);
-    }
     else
     {
 
@@ -6170,14 +5833,14 @@
    * @defgroup SQRT Square Root
    *
    * Computes the square root of a number.
-   * There are separate functions for Q15, Q31, and floating-point data types.  
+   * There are separate functions for Q15, Q31, and floating-point data types.
    * The square root function is computed using the Newton-Raphson algorithm.
    * This is an iterative algorithm of the form:
    * <pre>
    *      x1 = x0 - f(x0)/f'(x0)
    * </pre>
    * where <code>x1</code> is the current estimate,
-   * <code>x0</code> is the previous estimate and
+   * <code>x0</code> is the previous estimate, and
    * <code>f'(x0)</code> is the derivative of <code>f()</code> evaluated at <code>x0</code>.
    * For the square root function, the algorithm reduces to:
    * <pre>
@@ -6200,21 +5863,19 @@
    * <code>in</code> is negative value and returns zero output for negative values.
    */
 
-  __STATIC_INLINE arm_status arm_sqrt_f32(
+  static __INLINE arm_status arm_sqrt_f32(
   float32_t in,
   float32_t * pOut)
   {
     if(in > 0)
     {
 
-//    #if __FPU_USED
-    #if (__FPU_USED == 1) && defined ( __CC_ARM   )
-        *pOut = __sqrtf(in);
-    #elif (__FPU_USED == 1) && defined ( __TMS_740 )
-        *pOut = __builtin_sqrtf(in);
-    #else
-        *pOut = sqrtf(in);
-    #endif
+//      #if __FPU_USED
+#if (__FPU_USED == 1) && defined ( __CC_ARM   )
+      *pOut = __sqrtf(in);
+#else
+      *pOut = sqrtf(in);
+#endif
 
       return (ARM_MATH_SUCCESS);
     }
@@ -6262,7 +5923,7 @@
    * @brief floating-point Circular write function.
    */
 
-  __STATIC_INLINE void arm_circularWrite_f32(
+  static __INLINE void arm_circularWrite_f32(
   int32_t * circBuffer,
   int32_t L,
   uint16_t * writeOffset,
@@ -6307,7 +5968,7 @@
   /**
    * @brief floating-point Circular Read function.
    */
-  __STATIC_INLINE void arm_circularRead_f32(
+  static __INLINE void arm_circularRead_f32(
   int32_t * circBuffer,
   int32_t L,
   int32_t * readOffset,
@@ -6362,7 +6023,7 @@
    * @brief Q15 Circular write function.
    */
 
-  __STATIC_INLINE void arm_circularWrite_q15(
+  static __INLINE void arm_circularWrite_q15(
   q15_t * circBuffer,
   int32_t L,
   uint16_t * writeOffset,
@@ -6407,7 +6068,7 @@
   /**
    * @brief Q15 Circular Read function.
    */
-  __STATIC_INLINE void arm_circularRead_q15(
+  static __INLINE void arm_circularRead_q15(
   q15_t * circBuffer,
   int32_t L,
   int32_t * readOffset,
@@ -6464,7 +6125,7 @@
    * @brief Q7 Circular write function.
    */
 
-  __STATIC_INLINE void arm_circularWrite_q7(
+  static __INLINE void arm_circularWrite_q7(
   q7_t * circBuffer,
   int32_t L,
   uint16_t * writeOffset,
@@ -6509,7 +6170,7 @@
   /**
    * @brief Q7 Circular Read function.
    */
-  __STATIC_INLINE void arm_circularRead_q7(
+  static __INLINE void arm_circularRead_q7(
   q7_t * circBuffer,
   int32_t L,
   int32_t * readOffset,
@@ -7080,11 +6741,11 @@
   uint32_t numSamples);
 
   /**
-   * @brief Converts the elements of the floating-point vector to Q31 vector. 
-   * @param[in]       *pSrc points to the floating-point input vector 
+   * @brief Converts the elements of the floating-point vector to Q31 vector.
+   * @param[in]       *pSrc points to the floating-point input vector
    * @param[out]      *pDst points to the Q31 output vector
-   * @param[in]       blockSize length of the input vector 
-   * @return none. 
+   * @param[in]       blockSize length of the input vector
+   * @return none.
    */
   void arm_float_to_q31(
   float32_t * pSrc,
@@ -7092,10 +6753,10 @@
   uint32_t blockSize);
 
   /**
-   * @brief Converts the elements of the floating-point vector to Q15 vector. 
-   * @param[in]       *pSrc points to the floating-point input vector 
+   * @brief Converts the elements of the floating-point vector to Q15 vector.
+   * @param[in]       *pSrc points to the floating-point input vector
    * @param[out]      *pDst points to the Q15 output vector
-   * @param[in]       blockSize length of the input vector 
+   * @param[in]       blockSize length of the input vector
    * @return          none
    */
   void arm_float_to_q15(
@@ -7104,10 +6765,10 @@
   uint32_t blockSize);
 
   /**
-   * @brief Converts the elements of the floating-point vector to Q7 vector. 
-   * @param[in]       *pSrc points to the floating-point input vector 
+   * @brief Converts the elements of the floating-point vector to Q7 vector.
+   * @param[in]       *pSrc points to the floating-point input vector
    * @param[out]      *pDst points to the Q7 output vector
-   * @param[in]       blockSize length of the input vector 
+   * @param[in]       blockSize length of the input vector
    * @return          none
    */
   void arm_float_to_q7(
@@ -7227,12 +6888,12 @@
    *           + f(XF, YF+1) * (1-(x-XF))*(y-YF)
    *           + f(XF+1, YF+1) * (x-XF)*(y-YF)
    * </pre>
-   * Note that the coordinates (x, y) contain integer and fractional components.  
+   * Note that the coordinates (x, y) contain integer and fractional components.
    * The integer components specify which portion of the table to use while the
    * fractional components control the interpolation processor.
    *
    * \par
-   * if (x,y) are outside of the table boundary, Bilinear interpolation returns zero output. 
+   * if (x,y) are outside of the table boundary, Bilinear interpolation returns zero output.
    */
 
   /**
@@ -7250,7 +6911,7 @@
   */
 
 
-  __STATIC_INLINE float32_t arm_bilinear_interp_f32(
+  static __INLINE float32_t arm_bilinear_interp_f32(
   const arm_bilinear_interp_instance_f32 * S,
   float32_t X,
   float32_t Y)
@@ -7318,7 +6979,7 @@
   * @return out interpolated value.
   */
 
-  __STATIC_INLINE q31_t arm_bilinear_interp_q31(
+  static __INLINE q31_t arm_bilinear_interp_q31(
   arm_bilinear_interp_instance_q31 * S,
   q31_t X,
   q31_t Y)
@@ -7394,7 +7055,7 @@
   * @return out interpolated value.
   */
 
-  __STATIC_INLINE q15_t arm_bilinear_interp_q15(
+  static __INLINE q15_t arm_bilinear_interp_q15(
   arm_bilinear_interp_instance_q15 * S,
   q31_t X,
   q31_t Y)
@@ -7474,7 +7135,7 @@
   * @return out interpolated value.
   */
 
-  __STATIC_INLINE q7_t arm_bilinear_interp_q7(
+  static __INLINE q7_t arm_bilinear_interp_q7(
   arm_bilinear_interp_instance_q7 * S,
   q31_t X,
   q31_t Y)
@@ -7547,6 +7208,84 @@
    */
 
 
+#if   defined ( __CC_ARM ) //Keil
+//SMMLAR
+  #define multAcc_32x32_keep32_R(a, x, y) \
+  a = (q31_t) (((((q63_t) a) << 32) + ((q63_t) x * y) + 0x80000000LL ) >> 32)
+
+//SMMLSR
+  #define multSub_32x32_keep32_R(a, x, y) \
+  a = (q31_t) (((((q63_t) a) << 32) - ((q63_t) x * y) + 0x80000000LL ) >> 32)
+
+//SMMULR
+  #define mult_32x32_keep32_R(a, x, y) \
+  a = (q31_t) (((q63_t) x * y + 0x80000000LL ) >> 32)
+
+//Enter low optimization region - place directly above function definition
+  #define LOW_OPTIMIZATION_ENTER \
+     _Pragma ("push")         \
+     _Pragma ("O1")
+
+//Exit low optimization region - place directly after end of function definition
+  #define LOW_OPTIMIZATION_EXIT \
+     _Pragma ("pop")
+
+//Enter low optimization region - place directly above function definition
+  #define IAR_ONLY_LOW_OPTIMIZATION_ENTER
+
+//Exit low optimization region - place directly after end of function definition
+  #define IAR_ONLY_LOW_OPTIMIZATION_EXIT
+
+#elif defined(__ICCARM__) //IAR
+ //SMMLA
+  #define multAcc_32x32_keep32_R(a, x, y) \
+  a += (q31_t) (((q63_t) x * y) >> 32)
+
+ //SMMLS
+  #define multSub_32x32_keep32_R(a, x, y) \
+  a -= (q31_t) (((q63_t) x * y) >> 32)
+
+//SMMUL
+  #define mult_32x32_keep32_R(a, x, y) \
+  a = (q31_t) (((q63_t) x * y ) >> 32)
+
+//Enter low optimization region - place directly above function definition
+  #define LOW_OPTIMIZATION_ENTER \
+     _Pragma ("optimize=low")
+
+//Exit low optimization region - place directly after end of function definition
+  #define LOW_OPTIMIZATION_EXIT
+
+//Enter low optimization region - place directly above function definition
+  #define IAR_ONLY_LOW_OPTIMIZATION_ENTER \
+     _Pragma ("optimize=low")
+
+//Exit low optimization region - place directly after end of function definition
+  #define IAR_ONLY_LOW_OPTIMIZATION_EXIT
+
+#elif defined(__GNUC__)
+ //SMMLA
+  #define multAcc_32x32_keep32_R(a, x, y) \
+  a += (q31_t) (((q63_t) x * y) >> 32)
+
+ //SMMLS
+  #define multSub_32x32_keep32_R(a, x, y) \
+  a -= (q31_t) (((q63_t) x * y) >> 32)
+
+//SMMUL
+  #define mult_32x32_keep32_R(a, x, y) \
+  a = (q31_t) (((q63_t) x * y ) >> 32)
+
+  #define LOW_OPTIMIZATION_ENTER __attribute__(( optimize("-O1") ))
+
+  #define LOW_OPTIMIZATION_EXIT
+
+  #define IAR_ONLY_LOW_OPTIMIZATION_ENTER
+
+  #define IAR_ONLY_LOW_OPTIMIZATION_EXIT
+
+#endif
+