Fork of mbed-dsp. CMSIS-DSP library of supporting NEON

Dependents:   mbed-os-example-cmsis_dsp_neon

Fork of mbed-dsp by mbed official

Information

Japanese version is available in lower part of this page.
このページの後半に日本語版が用意されています.

CMSIS-DSP of supporting NEON

What is this ?

A library for CMSIS-DSP of supporting NEON.
We supported the NEON to CMSIS-DSP Ver1.4.3(CMSIS V4.1) that ARM supplied, has achieved the processing speed improvement.
If you use the mbed-dsp library, you can use to replace this library.
CMSIS-DSP of supporting NEON is provied as a library.

Library Creation environment

CMSIS-DSP library of supporting NEON was created by the following environment.

  • Compiler
    ARMCC Version 5.03
  • Compile option switch[C Compiler]
   -DARM_MATH_MATRIX_CHECK -DARM_MATH_ROUNDING -O3 -Otime --cpu=Cortex-A9 --littleend --arm 
   --apcs=/interwork --no_unaligned_access --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp 
   --vectorize --asm
  • Compile option switch[Assembler]
   --cpreproc --cpu=Cortex-A9 --littleend --arm --apcs=/interwork --no_unaligned_access 
   --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp


Effects of NEON support

In the data which passes to each function, large size will be expected more effective than small size.
Also if the data is a multiple of 16, effect will be expected in every function in the CMSIS-DSP.


NEON対応CMSIS-DSP

概要

NEON対応したCMSIS-DSPのライブラリです。
ARM社提供のCMSIS-DSP Ver1.4.3(CMSIS V4.1)をターゲットにNEON対応を行ない、処理速度向上を実現しております。
mbed-dspライブラリを使用している場合は、本ライブラリに置き換えて使用することができます。
NEON対応したCMSIS-DSPはライブラリで提供します。

ライブラリ作成環境

NEON対応CMSIS-DSPライブラリは、以下の環境で作成しています。

  • コンパイラ
    ARMCC Version 5.03
  • コンパイルオプションスイッチ[C Compiler]
   -DARM_MATH_MATRIX_CHECK -DARM_MATH_ROUNDING -O3 -Otime --cpu=Cortex-A9 --littleend --arm 
   --apcs=/interwork --no_unaligned_access --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp 
   --vectorize --asm
  • コンパイルオプションスイッチ[Assembler]
   --cpreproc --cpu=Cortex-A9 --littleend --arm --apcs=/interwork --no_unaligned_access 
   --fpu=vfpv3_fp16 --fpmode=fast --apcs=/hardfp


NEON対応による効果について

CMSIS-DSP内の各関数へ渡すデータは、小さいサイズよりも大きいサイズの方が効果が見込めます。
また、16の倍数のデータであれば、CMSIS-DSP内のどの関数でも効果が見込めます。


Revision:
5:a912b042151f
Parent:
3:7a284390b0ce
--- a/cmsis_dsp/arm_math.h	Mon Jun 23 09:30:09 2014 +0100
+++ b/cmsis_dsp/arm_math.h	Tue Jun 23 06:23:42 2015 +0000
@@ -1,13 +1,13 @@
 /* ----------------------------------------------------------------------
-* Copyright (C) 2010-2013 ARM Limited. All rights reserved.
+* Copyright (C) 2010-2014 ARM Limited. All rights reserved.
 *
-* $Date:        17. January 2013
-* $Revision:    V1.4.1
+* $Date:        12. March 2014
+* $Revision: 	V1.4.3
 *
-* Project:      CMSIS DSP Library
-* Title:        arm_math.h
+* Project: 	    CMSIS DSP Library
+* Title:	    arm_math.h
 *
-* Description:  Public header file for CMSIS DSP Library
+* Description:	Public header file for CMSIS DSP Library
 *
 * Target Processor: Cortex-M4/Cortex-M3/Cortex-M0
 *
@@ -41,7 +41,8 @@
 /**
    \mainpage CMSIS DSP Software Library
    *
-   * <b>Introduction</b>
+   * Introduction
+   * ------------
    *
    * This user manual describes the CMSIS DSP software library,
    * a suite of common signal processing functions for use on Cortex-M processor based devices.
@@ -61,7 +62,8 @@
    * The library has separate functions for operating on 8-bit integers, 16-bit integers,
    * 32-bit integer and 32-bit floating-point values.
    *
-   * <b>Using the Library</b>
+   * Using the Library
+   * ------------
    *
    * The library installer contains prebuilt versions of the libraries in the <code>Lib</code> folder.
    * - arm_cortexM4lf_math.lib (Little endian and Floating Point Unit on Cortex-M4)
@@ -79,31 +81,28 @@
    * Define the appropriate pre processor MACRO ARM_MATH_CM4 or  ARM_MATH_CM3 or
    * ARM_MATH_CM0 or ARM_MATH_CM0PLUS depending on the target processor in the application.
    *
-   * <b>Examples</b>
+   * Examples
+   * --------
    *
    * The library ships with a number of examples which demonstrate how to use the library functions.
    *
-   * <b>Toolchain Support</b>
+   * Toolchain Support
+   * ------------
    *
    * The library has been developed and tested with MDK-ARM version 4.60.
    * The library is being tested in GCC and IAR toolchains and updates on this activity will be made available shortly.
    *
-   * <b>Building the Library</b>
+   * Building the Library
+   * ------------
    *
-   * The library installer contains project files to re build libraries on MDK Tool chain in the <code>CMSIS\\DSP_Lib\\Source\\ARM</code> folder.
-   * - arm_cortexM0b_math.uvproj
-   * - arm_cortexM0l_math.uvproj
-   * - arm_cortexM3b_math.uvproj
-   * - arm_cortexM3l_math.uvproj
-   * - arm_cortexM4b_math.uvproj
-   * - arm_cortexM4l_math.uvproj
-   * - arm_cortexM4bf_math.uvproj
-   * - arm_cortexM4lf_math.uvproj
+   * The library installer contains a project file to re build libraries on MDK-ARM Tool chain in the <code>CMSIS\\DSP_Lib\\Source\\ARM</code> folder.
+   * - arm_cortexM_math.uvproj
    *
    *
-   * The project can be built by opening the appropriate project in MDK-ARM 4.60 chain and defining the optional pre processor MACROs detailed above.
+   * The libraries can be built by opening the arm_cortexM_math.uvproj project in MDK-ARM, selecting a specific target, and defining the optional pre processor MACROs detailed above.
    *
-   * <b>Pre-processor Macros</b>
+   * Pre-processor Macros
+   * ------------
    *
    * Each library project have differant pre-processor macros.
    *
@@ -132,9 +131,27 @@
    *
    * Initialize macro __FPU_PRESENT = 1 when building on FPU supported Targets. Enable this macro for M4bf and M4lf libraries
    *
-   * <b>Copyright Notice</b>
+   * <hr>
+   * CMSIS-DSP in ARM::CMSIS Pack
+   * -----------------------------
+   * 
+   * The following files relevant to CMSIS-DSP are present in the <b>ARM::CMSIS</b> Pack directories:
+   * |File/Folder                   |Content                                                                 |
+   * |------------------------------|------------------------------------------------------------------------|
+   * |\b CMSIS\\Documentation\\DSP  | This documentation                                                     |
+   * |\b CMSIS\\DSP_Lib             | Software license agreement (license.txt)                               |
+   * |\b CMSIS\\DSP_Lib\\Examples   | Example projects demonstrating the usage of the library functions      |
+   * |\b CMSIS\\DSP_Lib\\Source     | Source files for rebuilding the library                                |
+   * 
+   * <hr>
+   * Revision History of CMSIS-DSP
+   * ------------
+   * Please refer to \ref ChangeLog_pg.
    *
-   * Copyright (C) 2010-2013 ARM Limited. All rights reserved.
+   * Copyright Notice
+   * ------------
+   *
+   * Copyright (C) 2010-2014 ARM Limited. All rights reserved.
    */
 
 
@@ -266,6 +283,7 @@
 
 #define __CMSIS_GENERIC         /* disable NVIC and Systick functions */
 
+#if 0
 #if defined (ARM_MATH_CM4)
 #include "core_cm4.h"
 #elif defined (ARM_MATH_CM3)
@@ -276,10 +294,51 @@
 #elif defined (ARM_MATH_CM0PLUS)
 #include "core_cm0plus.h"
 #define ARM_MATH_CM0_FAMILY
+#elif defined (ARM_MATH_CA9)
+#include "core_ca9.h"
 #else
 #include "ARMCM4.h"
 #warning "Define either ARM_MATH_CM4 OR ARM_MATH_CM3...By Default building on ARM_MATH_CM4....."
 #endif
+#else
+#include <arm_neon.h>
+#if !defined(__CC_ARM)
+#define ARM_MATH_CM0_FAMILY
+#endif
+#define __FPU_USED	1
+#define __INLINE	__inline
+#if defined(__CC_ARM)
+#define __QADD		__qadd
+#define __QSUB		__qsub
+#define __QSUB16	__qsub16
+#define __QSUB8		__qsub8
+#define __CLZ		__clz
+#define __SSAT		__ssat
+#define __SMUAD		__smuad
+#define __SMLALD	__smlald
+#define __QADD16	__qadd16
+#define __SHADD16	__shadd16
+#define __SMUADX	__smuadx
+#define __SMUSD		__smusd
+#define __SMUSDX	__smusdx
+#define __QASX		__qasx
+#define __QSAX		__qsax
+#define __SHASX		__shasx
+#define __SHSAX		__shsax
+#define __SHSUB16	__shsub16
+#define __SMLAD		__smlad
+#define __SMLADX	__smladx
+#define __SMLSDX	__smlsdx
+#define __ROR		__ror
+#define __SXTB16	__sxtb16
+#define __SMLALDX	__smlaldx
+#define __QADD8		__qadd8
+#endif
+#define __PKHBT(ARG1, ARG2, ARG3)      ( (((int32_t)(ARG1) <<  0) & (int32_t)0x0000FFFF) | \
+                                         (((int32_t)(ARG2) << ARG3) & (int32_t)0xFFFF0000)  )
+#define __PKHTB(ARG1, ARG2, ARG3)      ( (((int32_t)(ARG1) <<  0) & (int32_t)0xFFFF0000) | \
+                                         (((int32_t)(ARG2) >> ARG3) & (int32_t)0x0000FFFF)  )
+#endif
 
 #undef  __CMSIS_GENERIC         /* enable NVIC and Systick functions */
 #include "string.h"
@@ -305,9 +364,13 @@
    * @brief Macros required for SINE and COSINE Fast math approximations
    */
 
-#define TABLE_SIZE			256
-#define TABLE_SPACING_Q31	0x800000
-#define TABLE_SPACING_Q15	0x80
+#define FAST_MATH_TABLE_SIZE  512
+#define FAST_MATH_Q31_SHIFT   (32 - 10)
+#define FAST_MATH_Q15_SHIFT   (16 - 10)
+#define CONTROLLER_Q31_SHIFT  (32 - 9)
+#define TABLE_SIZE  256
+#define TABLE_SPACING_Q31	   0x400000
+#define TABLE_SPACING_Q15	   0x80
 
   /**
    * @brief Macros required for SINE and COSINE Controller functions
@@ -386,6 +449,9 @@
 #elif defined __GNUC__
 #define __SIMD32_TYPE int32_t
 #define CMSIS_UNUSED __attribute__((unused))
+#elif defined __CSMC__			/* Cosmic */
+#define CMSIS_UNUSED
+#define __SIMD32_TYPE int32_t
 #else
 #error Unknown compiler
 #endif
@@ -483,7 +549,10 @@
 
 #if defined (ARM_MATH_CM0_FAMILY) && defined ( __CC_ARM   )
 #define __CLZ __clz
-#elif defined (ARM_MATH_CM0_FAMILY) && ((defined (__ICCARM__)) ||(defined (__GNUC__)) || defined (__TASKING__) )
+#endif
+
+/* #if defined (ARM_MATH_CM0_FAMILY) && ((defined (__ICCARM__)) ||(defined (__GNUC__)) || defined (__TASKING__) )   */
+#if !defined(__CLZ)
 
   static __INLINE uint32_t __CLZ(
   q31_t data);
@@ -728,8 +797,8 @@
     q31_t sum;
     q31_t r, s;
 
-    r = (short) x;
-    s = (short) y;
+    r = (q15_t) x;
+    s = (q15_t) y;
 
     r = __SSAT(r + s, 16);
     s = __SSAT(((q31_t) ((x >> 16) + (y >> 16))), 16) << 16;
@@ -751,8 +820,8 @@
     q31_t sum;
     q31_t r, s;
 
-    r = (short) x;
-    s = (short) y;
+    r = (q15_t) x;
+    s = (q15_t) y;
 
     r = ((r >> 1) + (s >> 1));
     s = ((q31_t) ((x >> 17) + (y >> 17))) << 16;
@@ -774,8 +843,8 @@
     q31_t sum;
     q31_t r, s;
 
-    r = (short) x;
-    s = (short) y;
+    r = (q15_t) x;
+    s = (q15_t) y;
 
     r = __SSAT(r - s, 16);
     s = __SSAT(((q31_t) ((x >> 16) - (y >> 16))), 16) << 16;
@@ -796,8 +865,8 @@
     q31_t diff;
     q31_t r, s;
 
-    r = (short) x;
-    s = (short) y;
+    r = (q15_t) x;
+    s = (q15_t) y;
 
     r = ((r >> 1) - (s >> 1));
     s = (((x >> 17) - (y >> 17)) << 16);
@@ -819,8 +888,8 @@
 
     sum =
       ((sum +
-        clip_q31_to_q15((q31_t) ((short) (x >> 16) + (short) y))) << 16) +
-      clip_q31_to_q15((q31_t) ((short) x - (short) (y >> 16)));
+        clip_q31_to_q15((q31_t) ((q15_t) (x >> 16) + (q15_t) y))) << 16) +
+      clip_q31_to_q15((q31_t) ((q15_t) x - (q15_t) (y >> 16)));
 
     return sum;
   }
@@ -836,8 +905,8 @@
     q31_t sum;
     q31_t r, s;
 
-    r = (short) x;
-    s = (short) y;
+    r = (q15_t) x;
+    s = (q15_t) y;
 
     r = ((r >> 1) - (y >> 17));
     s = (((x >> 17) + (s >> 1)) << 16);
@@ -860,8 +929,8 @@
 
     sum =
       ((sum +
-        clip_q31_to_q15((q31_t) ((short) (x >> 16) - (short) y))) << 16) +
-      clip_q31_to_q15((q31_t) ((short) x + (short) (y >> 16)));
+        clip_q31_to_q15((q31_t) ((q15_t) (x >> 16) - (q15_t) y))) << 16) +
+      clip_q31_to_q15((q31_t) ((q15_t) x + (q15_t) (y >> 16)));
 
     return sum;
   }
@@ -877,8 +946,8 @@
     q31_t sum;
     q31_t r, s;
 
-    r = (short) x;
-    s = (short) y;
+    r = (q15_t) x;
+    s = (q15_t) y;
 
     r = ((r >> 1) + (y >> 17));
     s = (((x >> 17) - (s >> 1)) << 16);
@@ -896,8 +965,8 @@
   q31_t y)
   {
 
-    return ((q31_t) (((short) x * (short) (y >> 16)) -
-                     ((short) (x >> 16) * (short) y)));
+    return ((q31_t) (((q15_t) x * (q15_t) (y >> 16)) -
+                     ((q15_t) (x >> 16) * (q15_t) y)));
   }
 
   /*
@@ -908,8 +977,8 @@
   q31_t y)
   {
 
-    return ((q31_t) (((short) x * (short) (y >> 16)) +
-                     ((short) (x >> 16) * (short) y)));
+    return ((q31_t) (((q15_t) x * (q15_t) (y >> 16)) +
+                     ((q15_t) (x >> 16) * (q15_t) y)));
   }
 
   /*
@@ -941,8 +1010,8 @@
   q31_t sum)
   {
 
-    return (sum + ((short) (x >> 16) * (short) (y >> 16)) +
-            ((short) x * (short) y));
+    return (sum + ((q15_t) (x >> 16) * (q15_t) (y >> 16)) +
+            ((q15_t) x * (q15_t) y));
   }
 
   /*
@@ -954,8 +1023,8 @@
   q31_t sum)
   {
 
-    return (sum + ((short) (x >> 16) * (short) (y)) +
-            ((short) x * (short) (y >> 16)));
+    return (sum + ((q15_t) (x >> 16) * (q15_t) (y)) +
+            ((q15_t) x * (q15_t) (y >> 16)));
   }
 
   /*
@@ -967,8 +1036,8 @@
   q31_t sum)
   {
 
-    return (sum - ((short) (x >> 16) * (short) (y)) +
-            ((short) x * (short) (y >> 16)));
+    return (sum - ((q15_t) (x >> 16) * (q15_t) (y)) +
+            ((q15_t) x * (q15_t) (y >> 16)));
   }
 
   /*
@@ -980,8 +1049,8 @@
   q63_t sum)
   {
 
-    return (sum + ((short) (x >> 16) * (short) (y >> 16)) +
-            ((short) x * (short) y));
+    return (sum + ((q15_t) (x >> 16) * (q15_t) (y >> 16)) +
+            ((q15_t) x * (q15_t) y));
   }
 
   /*
@@ -993,8 +1062,8 @@
   q63_t sum)
   {
 
-    return (sum + ((short) (x >> 16) * (short) y)) +
-      ((short) x * (short) (y >> 16));
+    return (sum + ((q15_t) (x >> 16) * (q15_t) y)) +
+      ((q15_t) x * (q15_t) (y >> 16));
   }
 
   /*
@@ -1476,6 +1545,49 @@
   const arm_matrix_instance_q31 * pSrcB,
   arm_matrix_instance_q31 * pDst);
 
+  /**
+   * @brief Floating-point, complex, matrix multiplication.
+   * @param[in]       *pSrcA points to the first input matrix structure
+   * @param[in]       *pSrcB points to the second input matrix structure
+   * @param[out]      *pDst points to output matrix structure
+   * @return     The function returns either
+   * <code>ARM_MATH_SIZE_MISMATCH</code> or <code>ARM_MATH_SUCCESS</code> based on the outcome of size checking.
+   */
+
+  arm_status arm_mat_cmplx_mult_f32(
+  const arm_matrix_instance_f32 * pSrcA,
+  const arm_matrix_instance_f32 * pSrcB,
+  arm_matrix_instance_f32 * pDst);
+
+  /**
+   * @brief Q15, complex,  matrix multiplication.
+   * @param[in]       *pSrcA points to the first input matrix structure
+   * @param[in]       *pSrcB points to the second input matrix structure
+   * @param[out]      *pDst points to output matrix structure
+   * @return     The function returns either
+   * <code>ARM_MATH_SIZE_MISMATCH</code> or <code>ARM_MATH_SUCCESS</code> based on the outcome of size checking.
+   */
+
+  arm_status arm_mat_cmplx_mult_q15(
+  const arm_matrix_instance_q15 * pSrcA,
+  const arm_matrix_instance_q15 * pSrcB,
+  arm_matrix_instance_q15 * pDst,
+  q15_t * pScratch);
+
+  /**
+   * @brief Q31, complex, matrix multiplication.
+   * @param[in]       *pSrcA points to the first input matrix structure
+   * @param[in]       *pSrcB points to the second input matrix structure
+   * @param[out]      *pDst points to output matrix structure
+   * @return     The function returns either
+   * <code>ARM_MATH_SIZE_MISMATCH</code> or <code>ARM_MATH_SUCCESS</code> based on the outcome of size checking.
+   */
+
+  arm_status arm_mat_cmplx_mult_q31(
+  const arm_matrix_instance_q31 * pSrcA,
+  const arm_matrix_instance_q31 * pSrcB,
+  arm_matrix_instance_q31 * pDst);
+
 
   /**
    * @brief Floating-point matrix transpose.
@@ -1534,7 +1646,7 @@
    * @param[in]       *pSrcA points to the first input matrix structure
    * @param[in]       *pSrcB points to the second input matrix structure
    * @param[out]      *pDst points to output matrix structure
-   * @param[in]		  *pState points to the array for storing intermediate results
+   * @param[in]		 *pState points to the array for storing intermediate results
    * @return     The function returns either
    * <code>ARM_MATH_SIZE_MISMATCH</code> or <code>ARM_MATH_SUCCESS</code> based on the outcome of size checking.
    */
@@ -2046,7 +2158,6 @@
     uint16_t bitRevFactor;           /**< bit reversal modifier that supports different size FFTs with the same bit reversal table. */
   } arm_cfft_radix4_instance_q31;
 
-
   void arm_cfft_radix4_q31(
   const arm_cfft_radix4_instance_q31 * S,
   q31_t * pSrc);
@@ -5874,8 +5985,12 @@
 #if (__FPU_USED == 1) && defined ( __CC_ARM   )
       *pOut = __sqrtf(in);
 #else
+#if defined(__GNUC__)
+      *pOut = __builtin_sqrtf(in);
+#else
       *pOut = sqrtf(in);
 #endif
+#endif
 
       return (ARM_MATH_SUCCESS);
     }
@@ -6348,7 +6463,7 @@
   void arm_var_q31(
   q31_t * pSrc,
   uint32_t blockSize,
-  q63_t * pResult);
+  q31_t * pResult);
 
   /**
    * @brief  Variance of the elements of a Q15 vector.
@@ -6361,7 +6476,7 @@
   void arm_var_q15(
   q15_t * pSrc,
   uint32_t blockSize,
-  q31_t * pResult);
+  q15_t * pResult);
 
   /**
    * @brief  Root Mean Square of the elements of a floating-point vector.
@@ -7284,6 +7399,24 @@
 
   #define IAR_ONLY_LOW_OPTIMIZATION_EXIT
 
+#elif defined(__CSMC__)		// Cosmic
+//SMMLA
+   #define multAcc_32x32_keep32_R(a, x, y) \
+   a += (q31_t) (((q63_t) x * y) >> 32)
+
+ //SMMLS
+   #define multSub_32x32_keep32_R(a, x, y) \
+   a -= (q31_t) (((q63_t) x * y) >> 32)
+
+//SMMUL
+   #define mult_32x32_keep32_R(a, x, y) \
+   a = (q31_t) (((q63_t) x * y ) >> 32)
+
+#define LOW_OPTIMIZATION_ENTER
+#define LOW_OPTIMIZATION_EXIT
+#define IAR_ONLY_LOW_OPTIMIZATION_ENTER
+#define IAR_ONLY_LOW_OPTIMIZATION_EXIT
+
 #endif