CMSIS_DSP_401 - V4.0.1 of the ARM CMSIS DSP libraries. Note that…

Users » emh203 » Code » CMSIS_DSP_401

V4.0.1 of the ARM CMSIS DSP libraries. Note that arm_bitreversal2.s, arm_cfft_f32.c and arm_rfft_fast_f32.c had to be removed. arm_bitreversal2.s will not assemble with the online tools. So, the fast f32 FFT functions are not yet available. All the other FFT functions are available.

Dependents: MPU9150_Example fir_f32 fir_f32 MPU9150_nucleo_noni2cdev ... more

FilteringFunctions/arm_correlate_fast_q15.c@0:3d9c67d97d6f, 2014-07-28 (annotated)

Committer:: emh203
Date:: Mon Jul 28 15:03:15 2014 +0000
Revision:: 0:3d9c67d97d6f

1st working commit.   Had to remove arm_bitreversal2.s     arm_cfft_f32.c and arm_rfft_fast_f32.c.    The .s will not assemble.      For now I removed these functions so we could at least have a library for the other functions.

Who changed what in which revision?

User	Revision	Line number	New contents of line
emh203	0:3d9c67d97d6f	1	/* ----------------------------------------------------------------------
emh203	0:3d9c67d97d6f	2	* Copyright (C) 2010-2014 ARM Limited. All rights reserved.
emh203	0:3d9c67d97d6f	3	*
emh203	0:3d9c67d97d6f	4	* $Date: 12. March 2014
emh203	0:3d9c67d97d6f	5	* $Revision: V1.4.3
emh203	0:3d9c67d97d6f	6	*
emh203	0:3d9c67d97d6f	7	* Project: CMSIS DSP Library
emh203	0:3d9c67d97d6f	8	* Title: arm_correlate_fast_q15.c
emh203	0:3d9c67d97d6f	9	*
emh203	0:3d9c67d97d6f	10	* Description: Fast Q15 Correlation.
emh203	0:3d9c67d97d6f	11	*
emh203	0:3d9c67d97d6f	12	* Target Processor: Cortex-M4/Cortex-M3
emh203	0:3d9c67d97d6f	13	*
emh203	0:3d9c67d97d6f	14	* Redistribution and use in source and binary forms, with or without
emh203	0:3d9c67d97d6f	15	* modification, are permitted provided that the following conditions
emh203	0:3d9c67d97d6f	16	* are met:
emh203	0:3d9c67d97d6f	17	* - Redistributions of source code must retain the above copyright
emh203	0:3d9c67d97d6f	18	* notice, this list of conditions and the following disclaimer.
emh203	0:3d9c67d97d6f	19	* - Redistributions in binary form must reproduce the above copyright
emh203	0:3d9c67d97d6f	20	* notice, this list of conditions and the following disclaimer in
emh203	0:3d9c67d97d6f	21	* the documentation and/or other materials provided with the
emh203	0:3d9c67d97d6f	22	* distribution.
emh203	0:3d9c67d97d6f	23	* - Neither the name of ARM LIMITED nor the names of its contributors
emh203	0:3d9c67d97d6f	24	* may be used to endorse or promote products derived from this
emh203	0:3d9c67d97d6f	25	* software without specific prior written permission.
emh203	0:3d9c67d97d6f	26	*
emh203	0:3d9c67d97d6f	27	* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
emh203	0:3d9c67d97d6f	28	* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
emh203	0:3d9c67d97d6f	29	* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
emh203	0:3d9c67d97d6f	30	* FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
emh203	0:3d9c67d97d6f	31	* COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
emh203	0:3d9c67d97d6f	32	* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
emh203	0:3d9c67d97d6f	33	* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
emh203	0:3d9c67d97d6f	34	* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
emh203	0:3d9c67d97d6f	35	* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
emh203	0:3d9c67d97d6f	36	* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
emh203	0:3d9c67d97d6f	37	* ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
emh203	0:3d9c67d97d6f	38	* POSSIBILITY OF SUCH DAMAGE.
emh203	0:3d9c67d97d6f	39	* -------------------------------------------------------------------- */
emh203	0:3d9c67d97d6f	40
emh203	0:3d9c67d97d6f	41	#include "arm_math.h"
emh203	0:3d9c67d97d6f	42
emh203	0:3d9c67d97d6f	43	/**
emh203	0:3d9c67d97d6f	44	* @ingroup groupFilters
emh203	0:3d9c67d97d6f	45	*/
emh203	0:3d9c67d97d6f	46
emh203	0:3d9c67d97d6f	47	/**
emh203	0:3d9c67d97d6f	48	* @addtogroup Corr
emh203	0:3d9c67d97d6f	49	* @{
emh203	0:3d9c67d97d6f	50	*/
emh203	0:3d9c67d97d6f	51
emh203	0:3d9c67d97d6f	52	/**
emh203	0:3d9c67d97d6f	53	* @brief Correlation of Q15 sequences (fast version) for Cortex-M3 and Cortex-M4.
emh203	0:3d9c67d97d6f	54	* @param[in] *pSrcA points to the first input sequence.
emh203	0:3d9c67d97d6f	55	* @param[in] srcALen length of the first input sequence.
emh203	0:3d9c67d97d6f	56	* @param[in] *pSrcB points to the second input sequence.
emh203	0:3d9c67d97d6f	57	* @param[in] srcBLen length of the second input sequence.
emh203	0:3d9c67d97d6f	58	* @param[out] pDst points to the location where the output result is written. Length 2 max(srcALen, srcBLen) - 1.
emh203	0:3d9c67d97d6f	59	* @return none.
emh203	0:3d9c67d97d6f	60	*
emh203	0:3d9c67d97d6f	61	* <b>Scaling and Overflow Behavior:</b>
emh203	0:3d9c67d97d6f	62	*
emh203	0:3d9c67d97d6f	63	* \par
emh203	0:3d9c67d97d6f	64	* This fast version uses a 32-bit accumulator with 2.30 format.
emh203	0:3d9c67d97d6f	65	* The accumulator maintains full precision of the intermediate multiplication results but provides only a single guard bit.
emh203	0:3d9c67d97d6f	66	* There is no saturation on intermediate additions.
emh203	0:3d9c67d97d6f	67	* Thus, if the accumulator overflows it wraps around and distorts the result.
emh203	0:3d9c67d97d6f	68	* The input signals should be scaled down to avoid intermediate overflows.
emh203	0:3d9c67d97d6f	69	* Scale down one of the inputs by 1/min(srcALen, srcBLen) to avoid overflow since a
emh203	0:3d9c67d97d6f	70	* maximum of min(srcALen, srcBLen) number of additions is carried internally.
emh203	0:3d9c67d97d6f	71	* The 2.30 accumulator is right shifted by 15 bits and then saturated to 1.15 format to yield the final result.
emh203	0:3d9c67d97d6f	72	*
emh203	0:3d9c67d97d6f	73	* \par
emh203	0:3d9c67d97d6f	74	* See <code>arm_correlate_q15()</code> for a slower implementation of this function which uses a 64-bit accumulator to avoid wrap around distortion.
emh203	0:3d9c67d97d6f	75	*/
emh203	0:3d9c67d97d6f	76
emh203	0:3d9c67d97d6f	77	void arm_correlate_fast_q15(
emh203	0:3d9c67d97d6f	78	q15_t * pSrcA,
emh203	0:3d9c67d97d6f	79	uint32_t srcALen,
emh203	0:3d9c67d97d6f	80	q15_t * pSrcB,
emh203	0:3d9c67d97d6f	81	uint32_t srcBLen,
emh203	0:3d9c67d97d6f	82	q15_t * pDst)
emh203	0:3d9c67d97d6f	83	{
emh203	0:3d9c67d97d6f	84	#ifndef UNALIGNED_SUPPORT_DISABLE
emh203	0:3d9c67d97d6f	85
emh203	0:3d9c67d97d6f	86	q15_t pIn1; / inputA pointer */
emh203	0:3d9c67d97d6f	87	q15_t pIn2; / inputB pointer */
emh203	0:3d9c67d97d6f	88	q15_t pOut = pDst; / output pointer */
emh203	0:3d9c67d97d6f	89	q31_t sum, acc0, acc1, acc2, acc3; /* Accumulators */
emh203	0:3d9c67d97d6f	90	q15_t px; / Intermediate inputA pointer */
emh203	0:3d9c67d97d6f	91	q15_t py; / Intermediate inputB pointer */
emh203	0:3d9c67d97d6f	92	q15_t pSrc1; / Intermediate pointers */
emh203	0:3d9c67d97d6f	93	q31_t x0, x1, x2, x3, c0; /* temporary variables for holding input and coefficient values */
emh203	0:3d9c67d97d6f	94	uint32_t j, k = 0u, count, blkCnt, outBlockSize, blockSize1, blockSize2, blockSize3; /* loop counter */
emh203	0:3d9c67d97d6f	95	int32_t inc = 1; /* Destination address modifier */
emh203	0:3d9c67d97d6f	96
emh203	0:3d9c67d97d6f	97
emh203	0:3d9c67d97d6f	98	/* The algorithm implementation is based on the lengths of the inputs. */
emh203	0:3d9c67d97d6f	99	/* srcB is always made to slide across srcA. */
emh203	0:3d9c67d97d6f	100	/* So srcBLen is always considered as shorter or equal to srcALen */
emh203	0:3d9c67d97d6f	101	/* But CORR(x, y) is reverse of CORR(y, x) */
emh203	0:3d9c67d97d6f	102	/* So, when srcBLen > srcALen, output pointer is made to point to the end of the output buffer */
emh203	0:3d9c67d97d6f	103	/* and the destination pointer modifier, inc is set to -1 */
emh203	0:3d9c67d97d6f	104	/* If srcALen > srcBLen, zero pad has to be done to srcB to make the two inputs of same length */
emh203	0:3d9c67d97d6f	105	/* But to improve the performance,
emh203	0:3d9c67d97d6f	106	* we include zeroes in the output instead of zero padding either of the the inputs*/
emh203	0:3d9c67d97d6f	107	/* If srcALen > srcBLen,
emh203	0:3d9c67d97d6f	108	* (srcALen - srcBLen) zeroes has to included in the starting of the output buffer */
emh203	0:3d9c67d97d6f	109	/* If srcALen < srcBLen,
emh203	0:3d9c67d97d6f	110	* (srcALen - srcBLen) zeroes has to included in the ending of the output buffer */
emh203	0:3d9c67d97d6f	111	if(srcALen >= srcBLen)
emh203	0:3d9c67d97d6f	112	{
emh203	0:3d9c67d97d6f	113	/* Initialization of inputA pointer */
emh203	0:3d9c67d97d6f	114	pIn1 = (pSrcA);
emh203	0:3d9c67d97d6f	115
emh203	0:3d9c67d97d6f	116	/* Initialization of inputB pointer */
emh203	0:3d9c67d97d6f	117	pIn2 = (pSrcB);
emh203	0:3d9c67d97d6f	118
emh203	0:3d9c67d97d6f	119	/* Number of output samples is calculated */
emh203	0:3d9c67d97d6f	120	outBlockSize = (2u * srcALen) - 1u;
emh203	0:3d9c67d97d6f	121
emh203	0:3d9c67d97d6f	122	/* When srcALen > srcBLen, zero padding is done to srcB
emh203	0:3d9c67d97d6f	123	* to make their lengths equal.
emh203	0:3d9c67d97d6f	124	* Instead, (outBlockSize - (srcALen + srcBLen - 1))
emh203	0:3d9c67d97d6f	125	* number of output samples are made zero */
emh203	0:3d9c67d97d6f	126	j = outBlockSize - (srcALen + (srcBLen - 1u));
emh203	0:3d9c67d97d6f	127
emh203	0:3d9c67d97d6f	128	/* Updating the pointer position to non zero value */
emh203	0:3d9c67d97d6f	129	pOut += j;
emh203	0:3d9c67d97d6f	130
emh203	0:3d9c67d97d6f	131	}
emh203	0:3d9c67d97d6f	132	else
emh203	0:3d9c67d97d6f	133	{
emh203	0:3d9c67d97d6f	134	/* Initialization of inputA pointer */
emh203	0:3d9c67d97d6f	135	pIn1 = (pSrcB);
emh203	0:3d9c67d97d6f	136
emh203	0:3d9c67d97d6f	137	/* Initialization of inputB pointer */
emh203	0:3d9c67d97d6f	138	pIn2 = (pSrcA);
emh203	0:3d9c67d97d6f	139
emh203	0:3d9c67d97d6f	140	/* srcBLen is always considered as shorter or equal to srcALen */
emh203	0:3d9c67d97d6f	141	j = srcBLen;
emh203	0:3d9c67d97d6f	142	srcBLen = srcALen;
emh203	0:3d9c67d97d6f	143	srcALen = j;
emh203	0:3d9c67d97d6f	144
emh203	0:3d9c67d97d6f	145	/* CORR(x, y) = Reverse order(CORR(y, x)) */
emh203	0:3d9c67d97d6f	146	/* Hence set the destination pointer to point to the last output sample */
emh203	0:3d9c67d97d6f	147	pOut = pDst + ((srcALen + srcBLen) - 2u);
emh203	0:3d9c67d97d6f	148
emh203	0:3d9c67d97d6f	149	/* Destination address modifier is set to -1 */
emh203	0:3d9c67d97d6f	150	inc = -1;
emh203	0:3d9c67d97d6f	151
emh203	0:3d9c67d97d6f	152	}
emh203	0:3d9c67d97d6f	153
emh203	0:3d9c67d97d6f	154	/* The function is internally
emh203	0:3d9c67d97d6f	155	* divided into three parts according to the number of multiplications that has to be
emh203	0:3d9c67d97d6f	156	* taken place between inputA samples and inputB samples. In the first part of the
emh203	0:3d9c67d97d6f	157	* algorithm, the multiplications increase by one for every iteration.
emh203	0:3d9c67d97d6f	158	* In the second part of the algorithm, srcBLen number of multiplications are done.
emh203	0:3d9c67d97d6f	159	* In the third part of the algorithm, the multiplications decrease by one
emh203	0:3d9c67d97d6f	160	* for every iteration.*/
emh203	0:3d9c67d97d6f	161	/* The algorithm is implemented in three stages.
emh203	0:3d9c67d97d6f	162	* The loop counters of each stage is initiated here. */
emh203	0:3d9c67d97d6f	163	blockSize1 = srcBLen - 1u;
emh203	0:3d9c67d97d6f	164	blockSize2 = srcALen - (srcBLen - 1u);
emh203	0:3d9c67d97d6f	165	blockSize3 = blockSize1;
emh203	0:3d9c67d97d6f	166
emh203	0:3d9c67d97d6f	167	/* --------------------------
emh203	0:3d9c67d97d6f	168	* Initializations of stage1
emh203	0:3d9c67d97d6f	169	* -------------------------*/
emh203	0:3d9c67d97d6f	170
emh203	0:3d9c67d97d6f	171	/* sum = x[0] * y[srcBlen - 1]
emh203	0:3d9c67d97d6f	172	* sum = x[0] * y[srcBlen - 2] + x[1] * y[srcBlen - 1]
emh203	0:3d9c67d97d6f	173	* ....
emh203	0:3d9c67d97d6f	174	* sum = x[0] * y[0] + x[1] * y[1] +...+ x[srcBLen - 1] * y[srcBLen - 1]
emh203	0:3d9c67d97d6f	175	*/
emh203	0:3d9c67d97d6f	176
emh203	0:3d9c67d97d6f	177	/* In this stage the MAC operations are increased by 1 for every iteration.
emh203	0:3d9c67d97d6f	178	The count variable holds the number of MAC operations performed */
emh203	0:3d9c67d97d6f	179	count = 1u;
emh203	0:3d9c67d97d6f	180
emh203	0:3d9c67d97d6f	181	/* Working pointer of inputA */
emh203	0:3d9c67d97d6f	182	px = pIn1;
emh203	0:3d9c67d97d6f	183
emh203	0:3d9c67d97d6f	184	/* Working pointer of inputB */
emh203	0:3d9c67d97d6f	185	pSrc1 = pIn2 + (srcBLen - 1u);
emh203	0:3d9c67d97d6f	186	py = pSrc1;
emh203	0:3d9c67d97d6f	187
emh203	0:3d9c67d97d6f	188	/* ------------------------
emh203	0:3d9c67d97d6f	189	* Stage1 process
emh203	0:3d9c67d97d6f	190	* ----------------------*/
emh203	0:3d9c67d97d6f	191
emh203	0:3d9c67d97d6f	192	/* The first loop starts here */
emh203	0:3d9c67d97d6f	193	while(blockSize1 > 0u)
emh203	0:3d9c67d97d6f	194	{
emh203	0:3d9c67d97d6f	195	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	196	sum = 0;
emh203	0:3d9c67d97d6f	197
emh203	0:3d9c67d97d6f	198	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	199	k = count >> 2;
emh203	0:3d9c67d97d6f	200
emh203	0:3d9c67d97d6f	201	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	202	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	203	while(k > 0u)
emh203	0:3d9c67d97d6f	204	{
emh203	0:3d9c67d97d6f	205	/* x[0] * y[srcBLen - 4] , x[1] * y[srcBLen - 3] */
emh203	0:3d9c67d97d6f	206	sum = __SMLAD(__SIMD32(px)++, __SIMD32(py)++, sum);
emh203	0:3d9c67d97d6f	207	/* x[3] * y[srcBLen - 1] , x[2] * y[srcBLen - 2] */
emh203	0:3d9c67d97d6f	208	sum = __SMLAD(__SIMD32(px)++, __SIMD32(py)++, sum);
emh203	0:3d9c67d97d6f	209
emh203	0:3d9c67d97d6f	210	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	211	k--;
emh203	0:3d9c67d97d6f	212	}
emh203	0:3d9c67d97d6f	213
emh203	0:3d9c67d97d6f	214	/* If the count is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	215	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	216	k = count % 0x4u;
emh203	0:3d9c67d97d6f	217
emh203	0:3d9c67d97d6f	218	while(k > 0u)
emh203	0:3d9c67d97d6f	219	{
emh203	0:3d9c67d97d6f	220	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	221	/* x[0] * y[srcBLen - 1] */
emh203	0:3d9c67d97d6f	222	sum = __SMLAD(px++, py++, sum);
emh203	0:3d9c67d97d6f	223
emh203	0:3d9c67d97d6f	224	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	225	k--;
emh203	0:3d9c67d97d6f	226	}
emh203	0:3d9c67d97d6f	227
emh203	0:3d9c67d97d6f	228	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	229	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	230	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	231	pOut += inc;
emh203	0:3d9c67d97d6f	232
emh203	0:3d9c67d97d6f	233	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	234	py = pSrc1 - count;
emh203	0:3d9c67d97d6f	235	px = pIn1;
emh203	0:3d9c67d97d6f	236
emh203	0:3d9c67d97d6f	237	/* Increment the MAC count */
emh203	0:3d9c67d97d6f	238	count++;
emh203	0:3d9c67d97d6f	239
emh203	0:3d9c67d97d6f	240	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	241	blockSize1--;
emh203	0:3d9c67d97d6f	242	}
emh203	0:3d9c67d97d6f	243
emh203	0:3d9c67d97d6f	244	/* --------------------------
emh203	0:3d9c67d97d6f	245	* Initializations of stage2
emh203	0:3d9c67d97d6f	246	* ------------------------*/
emh203	0:3d9c67d97d6f	247
emh203	0:3d9c67d97d6f	248	/* sum = x[0] * y[0] + x[1] * y[1] +...+ x[srcBLen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	249	* sum = x[1] * y[0] + x[2] * y[1] +...+ x[srcBLen] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	250	* ....
emh203	0:3d9c67d97d6f	251	* sum = x[srcALen-srcBLen-2] * y[0] + x[srcALen-srcBLen-1] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	252	*/
emh203	0:3d9c67d97d6f	253
emh203	0:3d9c67d97d6f	254	/* Working pointer of inputA */
emh203	0:3d9c67d97d6f	255	px = pIn1;
emh203	0:3d9c67d97d6f	256
emh203	0:3d9c67d97d6f	257	/* Working pointer of inputB */
emh203	0:3d9c67d97d6f	258	py = pIn2;
emh203	0:3d9c67d97d6f	259
emh203	0:3d9c67d97d6f	260	/* count is index by which the pointer pIn1 to be incremented */
emh203	0:3d9c67d97d6f	261	count = 0u;
emh203	0:3d9c67d97d6f	262
emh203	0:3d9c67d97d6f	263	/* -------------------
emh203	0:3d9c67d97d6f	264	* Stage2 process
emh203	0:3d9c67d97d6f	265	* ------------------*/
emh203	0:3d9c67d97d6f	266
emh203	0:3d9c67d97d6f	267	/* Stage2 depends on srcBLen as in this stage srcBLen number of MACS are performed.
emh203	0:3d9c67d97d6f	268	* So, to loop unroll over blockSize2,
emh203	0:3d9c67d97d6f	269	* srcBLen should be greater than or equal to 4, to loop unroll the srcBLen loop */
emh203	0:3d9c67d97d6f	270	if(srcBLen >= 4u)
emh203	0:3d9c67d97d6f	271	{
emh203	0:3d9c67d97d6f	272	/* Loop unroll over blockSize2, by 4 */
emh203	0:3d9c67d97d6f	273	blkCnt = blockSize2 >> 2u;
emh203	0:3d9c67d97d6f	274
emh203	0:3d9c67d97d6f	275	while(blkCnt > 0u)
emh203	0:3d9c67d97d6f	276	{
emh203	0:3d9c67d97d6f	277	/* Set all accumulators to zero */
emh203	0:3d9c67d97d6f	278	acc0 = 0;
emh203	0:3d9c67d97d6f	279	acc1 = 0;
emh203	0:3d9c67d97d6f	280	acc2 = 0;
emh203	0:3d9c67d97d6f	281	acc3 = 0;
emh203	0:3d9c67d97d6f	282
emh203	0:3d9c67d97d6f	283	/* read x[0], x[1] samples */
emh203	0:3d9c67d97d6f	284	x0 = *__SIMD32(px);
emh203	0:3d9c67d97d6f	285	/* read x[1], x[2] samples */
emh203	0:3d9c67d97d6f	286	x1 = _SIMD32_OFFSET(px + 1);
emh203	0:3d9c67d97d6f	287	px += 2u;
emh203	0:3d9c67d97d6f	288
emh203	0:3d9c67d97d6f	289	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	290	k = srcBLen >> 2u;
emh203	0:3d9c67d97d6f	291
emh203	0:3d9c67d97d6f	292	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	293	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	294	do
emh203	0:3d9c67d97d6f	295	{
emh203	0:3d9c67d97d6f	296	/* Read the first two inputB samples using SIMD:
emh203	0:3d9c67d97d6f	297	* y[0] and y[1] */
emh203	0:3d9c67d97d6f	298	c0 = *__SIMD32(py)++;
emh203	0:3d9c67d97d6f	299
emh203	0:3d9c67d97d6f	300	/* acc0 += x[0] * y[0] + x[1] * y[1] */
emh203	0:3d9c67d97d6f	301	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	302
emh203	0:3d9c67d97d6f	303	/* acc1 += x[1] * y[0] + x[2] * y[1] */
emh203	0:3d9c67d97d6f	304	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	305
emh203	0:3d9c67d97d6f	306	/* Read x[2], x[3] */
emh203	0:3d9c67d97d6f	307	x2 = *__SIMD32(px);
emh203	0:3d9c67d97d6f	308
emh203	0:3d9c67d97d6f	309	/* Read x[3], x[4] */
emh203	0:3d9c67d97d6f	310	x3 = _SIMD32_OFFSET(px + 1);
emh203	0:3d9c67d97d6f	311
emh203	0:3d9c67d97d6f	312	/* acc2 += x[2] * y[0] + x[3] * y[1] */
emh203	0:3d9c67d97d6f	313	acc2 = __SMLAD(x2, c0, acc2);
emh203	0:3d9c67d97d6f	314
emh203	0:3d9c67d97d6f	315	/* acc3 += x[3] * y[0] + x[4] * y[1] */
emh203	0:3d9c67d97d6f	316	acc3 = __SMLAD(x3, c0, acc3);
emh203	0:3d9c67d97d6f	317
emh203	0:3d9c67d97d6f	318	/* Read y[2] and y[3] */
emh203	0:3d9c67d97d6f	319	c0 = *__SIMD32(py)++;
emh203	0:3d9c67d97d6f	320
emh203	0:3d9c67d97d6f	321	/* acc0 += x[2] * y[2] + x[3] * y[3] */
emh203	0:3d9c67d97d6f	322	acc0 = __SMLAD(x2, c0, acc0);
emh203	0:3d9c67d97d6f	323
emh203	0:3d9c67d97d6f	324	/* acc1 += x[3] * y[2] + x[4] * y[3] */
emh203	0:3d9c67d97d6f	325	acc1 = __SMLAD(x3, c0, acc1);
emh203	0:3d9c67d97d6f	326
emh203	0:3d9c67d97d6f	327	/* Read x[4], x[5] */
emh203	0:3d9c67d97d6f	328	x0 = _SIMD32_OFFSET(px + 2);
emh203	0:3d9c67d97d6f	329
emh203	0:3d9c67d97d6f	330	/* Read x[5], x[6] */
emh203	0:3d9c67d97d6f	331	x1 = _SIMD32_OFFSET(px + 3);
emh203	0:3d9c67d97d6f	332	px += 4u;
emh203	0:3d9c67d97d6f	333
emh203	0:3d9c67d97d6f	334	/* acc2 += x[4] * y[2] + x[5] * y[3] */
emh203	0:3d9c67d97d6f	335	acc2 = __SMLAD(x0, c0, acc2);
emh203	0:3d9c67d97d6f	336
emh203	0:3d9c67d97d6f	337	/* acc3 += x[5] * y[2] + x[6] * y[3] */
emh203	0:3d9c67d97d6f	338	acc3 = __SMLAD(x1, c0, acc3);
emh203	0:3d9c67d97d6f	339
emh203	0:3d9c67d97d6f	340	} while(--k);
emh203	0:3d9c67d97d6f	341
emh203	0:3d9c67d97d6f	342	/* For the next MAC operations, SIMD is not used
emh203	0:3d9c67d97d6f	343	* So, the 16 bit pointer if inputB, py is updated */
emh203	0:3d9c67d97d6f	344
emh203	0:3d9c67d97d6f	345	/* If the srcBLen is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	346	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	347	k = srcBLen % 0x4u;
emh203	0:3d9c67d97d6f	348
emh203	0:3d9c67d97d6f	349	if(k == 1u)
emh203	0:3d9c67d97d6f	350	{
emh203	0:3d9c67d97d6f	351	/* Read y[4] */
emh203	0:3d9c67d97d6f	352	c0 = *py;
emh203	0:3d9c67d97d6f	353	#ifdef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	354
emh203	0:3d9c67d97d6f	355	c0 = c0 << 16u;
emh203	0:3d9c67d97d6f	356
emh203	0:3d9c67d97d6f	357	#else
emh203	0:3d9c67d97d6f	358
emh203	0:3d9c67d97d6f	359	c0 = c0 & 0x0000FFFF;
emh203	0:3d9c67d97d6f	360
emh203	0:3d9c67d97d6f	361	#endif /* #ifdef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	362
emh203	0:3d9c67d97d6f	363	/* Read x[7] */
emh203	0:3d9c67d97d6f	364	x3 = *__SIMD32(px);
emh203	0:3d9c67d97d6f	365	px++;
emh203	0:3d9c67d97d6f	366
emh203	0:3d9c67d97d6f	367	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	368	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	369	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	370	acc2 = __SMLADX(x1, c0, acc2);
emh203	0:3d9c67d97d6f	371	acc3 = __SMLADX(x3, c0, acc3);
emh203	0:3d9c67d97d6f	372	}
emh203	0:3d9c67d97d6f	373
emh203	0:3d9c67d97d6f	374	if(k == 2u)
emh203	0:3d9c67d97d6f	375	{
emh203	0:3d9c67d97d6f	376	/* Read y[4], y[5] */
emh203	0:3d9c67d97d6f	377	c0 = *__SIMD32(py);
emh203	0:3d9c67d97d6f	378
emh203	0:3d9c67d97d6f	379	/* Read x[7], x[8] */
emh203	0:3d9c67d97d6f	380	x3 = *__SIMD32(px);
emh203	0:3d9c67d97d6f	381
emh203	0:3d9c67d97d6f	382	/* Read x[9] */
emh203	0:3d9c67d97d6f	383	x2 = _SIMD32_OFFSET(px + 1);
emh203	0:3d9c67d97d6f	384	px += 2u;
emh203	0:3d9c67d97d6f	385
emh203	0:3d9c67d97d6f	386	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	387	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	388	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	389	acc2 = __SMLAD(x3, c0, acc2);
emh203	0:3d9c67d97d6f	390	acc3 = __SMLAD(x2, c0, acc3);
emh203	0:3d9c67d97d6f	391	}
emh203	0:3d9c67d97d6f	392
emh203	0:3d9c67d97d6f	393	if(k == 3u)
emh203	0:3d9c67d97d6f	394	{
emh203	0:3d9c67d97d6f	395	/* Read y[4], y[5] */
emh203	0:3d9c67d97d6f	396	c0 = *__SIMD32(py)++;
emh203	0:3d9c67d97d6f	397
emh203	0:3d9c67d97d6f	398	/* Read x[7], x[8] */
emh203	0:3d9c67d97d6f	399	x3 = *__SIMD32(px);
emh203	0:3d9c67d97d6f	400
emh203	0:3d9c67d97d6f	401	/* Read x[9] */
emh203	0:3d9c67d97d6f	402	x2 = _SIMD32_OFFSET(px + 1);
emh203	0:3d9c67d97d6f	403
emh203	0:3d9c67d97d6f	404	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	405	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	406	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	407	acc2 = __SMLAD(x3, c0, acc2);
emh203	0:3d9c67d97d6f	408	acc3 = __SMLAD(x2, c0, acc3);
emh203	0:3d9c67d97d6f	409
emh203	0:3d9c67d97d6f	410	c0 = (*py);
emh203	0:3d9c67d97d6f	411	/* Read y[6] */
emh203	0:3d9c67d97d6f	412	#ifdef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	413
emh203	0:3d9c67d97d6f	414	c0 = c0 << 16u;
emh203	0:3d9c67d97d6f	415	#else
emh203	0:3d9c67d97d6f	416
emh203	0:3d9c67d97d6f	417	c0 = c0 & 0x0000FFFF;
emh203	0:3d9c67d97d6f	418	#endif /* #ifdef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	419
emh203	0:3d9c67d97d6f	420	/* Read x[10] */
emh203	0:3d9c67d97d6f	421	x3 = _SIMD32_OFFSET(px + 2);
emh203	0:3d9c67d97d6f	422	px += 3u;
emh203	0:3d9c67d97d6f	423
emh203	0:3d9c67d97d6f	424	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	425	acc0 = __SMLADX(x1, c0, acc0);
emh203	0:3d9c67d97d6f	426	acc1 = __SMLAD(x2, c0, acc1);
emh203	0:3d9c67d97d6f	427	acc2 = __SMLADX(x2, c0, acc2);
emh203	0:3d9c67d97d6f	428	acc3 = __SMLADX(x3, c0, acc3);
emh203	0:3d9c67d97d6f	429	}
emh203	0:3d9c67d97d6f	430
emh203	0:3d9c67d97d6f	431	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	432	*pOut = (q15_t) (acc0 >> 15);
emh203	0:3d9c67d97d6f	433	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	434	pOut += inc;
emh203	0:3d9c67d97d6f	435
emh203	0:3d9c67d97d6f	436	*pOut = (q15_t) (acc1 >> 15);
emh203	0:3d9c67d97d6f	437	pOut += inc;
emh203	0:3d9c67d97d6f	438
emh203	0:3d9c67d97d6f	439	*pOut = (q15_t) (acc2 >> 15);
emh203	0:3d9c67d97d6f	440	pOut += inc;
emh203	0:3d9c67d97d6f	441
emh203	0:3d9c67d97d6f	442	*pOut = (q15_t) (acc3 >> 15);
emh203	0:3d9c67d97d6f	443	pOut += inc;
emh203	0:3d9c67d97d6f	444
emh203	0:3d9c67d97d6f	445	/* Increment the pointer pIn1 index, count by 1 */
emh203	0:3d9c67d97d6f	446	count += 4u;
emh203	0:3d9c67d97d6f	447
emh203	0:3d9c67d97d6f	448	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	449	px = pIn1 + count;
emh203	0:3d9c67d97d6f	450	py = pIn2;
emh203	0:3d9c67d97d6f	451
emh203	0:3d9c67d97d6f	452
emh203	0:3d9c67d97d6f	453	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	454	blkCnt--;
emh203	0:3d9c67d97d6f	455	}
emh203	0:3d9c67d97d6f	456
emh203	0:3d9c67d97d6f	457	/* If the blockSize2 is not a multiple of 4, compute any remaining output samples here.
emh203	0:3d9c67d97d6f	458	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	459	blkCnt = blockSize2 % 0x4u;
emh203	0:3d9c67d97d6f	460
emh203	0:3d9c67d97d6f	461	while(blkCnt > 0u)
emh203	0:3d9c67d97d6f	462	{
emh203	0:3d9c67d97d6f	463	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	464	sum = 0;
emh203	0:3d9c67d97d6f	465
emh203	0:3d9c67d97d6f	466	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	467	k = srcBLen >> 2u;
emh203	0:3d9c67d97d6f	468
emh203	0:3d9c67d97d6f	469	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	470	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	471	while(k > 0u)
emh203	0:3d9c67d97d6f	472	{
emh203	0:3d9c67d97d6f	473	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	474	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	475	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	476	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	477	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	478
emh203	0:3d9c67d97d6f	479	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	480	k--;
emh203	0:3d9c67d97d6f	481	}
emh203	0:3d9c67d97d6f	482
emh203	0:3d9c67d97d6f	483	/* If the srcBLen is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	484	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	485	k = srcBLen % 0x4u;
emh203	0:3d9c67d97d6f	486
emh203	0:3d9c67d97d6f	487	while(k > 0u)
emh203	0:3d9c67d97d6f	488	{
emh203	0:3d9c67d97d6f	489	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	490	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	491
emh203	0:3d9c67d97d6f	492	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	493	k--;
emh203	0:3d9c67d97d6f	494	}
emh203	0:3d9c67d97d6f	495
emh203	0:3d9c67d97d6f	496	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	497	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	498	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	499	pOut += inc;
emh203	0:3d9c67d97d6f	500
emh203	0:3d9c67d97d6f	501	/* Increment the pointer pIn1 index, count by 1 */
emh203	0:3d9c67d97d6f	502	count++;
emh203	0:3d9c67d97d6f	503
emh203	0:3d9c67d97d6f	504	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	505	px = pIn1 + count;
emh203	0:3d9c67d97d6f	506	py = pIn2;
emh203	0:3d9c67d97d6f	507
emh203	0:3d9c67d97d6f	508	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	509	blkCnt--;
emh203	0:3d9c67d97d6f	510	}
emh203	0:3d9c67d97d6f	511	}
emh203	0:3d9c67d97d6f	512	else
emh203	0:3d9c67d97d6f	513	{
emh203	0:3d9c67d97d6f	514	/* If the srcBLen is not a multiple of 4,
emh203	0:3d9c67d97d6f	515	* the blockSize2 loop cannot be unrolled by 4 */
emh203	0:3d9c67d97d6f	516	blkCnt = blockSize2;
emh203	0:3d9c67d97d6f	517
emh203	0:3d9c67d97d6f	518	while(blkCnt > 0u)
emh203	0:3d9c67d97d6f	519	{
emh203	0:3d9c67d97d6f	520	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	521	sum = 0;
emh203	0:3d9c67d97d6f	522
emh203	0:3d9c67d97d6f	523	/* Loop over srcBLen */
emh203	0:3d9c67d97d6f	524	k = srcBLen;
emh203	0:3d9c67d97d6f	525
emh203	0:3d9c67d97d6f	526	while(k > 0u)
emh203	0:3d9c67d97d6f	527	{
emh203	0:3d9c67d97d6f	528	/* Perform the multiply-accumulate */
emh203	0:3d9c67d97d6f	529	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	530
emh203	0:3d9c67d97d6f	531	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	532	k--;
emh203	0:3d9c67d97d6f	533	}
emh203	0:3d9c67d97d6f	534
emh203	0:3d9c67d97d6f	535	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	536	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	537	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	538	pOut += inc;
emh203	0:3d9c67d97d6f	539
emh203	0:3d9c67d97d6f	540	/* Increment the MAC count */
emh203	0:3d9c67d97d6f	541	count++;
emh203	0:3d9c67d97d6f	542
emh203	0:3d9c67d97d6f	543	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	544	px = pIn1 + count;
emh203	0:3d9c67d97d6f	545	py = pIn2;
emh203	0:3d9c67d97d6f	546
emh203	0:3d9c67d97d6f	547	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	548	blkCnt--;
emh203	0:3d9c67d97d6f	549	}
emh203	0:3d9c67d97d6f	550	}
emh203	0:3d9c67d97d6f	551
emh203	0:3d9c67d97d6f	552	/* --------------------------
emh203	0:3d9c67d97d6f	553	* Initializations of stage3
emh203	0:3d9c67d97d6f	554	* -------------------------*/
emh203	0:3d9c67d97d6f	555
emh203	0:3d9c67d97d6f	556	/* sum += x[srcALen-srcBLen+1] * y[0] + x[srcALen-srcBLen+2] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	557	* sum += x[srcALen-srcBLen+2] * y[0] + x[srcALen-srcBLen+3] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	558	* ....
emh203	0:3d9c67d97d6f	559	* sum += x[srcALen-2] * y[0] + x[srcALen-1] * y[1]
emh203	0:3d9c67d97d6f	560	* sum += x[srcALen-1] * y[0]
emh203	0:3d9c67d97d6f	561	*/
emh203	0:3d9c67d97d6f	562
emh203	0:3d9c67d97d6f	563	/* In this stage the MAC operations are decreased by 1 for every iteration.
emh203	0:3d9c67d97d6f	564	The count variable holds the number of MAC operations performed */
emh203	0:3d9c67d97d6f	565	count = srcBLen - 1u;
emh203	0:3d9c67d97d6f	566
emh203	0:3d9c67d97d6f	567	/* Working pointer of inputA */
emh203	0:3d9c67d97d6f	568	pSrc1 = (pIn1 + srcALen) - (srcBLen - 1u);
emh203	0:3d9c67d97d6f	569	px = pSrc1;
emh203	0:3d9c67d97d6f	570
emh203	0:3d9c67d97d6f	571	/* Working pointer of inputB */
emh203	0:3d9c67d97d6f	572	py = pIn2;
emh203	0:3d9c67d97d6f	573
emh203	0:3d9c67d97d6f	574	/* -------------------
emh203	0:3d9c67d97d6f	575	* Stage3 process
emh203	0:3d9c67d97d6f	576	* ------------------*/
emh203	0:3d9c67d97d6f	577
emh203	0:3d9c67d97d6f	578	while(blockSize3 > 0u)
emh203	0:3d9c67d97d6f	579	{
emh203	0:3d9c67d97d6f	580	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	581	sum = 0;
emh203	0:3d9c67d97d6f	582
emh203	0:3d9c67d97d6f	583	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	584	k = count >> 2u;
emh203	0:3d9c67d97d6f	585
emh203	0:3d9c67d97d6f	586	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	587	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	588	while(k > 0u)
emh203	0:3d9c67d97d6f	589	{
emh203	0:3d9c67d97d6f	590	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	591	/* sum += x[srcALen - srcBLen + 4] * y[3] , sum += x[srcALen - srcBLen + 3] * y[2] */
emh203	0:3d9c67d97d6f	592	sum = __SMLAD(__SIMD32(px)++, __SIMD32(py)++, sum);
emh203	0:3d9c67d97d6f	593	/* sum += x[srcALen - srcBLen + 2] * y[1] , sum += x[srcALen - srcBLen + 1] * y[0] */
emh203	0:3d9c67d97d6f	594	sum = __SMLAD(__SIMD32(px)++, __SIMD32(py)++, sum);
emh203	0:3d9c67d97d6f	595
emh203	0:3d9c67d97d6f	596	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	597	k--;
emh203	0:3d9c67d97d6f	598	}
emh203	0:3d9c67d97d6f	599
emh203	0:3d9c67d97d6f	600	/* If the count is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	601	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	602	k = count % 0x4u;
emh203	0:3d9c67d97d6f	603
emh203	0:3d9c67d97d6f	604	while(k > 0u)
emh203	0:3d9c67d97d6f	605	{
emh203	0:3d9c67d97d6f	606	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	607	sum = __SMLAD(px++, py++, sum);
emh203	0:3d9c67d97d6f	608
emh203	0:3d9c67d97d6f	609	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	610	k--;
emh203	0:3d9c67d97d6f	611	}
emh203	0:3d9c67d97d6f	612
emh203	0:3d9c67d97d6f	613	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	614	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	615	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	616	pOut += inc;
emh203	0:3d9c67d97d6f	617
emh203	0:3d9c67d97d6f	618	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	619	px = ++pSrc1;
emh203	0:3d9c67d97d6f	620	py = pIn2;
emh203	0:3d9c67d97d6f	621
emh203	0:3d9c67d97d6f	622	/* Decrement the MAC count */
emh203	0:3d9c67d97d6f	623	count--;
emh203	0:3d9c67d97d6f	624
emh203	0:3d9c67d97d6f	625	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	626	blockSize3--;
emh203	0:3d9c67d97d6f	627	}
emh203	0:3d9c67d97d6f	628
emh203	0:3d9c67d97d6f	629	#else
emh203	0:3d9c67d97d6f	630
emh203	0:3d9c67d97d6f	631	q15_t pIn1; / inputA pointer */
emh203	0:3d9c67d97d6f	632	q15_t pIn2; / inputB pointer */
emh203	0:3d9c67d97d6f	633	q15_t pOut = pDst; / output pointer */
emh203	0:3d9c67d97d6f	634	q31_t sum, acc0, acc1, acc2, acc3; /* Accumulators */
emh203	0:3d9c67d97d6f	635	q15_t px; / Intermediate inputA pointer */
emh203	0:3d9c67d97d6f	636	q15_t py; / Intermediate inputB pointer */
emh203	0:3d9c67d97d6f	637	q15_t pSrc1; / Intermediate pointers */
emh203	0:3d9c67d97d6f	638	q31_t x0, x1, x2, x3, c0; /* temporary variables for holding input and coefficient values */
emh203	0:3d9c67d97d6f	639	uint32_t j, k = 0u, count, blkCnt, outBlockSize, blockSize1, blockSize2, blockSize3; /* loop counter */
emh203	0:3d9c67d97d6f	640	int32_t inc = 1; /* Destination address modifier */
emh203	0:3d9c67d97d6f	641	q15_t a, b;
emh203	0:3d9c67d97d6f	642
emh203	0:3d9c67d97d6f	643
emh203	0:3d9c67d97d6f	644	/* The algorithm implementation is based on the lengths of the inputs. */
emh203	0:3d9c67d97d6f	645	/* srcB is always made to slide across srcA. */
emh203	0:3d9c67d97d6f	646	/* So srcBLen is always considered as shorter or equal to srcALen */
emh203	0:3d9c67d97d6f	647	/* But CORR(x, y) is reverse of CORR(y, x) */
emh203	0:3d9c67d97d6f	648	/* So, when srcBLen > srcALen, output pointer is made to point to the end of the output buffer */
emh203	0:3d9c67d97d6f	649	/* and the destination pointer modifier, inc is set to -1 */
emh203	0:3d9c67d97d6f	650	/* If srcALen > srcBLen, zero pad has to be done to srcB to make the two inputs of same length */
emh203	0:3d9c67d97d6f	651	/* But to improve the performance,
emh203	0:3d9c67d97d6f	652	* we include zeroes in the output instead of zero padding either of the the inputs*/
emh203	0:3d9c67d97d6f	653	/* If srcALen > srcBLen,
emh203	0:3d9c67d97d6f	654	* (srcALen - srcBLen) zeroes has to included in the starting of the output buffer */
emh203	0:3d9c67d97d6f	655	/* If srcALen < srcBLen,
emh203	0:3d9c67d97d6f	656	* (srcALen - srcBLen) zeroes has to included in the ending of the output buffer */
emh203	0:3d9c67d97d6f	657	if(srcALen >= srcBLen)
emh203	0:3d9c67d97d6f	658	{
emh203	0:3d9c67d97d6f	659	/* Initialization of inputA pointer */
emh203	0:3d9c67d97d6f	660	pIn1 = (pSrcA);
emh203	0:3d9c67d97d6f	661
emh203	0:3d9c67d97d6f	662	/* Initialization of inputB pointer */
emh203	0:3d9c67d97d6f	663	pIn2 = (pSrcB);
emh203	0:3d9c67d97d6f	664
emh203	0:3d9c67d97d6f	665	/* Number of output samples is calculated */
emh203	0:3d9c67d97d6f	666	outBlockSize = (2u * srcALen) - 1u;
emh203	0:3d9c67d97d6f	667
emh203	0:3d9c67d97d6f	668	/* When srcALen > srcBLen, zero padding is done to srcB
emh203	0:3d9c67d97d6f	669	* to make their lengths equal.
emh203	0:3d9c67d97d6f	670	* Instead, (outBlockSize - (srcALen + srcBLen - 1))
emh203	0:3d9c67d97d6f	671	* number of output samples are made zero */
emh203	0:3d9c67d97d6f	672	j = outBlockSize - (srcALen + (srcBLen - 1u));
emh203	0:3d9c67d97d6f	673
emh203	0:3d9c67d97d6f	674	/* Updating the pointer position to non zero value */
emh203	0:3d9c67d97d6f	675	pOut += j;
emh203	0:3d9c67d97d6f	676
emh203	0:3d9c67d97d6f	677	}
emh203	0:3d9c67d97d6f	678	else
emh203	0:3d9c67d97d6f	679	{
emh203	0:3d9c67d97d6f	680	/* Initialization of inputA pointer */
emh203	0:3d9c67d97d6f	681	pIn1 = (pSrcB);
emh203	0:3d9c67d97d6f	682
emh203	0:3d9c67d97d6f	683	/* Initialization of inputB pointer */
emh203	0:3d9c67d97d6f	684	pIn2 = (pSrcA);
emh203	0:3d9c67d97d6f	685
emh203	0:3d9c67d97d6f	686	/* srcBLen is always considered as shorter or equal to srcALen */
emh203	0:3d9c67d97d6f	687	j = srcBLen;
emh203	0:3d9c67d97d6f	688	srcBLen = srcALen;
emh203	0:3d9c67d97d6f	689	srcALen = j;
emh203	0:3d9c67d97d6f	690
emh203	0:3d9c67d97d6f	691	/* CORR(x, y) = Reverse order(CORR(y, x)) */
emh203	0:3d9c67d97d6f	692	/* Hence set the destination pointer to point to the last output sample */
emh203	0:3d9c67d97d6f	693	pOut = pDst + ((srcALen + srcBLen) - 2u);
emh203	0:3d9c67d97d6f	694
emh203	0:3d9c67d97d6f	695	/* Destination address modifier is set to -1 */
emh203	0:3d9c67d97d6f	696	inc = -1;
emh203	0:3d9c67d97d6f	697
emh203	0:3d9c67d97d6f	698	}
emh203	0:3d9c67d97d6f	699
emh203	0:3d9c67d97d6f	700	/* The function is internally
emh203	0:3d9c67d97d6f	701	* divided into three parts according to the number of multiplications that has to be
emh203	0:3d9c67d97d6f	702	* taken place between inputA samples and inputB samples. In the first part of the
emh203	0:3d9c67d97d6f	703	* algorithm, the multiplications increase by one for every iteration.
emh203	0:3d9c67d97d6f	704	* In the second part of the algorithm, srcBLen number of multiplications are done.
emh203	0:3d9c67d97d6f	705	* In the third part of the algorithm, the multiplications decrease by one
emh203	0:3d9c67d97d6f	706	* for every iteration.*/
emh203	0:3d9c67d97d6f	707	/* The algorithm is implemented in three stages.
emh203	0:3d9c67d97d6f	708	* The loop counters of each stage is initiated here. */
emh203	0:3d9c67d97d6f	709	blockSize1 = srcBLen - 1u;
emh203	0:3d9c67d97d6f	710	blockSize2 = srcALen - (srcBLen - 1u);
emh203	0:3d9c67d97d6f	711	blockSize3 = blockSize1;
emh203	0:3d9c67d97d6f	712
emh203	0:3d9c67d97d6f	713	/* --------------------------
emh203	0:3d9c67d97d6f	714	* Initializations of stage1
emh203	0:3d9c67d97d6f	715	* -------------------------*/
emh203	0:3d9c67d97d6f	716
emh203	0:3d9c67d97d6f	717	/* sum = x[0] * y[srcBlen - 1]
emh203	0:3d9c67d97d6f	718	* sum = x[0] * y[srcBlen - 2] + x[1] * y[srcBlen - 1]
emh203	0:3d9c67d97d6f	719	* ....
emh203	0:3d9c67d97d6f	720	* sum = x[0] * y[0] + x[1] * y[1] +...+ x[srcBLen - 1] * y[srcBLen - 1]
emh203	0:3d9c67d97d6f	721	*/
emh203	0:3d9c67d97d6f	722
emh203	0:3d9c67d97d6f	723	/* In this stage the MAC operations are increased by 1 for every iteration.
emh203	0:3d9c67d97d6f	724	The count variable holds the number of MAC operations performed */
emh203	0:3d9c67d97d6f	725	count = 1u;
emh203	0:3d9c67d97d6f	726
emh203	0:3d9c67d97d6f	727	/* Working pointer of inputA */
emh203	0:3d9c67d97d6f	728	px = pIn1;
emh203	0:3d9c67d97d6f	729
emh203	0:3d9c67d97d6f	730	/* Working pointer of inputB */
emh203	0:3d9c67d97d6f	731	pSrc1 = pIn2 + (srcBLen - 1u);
emh203	0:3d9c67d97d6f	732	py = pSrc1;
emh203	0:3d9c67d97d6f	733
emh203	0:3d9c67d97d6f	734	/* ------------------------
emh203	0:3d9c67d97d6f	735	* Stage1 process
emh203	0:3d9c67d97d6f	736	* ----------------------*/
emh203	0:3d9c67d97d6f	737
emh203	0:3d9c67d97d6f	738	/* The first loop starts here */
emh203	0:3d9c67d97d6f	739	while(blockSize1 > 0u)
emh203	0:3d9c67d97d6f	740	{
emh203	0:3d9c67d97d6f	741	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	742	sum = 0;
emh203	0:3d9c67d97d6f	743
emh203	0:3d9c67d97d6f	744	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	745	k = count >> 2;
emh203	0:3d9c67d97d6f	746
emh203	0:3d9c67d97d6f	747	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	748	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	749	while(k > 0u)
emh203	0:3d9c67d97d6f	750	{
emh203	0:3d9c67d97d6f	751	/* x[0] * y[srcBLen - 4] , x[1] * y[srcBLen - 3] */
emh203	0:3d9c67d97d6f	752	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	753	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	754	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	755	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	756
emh203	0:3d9c67d97d6f	757	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	758	k--;
emh203	0:3d9c67d97d6f	759	}
emh203	0:3d9c67d97d6f	760
emh203	0:3d9c67d97d6f	761	/* If the count is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	762	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	763	k = count % 0x4u;
emh203	0:3d9c67d97d6f	764
emh203	0:3d9c67d97d6f	765	while(k > 0u)
emh203	0:3d9c67d97d6f	766	{
emh203	0:3d9c67d97d6f	767	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	768	/* x[0] * y[srcBLen - 1] */
emh203	0:3d9c67d97d6f	769	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	770
emh203	0:3d9c67d97d6f	771	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	772	k--;
emh203	0:3d9c67d97d6f	773	}
emh203	0:3d9c67d97d6f	774
emh203	0:3d9c67d97d6f	775	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	776	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	777	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	778	pOut += inc;
emh203	0:3d9c67d97d6f	779
emh203	0:3d9c67d97d6f	780	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	781	py = pSrc1 - count;
emh203	0:3d9c67d97d6f	782	px = pIn1;
emh203	0:3d9c67d97d6f	783
emh203	0:3d9c67d97d6f	784	/* Increment the MAC count */
emh203	0:3d9c67d97d6f	785	count++;
emh203	0:3d9c67d97d6f	786
emh203	0:3d9c67d97d6f	787	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	788	blockSize1--;
emh203	0:3d9c67d97d6f	789	}
emh203	0:3d9c67d97d6f	790
emh203	0:3d9c67d97d6f	791	/* --------------------------
emh203	0:3d9c67d97d6f	792	* Initializations of stage2
emh203	0:3d9c67d97d6f	793	* ------------------------*/
emh203	0:3d9c67d97d6f	794
emh203	0:3d9c67d97d6f	795	/* sum = x[0] * y[0] + x[1] * y[1] +...+ x[srcBLen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	796	* sum = x[1] * y[0] + x[2] * y[1] +...+ x[srcBLen] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	797	* ....
emh203	0:3d9c67d97d6f	798	* sum = x[srcALen-srcBLen-2] * y[0] + x[srcALen-srcBLen-1] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	799	*/
emh203	0:3d9c67d97d6f	800
emh203	0:3d9c67d97d6f	801	/* Working pointer of inputA */
emh203	0:3d9c67d97d6f	802	px = pIn1;
emh203	0:3d9c67d97d6f	803
emh203	0:3d9c67d97d6f	804	/* Working pointer of inputB */
emh203	0:3d9c67d97d6f	805	py = pIn2;
emh203	0:3d9c67d97d6f	806
emh203	0:3d9c67d97d6f	807	/* count is index by which the pointer pIn1 to be incremented */
emh203	0:3d9c67d97d6f	808	count = 0u;
emh203	0:3d9c67d97d6f	809
emh203	0:3d9c67d97d6f	810	/* -------------------
emh203	0:3d9c67d97d6f	811	* Stage2 process
emh203	0:3d9c67d97d6f	812	* ------------------*/
emh203	0:3d9c67d97d6f	813
emh203	0:3d9c67d97d6f	814	/* Stage2 depends on srcBLen as in this stage srcBLen number of MACS are performed.
emh203	0:3d9c67d97d6f	815	* So, to loop unroll over blockSize2,
emh203	0:3d9c67d97d6f	816	* srcBLen should be greater than or equal to 4, to loop unroll the srcBLen loop */
emh203	0:3d9c67d97d6f	817	if(srcBLen >= 4u)
emh203	0:3d9c67d97d6f	818	{
emh203	0:3d9c67d97d6f	819	/* Loop unroll over blockSize2, by 4 */
emh203	0:3d9c67d97d6f	820	blkCnt = blockSize2 >> 2u;
emh203	0:3d9c67d97d6f	821
emh203	0:3d9c67d97d6f	822	while(blkCnt > 0u)
emh203	0:3d9c67d97d6f	823	{
emh203	0:3d9c67d97d6f	824	/* Set all accumulators to zero */
emh203	0:3d9c67d97d6f	825	acc0 = 0;
emh203	0:3d9c67d97d6f	826	acc1 = 0;
emh203	0:3d9c67d97d6f	827	acc2 = 0;
emh203	0:3d9c67d97d6f	828	acc3 = 0;
emh203	0:3d9c67d97d6f	829
emh203	0:3d9c67d97d6f	830	/* read x[0], x[1], x[2] samples */
emh203	0:3d9c67d97d6f	831	a = *px;
emh203	0:3d9c67d97d6f	832	b = *(px + 1);
emh203	0:3d9c67d97d6f	833
emh203	0:3d9c67d97d6f	834	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	835
emh203	0:3d9c67d97d6f	836	x0 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	837	a = *(px + 2);
emh203	0:3d9c67d97d6f	838	x1 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	839
emh203	0:3d9c67d97d6f	840	#else
emh203	0:3d9c67d97d6f	841
emh203	0:3d9c67d97d6f	842	x0 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	843	a = *(px + 2);
emh203	0:3d9c67d97d6f	844	x1 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	845
emh203	0:3d9c67d97d6f	846	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	847
emh203	0:3d9c67d97d6f	848	px += 2u;
emh203	0:3d9c67d97d6f	849
emh203	0:3d9c67d97d6f	850	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	851	k = srcBLen >> 2u;
emh203	0:3d9c67d97d6f	852
emh203	0:3d9c67d97d6f	853	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	854	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	855	do
emh203	0:3d9c67d97d6f	856	{
emh203	0:3d9c67d97d6f	857	/* Read the first two inputB samples using SIMD:
emh203	0:3d9c67d97d6f	858	* y[0] and y[1] */
emh203	0:3d9c67d97d6f	859	a = *py;
emh203	0:3d9c67d97d6f	860	b = *(py + 1);
emh203	0:3d9c67d97d6f	861
emh203	0:3d9c67d97d6f	862	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	863
emh203	0:3d9c67d97d6f	864	c0 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	865
emh203	0:3d9c67d97d6f	866	#else
emh203	0:3d9c67d97d6f	867
emh203	0:3d9c67d97d6f	868	c0 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	869
emh203	0:3d9c67d97d6f	870	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	871
emh203	0:3d9c67d97d6f	872	/* acc0 += x[0] * y[0] + x[1] * y[1] */
emh203	0:3d9c67d97d6f	873	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	874
emh203	0:3d9c67d97d6f	875	/* acc1 += x[1] * y[0] + x[2] * y[1] */
emh203	0:3d9c67d97d6f	876	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	877
emh203	0:3d9c67d97d6f	878	/* Read x[2], x[3], x[4] */
emh203	0:3d9c67d97d6f	879	a = *px;
emh203	0:3d9c67d97d6f	880	b = *(px + 1);
emh203	0:3d9c67d97d6f	881
emh203	0:3d9c67d97d6f	882	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	883
emh203	0:3d9c67d97d6f	884	x2 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	885	a = *(px + 2);
emh203	0:3d9c67d97d6f	886	x3 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	887
emh203	0:3d9c67d97d6f	888	#else
emh203	0:3d9c67d97d6f	889
emh203	0:3d9c67d97d6f	890	x2 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	891	a = *(px + 2);
emh203	0:3d9c67d97d6f	892	x3 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	893
emh203	0:3d9c67d97d6f	894	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	895
emh203	0:3d9c67d97d6f	896	/* acc2 += x[2] * y[0] + x[3] * y[1] */
emh203	0:3d9c67d97d6f	897	acc2 = __SMLAD(x2, c0, acc2);
emh203	0:3d9c67d97d6f	898
emh203	0:3d9c67d97d6f	899	/* acc3 += x[3] * y[0] + x[4] * y[1] */
emh203	0:3d9c67d97d6f	900	acc3 = __SMLAD(x3, c0, acc3);
emh203	0:3d9c67d97d6f	901
emh203	0:3d9c67d97d6f	902	/* Read y[2] and y[3] */
emh203	0:3d9c67d97d6f	903	a = *(py + 2);
emh203	0:3d9c67d97d6f	904	b = *(py + 3);
emh203	0:3d9c67d97d6f	905
emh203	0:3d9c67d97d6f	906	py += 4u;
emh203	0:3d9c67d97d6f	907
emh203	0:3d9c67d97d6f	908	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	909
emh203	0:3d9c67d97d6f	910	c0 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	911
emh203	0:3d9c67d97d6f	912	#else
emh203	0:3d9c67d97d6f	913
emh203	0:3d9c67d97d6f	914	c0 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	915
emh203	0:3d9c67d97d6f	916	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	917
emh203	0:3d9c67d97d6f	918	/* acc0 += x[2] * y[2] + x[3] * y[3] */
emh203	0:3d9c67d97d6f	919	acc0 = __SMLAD(x2, c0, acc0);
emh203	0:3d9c67d97d6f	920
emh203	0:3d9c67d97d6f	921	/* acc1 += x[3] * y[2] + x[4] * y[3] */
emh203	0:3d9c67d97d6f	922	acc1 = __SMLAD(x3, c0, acc1);
emh203	0:3d9c67d97d6f	923
emh203	0:3d9c67d97d6f	924	/* Read x[4], x[5], x[6] */
emh203	0:3d9c67d97d6f	925	a = *(px + 2);
emh203	0:3d9c67d97d6f	926	b = *(px + 3);
emh203	0:3d9c67d97d6f	927
emh203	0:3d9c67d97d6f	928	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	929
emh203	0:3d9c67d97d6f	930	x0 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	931	a = *(px + 4);
emh203	0:3d9c67d97d6f	932	x1 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	933
emh203	0:3d9c67d97d6f	934	#else
emh203	0:3d9c67d97d6f	935
emh203	0:3d9c67d97d6f	936	x0 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	937	a = *(px + 4);
emh203	0:3d9c67d97d6f	938	x1 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	939
emh203	0:3d9c67d97d6f	940	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	941
emh203	0:3d9c67d97d6f	942	px += 4u;
emh203	0:3d9c67d97d6f	943
emh203	0:3d9c67d97d6f	944	/* acc2 += x[4] * y[2] + x[5] * y[3] */
emh203	0:3d9c67d97d6f	945	acc2 = __SMLAD(x0, c0, acc2);
emh203	0:3d9c67d97d6f	946
emh203	0:3d9c67d97d6f	947	/* acc3 += x[5] * y[2] + x[6] * y[3] */
emh203	0:3d9c67d97d6f	948	acc3 = __SMLAD(x1, c0, acc3);
emh203	0:3d9c67d97d6f	949
emh203	0:3d9c67d97d6f	950	} while(--k);
emh203	0:3d9c67d97d6f	951
emh203	0:3d9c67d97d6f	952	/* For the next MAC operations, SIMD is not used
emh203	0:3d9c67d97d6f	953	* So, the 16 bit pointer if inputB, py is updated */
emh203	0:3d9c67d97d6f	954
emh203	0:3d9c67d97d6f	955	/* If the srcBLen is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	956	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	957	k = srcBLen % 0x4u;
emh203	0:3d9c67d97d6f	958
emh203	0:3d9c67d97d6f	959	if(k == 1u)
emh203	0:3d9c67d97d6f	960	{
emh203	0:3d9c67d97d6f	961	/* Read y[4] */
emh203	0:3d9c67d97d6f	962	c0 = *py;
emh203	0:3d9c67d97d6f	963	#ifdef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	964
emh203	0:3d9c67d97d6f	965	c0 = c0 << 16u;
emh203	0:3d9c67d97d6f	966
emh203	0:3d9c67d97d6f	967	#else
emh203	0:3d9c67d97d6f	968
emh203	0:3d9c67d97d6f	969	c0 = c0 & 0x0000FFFF;
emh203	0:3d9c67d97d6f	970
emh203	0:3d9c67d97d6f	971	#endif /* #ifdef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	972
emh203	0:3d9c67d97d6f	973	/* Read x[7] */
emh203	0:3d9c67d97d6f	974	a = *px;
emh203	0:3d9c67d97d6f	975	b = *(px + 1);
emh203	0:3d9c67d97d6f	976
emh203	0:3d9c67d97d6f	977	px++;;
emh203	0:3d9c67d97d6f	978
emh203	0:3d9c67d97d6f	979	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	980
emh203	0:3d9c67d97d6f	981	x3 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	982
emh203	0:3d9c67d97d6f	983	#else
emh203	0:3d9c67d97d6f	984
emh203	0:3d9c67d97d6f	985	x3 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	986
emh203	0:3d9c67d97d6f	987	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	988
emh203	0:3d9c67d97d6f	989	px++;
emh203	0:3d9c67d97d6f	990
emh203	0:3d9c67d97d6f	991	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	992	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	993	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	994	acc2 = __SMLADX(x1, c0, acc2);
emh203	0:3d9c67d97d6f	995	acc3 = __SMLADX(x3, c0, acc3);
emh203	0:3d9c67d97d6f	996	}
emh203	0:3d9c67d97d6f	997
emh203	0:3d9c67d97d6f	998	if(k == 2u)
emh203	0:3d9c67d97d6f	999	{
emh203	0:3d9c67d97d6f	1000	/* Read y[4], y[5] */
emh203	0:3d9c67d97d6f	1001	a = *py;
emh203	0:3d9c67d97d6f	1002	b = *(py + 1);
emh203	0:3d9c67d97d6f	1003
emh203	0:3d9c67d97d6f	1004	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	1005
emh203	0:3d9c67d97d6f	1006	c0 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	1007
emh203	0:3d9c67d97d6f	1008	#else
emh203	0:3d9c67d97d6f	1009
emh203	0:3d9c67d97d6f	1010	c0 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	1011
emh203	0:3d9c67d97d6f	1012	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	1013
emh203	0:3d9c67d97d6f	1014	/* Read x[7], x[8], x[9] */
emh203	0:3d9c67d97d6f	1015	a = *px;
emh203	0:3d9c67d97d6f	1016	b = *(px + 1);
emh203	0:3d9c67d97d6f	1017
emh203	0:3d9c67d97d6f	1018	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	1019
emh203	0:3d9c67d97d6f	1020	x3 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	1021	a = *(px + 2);
emh203	0:3d9c67d97d6f	1022	x2 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	1023
emh203	0:3d9c67d97d6f	1024	#else
emh203	0:3d9c67d97d6f	1025
emh203	0:3d9c67d97d6f	1026	x3 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	1027	a = *(px + 2);
emh203	0:3d9c67d97d6f	1028	x2 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	1029
emh203	0:3d9c67d97d6f	1030	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	1031
emh203	0:3d9c67d97d6f	1032	px += 2u;
emh203	0:3d9c67d97d6f	1033
emh203	0:3d9c67d97d6f	1034	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	1035	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	1036	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	1037	acc2 = __SMLAD(x3, c0, acc2);
emh203	0:3d9c67d97d6f	1038	acc3 = __SMLAD(x2, c0, acc3);
emh203	0:3d9c67d97d6f	1039	}
emh203	0:3d9c67d97d6f	1040
emh203	0:3d9c67d97d6f	1041	if(k == 3u)
emh203	0:3d9c67d97d6f	1042	{
emh203	0:3d9c67d97d6f	1043	/* Read y[4], y[5] */
emh203	0:3d9c67d97d6f	1044	a = *py;
emh203	0:3d9c67d97d6f	1045	b = *(py + 1);
emh203	0:3d9c67d97d6f	1046
emh203	0:3d9c67d97d6f	1047	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	1048
emh203	0:3d9c67d97d6f	1049	c0 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	1050
emh203	0:3d9c67d97d6f	1051	#else
emh203	0:3d9c67d97d6f	1052
emh203	0:3d9c67d97d6f	1053	c0 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	1054
emh203	0:3d9c67d97d6f	1055	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	1056
emh203	0:3d9c67d97d6f	1057	py += 2u;
emh203	0:3d9c67d97d6f	1058
emh203	0:3d9c67d97d6f	1059	/* Read x[7], x[8], x[9] */
emh203	0:3d9c67d97d6f	1060	a = *px;
emh203	0:3d9c67d97d6f	1061	b = *(px + 1);
emh203	0:3d9c67d97d6f	1062
emh203	0:3d9c67d97d6f	1063	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	1064
emh203	0:3d9c67d97d6f	1065	x3 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	1066	a = *(px + 2);
emh203	0:3d9c67d97d6f	1067	x2 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	1068
emh203	0:3d9c67d97d6f	1069	#else
emh203	0:3d9c67d97d6f	1070
emh203	0:3d9c67d97d6f	1071	x3 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	1072	a = *(px + 2);
emh203	0:3d9c67d97d6f	1073	x2 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	1074
emh203	0:3d9c67d97d6f	1075	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	1076
emh203	0:3d9c67d97d6f	1077	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	1078	acc0 = __SMLAD(x0, c0, acc0);
emh203	0:3d9c67d97d6f	1079	acc1 = __SMLAD(x1, c0, acc1);
emh203	0:3d9c67d97d6f	1080	acc2 = __SMLAD(x3, c0, acc2);
emh203	0:3d9c67d97d6f	1081	acc3 = __SMLAD(x2, c0, acc3);
emh203	0:3d9c67d97d6f	1082
emh203	0:3d9c67d97d6f	1083	c0 = (*py);
emh203	0:3d9c67d97d6f	1084	/* Read y[6] */
emh203	0:3d9c67d97d6f	1085	#ifdef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	1086
emh203	0:3d9c67d97d6f	1087	c0 = c0 << 16u;
emh203	0:3d9c67d97d6f	1088	#else
emh203	0:3d9c67d97d6f	1089
emh203	0:3d9c67d97d6f	1090	c0 = c0 & 0x0000FFFF;
emh203	0:3d9c67d97d6f	1091	#endif /* #ifdef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	1092
emh203	0:3d9c67d97d6f	1093	/* Read x[10] */
emh203	0:3d9c67d97d6f	1094	b = *(px + 3);
emh203	0:3d9c67d97d6f	1095
emh203	0:3d9c67d97d6f	1096	#ifndef ARM_MATH_BIG_ENDIAN
emh203	0:3d9c67d97d6f	1097
emh203	0:3d9c67d97d6f	1098	x3 = __PKHBT(a, b, 16);
emh203	0:3d9c67d97d6f	1099
emh203	0:3d9c67d97d6f	1100	#else
emh203	0:3d9c67d97d6f	1101
emh203	0:3d9c67d97d6f	1102	x3 = __PKHBT(b, a, 16);
emh203	0:3d9c67d97d6f	1103
emh203	0:3d9c67d97d6f	1104	#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
emh203	0:3d9c67d97d6f	1105
emh203	0:3d9c67d97d6f	1106	px += 3u;
emh203	0:3d9c67d97d6f	1107
emh203	0:3d9c67d97d6f	1108	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	1109	acc0 = __SMLADX(x1, c0, acc0);
emh203	0:3d9c67d97d6f	1110	acc1 = __SMLAD(x2, c0, acc1);
emh203	0:3d9c67d97d6f	1111	acc2 = __SMLADX(x2, c0, acc2);
emh203	0:3d9c67d97d6f	1112	acc3 = __SMLADX(x3, c0, acc3);
emh203	0:3d9c67d97d6f	1113	}
emh203	0:3d9c67d97d6f	1114
emh203	0:3d9c67d97d6f	1115	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	1116	*pOut = (q15_t) (acc0 >> 15);
emh203	0:3d9c67d97d6f	1117	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	1118	pOut += inc;
emh203	0:3d9c67d97d6f	1119
emh203	0:3d9c67d97d6f	1120	*pOut = (q15_t) (acc1 >> 15);
emh203	0:3d9c67d97d6f	1121	pOut += inc;
emh203	0:3d9c67d97d6f	1122
emh203	0:3d9c67d97d6f	1123	*pOut = (q15_t) (acc2 >> 15);
emh203	0:3d9c67d97d6f	1124	pOut += inc;
emh203	0:3d9c67d97d6f	1125
emh203	0:3d9c67d97d6f	1126	*pOut = (q15_t) (acc3 >> 15);
emh203	0:3d9c67d97d6f	1127	pOut += inc;
emh203	0:3d9c67d97d6f	1128
emh203	0:3d9c67d97d6f	1129	/* Increment the pointer pIn1 index, count by 1 */
emh203	0:3d9c67d97d6f	1130	count += 4u;
emh203	0:3d9c67d97d6f	1131
emh203	0:3d9c67d97d6f	1132	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	1133	px = pIn1 + count;
emh203	0:3d9c67d97d6f	1134	py = pIn2;
emh203	0:3d9c67d97d6f	1135
emh203	0:3d9c67d97d6f	1136
emh203	0:3d9c67d97d6f	1137	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1138	blkCnt--;
emh203	0:3d9c67d97d6f	1139	}
emh203	0:3d9c67d97d6f	1140
emh203	0:3d9c67d97d6f	1141	/* If the blockSize2 is not a multiple of 4, compute any remaining output samples here.
emh203	0:3d9c67d97d6f	1142	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	1143	blkCnt = blockSize2 % 0x4u;
emh203	0:3d9c67d97d6f	1144
emh203	0:3d9c67d97d6f	1145	while(blkCnt > 0u)
emh203	0:3d9c67d97d6f	1146	{
emh203	0:3d9c67d97d6f	1147	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	1148	sum = 0;
emh203	0:3d9c67d97d6f	1149
emh203	0:3d9c67d97d6f	1150	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	1151	k = srcBLen >> 2u;
emh203	0:3d9c67d97d6f	1152
emh203	0:3d9c67d97d6f	1153	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	1154	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	1155	while(k > 0u)
emh203	0:3d9c67d97d6f	1156	{
emh203	0:3d9c67d97d6f	1157	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	1158	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1159	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1160	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1161	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1162
emh203	0:3d9c67d97d6f	1163	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1164	k--;
emh203	0:3d9c67d97d6f	1165	}
emh203	0:3d9c67d97d6f	1166
emh203	0:3d9c67d97d6f	1167	/* If the srcBLen is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	1168	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	1169	k = srcBLen % 0x4u;
emh203	0:3d9c67d97d6f	1170
emh203	0:3d9c67d97d6f	1171	while(k > 0u)
emh203	0:3d9c67d97d6f	1172	{
emh203	0:3d9c67d97d6f	1173	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	1174	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1175
emh203	0:3d9c67d97d6f	1176	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1177	k--;
emh203	0:3d9c67d97d6f	1178	}
emh203	0:3d9c67d97d6f	1179
emh203	0:3d9c67d97d6f	1180	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	1181	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	1182	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	1183	pOut += inc;
emh203	0:3d9c67d97d6f	1184
emh203	0:3d9c67d97d6f	1185	/* Increment the pointer pIn1 index, count by 1 */
emh203	0:3d9c67d97d6f	1186	count++;
emh203	0:3d9c67d97d6f	1187
emh203	0:3d9c67d97d6f	1188	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	1189	px = pIn1 + count;
emh203	0:3d9c67d97d6f	1190	py = pIn2;
emh203	0:3d9c67d97d6f	1191
emh203	0:3d9c67d97d6f	1192	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1193	blkCnt--;
emh203	0:3d9c67d97d6f	1194	}
emh203	0:3d9c67d97d6f	1195	}
emh203	0:3d9c67d97d6f	1196	else
emh203	0:3d9c67d97d6f	1197	{
emh203	0:3d9c67d97d6f	1198	/* If the srcBLen is not a multiple of 4,
emh203	0:3d9c67d97d6f	1199	* the blockSize2 loop cannot be unrolled by 4 */
emh203	0:3d9c67d97d6f	1200	blkCnt = blockSize2;
emh203	0:3d9c67d97d6f	1201
emh203	0:3d9c67d97d6f	1202	while(blkCnt > 0u)
emh203	0:3d9c67d97d6f	1203	{
emh203	0:3d9c67d97d6f	1204	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	1205	sum = 0;
emh203	0:3d9c67d97d6f	1206
emh203	0:3d9c67d97d6f	1207	/* Loop over srcBLen */
emh203	0:3d9c67d97d6f	1208	k = srcBLen;
emh203	0:3d9c67d97d6f	1209
emh203	0:3d9c67d97d6f	1210	while(k > 0u)
emh203	0:3d9c67d97d6f	1211	{
emh203	0:3d9c67d97d6f	1212	/* Perform the multiply-accumulate */
emh203	0:3d9c67d97d6f	1213	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1214
emh203	0:3d9c67d97d6f	1215	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1216	k--;
emh203	0:3d9c67d97d6f	1217	}
emh203	0:3d9c67d97d6f	1218
emh203	0:3d9c67d97d6f	1219	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	1220	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	1221	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	1222	pOut += inc;
emh203	0:3d9c67d97d6f	1223
emh203	0:3d9c67d97d6f	1224	/* Increment the MAC count */
emh203	0:3d9c67d97d6f	1225	count++;
emh203	0:3d9c67d97d6f	1226
emh203	0:3d9c67d97d6f	1227	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	1228	px = pIn1 + count;
emh203	0:3d9c67d97d6f	1229	py = pIn2;
emh203	0:3d9c67d97d6f	1230
emh203	0:3d9c67d97d6f	1231	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1232	blkCnt--;
emh203	0:3d9c67d97d6f	1233	}
emh203	0:3d9c67d97d6f	1234	}
emh203	0:3d9c67d97d6f	1235
emh203	0:3d9c67d97d6f	1236	/* --------------------------
emh203	0:3d9c67d97d6f	1237	* Initializations of stage3
emh203	0:3d9c67d97d6f	1238	* -------------------------*/
emh203	0:3d9c67d97d6f	1239
emh203	0:3d9c67d97d6f	1240	/* sum += x[srcALen-srcBLen+1] * y[0] + x[srcALen-srcBLen+2] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	1241	* sum += x[srcALen-srcBLen+2] * y[0] + x[srcALen-srcBLen+3] * y[1] +...+ x[srcALen-1] * y[srcBLen-1]
emh203	0:3d9c67d97d6f	1242	* ....
emh203	0:3d9c67d97d6f	1243	* sum += x[srcALen-2] * y[0] + x[srcALen-1] * y[1]
emh203	0:3d9c67d97d6f	1244	* sum += x[srcALen-1] * y[0]
emh203	0:3d9c67d97d6f	1245	*/
emh203	0:3d9c67d97d6f	1246
emh203	0:3d9c67d97d6f	1247	/* In this stage the MAC operations are decreased by 1 for every iteration.
emh203	0:3d9c67d97d6f	1248	The count variable holds the number of MAC operations performed */
emh203	0:3d9c67d97d6f	1249	count = srcBLen - 1u;
emh203	0:3d9c67d97d6f	1250
emh203	0:3d9c67d97d6f	1251	/* Working pointer of inputA */
emh203	0:3d9c67d97d6f	1252	pSrc1 = (pIn1 + srcALen) - (srcBLen - 1u);
emh203	0:3d9c67d97d6f	1253	px = pSrc1;
emh203	0:3d9c67d97d6f	1254
emh203	0:3d9c67d97d6f	1255	/* Working pointer of inputB */
emh203	0:3d9c67d97d6f	1256	py = pIn2;
emh203	0:3d9c67d97d6f	1257
emh203	0:3d9c67d97d6f	1258	/* -------------------
emh203	0:3d9c67d97d6f	1259	* Stage3 process
emh203	0:3d9c67d97d6f	1260	* ------------------*/
emh203	0:3d9c67d97d6f	1261
emh203	0:3d9c67d97d6f	1262	while(blockSize3 > 0u)
emh203	0:3d9c67d97d6f	1263	{
emh203	0:3d9c67d97d6f	1264	/* Accumulator is made zero for every iteration */
emh203	0:3d9c67d97d6f	1265	sum = 0;
emh203	0:3d9c67d97d6f	1266
emh203	0:3d9c67d97d6f	1267	/* Apply loop unrolling and compute 4 MACs simultaneously. */
emh203	0:3d9c67d97d6f	1268	k = count >> 2u;
emh203	0:3d9c67d97d6f	1269
emh203	0:3d9c67d97d6f	1270	/* First part of the processing with loop unrolling. Compute 4 MACs at a time.
emh203	0:3d9c67d97d6f	1271	** a second loop below computes MACs for the remaining 1 to 3 samples. */
emh203	0:3d9c67d97d6f	1272	while(k > 0u)
emh203	0:3d9c67d97d6f	1273	{
emh203	0:3d9c67d97d6f	1274	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	1275	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1276	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1277	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1278	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1279
emh203	0:3d9c67d97d6f	1280	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1281	k--;
emh203	0:3d9c67d97d6f	1282	}
emh203	0:3d9c67d97d6f	1283
emh203	0:3d9c67d97d6f	1284	/* If the count is not a multiple of 4, compute any remaining MACs here.
emh203	0:3d9c67d97d6f	1285	** No loop unrolling is used. */
emh203	0:3d9c67d97d6f	1286	k = count % 0x4u;
emh203	0:3d9c67d97d6f	1287
emh203	0:3d9c67d97d6f	1288	while(k > 0u)
emh203	0:3d9c67d97d6f	1289	{
emh203	0:3d9c67d97d6f	1290	/* Perform the multiply-accumulates */
emh203	0:3d9c67d97d6f	1291	sum += ((q31_t) * px++ * *py++);
emh203	0:3d9c67d97d6f	1292
emh203	0:3d9c67d97d6f	1293	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1294	k--;
emh203	0:3d9c67d97d6f	1295	}
emh203	0:3d9c67d97d6f	1296
emh203	0:3d9c67d97d6f	1297	/* Store the result in the accumulator in the destination buffer. */
emh203	0:3d9c67d97d6f	1298	*pOut = (q15_t) (sum >> 15);
emh203	0:3d9c67d97d6f	1299	/* Destination pointer is updated according to the address modifier, inc */
emh203	0:3d9c67d97d6f	1300	pOut += inc;
emh203	0:3d9c67d97d6f	1301
emh203	0:3d9c67d97d6f	1302	/* Update the inputA and inputB pointers for next MAC calculation */
emh203	0:3d9c67d97d6f	1303	px = ++pSrc1;
emh203	0:3d9c67d97d6f	1304	py = pIn2;
emh203	0:3d9c67d97d6f	1305
emh203	0:3d9c67d97d6f	1306	/* Decrement the MAC count */
emh203	0:3d9c67d97d6f	1307	count--;
emh203	0:3d9c67d97d6f	1308
emh203	0:3d9c67d97d6f	1309	/* Decrement the loop counter */
emh203	0:3d9c67d97d6f	1310	blockSize3--;
emh203	0:3d9c67d97d6f	1311	}
emh203	0:3d9c67d97d6f	1312
emh203	0:3d9c67d97d6f	1313	#endif /* #ifndef UNALIGNED_SUPPORT_DISABLE */
emh203	0:3d9c67d97d6f	1314
emh203	0:3d9c67d97d6f	1315	}
emh203	0:3d9c67d97d6f	1316
emh203	0:3d9c67d97d6f	1317	/**
emh203	0:3d9c67d97d6f	1318	* @} end of Corr group
emh203	0:3d9c67d97d6f	1319	*/

Repository toolbox

Export to desktop IDE

Repository details

Type:	Library
Created:	28 Jul 2014
Imports:	1167
Forks:	0
Commits:	1
Dependents:	15
Dependencies:	0
Followers:	39

FilteringFunctions/arm_correlate_fast_q15.c@0:3d9c67d97d6f, 2014-07-28 (annotated)

Who changed what in which revision?

Repository toolbox

Repository details

Important Information for this Arm website

Access Warning