As part of a data size reduction and chromatogram smoothing PoC project, I needed to implement a Hello World ‘Centred Moving Average’ algorithm on a large number of UV data sets.
First of all, a recap – “What is a Centred Moving Average” ? The terse answer is that it is the average of values over a fixed size yet moving (preceding and following) window of the datapoints. So for a window of 5 and a datapoint set of 1, 2, 3, 3, 3, 4, 1, 6, 9, 21, 18, 13, 10, 7, 6, 5, 4, 2, 3, and 4, then the moving average dataset for the highlighted values 4, 21, and 7 would be (3+3+4+1+6)/5=3.4, (6+9+21+18+13)/5=13.4, and (13+10+7+6+5)/5=8.2. In other words the average of two preceding points, the midpoint, and the two following points for a window size of 5 (or the preceding point, the midpoint, and following point for a window size of 3, and so on).
I had naively expected to use a search engine to search out an existing code implementation, tart it up a bit, and move on with some of the more interesting aspects of the project. Somewhat surprisingly, there is little ‘out there’, and certainly not in the technologies I require, viz. Matlab, Excel, C#, and (Oracle) SQL (two required for quick prototype/demo code implementations, Excel and Oracle, and the other two in the PoC code I had to deliver).
I had to write my own implementations. For a PoC application, I don’t need to worry much about the edge cases, nor the code efficiency, nor even if the independent variable was equidistantly spaced, I just wanted a small handful of basic implementations and to prove it does what I need it to do in downstream processes. During the exercise it came as quite a surprise to me, a seasoned curly brace language and SQL developer, how vastly different the four algorithms were in these four programming languages. A ‘diff’ tool would have found very few similarities in the source code.
In years gone by, Fortran was used for mathematical operations, COBOL for business, etc. Nowadays mainstream programming languages seem to have coalesced to those that use curly braces (C, C++, PHP, Perl, C#, Java, …), and those that don’t (SQL, XSLT, VBA, S and R,….), and the programming language used by the developer match those of the developers’ skill set, and not not necessarily the programming language suited to the task at hand. This brief blog posts shows that it is still the case that the right language/tool should be used for the job.
Note: Despite the coding implementations being trivial, I have uploaded the four examples to github here: https://github.com/michaeldoshea/CentredMovingAverage Use them as you see fit.
Matlab
Not unsurprisingly Matlab was very suited to the task, and immediately became as my trusty reference implementation of the calculation results. The code was also the most minimalistic and easy to understand.
Y = [1 2 3 3 3 4 1 6 9 21 18 13 10 7 6 5 4 2 3 4]; CMA = movmean(Y, [2 2], 'Endpoints', 'discard')
Another MatLab implementation of UV data smoothing using the Savitzky-Golay algorithm was not so trivial.
Excel
Ubiquitous Excel proved again to be very suited to this type of calculation task too, and also easily enabled me to gild the lily a bit by assigning names to a few cells, and performing calculations based on a configurable midpoint offset (the fixed window size, as coded, being 2*midpoint offset + 1).
=IF( AND( ROW()-ROW(FIRSTROW)>=MIDPOINTOFFSET, ROW()<=ROW(LASTROW)-MIDPOINTOFFSET ), AVERAGE( OFFSET($D7,-MIDPOINTOFFSET,0, MIDPOINTOFFSET*2+1,1) ), "" )
C#
A C# LINQ implementation of the Centred Moving Average (as opposed to some sort of horrific piece of logic with arrays, array offsets and loops of loops, and a lot of debugging pain even for a task this simple) could also be achieved in a very readable and maintainable form. Without the use of external libraries or a home-rolled extension methods, the example code below just must be the minimal logic required to implement a generic Centred Moving Average.
using System.Collections.Generic; using System.Linq; namespace ConsoleApplication3841 { class Program { static void Main(string[] args) { var yvals = new List<float>() { 1, 2, 3, 3, 3, 4, 1, 6, 9, 21, 18, 13, 10, 7, 6, 5, 4, 2, 3, 4 }; int midpointOffset = 2; var centredMovingAverage = Enumerable .Range(0, yvals.Count - (2 * midpointOffset + 1) + 1) .Select(n =>, yvals.Skip(n).Take((2 * midpointOffset + 1)) .Average()) .ToList(); } } }
Oracle SQL
Writing this type of algorithm with SQL can be achieved, as demonstrated below, but it is very messy. Despite occasionally finding this type of code in production systems, SQL isn’t really the first language of choice for this type of task. Or the second, third, or fourth choice either, clearly. I suspect leveraging R Services (if using Microsoft SQL Server of course) to perform these types of calculations would be a move in the direction of ‘using the right tool for the job’.
SQL> SQL> SQL> SQL> SQL> WITH movAvg AS (SELECT 2 midPointOffset FROM DUAL), 2 avgCandidates AS ( 3 SELECT x, 4 y, 5 AVG(y) 6 OVER (ORDER BY x ROWS BETWEEN m.midPointOffset PRECEDING AND m.midPointOffset FOLLOWING) centredMovingAvg, 7 ROW_NUMBER() 8 OVER (ORDER BY 1) rn, 9 COUNT(*) 10 OVER (ORDER BY 1) - m.midPointOffset + 1 windowUpper 11 FROM test, 12 movAvg m 13 ) 14 SELECT x, y, centredMovingAvg 15 FROM avgCandidates c, 16 movAvg a 17 WHERE c.rn > a.midPointOffset 18 AND c.rn < c.windowUpper; X Y CENTREDMOVINGAVG ----- ----- ---------------- 3 3 2.40 4 3 3.00 5 3 2.80 6 4 3.40 7 1 4.60 8 6 8.20 9 9 11.00 10 21 13.40 11 18 14.20 12 13 13.80 13 10 10.80 14 7 8.20 15 6 6.40 16 5 4.80 17 4 4.00 18 2 3.60 16 rows selected. SQL> SQL>
— Published by Mike, 12:08:41 05 February 2017
By Month: November 2022, October 2022, August 2022, February 2021, January 2021, December 2020, November 2020, March 2019, September 2018, June 2018, May 2018, April 2018
Apple, C#, Databases, Faircom, General IT Rant, German, Informatics, LINQ, MongoDB, Oracle, Perl, PostgreSQL, SQL, SQL Server, Unit Testing, XML/XSLT
Leave a Reply