SigmaPlot provides seven different data smoothing algorithms that should satisfy most smoothing needs negative exponential, loess, running average, running median, bisquare, inverse square and inverse distance. Each smoother contains options that make them very flexible. For example, unequally spaced data that occurs in clumps is better analyzed using the nearest neighbor rather than a fixed bandwidth method. Also, outlier rejection is available in some smoothers.
Smoothing is used to elicit trends from noisy data. The three examples in Tukey´s book Exploratory Data Analysis (Addison-Wesley, 1977) show the need for smoothing beautifully. The trends in the U.S. gold production from 1872 to 1956, Figure 1A, are fairly clear.
The peaks and valleys in the U.S. wheat production, Figure 1B, are less clear. I challenge you to visually find the trends in the annual New York City precipitation data shown in Figure 1C. The loess algorithm will be used to smooth these data sets. “loess” means locally weighted regression. Each point along the smooth curve is obtained from a regression of data points close to the curve point with the closest points more heavily weighted. The amount of smoothing, which affects the number of points in the regression is determined by the user. A weighted regression is performed for each point along the smooth curve.
Figure 1. Data with trends that are increasingly more difficult to visualize
loess smoothed curves for the three examples in Figure 1 are shown in Figure 2. The smoothed curves in Figure 2A and 2B make the trends in the gold and wheat data very clear. It is still difficult to visualize in the raw data the precipitation trend shown in Figure 2C. To confirm the results of the loess smoothed curve the histogram of average rainfall in ten year intervals was computed and superimposed on the smooth curve. There is a good comparison between the histogram and the loess
smooth.
The loess smoothing parameters were varied to achieve the best visualization. A polynomial degree of one was used in all cases. A 0.1 sampling proportion was used in Figure 2A and B and 0.3 in Figure 2C. Since the data was unequally spaced along the x axis the nearest neighbor bandwidth method was used. The default number of intervals (100) for generation of the smooth curve was found to be the best. This generates a line using straight lines between curve points. Sometimes this leads to sharp corners in the smooth so the spline interpolation line type (Smoothed (spline)) was used.
Figure 2. Smoothed curves for data in Figure 1. A ten year average rainfall histogram is also shown in C.
Several of the smoothing methods, including loess, are based on local polynomial regression and the polynomial order is selectable. Increasing the order tends to include more high frequency components in the smooth. The effect of increasing the order from 1 (local linear regressions) to 2 (local quadratic regressions) is shown in Figure 3. The effect is to increase peak height magnitude and introduce additional high frequency components (wiggles) in B. A subsequent increase of the sampling proportion in C results in a smooth very much like the original for order 1 in A.
Figure 3. Effect of increasing the regression polynomial order. The order is 1 and sampling proportion is 0.1 in A. The order is increased to 2 in B and then the sampling proportion is increased to 0.2 in C.
Visualizing spatial relationships in a three dimensional scatter plot can be very difficult. The strongest three dimensional cue is provided by an animated rotation of the data. Since this is not possible in paper publications we must resort to using drop lines, enclosing the graph with additional axes,etc. Figure 4 shows that a smooth surface also helps. This data describes the reaction characteristics on an isomer of hexane. The smooth surface B clearly shows the trends with respect to temperature and reaction rate whereas visualizing this in the scatterplot A is difficult.
Figure 4. The data trend in A is easily visualized with a loess smoothed surface, B.
This data is relatively sparse so a large sampling proportion 0.6 was required to avoid oscillations and spikes in the loess surface. A polynomial degree of 1 and the nearest neighbor bandwidth method were used. The Preview feature allows a quick comparison of smoothing methods on a given data set. For this data essentially equivalent smooth surfaces were obtained with the negative exponential and bisquare methods.
The bandwidth method option is also very useful. The nearest neighbor method works well for unequally spaced data. The data in Figure 3 is unequally spaced in both X and Y. Compare the smoothing results using the nearest neighbor and fixed methods shown in Figure 5. The result for the fixed method is about the best that could be obtained by varying the sampling proportion with a value of 0.8 shown.
Figure 5. Comparison of bandwidth methods for unequally spaced data. Nearest neighbor on the left and fixed on the right.
Additional Computational Details
Smoothers is a generic name for a variety of techniques that can be used to either smooth a data set by removing undesired high-frequency components
(locations of rapid variation, such as noise contamination), or to resample dependent variable values to other independent variable locations using the values of the data at nearby points. The smoothing methods provided in SigmaPlot operate by weighting the data in a neighborhood of the smoothing location and applying linear or non-linear methods to combine the weighted values to produce a smoothed value. These non-parametric smoothing techniques provide a good complement to the parameterized curve/surface fitting facility (Regression Wizard) in SigmaPlot. For data subjected to measurement errors, noise, etc., either method can be used to predict behavior or
to estimate true values.
The kernel used in the smoothing computation and the smoothing method are given in the following table.
Algorithm |
Weighting Kernel |
Method to Compute Smoothed Value |
Negative Exponential |
Gaussian |
Polynomial or Loess Fitting |
Loess |
Tricube |
Polynomial or Loess Fitting |
Running Average |
Uniform |
Mean |
Running Median |
Uniform |
Median |
Bisquare |
Biweight |
Polynomial Fitting |
Inverse Square |
Cauchy |
Mean |
Inverse Distance |
Inverse Distance |
Mean |
The equations used for each kernel are:
Kernel |
Kernel Formula (y=0 for 2D Smoothers) |
Uniform |
1 |
Biweight |
(1 - x^{2} ? y^{2})^{2} |
Tricube |
(1 - sqrt(x^{2}+y^{2})^{3})^{3} |
Gaussian |
exp(-x^{2}-y^{2}) |
Cauchy |
1/(1+x^{2}+y^{2}) |
Inverse Distance (3D only) |
1/sqrt(x^{2}+y^{2}) |