| Primer |
|
| Curve fitting made easy |
| by Marko Ledvij |
Scientists and engineers often want to represent empirical data
using a model based on mathematical equations. With the correct
model and calculus, one can determine important characteristics
of the data, such as the rate of change anywhere on the curve (first
derivative), the local minimum and maximum points of the function
(zeros of the first derivative), and the area under the curve (integral).
The goal of data (or curve) fitting is to find the parameter values
that most closely match the data. The models to which data are fitted
depend on adjustable parameters.
Take the formula
Y = A*exp(X/X0)
in which X is the independent variable, Y is the
dependent variable, and A and X0 are the
parameters. In an experiment, one typically measures variable Y
at a discrete set of values of variable X. Curve fitting
provides two general uses for this formula:
- A person believes the formula describes the data and wants
to know what parameter values of A and X0
correspond to the data.
- A person wants to know which of several formulas best describes
the data.
To perform fitting, we define some function which depends
on the parameters that measures the closeness between the
data and the model. This function is then minimized to the smallest
possible value with respect to the parameters. The parameter values
that minimize the function are the best-fitting parameters. In some
cases, the parameters are the coefficients of the terms of the model,
for example, if the model is polynomial or Gaussian. In the simplest
case, known as linear regression, a straight line is fitted to the
data. In most scientific and engineering models, however, the dependent
variable depends on the parameters in a nonlinear way.
Model selection
One needs a good understanding of the underlying physics, chemistry,
electronics, and other properties of a problem to choose the right
model. No graphing or analysis software can pick a model for the
given datait can only help in differentiating between models.
Once a model is picked, one can do a rough assessment of it by plotting
the data. At least some agreement is needed between the data and
the model curve. If the model is supposed to represent exponential
growth and the data are monotonically decreasing, the model is obviously
wrong. Many commercial software programs perform linear and nonlinear
regression analysis and curve fitting using the most common models.
The better programs also allow users to specify their own fitting
functions.
A typical example of exponential decay in physics is radioactive
decay. The measurable rate of decay as a function of time
that is, the number of decays per unit of timeis given by
L(t) = L(0)e-lt
where t is elapsed time and l
is the probability of decay of one nucleus during one time unit.
By measuring the decay rate as a function of time and fitting the
data to the exponential-decay law, one can infer the constant l.
As another example, the following Gaussian function describes a
normal distribution:

Here, x0 is the mean value and s is the standard
deviation of the distribution. Usually, several measurements are
performed of a quantity x. The results are binned assigned
to discrete intervals called bins as a function of x,
such that y(x) represents the number of counts of
the values in the bin centered around x. The mean value and
standard deviation are then obtained by curve fitting.
|
Sometimes, it is possible to linearize a nonlinear model.
For example, consider the first equation. If we take the logarithm
of both the left- and the righthand side, we get
log(Y) = log(A) X/X0
Hence, log(Y) is a linear function of X. This
means that one could try to fit the logarithm of Y
data to a linear function of X data to find A
and X0. This approach was widely used before
the age of computers. There are problems with it, however,
including the fact that the transformed data usually do not
satisfy the assumptions of linear regression. Linear regression
assumes that the scatter of points around the line follows
a Gaussian distribution, and that the standard deviation is
the same at every value of the independent variable.
|
 |
| Figure 1. In linear regression,
the correct values of a and b in the straight line equation
are those that minimize the value of c2. |
|
To find the values of the models parameters that yield the curve
closest to the data points, one must define a function that measures
the closeness between the data and the model. This function depends
on the method used to do the fitting, and the function is typically
chosen as the sum of the squares of the vertical deviations from each
data point to the curve. (The deviations are first squared and then
added up to avoid cancellations between positive and negative values.)
This method assumes that the measurement errors are independent and
normally distributed with constant standard deviation.
Linear regression
The simplest form of modeling is linear regression, which is the
prediction of one variable from another when the relationship between
them is assumed to be linear. One variable is considered to be independent
and the other dependent. A linear regression line has an equation
of the form
y = a + bx,
where x is the independent variable and y is the
dependent variable. The slope of the line is b, and a is
the intercept (the value of y when x = 0). The goal
is to find the values of a and b that make the line closest to the
data. The chi-square estimator in this case (where si
are measurement errors; if they are not known, they should all be
set to 1) is given by
X2 is a function of the parameters of the model;
N is the total number of points. We need to find the values
of a and b that minimize X2. Typically, this means
differentiating X2 with respect to the parameters
a and b and setting the derivatives equal to 0. For
a linear model, the minimum can be found and expressed in analytic
form as
The resulting system of two equations with two unknowns can be
solved explicitly for a and b, producing the regression line like
that shown in Figure 1.
|
Nonlinear models
In most cases, the relationship between measured values and
measurement variables is nonlinear. Nonlinear curve fitting
also seeks to find those parameter values that minimize the
deviations between the observed y values and the expected
y values. In nonlinear models, however, one usually
cannot solve the equation for the parameters. Various iterative
procedures are used instead. Nonlinear fitting always starts
with an initial guess of parameter values. One usually looks
at the general shape of the model curve and compares it with
the shape of the data points. For example, look at the sigmoidal
function in Figure 2.
This function has four parameters. Ab and
At are two asymptotic values values
that the function approaches but never quite reachesat
small and large x. The curve crosses over between these
two asymptotic values in a region of x values whose
approximate width is w and which is centered around x0.
For the initial estimate of the values of Ab
and At, we can use the average of the lower
third and the upper third of the Y values of data points.
The initial value of x0 is the average of
X values of the crossover area, and the initial value
of w is the width of the region of points between the two
asymptotic values.
|
 |
| Figure 2. In nonlinear curve-fitting,
such as this sigmoidal function, to minimize the deviations
between the observed and expected y values, it is necessary
to estimate parameter values and use iterative procedures. |
|
A good understanding of the selected model is important because
the initial guess of the parameter values must make sense in the
real world. In addition, a good fitting software will try to provide
an initial guess for the functions built into it. The nonlinear
fitting procedure is iterative. The most widely used iterative procedure
for nonlinear curve fitting is the LevenbergMarquardt algorithm,
which good curve-fitting softwares implement. Starting from some
initial values of parameters, the method tries to minimize
by successive small variations of parameter values and reevaluation
of c2 until a minimum
is reached. The function
f is the one to which we are trying to fit the data, and
c2 and yi are the X and
Y values of data points. Although the method involves sophisticated
mathematics, end users of typical curve-fitting software usually
only need to initialize parameters and press a button to start
the iterative fitting procedure.
Weighting
Often, curve fitting is done without weighting. In the above formulas
for c2, the weights si
are all set to 1. If experimental measurement errors are known,
they should be used as si.
They then provide an uneven weighting of different experimental
points: the smaller the measurement error for a particular experimental
point, the larger the weight of that point in c2.
Proper curve-fitting software allows users to supply measurement
errors for use in the fitting process. If the experimental errors
are not known exactly, it is sometimes reasonable to make an
assumption.
For example, if you believe that the deviation of the experimental
data from the theoretical curve is proportional to the value
of
the data, then you can use
si = yi
Here, a weight of each point, which is the inverse of the square
of si , is inversely
proportional to the square of the value of a data point.
| Interpreting results
In assessing modeling results, we must first see whether
the program can fit the model to the datathat is,
whether the iteration converged to a set of values for
the parameters.
If it did not, we probably made an error, such as choosing
the wrong model or initial values. A look at the graphed
data
points and fitted curve can show if the curve is close to
the data. If it is not, then the fit has failed, perhaps
because
the iterative procedure did not find a minimum of c2
or the wrong model was chosen.
The parameter values must also make sense from the standpoint
of the model. If, for example, one of the parameters is temperature
in kelvins and the value obtained from fitting is negative,
we have obtained a physically meaningless result. This is
true even if the curve seemingly does fit the data well. In
this example, we have likely chosen an incorrect model.
There is a quantitative value that describes how well the
model fits the data. The value is defined as c2
divided by the number of degrees of freedom, and it is referred
to as the standard deviation of the residuals, which are the
vertical distances of the data points from the line. The smaller
this value, the better the fit.
|
 |
| Figure 3. The residual plot
(blue) shows the deviations of the data points (black)
from the fitted curve (red) and indicates by its nonrandomness
a possible hidden variable. |
|
An even better quantitative value is R2, known
as the goodness of fit. It is computed as the fraction of
the total variation of the Y values of data points that is
attributable to the assumed model curve, and its values typically
range from 0 to 1. Values close to 1 indicate a good fit.
To assess the certainty of the best-fit parameter values, we look
at the obtained standard errors of parameters, which indirectly
tell us how well the curve fits the data. If the standard error
is small, it means that a small variation in the parameter would
produce a curve that fits the data much worse. Therefore, we say
that we know the value of that parameter well. If the standard error
is large, a relatively large variation of that parameter would not
spoil the fit much; therefore, we do not really know the parameters
value well.
Important intervals
Standard errors are related to another quantity, one usually obtained
from the fitting and computed by curve-fitting software: the confidence
interval. It says how good the obtained estimate of the value of
the fitting function is at a particular value of X. The confidence
interval gives a desired level of statistical confidence, expressed
as a percentage, usually 95% or 99%. For a particular value of X,
we can claim that the correct value for the fitting function lies
within the obtained confidence interval. Good curve-fitting software
computes the confidence interval and draws the confidence interval
lines on the graph with the fitted curve and the data points. The
smaller the confidence interval, the more likely the obtained curve
is close to the real curve.
Another useful quantity usually produced by curve-fitting software
is the prediction interval. Whereas the confidence interval deals
with the estimate of the fitting function, the prediction interval
deals with the data. It is computed with respect to a desired confidence
level p, whose values are usually chosen to be between 95% and 99%.
The prediction interval is the interval of Y values for a
given X value within which p% of all experimental
points in a series of repeated measurements are expected to fall.
The narrower the interval, the better the predictive nature of the
model used. The prediction interval is represented by two curves
lying on opposite sides of the fitted curve.
The residuals plot provides insight into the quality of the fitted
curve. Residuals are the deviation of the data points from the fitted
curve. Perhaps the most important information obtained is whether
there are any lurking (hidden) variables. In Figure 3, the data
points are black squares, the obtained fitted line is red, and the
blue line represents residuals.
Normally, if the curve fits the data well, one expects the data-point
deviations to be randomly distributed around the fitted curve. But
in this graph, for the data points whose values are less than about
22, all residuals are positive; that is, the data points are above
the fitted curve. However, for data points with X values above 22,
all residuals are negative and the data points fall below the fitted
curve. This indicates a nonrandom behavior of residuals, which means
that this behavior may be driven by a hidden variable.
So curve fitting, although it cannot tell you how, can certainly
tell you when your chosen model needs improvement.
Further reading
- OriginLab
- Least
Squares Fitting - from Mathworld
- Press, W. H.; et al. Numerical Recipes in C, 2nd ed.;
Cambridge University Press: New York, 1992; 994 pp.; provides
a detailed description of the LevenbergMarquardt algorithm.
- Bevington, P. R. Data Reduction and Error Analysis for the
Physical Sciences, McGraw-Hill, New York, 1969.
- Draper, N. R.; Smith, H. Applied Regression Analysis,
3rd edition, John Wiley & Sons, 1998.
Biography
Marko Ledvij is a senior
software developer at OriginLab Corp. in Framingham, Massachusetts
(marko@originlab.com).
|