Fitting Data II =============== In the previous lecture, we did a very simple fit of a straight line to some data, extracting the value of slope of the data. **But how does gnuplot define the** "*Best*" **fit?** This goes under a very old branch of mathematics called "*Linear Regression*" (when fitting linear functions), or "Least Squares*" fitting, or sometimes "*Data Modelling*". In it's simplest form, the procedure works like this. We'll consider a very simple linear fit to three data points. Suppose your experiment gives you these three data points: :: 1.0 0.7 # x1, y1 2.0 0.8 # x2, y2 3.0 1.7 # x3, y3 You define a function `f(x) = a*x + b` which has some parameters that you can change, **a** and **b** in this case. Each different value of `a` and `b` gives a different line; three different possible line examples are shown below. But again, how do you know which line is the *best* fit (*i.e.* comes closest to the data)? For each data point (labeled by *i = 1,2,3*), we compute the **difference** between our guess function :math:`f(x_i) = a x_i + b` (at :math:`x_i`) , and the :math:`y_i` value of the real data. .. image:: imgs/lsdata.jpg This difference is called the *Residual* at :math:`x_i` . The residuals of each data point are shown as the little vertical lines ( | ) in the figure above. Since these can be either positive or negative, we *square them* to get a positive number irregardless of whether the data is above or below the line. Then we sum all the squares of residuals. .. math:: R &= \sum_i \left[ y_i - f(x_i) \right]^2 \\ &= \sum_i \left[ y_i - (ax_i + b)\right]^2 Then: the **fit** program changes the parameters ( 'a' and 'b' in our case, but there could be more) to get the *lowest sum of the Squares* of the Residuals , *min{R}*, hence the name "**Least Squares Fit**". For those who know calculus, we are really doing is taking the derivative of *R* with respect to *a* and *b* and setting those expressions to zero; this finds where *R* has a minimum as we vary *a* and *b*. This gives two equations that we have to solve, sumultaneously. .. math:: \frac{dR}{da} = -2\sum_i \left[ y_i - (ax_i + b)\right]x_i &= 0 \\ \frac{dR}{db} = -2\sum_i \left[ y_i - (ax_i + b)\right] &= 0 These equations are actually easy to solve (email me if you need help). They are really just a *system of two coupled linear* equations, which you could use *Linear Algebra* to solve! .. math:: a \alpha_1 + b \beta_1 &= \gamma_1 \\ a \alpha_2 + b \beta_2 &= \gamma_2 where .. math:: \alpha_1 &= \sum_i x_i^2 \\ \beta_1 &= \sum_i x_i \\ \gamma_1 &= \sum_i x_i y_i \\ \alpha_2 &= \sum_i x_i \\ \beta_2 &= \sum_i = N \\ \gamma_2 &= \sum_i y_i It's easy to solve the equations above for :math:`a` and :math:`b`: .. math:: a &= \frac{\gamma_1\beta_2 - \gamma_2\beta_1}{\alpha_1\beta_2 - \alpha_2\beta_1} = \frac{(\sum_i x_i y_i)(N) - (\sum_i y_i)(\sum_i x_i)}{(\sum_i x_i^2)(N) - (\sum_i x_i)(\sum_i x_i)} \\ ~~&~~ \\ b &= \frac{\gamma_2\alpha_1 - \gamma_1\alpha_2}{\alpha_1\beta_2 - \alpha_2\beta_1} = \frac{(\sum_i y_i)(\sum_i x_i^2)-(\sum_i x_i y_i)(\sum_i x_i)}{(\sum_i x_i^2)(N) - (\sum_i x_i)(\sum_i x_i)} However, gnuplot doesn't actually solve them! It just varies *a* and *b* and sees how much *R* changes. If increasing *a* decreases *R*, then gnuplot increases *a* a little more. Once changing the parameters by a little amount always increases *R*, **fit** stops and prints out the resulting parameters: :: After 4 iterations the fit converged. final sum of squares of residuals : 0.106667 rel. change during last iteration : -3.71111e-09 degrees of freedom (FIT_NDF) : 1 rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.326599 variance of residuals (reduced chisquare) = WSSR/ndf : 0.106667 Final set of parameters Asymptotic Standard Error ======================= ========================== ** a = 0.5 +/- 0.2309 (46.19%) b = 0.0666667 +/- 0.4989 (748.3%) ** This gives the paramerters of the line that is the "*Best*" fit to the data, as well as some information about how good the fit is (how big was the sum of the squares of the residuals). This *Best* line looks like this. It's the one that's closest to the data. .. image:: imgs/lsdatabest.jpg For more information on the *Least Squares* method can be found here: + `http://mathworld.wolfram.com/LeastSquaresFitting.html`_ + `http://www.mathwords.com/l/least_squares_regression_line.htm`_ as well as any good math book on statistics. As you can see from the figure below, we need not limit ourselves to linear functions--you could do the same with almost any function and data. .. image:: imgs/LSparab.jpg Your job as a scientist is to have insight (usually from *theory*) as to which function is correct. In this case, you want a quadratic function. You would do the following :: gnuplot> f(x) = a*x**2 + b*x + c gnuplot> fit f(x) "data2.dat" via a ,b, c gnuplot> plot "data2.dat", f(x) Do you see that we've defined a quadratic function: :math"`a x^2 + b x + c` above? Try this yourself . Here's the data: `data2.dat `_ Here's my fit. .. image:: imgs/parafit.jpg .. topic:: Operator Binding or Precedence Above, when writing about how to enter into gnuplot the fit function for a parabola, I wrote `f(x) = a\*x\*\*2 + b\*x + c` without parentheses. This probably bothers several of you, who would rather see the function written as: `f(x) = a\*(x\*\*2) + b\*x + c` which is "*safer*" . As with most computer math operators (like \* and /), the "exponent" operator `\*\*` acts only on the previous thing. Clearly `(a\*x)\*\*2` is much different than `a\*(x\*\*2)`. I have been a bit lazy and used the fact that the exponent operator "*binds tighter*" than the multiplication operator. When gnuplot reads the line `a\*x\*\*2 + b\*x + c` it first squares `x` before anything else, because `\*\*` has the highest "*operator precedence*" or "*tightest binding*". Then it does the multiplications `a\*(x\*\*2)` and `b\*x` Finally, it does the additions because `+` has lowest precedence. But we could also use: `a\*(x\*\*2) + b\*x + c` Using the "*Using*" command --------------------------- A very *VERY* important feature of gunplot's `plot` and `fit` functions is the **using** argument. It has *MANY* uses, some of which we'll look at below. Plotting data from different columns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppose your data file has separate columns of data that are all measured at the same time. You have an example of this in your "`YOURNAMEcalpop.dat`" file. Go ahead and find it now. Before we look at it in gnuplot however, we will have to edit it slightly. It will look something like this (right?): :: Year Total US Total California 1900 76212168 1485053 1910 92228496 2377549 ... Start emacs and Open `YOURNAMEcalpop.dat`. We have to make the first line (or lines) into a comment, ortherwise gnuplot will bail trying to plot the number "*Year*". Remember how to make a line into a comment? If not see here: `Plotting Data from a File `_. (Simply put a `#` at the beginning of the line. ) Now our file looks like this: :: #Year Total US Total California 1900 76212168 1485053 1910 92228496 2377549 ... We can do a normal plot: :: plot "calpop.dat"` which plots column 1 and 2 on the x and y axes. However look at what this command does :: gnuplot> plot "calpop.dat", "calpop.dat" using 1:3 | .. image:: imgs/calpopusing13.jpg While the first plot is the normal default one--columns 1:2 on x : y , the second part, `calpop.dat" using 1:3`, plots the data in `calpop.dat` again, but this time `using` columns 1 and 3. You can simplify this with the abreviated version: :: gnuplot> plot "calpop.dat", "" u 1:3 The empty `""` means use the same datafile as before (with older versions of gnuplot you will have to supply the datafile name again), and `u` is short for *using*. Also, since the *Key* is in the way of the data in the upper right, let's move it, and give the key items *title*\s other than the default names for the `plot` command. :: gnuplot> set key box left gnuplot> plot "calpop.dat" title "US Population", "" u 1:3 title "CA Population" pointtype 7 .. image:: imgs/calpopusingclean.jpg This looks very nice. Notice that we: + moved the *Key* and put it in a box, with the single line: `set key box left` + added *titles* to the key items with `title "string"`. The title character string must be enclosed in quotes. + changed the *point type* used for displaying the data by specifying `pointtype 7` This long command to gnuplot can be considerably simplified by abreviating `title` and `pointtype` to :: gnuplot> plot "calpop.dat" t "US Population", "" u 1:3 t "CA Population" pt 7 You can see the different `pointtype`s, and some other info, by typing :: gnuplot> test .. image:: imgs/gptest.jpg The *pointtypes* (and *linetypes*) are shown on the right side). | Of course you could do more if you had more columns, such as `plot "datafile" u 1:4, "" u 1:3, "" u 3:2` etc. Notice that the last instance, `u 3:2`, plots the 3rd column as the *x* values and 2nd column as *y* values of the points. Very nice, huh? But now--*Check this out*! | ----------------------------------- | You can modify the data on the fly with *using*! Try these examples: :: gnuplot> plot "calpop.dat", "" u 1:(10*$3) gnuplot> plot "calpop.dat" u 1:(1.0/$2) gnuplot> plot "calpop.dat" u 1:($3/$2) In the above examples we did :: u 1:(10*$3) -> plots 10 times the -value- in column 3 u 1:(1.0/$2) -> plots 1.0/y for column 2 values u 1:($2/$3) -> plots the ratio of the values in columns 3 and 2 | Here's the syntax for modifying data with *using*: **Simple** *i.e.* just plot different columns: + The *using* command takes the form `u 1:4` to plot columns 1=x and 4=y, for example. No parentheses needed to just rearrange the columns used for *x* and *y*. If you want to refer to the **value** in a particular column, so that you can multiply, divide, or act with other math functions on it, etc. **Manipulating column values** in the plot: + instead of a column number [example `u 1:4`], you must enclose the column placeholder in **()**\s. + *inside the ()*\s, you must refer to **value** of a datum in column *n* as **$n**. Like this: `u 1:($4)` + then, you can treat **$n** as the **value** of the datum for that point. Here's an example. Our *Rutherford* :math:`\alpha` *-particle scattering* experiment produces the following data: :: file: ruth.dat: #theta flux -30 27.6 -25 65.4 -20 195.2 -15 762.4 -10 4740.8 -5 14392.6 0 9139.8 5 1957.6 10 361.6 15 113.6 20 35.8 25 23.2 30 12.2 and the following gnuplot commands create this plot: :: gnuplot> set xlabel "angle theta (degrees)" gnuplot> set ylabel "Counts per 1000 sec" gnuplot> plot "ruth.dat" t "Run 1" pt 5 .. image:: imgs/rutherford1.jpg According to `theory `_, these data should follow a curve that looks like this, where A and :math:`\theta_o` are constants (to be solved for): .. math:: N(\theta) = \frac{A}{sin^4[(\theta-\theta_o)/2)} It turns out that the `fit` procedure has a hard time with this because of the singularity when :math:`\theta` is zero. This is a common issue for curve fitting algorithms; if the fit function has a singularity, they often fail. ---------- | However, if we were to plot *the inverse* of our data: :math:`1/N(\theta)`, our fit function to those data would be .. math:: \frac{1}{N}(\theta) = \frac{1}{A} sin^4[(\theta-\theta_o)]/2) This function simply goes to zero as :math:`\theta \rightarrow 0`, instead of blowing up. So the `fit` command is stable. It's really simple to plot the *inverse* of our data from the same file: :: gnuplot> plot "ruth.dat" **u (pi*$1/180):(1.0/$2)** pt 5 Notice that I also converted the angle (in column 1) to radians. Now it's easy to fit this data: :: gnuplot> fit f(x) "ruth.dat" u (pi*$1/180):(1.0/$2) via a,x0 ... After 12 iterations the fit converged. final sum of squares of residuals : 5.97834e-05 rel. change during last iteration : -4.08902e-07 degrees of freedom (FIT_NDF) : 11 rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.00233128 variance of residuals (reduced chisquare) = WSSR/ndf : 5.43486e-06 Final set of parameters Asymptotic Standard Error ======================= ========================== a = 12.394 +/- 0.3675 (2.965%) x0 = -0.0577061 +/- 0.003981 (6.899%) .. image:: imgs/rutherford3.jpg *Shazzam!* | --------------------- | Mathematical Manipulations of Data with *Using* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With the `using` command, you can do simple or complicated mathematical transformations of your data when you plot it. A very simple example is this. Suppose you you have these two data sets: + `h1.dat `_ + `h2.dat `_ If you plot them, :: gnuplot> plot "h1.dat", "h2.dat" they look like this: .. image:: imgs/h1h2.jpg | The `h1` data shows some interesting structure: clearly there is something happening at around *x = 25*, but it's pretty hard to see what's going on with `h2`. Have a look at the data files by scrolling throuh them with `less`: `less h1.dat` :: ... 1.130134e+01 70 1.230134e+01 79 1.330134e+01 93 1.430134e+01 104 1.530134e+01 134 1.630134e+01 146 1.730134e+01 139 ... and `less h2.dat` :: ... 1.275000e+01 0.426 1.325000e+01 0.346 1.375000e+01 0.324 1.425000e+01 0.292 1.475000e+01 0.252 1.525000e+01 0.198 ... Ah. Now we see what's going on. Where `h1.dat` has values around ~100, `h2.dat`'s values are around ~01. Perhaps we are looking at the energy spectra from two different sources, one much more intense than the other. Let's adjust this with the `using` command and scale the `h2` data. Plot the data this time with: :: gnuplot> plot "h1.dat", "h2.dat" u 1:(10*$2) .. image:: imgs/h1h2x10.jpg Notice the syntax to the *using* (`u`) command: **u 1:(10*$2)**. This means "use the column 1 values as they are for *x*, then take the value of the column 2 entry ($2), multiply it by 10, and use it for the *y* value (10*$2) of the point to be plotted. Now we see something emerging--a little bump at *~x = 10*. We can blow this up even more by increasing our "scale factor" in the `using` command. Try: :: gnuplot> plot "h1.dat", "h2.dat" u 1:(100*$2) gnuplot> plot "h1.dat", "h2.dat" u 1:(400*$2) .. image:: imgs/h1h2x400.jpg Now we can see something important. The `h1` data has a peak around ~25, whereas the `h2` data peaks around ~10. This is a simple use of the `using` feature. | While you need to know what you are doing, know that you can use any of the `built-in math functions `_ with `using`. In fact you can build quite complicated functions of your own and apply them. Here's a very simple example: Suppose your file looked like this: :: # x y 1 1 2 10 3 100 4 1000 the *using* command :: u 1:(2*log10($2)+1)` would plot these data as if they were :: # x new y 1 2*0+1 = 1 2 2*1+1 = 3 3 2*2+1 = 5 | Using and If/Then ~~~~~~~~~~~~~~~~~ Here is another of my favorite handy *using* examples. Make sure you understand what it does. :: plot "calpop.dat" u 1:($1>=1950 ? $2 : 1/0) pt 5, "" u 1:($1<1950 ? $2 : 1/0) pt 7 .. image:: imgs/calpopmasked.jpg | This uses the fact that `1/0`, *division by zero*, evaluates to undefined , so gnuplot ignores that point. The IF/THEN (actually called the "*ternary*") operator works like this: :: y = Conditional ? Value if True : Value if False Ususally, when we write **y = x** we are set **y** to the *value* of **x**. So if the value of *x = 5*, then after the statement *y = x*, the *value of y* is now 5. In the *ternary* operation, in order to determine the value that *y* should be set to, the *Conditional* statement is evaluated. A *Conditional* statement is either **True** or **False**, like `x > 0` or `x <= 10`. If the *Conditional* is *True* then the value used for *y* is the first value--the one between the **?** and the **:**. If the *Conditional* is *False*, then the second value is used, the one after the **:**. In the example above, :: u 1:($1>=1950 ? $2 : 1/0) the first part of the `using` operator `u 1:`, takes the first column of the data file (Year) for the *x* value of the plotted point, as per normal. However the second part `($1>=1950 ? $2 : 1/0` runs the *Conditional* first: `$1>=1950`, and checks if the year ($1 is the value in the *1st*-column) is less than 1950; if YES it is, the *Conditional* returns **True** and the "normal" value as listed in the file: *$2* meaning the value from column 2 is used for the *y* coordinate of the point. If the *Conditional* is *NOT* satisfied (evaluates as **False**), then the *y* coordinate is set to 1/0, which is undefined. So gnuplot ignores this point and does not plot it. The result of the above `using` expression is that only points whose year is greater or equal to 1950 are plotted. To make the plot shown above, with red squares and green circles, plotted two versions of the data, each with its own conditional. For the first data set in the plot, the `u` command asks if the *x* value (the year) is greater or equal to 1950. If so, it is plotted as is, with `pointtype (pt)` 7. In the next data set in the plot, if *x* (the year) is less than 1950, the `u` command returns that point's normal *y* value, and it is plotted (this time with `pt 7`). In the second case, points for years > 1950 are *not* plotted. This is a handy way to suppress plotting or fitting of some part of your data. | --------------- | **More fun with the Ternary** | Finally, one last example. This one goes out to all of you who have to take *Digital Signal Processing*. First, let me introduce a couple interesting math functions. :: floor(x) returns the largest integer less than x. Thus: floor(4.73) = 4 floor(9.99999) = 9 floor(pi) = 3 etc. and the **mod** operation, for which we use the percent symbol: `%` :: For two integers a and b: a % b returns the remainder of integer division. Thus: 7%3 = 1 12%6 = 0 24%5 = 4 etc. A convenient *Conditional* to check if an integer *a* is EVEN, is :: a%2 == 0 If this is **True**, then *a* must be even (divisible by 2 with no remainder). If it's **False**, then the remainder is 1, and the number is ODD. The `mod` operation only works for *integers* however, but you can combine the `floor` operation to extract the integer part of a number, then use that in a `mod` expression. | Now consider this graph. I have `set samples 1000` to get a finer plot with 1000 sample points, rather than the default 100 points. That way, my "edges" are more vertical. If you `set samples 100` back to the default, you'll see what I mean. :: gnuplot> f(x) = floor(x)%2 == 0 ? 0 : 1 gnuplot> g(x) = floor(x/3)%2 == 0 ? 2: 3 gnuplot> set zeroaxis gnuplot> set samples 1000 gnuplot> plot [0:20][-1:5] f(x), g(x) .. image:: imgs/digital.jpg .. _data2.dat: :///home/jhetrick/www/phys27/pages/gnuplot/data2.dat .. _http://www.mathwords.com/l/least_squares_regression_line.htm: http://www.mathwords.com/l/least_squares_regression_line.htm .. _page 4: :///home/jhetrick/www/phys27/pages/gnuplot/page4.html .. _http://mathworld.wolfram.com/LeastSquaresFitting.html: http://mathworld.wolfram.com/LeastSquaresFitting.html