Statistical Techniques in the Data Library: A Tutorial

Statistical techniques are essential tools for analyzing large datasets; this statistics tutorial thus covers essential skills for many data library users.

The Data Library is an online tool that provides access to an abundance of climate-related data via one easy-to-use interface. Ingrid, the programming language on which the Data Library is built, contains a variety of functions that can be used to manipulate data. These Ingrid functions are simple to perform because they are executed directly within the Data Library interface. The wide variety of available functions makes the Data Library beneficial to the novice and advanced user.

While the first Data Library tutorial, Navigating Through the Data Library, is primarily focused on introducing the Data Library to new users, this statistics-based tutorial facilitates the use of statistical functions within the Data Library, and even though it concentrates on some advanced techniques, the sections still cover many basic skills.

The following topics are included in the tutorial: measures of central tendency, measures of dispersion, climatologies, standardized anomalies, correlations, climate indices, frequency distributions, singular value decompositions, interpolations, and more. An introduction and detailed real-world example are provided for each statistical function.

One of the most common quantities used to summarize a set of data is its center. The center is a single value, chosen in such a way that it gives a reasonable approximation of normality.
Both running and weighted averages are important filtering methods for statistical analysis.
Climatology is commonly known as the study of our climate, yet the term encompasses many other important definitions. Climatology is also defined as the long-term average of a given variable, often over time periods of 20-30 years.
It is often important to determine if a set of data is homogeneous before any statistical technique is applied to it. Homogeneous data are drawn from a single population.
A random variable or random process is said to be stationary if all of its statistical parameters are independent of time. While most statistical techniques require that data is stationary, most atmospheric processes are visibly nonstationary.
While measures of central tendency are used to estimate "normal" values of a dataset, measures of dispersion are important for describing the spread of the data, or its variation around a central value.
The correlation is defined as the measure of linear association between two variables. A single value, commonly referred to as the correlation coefficient, is often needed to describe this association.
Indices are diagnostic tools used to describe the state of a climate system. Climate indices are most often represented with a time series; each point in time corresponds to one index value.
A frequency distribution is one of the most common graphical tools used to describe a single population. It is a tabulation of the frequencies of each value (or range of values).
Singular value decomposition (SVD) is quite possibly the most widely-used multivariate statistical technique used in the atmospheric sciences. The technique was first introduced to meteorology in a 1956 paper by Edward Lorenz, in which he referred to the process as empirical orthogonal function (EOF) analysis. Today, it is also commonly known as principal-component analysis (PCA). All three names are still used, and refer to the same set of procedures within the Data Library.
Interpolation is the process of using known data values to estimate unknown data values.