Tutoriel Statistique

Techniques Statistiques dans la Datathèque : Un Tutoriel

Les techniques statistiques sont des outils essentiels pour analyser de larges bases de données ; ce tutoriel statistique couvre ainsi les thèmes essentiels pour la plupart des utilisateurs de la datathèque.

La Datathèque est un outil en ligne qui fournit l'accès à une abondance de données relatives au climat par l'intermédiaire d'une interface facile d'accès. Ingrid, le langage de programmation à partir duquel la Datathèque s'est construite, propose une variété de fonctions qui permettent de manipuler les données. Ces fonctions Ingrid sont simples à mettre en œuvre parce qu'elles sont éxécutées directement via l'interface de la Datathèque. La grande variété de fonctions disponibles facilite aussi bien l'utilisateur novice qu'expérimenté de la Datathèque.

Tandis que le premier tutoriel de la Datathèque, Naviguer la Datathèque, a pour principal objectif de familiariser de nouveaux utilisateurs à la Datathèque, ce tutoriel de statistiques facilite l'usage de fonctions statistiques dans la Datathèque, et bien qu'il se focalise sur certaines techniques avancées, les sections couvrent aussi beaucoup de compétences basiques.

Les thèmes suivant sont abordés par le tutoriel : mesures de tendance centrale, mesures de dispersion, climatologies, anomalies standardisées, corrélations, indices climatiques, distributions, décomposition en valeurs singulières, interpolations. Une introduction et un exemple pratique composent chaque tutoriel de chaque fonction statistique.

Measures of Central Tendency

One of the most common quantities used to summarize a set of data is its center. The center is a single value, chosen in such a way that it gives a reasonable approximation of normality.

Running and Weighted Averages

Both running and weighted averages are important filtering methods for statistical analysis.

Climatologies and Standardized Anomalies

Climatology is commonly known as the study of our climate, yet the term encompasses many other important definitions. Climatology is also defined as the long-term average of a given variable, often over time periods of 20-30 years.

Data Homogeneity

It is often important to determine if a set of data is homogeneous before any statistical technique is applied to it. Homogeneous data are drawn from a single population.

Stationarity

A random variable or random process is said to be stationary if all of its statistical parameters are independent of time. While most statistical techniques require that data is stationary, most atmospheric processes are visibly nonstationary.

Measures of Dispersion

While measures of central tendency are used to estimate "normal" values of a dataset, measures of dispersion are important for describing the spread of the data, or its variation around a central value.

Correlation

The correlation is defined as the measure of linear association between two variables. A single value, commonly referred to as the correlation coefficient, is often needed to describe this association.

Climate Indices

Indices are diagnostic tools used to describe the state of a climate system. Climate indices are most often represented with a time series; each point in time corresponds to one index value.

Frequency Distributions

A frequency distribution is one of the most common graphical tools used to describe a single population. It is a tabulation of the frequencies of each value (or range of values).

Singular Value Decomposition

Singular value decomposition (SVD) is quite possibly the most widely-used multivariate statistical technique used in the atmospheric sciences. The technique was first introduced to meteorology in a 1956 paper by Edward Lorenz, in which he referred to the process as empirical orthogonal function (EOF) analysis. Today, it is also commonly known as principal-component analysis (PCA). All three names are still used, and refer to the same set of procedures within the Data Library.

Interpolation Techniques

Interpolation is the process of using known data values to estimate unknown data values.