R (programming language)

R is a free programming language for statistical computing and statistical graphics. It was created based on the programming language S and largely compatible with it. In addition, the developer oriented to the Scheme programming language.

R is part of the GNU Project and is available on many platforms. R is increasingly seen as the default language for statistical problems both in the commercial as well as in the scientific field (though mainly in the commercial sector SAS is also very popular). In the current TIOBE index (annual Index December 2013) is R 38th place employees with good R- knowledge which participated in the Dice Tech Salary Survey (2013 ) ( total 17236 - mainly American - employees from the technology sector ), had a higher average income than employees with other IT skills.

History

R was developed in 1992 by Ross Ihaka and Robert Gentleman at the University of Auckland. They oriented themselves closely to the developed at Bell Laboratories language S for processing statistical data, so that the majority of programs written for S runs under R. In particular, the S version 4 took into account, however, the sources of R were rewritten. Another source of inspiration was the Scheme language. In 1993, the software was first introduced to the public, since June 1995 R is licensed under the GNU General Public License.

Properties

R is a " case sensitive " (ie upper and lower case letters be observed ) interpreter language, which brings user input on the command line console by pressing the Enter key immediately for execution. Of course, programs can also bring in scripts for execution. The input of 2 * 4 ↵ looks like this:

> 2 * 4

8

The issue 8 means that it is the first element of a vector with the result 8. The simplest data structure in R is according to a vector dar. vectors and matrices must be elements of the same type principle, so for example, numeric, complex or character ( character strings). Corresponding calculation operations are applied to all elements of this data structure. For assigning different types of data into a vector data to be converted appropriately to obtain uniform data types. Thus, for example, from numbers then strings.

In addition to these homogeneous data structures, often called data frames for records to be used in statistical analysis. Data frames are matrix form, but may contain columns of different data types.

Moreover, there are so-called lists. In any R- list structures data structures and types may be included. Many statistical evaluation functions generate lists.

Packages

Features of R can be extended by a variety of packages and adapted for specific statistical problems from various application areas. Many packages may be selected directly from an on the R Console retrievable list and installed automatically. Central Archives for these packages is the Comprehensive R Archive Network ( CRAN ). The software is based on R and Bioconductor provides enhancements in the field of bioinformatics, in particular the analysis of gene expression data. Currently (January 2013) there are over 4200 packages on CRAN and over 600 packages on Bioconductor.

With PL / R, the programming language can also be used as an extension of PostgreSQL for server-side programming.

The foreign package allows to read data from other statistical programs such as SPSS, Stata or Minitab and analyze. RMySQL establishes an interface to a MySQL database and XML files can be read using the XML package.

Two well-known graphic extensions of R are the lattice and ggplot2 packages that can be created through greater abstraction faster complex graphics. lattice is an implementation of the idea of ​​Trellis graphics for visualization of multivariate data. ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics.

Under the heading Task Views CRAN contains a list of currently (March 2014) 33 topic areas for which an annotated description of the relevant for the topic of packages is included. These are Bayesian statistics, chemometrics and Computational Physics, Clinical trials, cluster analysis, differential equations, probability distributions, econometrics, mathematical description of the environment, design of experiments, Finance, Genetics, graphics, high performance computing and parallel computing, machine learning, Imaging techniques in medicine, meta-analysis, multivariate method, Computational Linguistics, Computational Mathematics, official statistics and survey optimization, pharmacokinetics, phylogeny, psychometrics, Reproducible research, Robust estimation methods, social sciences, geostatistics, geostatistics with respect to time, survival analysis, time series analysis, web services and technologies and Probabilistic Graphical models.

User interface

R runs in a command line environment. In addition, there are several graphical user interfaces (GUI ) such as RStudio, Statistics Laboratory, JGR (Jaguar), rkward, statet (Eclipse ) and others.

Two graphical user interfaces that are provided as packages in R, are the R- Commander (package name: rcmdr ) and relax. Both offer, accessible via a menu system, some important procedures of exploratory and analytical statistics. Also can be generated using the standard graphics menu. The R - Commander facilitates data management and helps when writing Auswertskripten. He's written operating system independent. Relax is specifically on the data analysis and documentation of results in a document designed to integrate (see Sweave ). Furthermore, it is sold as a package, the R Rattle GUI. It is the entry point data mining projects.

The platform-independent Java-based user interface JGR ( "Java GUI for R") enables a supported command input checks, for example, the number of brackets and has a Autovervollständigkungsfunktion. An additional useful extension represents the Deducer package contains the possibilities of data processing by a Data Viewer.

For many editors such as Tinn, Emacs, TextWrangler, SubEthaEdit or Notepad , there are R- extensions.

Example

As a simple example, the correlation coefficient between two data series is to be calculated:

# Size is a numeric vector # By the assignment operator "<- " is defined: Size <- c (176, 166, 172, 184, 179, 170, 176)   # Weight is defined as a numeric vector: Weight <- c (65, 55, 67, 82, 75, 65, 75)   # Calculating the Pearson correlation coefficient: cor ( weight, size, method = " pearson " ) The result is 0.9295038. Text after the number sign " #" is treated by the R interpreter as a comment.

As a further analysis one can perform a linear regression. This can be done by the R in the LM instruction, the dependent variable is separated from the independent variable by a tilde "~". The summary command then returns the coefficients of the regression and other statistics from this:

# Linear regression with weight as the target variable # Result is stored as reg: reg <- lm ( weight ~ size)   # Output the results of the above linear regression: summary ( reg) Diagrams are very easy to produce:

" useR! " is the name of the annual meetings of R users. The first of these events was useR! 2004, which was held in Vienna in May 2004. After 2005 was omitted, the conference was held annually, usually alternating between Europe and North America in different places. Subsequent conferences were:

  • UseR! 2006, Vienna, Austria
  • UseR! 2011, Coventry, United Kingdom
270393
de