Introduction to R

Aleix Ruiz de Villa Robert, RugBcn / TSS - Transport Simulation Systems
2014/03/05

What is R?

  • Software for statistical computing
  • Open source
  • Official webpage CRAN
  • Popularity in data analysis, machine learning, statistics, bioinformatics, …
  • More than 5000 packages.

What is R?

History

  • In May 1976 at the Bell Labs, the first discussion about a software for statistical computing began.
  • Called S, implemented mainly by Rick Becker and John Chambers, not free.
  • In 1996 in New Zealand, Ross Ihaka and Robert Gentleman developed a system compatible with S, called R.

Pros

  • Free - Big open source community
  • Designed for data analysis/statistics/machine learning, econometrics, bioinformatics (Bioconductor), …
    • Has all basics in those fields
    • Has the majority of new developments.
  • Wide range of applications, Task View. Among others: finance, social science, clinical trials…
  • Efficient workflow (in contrast to Excel, SPSS, …).
  • Flexible (in contrast to C++, Java, …).
  • In constant development.

Cons

  • R has a steep learning curve.
  • Sometimes tricky:
    • When possible, use functions from the base package (they are compiled in C).
    • Use vectorization
    • Natural ways of programming are sometimes very inefficicent: avoid using for (use the apply family instead).
  • Computational time. Interpreted language.

The audience - Programming skills

The audience - Interests

About this course

Overall

  • Provide code examples.
  • Show a global overview: what can be done with R.

This document

  • Can be used as a quick reference.
  • Provide external references: what to look for, where to look it.
  • Done with RStudio.
  • * means for people with experience programming.

References - I

Formal references

  • John Chambers, “Software for Data Analysis: Programming with R” (classic book, basic R language).
  • W.N. Venables and, B.D. Ripley “Modern Applied Statistics with S” (classic book, statistics with R).
  • The R journal.
  • The Journal of Satistical Software.

Blogs

References - II

Tutorials

Local R Communities

References - III

Help

Basic Elements

Gui

  • RStudio (recommended)
  • Eclipse
  • ESS - Emacs Speaks Statistics
  • Revolution Analytics commercial - intended for commercial applications and big data.
  • “Windows type”: R Commander, DeduceR.

Packages

  • Official packages can be found on CRAN.
  • Very simple installation and loading: install.packages("ggplot2"), require(ggplot2).
  • Wide range of areas: Task View.
  • Making packages is very easy due to the recent developments of RStudio.
  • Common documentation usually hard to read. If you are lucky you will find a vignette (how to use the package and some examples). For insatnce, data.table package.
  • Recomended basic packages: ggplot2, reshape, plyr, lubridate, stringr (from Hadley Wickham)

Data Frames

  • Main type in R: like a survey table.
  • Columns may have different data type (characters, numerics, logical, …)
  • When all columns have the same type we can use a matrix.

Base and stats packages

  • simple manipulation example: LmExample.R
  • base plots example: Plots.R

New Elements

Visualization - ggplot2

  • ggplot2 package: nice plots in R, by Hadley Wickham. Example: GGPlot2.R
  • Hadley Wickham “ggplot2: Elegant Graphics for Data Analysis” (book).
  • ggmap = ggplot2 + RGoogleMaps. Example: GGMap.R

Facebook Formation of Love

Visualization

  • Packages for managing spatial data (shapefile format, projections, …): sp, maptools, rgdal.
  • Google motion charts with package googleVis. Example: GoolgeVis.R.
  • Interactive charts rCharts (using D3.js*). Examples.

More on visualization (including R)

Machine Learning

  • Area between statistics, optimization and computer science.
  • Main objective: learn from a historic database to make predictions. Example: MachineLearning.R
  • Succesfully applied to (examples from a huge list):
    • Electricity market demand forecasting.
    • Image classification (i.e. number plate).
    • Spam filtering.
    • Recomender Systems (i.e. Amazon's books suggestions).
    • Natural language processing.
    • Fraud detection.
    • Medical diagnosis.

Integration with other languages*

  • Base functions are implemented in C or Fortran.
  • Integration with Cpp through package RCpp (currently very active and well integrated in RStudio).
  • Integration with Java through package rJava.
  • R can be called from python using RPy or PypeR. The other way round with rPython.

High Performance Computing*

  • Parallel computing: two types in the base package
    • multicore (works in anything but Windows).
    • snow : opens several hidden R sessions and initializes each one with the required data. Wrapper package snowfall.
  • Indexed structures with data.table(instead than dataframe).
  • Byte code compilation with compiler.
  • Use Rcpp.

Big Data*

  • Integration with Hadoop.
  • Dealing with files that do not fit in memory: ff package, bigmemory project.
  • Direct access to data bases: RODBC, RMySQL, ROracle, RPostgreSQL, … packages (among others).
  • Revolution Analytics
  • SAP HANA integrates R.

Automatic Reporting

  • Latex was traditionally integrated with Sweave. Currently with knitr. Example: Latex.R
  • RMarkdown (from knitr): html reporting (like this document).

Automatic Reporting - RMarkdown Example

The gaussian distribution has density function \[ f(x) = \frac{1}{\sqrt{2 \pi}}e^{-x^2/2} \] An example of the empirical distribution is

x = rnorm(1000)
hist(x)

plot of chunk unnamed-chunk-1

Web