April 6th 2017

Introduction

About me

  • PhD mathematics UAB (X.Tolsa)
  • Worked in:
    • Gaming industry: Cirsa
    • Transportation: TSS
    • Newspapers: LaVanguardia.com
    • Retail: Lidl

Organizations (via Meetup)

  • Coorganizer 'Grup d'estudi de machine learning de Barcelona'
  • Coorganizer 'Barcelona R Users Group'

Foundations of data science

Ways to enter into data science

  • Programming
  • Insights
  • Machine Learning
  • Business

About data

Typical difficulties:

  • Having access
  • Quality
  • Management

Programming

Typical skills

  • Many languages (R, Python, Java, C++, Javascript, Scala)
  • Agile coding (more experiments)
  • Collaborative coding
    • Tools: Git, svn
    • Easy to read code
  • Computer science coding:
    • testing
    • defensive programming
    • object oriented programming
    • development/production environments

Big Data

  • Big propaganda: does everyone really need it?
  • Big fantasy I: if we collect all the data we will know everthing
    • More difficult to manage
    • Counter-productive: we have to be able to digest the information
  • Big fantasy II: if we collect all the data we will not need models

Big Data Infrastructures

  • Tools for dealing with it:
    • Hadoop/Hive
    • Spark
  • Management: in house vs cloud platforms?
    • Amazon Web Services
    • Google Cloud Platform
    • Microsoft Azure
  • Comodity?

Relationship with IT department (personal experience)

  • Typically more oriented to exploitation instead of exploration (as currently data science is)
  • Political interests
  • Different team sizes (IT vs DS)
  • Older departments

Machine learning

Introduction

  • Started 50's and 60's
  • Combination of statistics/probability, optimization and computing.

Machine learning and statistical views

  • Main focus:
    • Machine learning: prediction
    • Statistics: explaining (and distinguishing randomness from not)

Reference: L. Breiman's 'Statistical Modelling: the two cultures'

Example: role of multicolinearity

Machine learning and statistical views II

  • Roles:
    • Machine learning: day-to-day operational advantage
    • Statistics: mid-large term roadmap

Example: churn prediction

Model selection

Validation of models via cross validation: accuracy in unseen observations

Parsimony principle

"Other things being equal, simpler models are preferred"

Vapnik and Chervonenkis, 70's-90's: statistical learning theory. With high probability

\[\hat{\mbox{err}} \leq \mathbb E[ \mbox{err}] + I \] Where \(I\):

  • decreases with sample size
  • increases with complexity of the family used for modelling

    • Number of coefficients in linear regression
    • Norm of coefficients in ridge regression

No free lunch

For any given learning algorithm, we can build a probability distribution that learns arbitrarily slow.

ML and statistics perspective

Stats

  • Suppose you know data generation process and verifies you can estimate it.
  • What if your used model is different from the data generation model? notion of correct and incorrect models

ML

  • No hypothesis on the data generation process
  • No good or bad models: only useful ones.
  • Focus on complexity

Many families/influences in machine learning

For instance:

  • Computer scientists: neural networks
  • Statisticians: decision trees/random forests, boosting, \(\ell_1\) penalization
  • Mathematicians: svm (kernels)

Reference: 'Elements of statistical learning' Friedman, Hastie and Tibshirani

hot topics from statistics in machine learing today

'Bayesian' statistics:

  • Inference using Bayes formula
  • Inference using priors

Reference: 'Machine learning, a probabilistic perspective' K.Murphy

bayesian hot topics - gaussian processes

bayesian hot topics - graphical models

For instance LDA (Latent Dirichlet Allocation) for topic modelling (documents)

bayesian hot topics - graphical models

bayesian hot topics - causality

  • Analyzing relationship between built models and their use

Reference: 'Causality' Judea Pearl

Example: churn and selection bias

Insights

Introduction

Objective: find patterns, understand causations

  • Craftmanship
  • Easy to give numbers, very difficult to link with particular actions
  • Widely needed
  • Political implications and participation from many departments
  • Mid-long term

Team implication:

  • Passive: reporting
  • Active: interpreting and proposing actions

Reference: 'Statistical Engineering: An Idea Whose Time Has Come?', Hoerl and Snee

Activities

  • Interpreting observational data
  • Designing and analysing experiments
    • AB Testing (frequentists or bayesian) / clinical trials
    • Causality

Business

Project points of view

  • Motivation
    • Engineering:
      • Given constraint: a time horizon
      • Optimize: maximize the quality of the solution
    • Science:
      • Given constraint: solve the problem
      • Optimize: minimize the time horizon
  • Understand viability
    • Logistically
    • Economically
    • Organizationally
  • Estimate the sensibilty of the clients to the outcome: selecting the interesting problem

Role of science in business

Increasing demand of scientific profiles: ability to deal with complex problems.

Managing complexity

intrinsic complexity of the field

Managing complexity

oversimplification in management

Managing complexity

complexity in implementation

Managing complexity

Complexity in definition of needs:

Type III error: give the right answer to the wrong question.

Are all the statisticians working at Universitat Autonoma de Barcelona…

                                          ...vallesians?
Thanks!