Web scraping in R

Aleix Ruiz de Villa
TSS - Transport Simulation Systems
RugBcn - Barcelona R Users Group
V Jornadas
December 12th, 2013

Elements of a webpage

  • Html: content
  • Css: appearance
  • Javascript:interact with the user, modify the content, ...

Html

Elements

Tree based framework. Main elements: tags.

< p > : paragraph
< ul > : unordered list
< il > : items of a list
< a > : anchor, hyperlink
< div > : section
< h1 > : headings
< table > : tables

Html

Xpaths

Ways of locating the form element. Example from http://selenium-python.readthedocs.org/



<html>
  <body>
   <form id="loginForm">
    <input name="username" type="text">
    <input name="password" type="password">
    <input name="continue" type="submit" value="Login">
    <input name="continue" type="button" value="Clear">
   <form>
  <body>
<html>


PATH COMMENTS
/html/body/form[1] absolute path for 'form'
//form[1] first form in the html
//form[@id='loginForm'] 'form' with attribute id with value 'loginForm'
//form[input/@name='username'] 'form' with child with name 'username'
//form[@id='loginForm']/input[1] first'input' in the selected 'form'
//input[@name='continue'][@type='button'] 'clear' button
//form[@id='loginForm']/input[4] 'clear' button

Css

Implementation

  • In-line:
    <p style="color: red;">text</p>
  • Internal:
    <!DOCTYPE html>
    <html>
      <head>
       <style>
        p{
         color: red;
        }
        a{
         color: blue;
        }
       </style>
      </head>
    </html>
  • External:
    In a separate file (called 'style.css' for instance):
    p{
      color: red;
    }
    a{
      color: blue;
    }
    In the html file: <!DOCTYPE html>
    <html>
      <head>
       <link rel="stylesheet" href="style.css">
    ...

Css

Applies to.

Simple Example

Web page

Html Code

Css file

/

#