RSMG 10 October 2019 ORR protects the interests of rail and road - - PowerPoint PPT Presentation

rsmg
SMART_READER_LITE
LIVE PREVIEW

RSMG 10 October 2019 ORR protects the interests of rail and road - - PowerPoint PPT Presentation

ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future RSMG 10 October 2019 ORR protects the interests of rail and road users, improving the safety, value and


slide-1
SLIDE 1

ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future

RSMG

10 October 2019

slide-2
SLIDE 2

ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future

Using R to automate reports

Reproducible Analytical Pipelines Lucy Charlton

slide-3
SLIDE 3

3

The problem…

■ Need to produce graphs and

commentary every month for internal report.

■ Data in same format, output in

same format

■ Involves same calculations,

such as moving averages

■ Takes at least half a day ■ Manual steps, making it time

consuming and risk of errors (and dull)

Data store (SQL) Calculations in Excel Power BI Copy and paste charts into Word Write commentary in Word Final report

slide-4
SLIDE 4

4

The solution- use RAP Data store (SQL) R functions Output

slide-5
SLIDE 5

5

What is R Markdown?

■ R is open source (free!) stats programming language ■ R Markdown is a file format for making dynamic documents with R.

An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code.

■ Do all editing and formatting in R document. ■ Can include all graphs and tables in one place- easy to update if

data updates.

■ Variety of outputs- word, pdf, html, interactive dashboards, tables,

websites….

slide-6
SLIDE 6

6

How to do it….

■ Use R Markdown to automatically create graphs and dynamic

reports

■ All graphs are consistent and presented in same way ■ Saves time and no manual calculation! ■ First-had to agree on set text for commentary and

write code

■ Success! Took less than 10 minutes to run code, edit text and

check all figures. Will be faster next time. Compared to the 4+ hours work it took before.

slide-7
SLIDE 7

7

Dynamic report

■ Need nice tidy data ■ No more coding in R- update three fields every period. ■ Very easy to amend formatting of report (uses word document as

template)

slide-8
SLIDE 8

8

Pros of using RAP

■ Auditability – records process of how report was created ■ Speed – quick and easy to update, can implement small

changes across many tables/reports simultaneously; so we save time on the repetitive task; more time for interesting analysis

■ Quality – can build QA into the pipeline ■ Knowledge transfer – all the information about how the

figures are produced is embedded in the code, makes it easy to handover

slide-9
SLIDE 9

9

Challenges of using RAP

■ Set aside time for initial code ■ New skills to learn (R, Python, GitHub) ■ IT challenges ■ Finding suitable R packages, some only work for pdf or Word

  • utput not both

■ Might not always want fixed commentary

slide-10
SLIDE 10

10

Is RAP for me?

■ Do you repeat the same workflow more than twice? ■ Is it time consuming and hard to replicate without a lot of manual

intervention?

■ Do you copy and paste a lot between different software? ■ What are the impact of errors in your spreadsheet or report? ■ Could you reproduce publication statistics from one year ago? Five

years ago?

slide-11
SLIDE 11

11

How do I get started?

Lots of online resources- DataCamp, CodeAcademy GovDataScience slack domain (rap_collaboration channel) https://govdatascience.slack.com RAP MOOC Mat Gregory (GDS) https://www.udemy.com/reproducible-analytical-pipelines/

slide-12
SLIDE 12

ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future

Rail fares index automation and web scraping with Python

Greg Williams

slide-13
SLIDE 13

13

Automation of existing processes

The focus of this presentation is on Web Scraping, but the automation of the production of the Fares Index is included as an example of an alternative application of Python in automating existing process. This work is part of civil service efforts to drive efficiency gains by producing Reproducible Analytical Pipelines. (RAP) https://gss.civilservice.gov.uk/blog/learning-about-reproducible- analytical-pipelines-rap-two-weeks-with-the-gss-good-practice-team/ Why automate at all?

  • Consistently produced output
  • Remove human error
  • Process is defined within the code
  • Faster processing of data
slide-14
SLIDE 14

14

Fares index preparation

What’s involved

  • 120 source files, 8 files of price information, 2 lookup tables, four file formats, 37

individual steps to follow

  • A total of 7.5 GB of data: equivalent to 7,600 digital photographs
  • Process Map below:
slide-15
SLIDE 15

15

Fares index preparation: results

  • Produced in just over an hour, not three months
  • Expensive proprietary software can be discontinued, saving £12,000 annually
  • Manual step of cross checking with Avantix no longer needed
  • Time gained reinvested in deeper QA processes

Red items are outputs not produced in manual process:

  • 1 ‘superfile’ of 78,000,000 rows of data at 12 GB
  • 5 Diagnostic files
  • Checks for missing categories, products, duplicates in the RDG files, checksums
  • 3 subsets for checking
  • Earnings over £500,000
  • Annual price change greater than 20%
  • Annual price change less than -20%
  • 2 subsets for calculation: advanced and non-advanced files
  • 1 hour needed, exclusive of source file data extraction
  • Final output based on the ‘superfile’, advanced and non-advanced
  • 5 minutes needed to create final outputs
slide-16
SLIDE 16

16

What is “web scraping”?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Source: https://en.wikipedia.org/wiki/Web_scraping

This presentation will cover web scraping through http, rather than an API Direct webscraping via a URL: https://www.nationalrail.co.uk/

slide-17
SLIDE 17

17

The problem

Economists want to see if the introduction of Open Access Operators changes availability of tickets and/or prices of tickets. This means collecting individual ticket prices based on particular routes, days of the week and times of day. Manual Data Collection: go into the NRE website, request data for route time combination and copy down the results into a spreadsheet.

  • Time consuming
  • Search times change depending on day of week
  • Risk of human error
  • Adding routes increases workload
  • Data not collected for weekends
slide-18
SLIDE 18

18

The solution

Automatic Data Collection through web scraping:

  • Speed: 60 seconds to collect a route with 54 items of data, with data

being automatically appended

  • Runs automatically: runs every day in the background
  • Variables automatically calculated: code works out the day of the week

and searches for appropriate times for each route

  • Accurate: The data is converted into a CSV format into a spreadsheet

without human intervention – no transcription errors

  • Flexible: A metadata sheet controls the nature and number of queries will

be made

  • A common standard: the CSV format is fixed for all users
  • Unlimited in scope: The code executes three times on Fridays, so that

weekend data is captured

slide-19
SLIDE 19

19

The benefits

  • It’s been of genuine interest to users:

 This has been rolled out to 6 ORR economists  Meeting with DfT Data Science team for a technical demonstration/explanation on 16th October

  • Enables team to collect more routes
  • Can add additional variables at little extra cost
  • Targets both

 changes in tickets a given number days from the present day  changes in tickets for fixed day in the future over a time series

slide-20
SLIDE 20

20

Steps in web scraping: Direct web scraping

Step number What to do Library used 1 Build your URL via string concatenation Generic: string 2 Send URL to website Requests 3 Extract the underlying JSON data BeautifulSoup 4 Parse JSON data json 5 Convert to CSV csv

slide-21
SLIDE 21

Journey Focussed Management

Observed Passenger Movement through a Network

slide-22
SLIDE 22

Agenda

  • Vision (NR)
  • Journey Focussed Management Project
  • Stakeholders
  • Outputs
  • Timeline and Next Steps
  • AOB
slide-23
SLIDE 23

System Operator

We are ‘the glue that holds the network together’, enabling the seamless provision of cross-boundary services and coordinating capacity requests to make the best use of the network. Our vision is to become the recognised expert trusted by decision makers to plan Britain’s railway. To achieve this we provide a wholesystem, long term view, using the detailed knowledge we have from capacity planning and timetabling the network. Our work is informed by our industry-wide interfaces with every train operator, the Network Rail routes, other infrastructure managers, public and private funders, and franchising bodies.

slide-24
SLIDE 24

Journey Focussed Management

Idea; The proposal is to work collaboratively with RDG, TfL and others to develop a Digital twin of railway operations; including nowcasting, forecasting and scenario simulation capabilities. The resulting digital asset will provide a Rail Platform for Mobility as a Service (MAAS) that maximises benefits to all travellers and enables more optimal management of the railway including the avoidance or delay of enhancements through operational interventions and safer and more effective management of perturbation. It will include the collection of passenger journey data that enables creation of new KPIs that more closely reflect passenger experience. The work will be aligned with digital twin activities being undertaken by Network Rail

slide-25
SLIDE 25

Journey Focussed Management

Initial Stakeholders: Supported by Phani Chinchapatnam (Enterprise Architect) and Zach Naylor (Principal Engineer Systems Engineering). David Harding - Head of Economics and Analysis Stephen Draper - Performance Analysis Manager NR Franchise Team TfL DfT RDG

slide-26
SLIDE 26

Journey Focussed Management

itial Requirements from Economics and Analysis Team; In

slide-27
SLIDE 27

Journey Focussed Management

Funding Request; £50k Outputs;

  • 1. Requirements exploration across NR departments and collaborating partner organisations to cover

all expected use cases 2. Initial Conversations with DISIC 3. Creation of high level system requirements & review with stakeholders 4. First characterisation of benefits against use cases 5. Benefit approximation using PDFH and Green Book 6. Consideration of internal and external funding options including potential for cost sharing by collaboration partners 7. Develop and R&D project plan consistent with proposed funding approach

slide-28
SLIDE 28

ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future

RSSB and the Cross-Industry Research Programme

slide-29
SLIDE 29

29

RSSB and the Cross-Industry Research Programme

■ Recently established an R&D Programme Coordination Group (PCG)

which seeks to strengthen links and grow awareness across rail research, development and innovation activities.

■ The PCG information does not include the activities captured within the

RSMG research register.

■ Introduction to SPARK, which offers a repository of research outputs

(contributions are welcomed from others), including internationally.

■ Join up efforts, with the PCG likely best placed to have the broadest view

  • f the full rail R,D&I landscape.
slide-30
SLIDE 30

Learning from context with AI

30

slide-31
SLIDE 31

Every year we conduct around 60,000 interviews with passengers – the National Rail Passenger Survey (NRPS)

  • An underutilised resource – it’s hard to read and understand this volume
  • So we decided to deploy Signoi – an Artificial Intelligence (AI) driven pattern

recognition platform – to identify themes and insights that we could all learn from

  • The way to think about this is ‘qualitative research, at scale’. A resource that

can be mined for new insight, ongoing

  • These comments represent something different from the more structured

ratings system – they reflect the broader contextual things that are on rail passengers’ minds…

  • In this exercise we’ve looked at four waves of data (from 2017/18)

Learning from context with AI

31

slide-32
SLIDE 32

What do these comments represent?

32

The comments are not entirely representative or balanced: people with negative experiences are more likely to add a comment. However, this is important because it helps us learn the ‘why’. Hence – qualitative, but at scale

People tend to use the opened ended question to either reinforce points made, or highlight contextual issues

51 54 64 77 82 10 20 30 40 50 60 70 80 90 Very Satisfied Fairly Satisfied Neither/nor Fairly Dissatisfied Very Dissatisfied

% giving open comment Answer at Q16: Satisfaction with journey

Likelihood of meaningful comment

N = 34,720 40,967 9,002 4,703 2,245

Total comments 51,674 across last four waves

slide-33
SLIDE 33

What does Signoi enable us to do?

New generation AI platform to derive meaning and themes from unstructured open text data

Signal from Noise Revelation, not search Storytelling & interrogation

Accelerated reading; fast, rigorous decoding of messy data revealing: meanings, emotional energies, attitudes and feelings; complex energy rather than base sentiment; meaning rather than simple coding Surfacing naturally emerging patterns and themes using neural nets, machine learning models, and advanced analytics. Shows what is there rather than just finding things you need to tell it to look for Cutting analysis time and cost by removing a lot of the grunt work – allowing human minds to do more of what they’re best at – thinking and interpretation

3 3

slide-34
SLIDE 34

Key themes

34

Everything’s Fine!

Signoi then uses a form of cluster analysis to automatically identify and quantify themes. Human analysts then name the themes based on the nature of the comments

Interconnected Frequency Dirty trains No Complaints Basic facilities Old & Shabby Crush Hour Misery Generally unreliable No hassle Accessibility and space Travelling (dis)comfort Basic Efficiency Human touch Feeling safe Misleading information Peace & quiet Mobile office Seating management Atypicality Satisfied BUT Signaling issues Value for money Announcements Faster trains Fares What’s going on? Transport options General annoyance Renationalise! Ticketing Minor quibbles Very expensive

8.8 7.8 10

Topic Presence %

7.2 6.9 5.3 4.9 4.4 4.1 3.9 3.9 3.7 3.2 2.9 2.9 2.8 2.6 2.3 2.1 2.1 2.0 2.0 1.9 1.9 1.8 1.7 1.5 1.4 1.1 0.8 0.7 0.6 0.5 0.3 2 4 6 8

  • 6. Misery
  • 20. Everything's fine
  • 19. Seating management
  • 1. Interconnected frequency
  • 29. Misleading information
  • 15. Crush hour
  • 17. Generally unreliable
  • 16. Human touch
  • 11. Accessible space
  • 8. Satisfied BUT
  • 32. Ticketing
  • 5. Old and shabby
  • 31. Peace and quiet
  • 4. Basic facilities
  • 18. Atypicality
  • 26. Feeling safe
  • 12. Travelling discomfort
  • 2. Dirty trains
  • 24. Fares
  • 3. No complaints
  • 25. What's going on?
  • 23. Faster trains
  • 33. Mobile office
  • 7. Minor quibbles
  • 14. Very expensive
  • 30. Renationalise!
  • 21. Value for money
  • 22. Announcements
  • 10. No hassle
  • 13. Basic efficiency
  • 27. Transport options
  • 9. Signalling issues
  • 28. General annoyance

NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on

slide-35
SLIDE 35

Key themes: summary

35

TOPIC IN A NUTSHELL

  • 1. Interconnected

frequency Issues with short connection times or lack of frequent connections, delays have knock on effect

  • 2. Dirty trains

Trains are not clean, sometimes smell, often old

  • 3. No complaints

Literally, no complaints!

  • 4. Basic facilities

On train and in station, availability of toilets, catering etc

  • 5. Old and shabby

Rolling stock is tired, rickety, and uncomfortable

  • 6. Misery

Debilitating delays and cancellations, often frequent

  • 7. Minor quibbles

Minor issues that do not impact satisfaction

  • 8. Satisfied BUT

Often contextualising against small delays

  • 9. Signaling issues

Explicit mentions of signal problems, track trespass, etc

  • 10. No hassle

Bland positive assessments

  • 11. Accessible

space Complex topic containing accessibility in all its forms: luggage, buggies, bikes, car parking, even escalators

  • 12. Traveling in

discomfort Uncomfortable train environment – seats, temperature, etc

  • 13. Basic

efficiency Getting from A to B as advertised

  • 14. Very expensive

Complaints about ticket prices

  • 15. Crush hour

Stories of hellish rush hour journeys

  • 16. Human touch

Mentions of staff (station & train) usually, not always, positive

  • 17. Generally

unreliable Wider contextual comments about a route or an

  • perator, beyond the specific journey

TOPIC IN A NUTSHELL

  • 18. Atypicality

Making point that while this journey may have been OK, it is generally worse

  • 19. Seating

management Availability of seat bookings, and whether they are honoured on the train itself

  • 20. Everything’s fine

Positive comments about a pleasant journey

  • 21. Value for money

Generally negative about value for money relative to service levels delivered

  • 22. Announcements

Unclear/inaudible announcements (train/station)

  • 23. Faster trains

General comments about slowness on some routes

  • 24. Fares

Complexity and/or price of getting the right fare

  • 25. What’s going
  • n?

Confusing or absent information (especially stations) when trains are delayed / cancelled

  • 26. Feeling safe

Concerns about safety on board and at stations, rowdy passengers, need for guards often mentioned

  • 27. Transport
  • ptions

Lack of choice in how to get from A to B

  • 28. General

annoyance A variety of irritations that do not sit well within other topics

  • 29. Misleading

information A major cause of stress, especially at stations – wrong information about platforms, etc

  • 30. Renationalise

Belief that TOCs need for profit undermines service

  • 31. Peace and quiet

The need for quiet journeys – includes issues with noisy passengers, quiet zones, etc

  • 32. Ticketing

Mainly about machines and lack of clarity about what ticket is valid on which route

  • 33. Mobile office

Onboard WiFi, power, etc, mainly for business people

NOTE: The themes aggregate comments together based on tone of voice as much as detailed content

slide-36
SLIDE 36

Key themes: example

36

EXAMPLES

NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on

General unreliability is a strong driver of (lack of) trust

Southern Rail provide a terrible service and the franchise should be awarded to a train company who can provide a decent service The rail service is awful, commute regularly and cannot rely on the trains to be on time, terrible Generally, trains running through Streatham are late or cancelled at very short notice. A very unreliable rail service This train (the 16.44) isn't too bad, but the DRS operated trains have terrible reliability and are slower Looking forward to Crossrail - hope its an improvement. Heathrow Connect frequently cancels services The rail service is appalling and overpriced for the service received Govia provide a terrible service, overpriced. I have changed how I live my life in order to get to work! West Croydon is fine. Southern trains are extremely unreliable though and the service is generally shambolic Thameslink is poor and unreliable, constantly late and poor quality trains Arriva Trains is very unreliable and makes no compensation. Overpriced for such poor service

slide-37
SLIDE 37

Key themes: example

37

EXAMPLES

NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on

The human touch is very important to passengers

One of the staff at Sittingbourne helped another passenger with a young child on to the train and asked if she needed assistance at the other end. He was very helpful to her and it didn't go unnoticed The staff on the train are always polite and friendly and very informative, always helpful As a solo traveler I always appreciate being able to see a member of staff on the train The lady member of staff at the station was extremely helpful and polite Staff were extremely helpful and polite on the train and on the platform/station Train and train station staff are always pleasant Friendly staff at Beaconsfield station - always helpful and kind Station staff very helpful especially two ladies on Clapham station

slide-38
SLIDE 38

Journey purpose

38

In terms of tangible themes, commuters, business travellers and leisure travellers’ comments are very different in tone

Commuters

  • General (un)reliability of

the service

  • Overcrowding
  • Fares and value for

money

  • Making tight connections

– small delays (by minutes, in some cases) have a big knock-on effect

Business travellers

Travelling in comfort

  • Comment on the state of

trains (old rolling stock, shabby, dirty, and so on)

  • WiFi, power, “mobile
  • ffice” is a priority
  • Price of tickets/VFM
  • On-train facilities (toilets,

catering)

  • Pre-booked seats: (a)

availability online and (b) actually getting the seat they booked

Leisure travellers

  • Peace and quiet
  • The human touch – staff

helping them, at station and on train

  • Station facilities,

information, announcements etc.

  • Car parking at stations,

and luggage / storage / buggy space on trains

  • Comfort on trains,

especially longer journeys

Most negative comments and emotions Elective journeys elicit fewest negative comments and emotions

slide-39
SLIDE 39

39

TOC overview

NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on.

The types of comments individual TOCs receive are very different, reflecting their distinct characteristics. Example comparison:

HEATHROW EXPRESS

  • WiFi, power, mobile
  • ffice a plus
  • Price of tickets a

negative

  • Positive comments

about speed, efficiency, no hassle VIRGIN TRAINS

  • Travelling in comfort

(often a negative)

  • Mobile office

(power/WiFi not working)

  • Onboard facilities

lacking, problems with toilets and so on SOUTHERN

  • Delays and

cancellations

  • General (un)reliability
  • f the service
  • Overcrowding
  • Fares and value for

money

Fewest negative comments Most negative comments

slide-40
SLIDE 40

Conclusions: how this helps us gain insight

40

People use the ‘any other comments’ opportunity to:

  • Reinforce the importance of points they may have made in the more structured

questions (for example, fleshing out WHY they were dissatisfied with a journey)

  • Make wider contextual points about their (general) travelling experience – for

example, talking about feeling safe not on ‘this’ journey, but on journeys in general (or on other journeys they have had)

  • Comment on the state of the nation – the national infrastructure, the way TOCs

invest or do not invest (as they see it), the age of rolling stock, and so on

  • Talk about stations, especially in terms of accessibility, facilities, confusing

announcements, hard-to-navigate ticketing, car parking, and more

  • Raise very specific points that may not be highlighted in the survey itself

The comments are a potential gold mine of insight

slide-41
SLIDE 41

Conclusions: what we see in NRPS comments

41

In general

The things that drive trust are the basic truths of travel: on time, reliable, predictability There is no one silver bullet for the passenger experience. It is often an accumulation of small things Old rolling stock, dirty trains and so on feed a narrative of underinvestment For frequent travelers, especially commuters, issues are exacerbated TOCs vary significantly in terms of the issues raised in the comments – there are many individual learnings

At the station

Improve clarity of information and announcements: need to avoid confusion and stress Accessibility is frequently referenced – especially by

  • lder travelers and leisure

travelers with children etc Station facilities, especially shops and toilets, can be improved (also WiFi) Station staff are a great asset, although there are some stories of rudeness etc Commuters / season ticket holders aside, there is confusion over ticket prices – transparency and clarity

On the train

Overcrowding and general misery is a constant refrain of commuters The train environment is noticed by passengers – a smart, clean train with working toilets goes a long way Where WiFi and power is advertised, it needs to work. Key for business travelers People need to feel safe. Guards play a strong role here, as do other passengers Simple things like adequate luggage space, room for buggies, bikes, etc. make a big difference