ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future
RSMG 10 October 2019 ORR protects the interests of rail and road - - PowerPoint PPT Presentation
RSMG 10 October 2019 ORR protects the interests of rail and road - - PowerPoint PPT Presentation
ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future RSMG 10 October 2019 ORR protects the interests of rail and road users, improving the safety, value and
ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future
Using R to automate reports
Reproducible Analytical Pipelines Lucy Charlton
3
The problem…
■ Need to produce graphs and
commentary every month for internal report.
■ Data in same format, output in
same format
■ Involves same calculations,
such as moving averages
■ Takes at least half a day ■ Manual steps, making it time
consuming and risk of errors (and dull)
Data store (SQL) Calculations in Excel Power BI Copy and paste charts into Word Write commentary in Word Final report
4
The solution- use RAP Data store (SQL) R functions Output
5
What is R Markdown?
■ R is open source (free!) stats programming language ■ R Markdown is a file format for making dynamic documents with R.
An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code.
■ Do all editing and formatting in R document. ■ Can include all graphs and tables in one place- easy to update if
data updates.
■ Variety of outputs- word, pdf, html, interactive dashboards, tables,
websites….
6
How to do it….
■ Use R Markdown to automatically create graphs and dynamic
reports
■ All graphs are consistent and presented in same way ■ Saves time and no manual calculation! ■ First-had to agree on set text for commentary and
write code
■ Success! Took less than 10 minutes to run code, edit text and
check all figures. Will be faster next time. Compared to the 4+ hours work it took before.
7
Dynamic report
■ Need nice tidy data ■ No more coding in R- update three fields every period. ■ Very easy to amend formatting of report (uses word document as
template)
8
Pros of using RAP
■ Auditability – records process of how report was created ■ Speed – quick and easy to update, can implement small
changes across many tables/reports simultaneously; so we save time on the repetitive task; more time for interesting analysis
■ Quality – can build QA into the pipeline ■ Knowledge transfer – all the information about how the
figures are produced is embedded in the code, makes it easy to handover
9
Challenges of using RAP
■ Set aside time for initial code ■ New skills to learn (R, Python, GitHub) ■ IT challenges ■ Finding suitable R packages, some only work for pdf or Word
- utput not both
■ Might not always want fixed commentary
10
Is RAP for me?
■ Do you repeat the same workflow more than twice? ■ Is it time consuming and hard to replicate without a lot of manual
intervention?
■ Do you copy and paste a lot between different software? ■ What are the impact of errors in your spreadsheet or report? ■ Could you reproduce publication statistics from one year ago? Five
years ago?
11
How do I get started?
Lots of online resources- DataCamp, CodeAcademy GovDataScience slack domain (rap_collaboration channel) https://govdatascience.slack.com RAP MOOC Mat Gregory (GDS) https://www.udemy.com/reproducible-analytical-pipelines/
ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future
Rail fares index automation and web scraping with Python
Greg Williams
13
Automation of existing processes
The focus of this presentation is on Web Scraping, but the automation of the production of the Fares Index is included as an example of an alternative application of Python in automating existing process. This work is part of civil service efforts to drive efficiency gains by producing Reproducible Analytical Pipelines. (RAP) https://gss.civilservice.gov.uk/blog/learning-about-reproducible- analytical-pipelines-rap-two-weeks-with-the-gss-good-practice-team/ Why automate at all?
- Consistently produced output
- Remove human error
- Process is defined within the code
- Faster processing of data
14
Fares index preparation
What’s involved
- 120 source files, 8 files of price information, 2 lookup tables, four file formats, 37
individual steps to follow
- A total of 7.5 GB of data: equivalent to 7,600 digital photographs
- Process Map below:
15
Fares index preparation: results
- Produced in just over an hour, not three months
- Expensive proprietary software can be discontinued, saving £12,000 annually
- Manual step of cross checking with Avantix no longer needed
- Time gained reinvested in deeper QA processes
Red items are outputs not produced in manual process:
- 1 ‘superfile’ of 78,000,000 rows of data at 12 GB
- 5 Diagnostic files
- Checks for missing categories, products, duplicates in the RDG files, checksums
- 3 subsets for checking
- Earnings over £500,000
- Annual price change greater than 20%
- Annual price change less than -20%
- 2 subsets for calculation: advanced and non-advanced files
- 1 hour needed, exclusive of source file data extraction
- Final output based on the ‘superfile’, advanced and non-advanced
- 5 minutes needed to create final outputs
16
What is “web scraping”?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Source: https://en.wikipedia.org/wiki/Web_scraping
This presentation will cover web scraping through http, rather than an API Direct webscraping via a URL: https://www.nationalrail.co.uk/
17
The problem
Economists want to see if the introduction of Open Access Operators changes availability of tickets and/or prices of tickets. This means collecting individual ticket prices based on particular routes, days of the week and times of day. Manual Data Collection: go into the NRE website, request data for route time combination and copy down the results into a spreadsheet.
- Time consuming
- Search times change depending on day of week
- Risk of human error
- Adding routes increases workload
- Data not collected for weekends
18
The solution
Automatic Data Collection through web scraping:
- Speed: 60 seconds to collect a route with 54 items of data, with data
being automatically appended
- Runs automatically: runs every day in the background
- Variables automatically calculated: code works out the day of the week
and searches for appropriate times for each route
- Accurate: The data is converted into a CSV format into a spreadsheet
without human intervention – no transcription errors
- Flexible: A metadata sheet controls the nature and number of queries will
be made
- A common standard: the CSV format is fixed for all users
- Unlimited in scope: The code executes three times on Fridays, so that
weekend data is captured
19
The benefits
- It’s been of genuine interest to users:
This has been rolled out to 6 ORR economists Meeting with DfT Data Science team for a technical demonstration/explanation on 16th October
- Enables team to collect more routes
- Can add additional variables at little extra cost
- Targets both
changes in tickets a given number days from the present day changes in tickets for fixed day in the future over a time series
20
Steps in web scraping: Direct web scraping
Step number What to do Library used 1 Build your URL via string concatenation Generic: string 2 Send URL to website Requests 3 Extract the underlying JSON data BeautifulSoup 4 Parse JSON data json 5 Convert to CSV csv
Journey Focussed Management
Observed Passenger Movement through a Network
Agenda
- Vision (NR)
- Journey Focussed Management Project
- Stakeholders
- Outputs
- Timeline and Next Steps
- AOB
System Operator
We are ‘the glue that holds the network together’, enabling the seamless provision of cross-boundary services and coordinating capacity requests to make the best use of the network. Our vision is to become the recognised expert trusted by decision makers to plan Britain’s railway. To achieve this we provide a wholesystem, long term view, using the detailed knowledge we have from capacity planning and timetabling the network. Our work is informed by our industry-wide interfaces with every train operator, the Network Rail routes, other infrastructure managers, public and private funders, and franchising bodies.
Journey Focussed Management
Idea; The proposal is to work collaboratively with RDG, TfL and others to develop a Digital twin of railway operations; including nowcasting, forecasting and scenario simulation capabilities. The resulting digital asset will provide a Rail Platform for Mobility as a Service (MAAS) that maximises benefits to all travellers and enables more optimal management of the railway including the avoidance or delay of enhancements through operational interventions and safer and more effective management of perturbation. It will include the collection of passenger journey data that enables creation of new KPIs that more closely reflect passenger experience. The work will be aligned with digital twin activities being undertaken by Network Rail
Journey Focussed Management
Initial Stakeholders: Supported by Phani Chinchapatnam (Enterprise Architect) and Zach Naylor (Principal Engineer Systems Engineering). David Harding - Head of Economics and Analysis Stephen Draper - Performance Analysis Manager NR Franchise Team TfL DfT RDG
Journey Focussed Management
itial Requirements from Economics and Analysis Team; In
Journey Focussed Management
Funding Request; £50k Outputs;
- 1. Requirements exploration across NR departments and collaborating partner organisations to cover
all expected use cases 2. Initial Conversations with DISIC 3. Creation of high level system requirements & review with stakeholders 4. First characterisation of benefits against use cases 5. Benefit approximation using PDFH and Green Book 6. Consideration of internal and external funding options including potential for cost sharing by collaboration partners 7. Develop and R&D project plan consistent with proposed funding approach
ORR protects the interests of rail and road users, improving the safety, value and performance of railways and roads today and in the future
RSSB and the Cross-Industry Research Programme
29
RSSB and the Cross-Industry Research Programme
■ Recently established an R&D Programme Coordination Group (PCG)
which seeks to strengthen links and grow awareness across rail research, development and innovation activities.
■ The PCG information does not include the activities captured within the
RSMG research register.
■ Introduction to SPARK, which offers a repository of research outputs
(contributions are welcomed from others), including internationally.
■ Join up efforts, with the PCG likely best placed to have the broadest view
- f the full rail R,D&I landscape.
Learning from context with AI
30
Every year we conduct around 60,000 interviews with passengers – the National Rail Passenger Survey (NRPS)
- An underutilised resource – it’s hard to read and understand this volume
- So we decided to deploy Signoi – an Artificial Intelligence (AI) driven pattern
recognition platform – to identify themes and insights that we could all learn from
- The way to think about this is ‘qualitative research, at scale’. A resource that
can be mined for new insight, ongoing
- These comments represent something different from the more structured
ratings system – they reflect the broader contextual things that are on rail passengers’ minds…
- In this exercise we’ve looked at four waves of data (from 2017/18)
Learning from context with AI
31
What do these comments represent?
32
The comments are not entirely representative or balanced: people with negative experiences are more likely to add a comment. However, this is important because it helps us learn the ‘why’. Hence – qualitative, but at scale
People tend to use the opened ended question to either reinforce points made, or highlight contextual issues
51 54 64 77 82 10 20 30 40 50 60 70 80 90 Very Satisfied Fairly Satisfied Neither/nor Fairly Dissatisfied Very Dissatisfied
% giving open comment Answer at Q16: Satisfaction with journey
Likelihood of meaningful comment
N = 34,720 40,967 9,002 4,703 2,245
Total comments 51,674 across last four waves
What does Signoi enable us to do?
New generation AI platform to derive meaning and themes from unstructured open text data
Signal from Noise Revelation, not search Storytelling & interrogation
Accelerated reading; fast, rigorous decoding of messy data revealing: meanings, emotional energies, attitudes and feelings; complex energy rather than base sentiment; meaning rather than simple coding Surfacing naturally emerging patterns and themes using neural nets, machine learning models, and advanced analytics. Shows what is there rather than just finding things you need to tell it to look for Cutting analysis time and cost by removing a lot of the grunt work – allowing human minds to do more of what they’re best at – thinking and interpretation
3 3
Key themes
34
Everything’s Fine!
Signoi then uses a form of cluster analysis to automatically identify and quantify themes. Human analysts then name the themes based on the nature of the comments
Interconnected Frequency Dirty trains No Complaints Basic facilities Old & Shabby Crush Hour Misery Generally unreliable No hassle Accessibility and space Travelling (dis)comfort Basic Efficiency Human touch Feeling safe Misleading information Peace & quiet Mobile office Seating management Atypicality Satisfied BUT Signaling issues Value for money Announcements Faster trains Fares What’s going on? Transport options General annoyance Renationalise! Ticketing Minor quibbles Very expensive
8.8 7.8 10
Topic Presence %
7.2 6.9 5.3 4.9 4.4 4.1 3.9 3.9 3.7 3.2 2.9 2.9 2.8 2.6 2.3 2.1 2.1 2.0 2.0 1.9 1.9 1.8 1.7 1.5 1.4 1.1 0.8 0.7 0.6 0.5 0.3 2 4 6 8
- 6. Misery
- 20. Everything's fine
- 19. Seating management
- 1. Interconnected frequency
- 29. Misleading information
- 15. Crush hour
- 17. Generally unreliable
- 16. Human touch
- 11. Accessible space
- 8. Satisfied BUT
- 32. Ticketing
- 5. Old and shabby
- 31. Peace and quiet
- 4. Basic facilities
- 18. Atypicality
- 26. Feeling safe
- 12. Travelling discomfort
- 2. Dirty trains
- 24. Fares
- 3. No complaints
- 25. What's going on?
- 23. Faster trains
- 33. Mobile office
- 7. Minor quibbles
- 14. Very expensive
- 30. Renationalise!
- 21. Value for money
- 22. Announcements
- 10. No hassle
- 13. Basic efficiency
- 27. Transport options
- 9. Signalling issues
- 28. General annoyance
NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on
Key themes: summary
35
TOPIC IN A NUTSHELL
- 1. Interconnected
frequency Issues with short connection times or lack of frequent connections, delays have knock on effect
- 2. Dirty trains
Trains are not clean, sometimes smell, often old
- 3. No complaints
Literally, no complaints!
- 4. Basic facilities
On train and in station, availability of toilets, catering etc
- 5. Old and shabby
Rolling stock is tired, rickety, and uncomfortable
- 6. Misery
Debilitating delays and cancellations, often frequent
- 7. Minor quibbles
Minor issues that do not impact satisfaction
- 8. Satisfied BUT
Often contextualising against small delays
- 9. Signaling issues
Explicit mentions of signal problems, track trespass, etc
- 10. No hassle
Bland positive assessments
- 11. Accessible
space Complex topic containing accessibility in all its forms: luggage, buggies, bikes, car parking, even escalators
- 12. Traveling in
discomfort Uncomfortable train environment – seats, temperature, etc
- 13. Basic
efficiency Getting from A to B as advertised
- 14. Very expensive
Complaints about ticket prices
- 15. Crush hour
Stories of hellish rush hour journeys
- 16. Human touch
Mentions of staff (station & train) usually, not always, positive
- 17. Generally
unreliable Wider contextual comments about a route or an
- perator, beyond the specific journey
TOPIC IN A NUTSHELL
- 18. Atypicality
Making point that while this journey may have been OK, it is generally worse
- 19. Seating
management Availability of seat bookings, and whether they are honoured on the train itself
- 20. Everything’s fine
Positive comments about a pleasant journey
- 21. Value for money
Generally negative about value for money relative to service levels delivered
- 22. Announcements
Unclear/inaudible announcements (train/station)
- 23. Faster trains
General comments about slowness on some routes
- 24. Fares
Complexity and/or price of getting the right fare
- 25. What’s going
- n?
Confusing or absent information (especially stations) when trains are delayed / cancelled
- 26. Feeling safe
Concerns about safety on board and at stations, rowdy passengers, need for guards often mentioned
- 27. Transport
- ptions
Lack of choice in how to get from A to B
- 28. General
annoyance A variety of irritations that do not sit well within other topics
- 29. Misleading
information A major cause of stress, especially at stations – wrong information about platforms, etc
- 30. Renationalise
Belief that TOCs need for profit undermines service
- 31. Peace and quiet
The need for quiet journeys – includes issues with noisy passengers, quiet zones, etc
- 32. Ticketing
Mainly about machines and lack of clarity about what ticket is valid on which route
- 33. Mobile office
Onboard WiFi, power, etc, mainly for business people
NOTE: The themes aggregate comments together based on tone of voice as much as detailed content
Key themes: example
36
EXAMPLES
NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on
General unreliability is a strong driver of (lack of) trust
Southern Rail provide a terrible service and the franchise should be awarded to a train company who can provide a decent service The rail service is awful, commute regularly and cannot rely on the trains to be on time, terrible Generally, trains running through Streatham are late or cancelled at very short notice. A very unreliable rail service This train (the 16.44) isn't too bad, but the DRS operated trains have terrible reliability and are slower Looking forward to Crossrail - hope its an improvement. Heathrow Connect frequently cancels services The rail service is appalling and overpriced for the service received Govia provide a terrible service, overpriced. I have changed how I live my life in order to get to work! West Croydon is fine. Southern trains are extremely unreliable though and the service is generally shambolic Thameslink is poor and unreliable, constantly late and poor quality trains Arriva Trains is very unreliable and makes no compensation. Overpriced for such poor service
Key themes: example
37
EXAMPLES
NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on
The human touch is very important to passengers
One of the staff at Sittingbourne helped another passenger with a young child on to the train and asked if she needed assistance at the other end. He was very helpful to her and it didn't go unnoticed The staff on the train are always polite and friendly and very informative, always helpful As a solo traveler I always appreciate being able to see a member of staff on the train The lady member of staff at the station was extremely helpful and polite Staff were extremely helpful and polite on the train and on the platform/station Train and train station staff are always pleasant Friendly staff at Beaconsfield station - always helpful and kind Station staff very helpful especially two ladies on Clapham station
Journey purpose
38
In terms of tangible themes, commuters, business travellers and leisure travellers’ comments are very different in tone
Commuters
- General (un)reliability of
the service
- Overcrowding
- Fares and value for
money
- Making tight connections
– small delays (by minutes, in some cases) have a big knock-on effect
Business travellers
Travelling in comfort
- Comment on the state of
trains (old rolling stock, shabby, dirty, and so on)
- WiFi, power, “mobile
- ffice” is a priority
- Price of tickets/VFM
- On-train facilities (toilets,
catering)
- Pre-booked seats: (a)
availability online and (b) actually getting the seat they booked
Leisure travellers
- Peace and quiet
- The human touch – staff
helping them, at station and on train
- Station facilities,
information, announcements etc.
- Car parking at stations,
and luggage / storage / buggy space on trains
- Comfort on trains,
especially longer journeys
Most negative comments and emotions Elective journeys elicit fewest negative comments and emotions
39
TOC overview
NOTE: Online dashboards allow you to explore the themes in detail, by TOC, journey type, and so on.
The types of comments individual TOCs receive are very different, reflecting their distinct characteristics. Example comparison:
HEATHROW EXPRESS
- WiFi, power, mobile
- ffice a plus
- Price of tickets a
negative
- Positive comments
about speed, efficiency, no hassle VIRGIN TRAINS
- Travelling in comfort
(often a negative)
- Mobile office
(power/WiFi not working)
- Onboard facilities
lacking, problems with toilets and so on SOUTHERN
- Delays and
cancellations
- General (un)reliability
- f the service
- Overcrowding
- Fares and value for
money
Fewest negative comments Most negative comments
Conclusions: how this helps us gain insight
40
People use the ‘any other comments’ opportunity to:
- Reinforce the importance of points they may have made in the more structured
questions (for example, fleshing out WHY they were dissatisfied with a journey)
- Make wider contextual points about their (general) travelling experience – for
example, talking about feeling safe not on ‘this’ journey, but on journeys in general (or on other journeys they have had)
- Comment on the state of the nation – the national infrastructure, the way TOCs
invest or do not invest (as they see it), the age of rolling stock, and so on
- Talk about stations, especially in terms of accessibility, facilities, confusing
announcements, hard-to-navigate ticketing, car parking, and more
- Raise very specific points that may not be highlighted in the survey itself
The comments are a potential gold mine of insight
Conclusions: what we see in NRPS comments
41
In general
The things that drive trust are the basic truths of travel: on time, reliable, predictability There is no one silver bullet for the passenger experience. It is often an accumulation of small things Old rolling stock, dirty trains and so on feed a narrative of underinvestment For frequent travelers, especially commuters, issues are exacerbated TOCs vary significantly in terms of the issues raised in the comments – there are many individual learnings
At the station
Improve clarity of information and announcements: need to avoid confusion and stress Accessibility is frequently referenced – especially by
- lder travelers and leisure
travelers with children etc Station facilities, especially shops and toilets, can be improved (also WiFi) Station staff are a great asset, although there are some stories of rudeness etc Commuters / season ticket holders aside, there is confusion over ticket prices – transparency and clarity
On the train
Overcrowding and general misery is a constant refrain of commuters The train environment is noticed by passengers – a smart, clean train with working toilets goes a long way Where WiFi and power is advertised, it needs to work. Key for business travelers People need to feel safe. Guards play a strong role here, as do other passengers Simple things like adequate luggage space, room for buggies, bikes, etc. make a big difference