1
Building an analytic department From Zero to TensorFlow 1 The Peter - - PowerPoint PPT Presentation
Building an analytic department From Zero to TensorFlow 1 The Peter - - PowerPoint PPT Presentation
Building an analytic department From Zero to TensorFlow 1 The Peter principle: People in a hierarchy tend to rise to their level of incompetence : An employee is promoted based on their success in previous jobs, until they reach a level
The Peter principle: People in a hierarchy tend to rise to their “level of incompetence”: An employee is promoted based on their success in previous jobs, until they reach a level at which they are no longer competent, as skills in one job do not necessarily translate to another.
3
Introductions
Antoine Desmet
Analytics manager – Smart Solutions, Komatsu
Hunter Valley
2000 The US Defense Department ended the purposeful degradation of GPS 2008 Komatsu releases Level 4 autonomy, driverless truck fleet. Operates even if wireless link is lost
Real-time terrain mapping
- LIDAR on diggers
- Scans stitched together
into terrain map
- Compare to plan
- Operator sees:
- Red: over-dug
- Blue: matches plan
- Green: needs digging
In near real-time
Topics This is the story of a growing analytics team. It’s a business-oriented presentation A collection of thoughts and discoveries: sorry if I don’t have all the definitive answers
- Background
- The beginning: small vs. big?
- Growth
- R&D
- Picking your projects
- Stakeholder management
- What’s next
9
Background
What data do we have, what we do with it… and WHY?
Ore extraction chain Cost=20,000$/hr Revenue = 40,000$/hr Profit when operating = +20,000 Profit on breakdown = -15,000 A leaking air hose: Time to fix = 1-2 hr Parts + labour = $300 Loss of production = 15-30 k$
The cost of downtime
Payload Operator’s joysticks Temperatures Motor currents Machine’s motions Auto-lube system Air pressure Brakes status
800 sensors Sampling rate: 100ms max
What we provide
- The machine’s control system will “fault” if it detects a severe malfunction
- Unplanned downtime is extremely costly in the mining industry
- We analyse telemetry data to detect issues before they trigger a system fault
- It’s not so much about saving the part. By the time we can detect a malfunction, often it’s
already beyond repair
- It’s about giving customer time to plan maintenance for what would otherwise be a disruptive
unplanned breakdown
14
In the beginning
At the peak of the “big data” hype cycle
Day 1
- 2014: one engineer (me) and one manager (sales)
- At the peak of the “Big Data” craze, but…
- In the midst of a mining downturn:
no budget, pressure to deliver
- 6 years prior, a visionary setup dataloggers + backend
to harvest hundreds of sensor data at high rez = lots of data
- Data locked-up in antiquated time-series databases
- Fragile infrastructure
- Zero process
You are here
The Skunk works
- Hired a couple of summer interns to boost output
- Version control = copy/paste in separate folders
That’s OK because there were only a couple of developers
- Built an rudimentary “model factory” data-dredging algorithm – without any hypothesis or prior
- assessment. Generally viewed as poor practice…
That’s OK because it’s machine data: correlations usually indicate something mechanically or electrically coupled. Feature engineering made it work. 3 Months=wide “coverage” of the machine.
- Do everything on your laptop, then straight to Production
That’s OK because there were no contracts or nothing mission critical. Mission-critical was demonstrating value
Reflections: Small Vs. Big
Small / startup model:
- Loose plan, objectives and strategy
- Less capital investment from business, so lower expectations
- Pick problems yourself: those that seem relevant, and “safe bets” = quick wins in months
- High risk of picking the wrong projects. Fast but disorganised, bound to run into scaling issues
Big / corporate model:
- Large investment, financial targets set from the start
- Regimented methods, pressure to deliver may hinder creativity
- 1 year, 10 DS: explore, investigate use cases for analytics
- Well organised, safe-but-slow approach, prepared for the long-term
19
Growing
Product: tick – customers: tick – what’s next?
Another start-up that became bloated
Mech/Elec engs were very productive and creative… but things started to tear at the seams:
- Why document when everyone knows… bus factor!
- IT upgrading databases crippled us with rework.
- Lack of software engineering practices = poor: reliability, readability, re-useability,
- Things started to slow down.
- Routine means you become blind to your own deficiencies.
- Hard to see the paradigm shift: “remember how we used to be faster, what happened?”
- Accept that things are the way they are. Getting a clean run or working faster isn’t possible.
Today
- 2-3 years later, we welcomed 3 team members, including a senior software dev.
- The software dev went on a crusade (still going) for: unit tests, doc, libraries
- The “old guard” had to lift their games and mature to integrate the “fresh blood”. Helped kick
the old counter-productive habits, and work towards increasing quality and pace Our team now has:
- 2 Data scientists: the theory
- 2 Engineers: make it work
- 2 Software developers: make it scale
- 1 Analyst / report developer: make it visible
- 3 Subject matter experts: make it relevant
Workflow challenges
The release cliff-hanger:
- Analysts are fluent at developing models on
their laptop…
- Releasing an analytic into production is a rare
- event. Lack of practice = frequent fails
Trialling a solution:
- Start with Test release of “skeleton”
- Instead of leaving release as final step
- DevOps 101: release early and frequently!
PROD success
- uch
- uch
- uch
Workflow challenges
From bench to streaming:
- R&D happens on a static block of time-series data (e.g. one month).
- Challenge = from static to live streaming: batch size, handover between batches, catching-up
(maintain full history) vs forcing forward (satisfy real-time) Standardise
- Build high-level functions & templates to abstract real-time execution aspects.
- Don’t lock-down the process and make it hard to build “non-standard”
- Standardising helps maintainability, collaboration, etc.
3 aspects of Continuous improvement
Streamline actioning the insights Streamline tools for faster analytics development Streamline analytics: generic and re-useable
25
R&D
Finance, industrial plants and insurance analytics
Industrial analytics are a niche application, no-one can help me! What could there be to gain by outside of my industry?
- Finance f(A,B) = Ĉ C is a share price, A and B the competitor’s share prices
If Ĉ >> C: sell, Ĉ << C:buy, Ĉ = C: do noting
- Insurances f(A,B) = Ĉ, C is the amount claimed, A and B some parameters of the claim
Ĉ ≈ C: do nothing, Ĉ << C: investigate a potentially fraudulent claim
- Plant analytics f(A,B) = Ĉ, C is the temperature of a motor, A and B are brearing temps.
Ĉ ≈ C: do nothing, Ĉ << C motor potentially overheating At the right level of abstraction, it all becomes the same. Talk to people. But I’m preaching the choir!
Interns for R&D
- Autonomy: R&D can be insulated from the production systems. Low risk to business.
Here’s a dataset, install [ your favourite toolset ] and go get it, tiger!
- This usually produces a proof-of-concept
- An intern can clear the fog on that high risk/high value project. You can make a sound
decision to proceed forwards, without having used any precious permanent employee time
- With the right intern: the newer the tech, the greater the challenge… the more they engage!
- Co-supervision with an academic will inject a lot of their knowledge in your project. This is
- ften a better solution vs. directly engaging into a research project with academics
- You can hire the outstanding ones, risk free!
28
Picking projects
business value vs. geeky indulgence
A tale of two companies merging Komatsu Mainly sells dump trucks A mine owns 50-200 + spare units Less expensive, small loss is not-mission critical Analytics strategy focus on compliance to scheduled maintenance, part sales, operator abuse
P&H
P&H Mainly sells primary digging equipment A mine owns 1-5 of them, no redundancy Very expensive, “top of the pyramid” Analytics strategy focus on fault prediction & uptime maximisation: keep them running 24/7
The “no free lunch” of analytics Leaking air hose Gearbox failure
Recurrent, low impact, easy: supervised Rare, extremely high impact, hard: unsupervised
TensorFlow to the rescue!
Need a generic Time Series pattern recognition Weary of the deep-learning hype: “hot topic” of 2016… At the peak of Gartner’s “hype curve” Is it just for images? An overkill? A summer intern ran the project with great success (accurate and generalises) CNN + LSTM is our standard approach to detect failure patterns in automated systems. Interested in the details? Data Science Sydney Meetup - Tue 28 May
Anomaly detection with automated data-dredging
- Lots of correlations across coupled sensors
- Leverage robustness of ensembles. The fault
you’re trying to detect doesn’t “spill” over everywhere
- Estimate sensor value, based on
its coupled counterparts + lots of feature engineering
- Compare estimate with reality
- Build a model for each pair-wise combination where corr>threshold
33
Stakeholder management
Warning: sarcastic content, rants, memes and exaggeration (for comical purposes only)
Data-driven vs. storytelling-driven
Data geek + subject geek = Successful analytic
- Field experts in mining are incredibly knowledgeable on the machines. Human-Wikipedia
- level. They are the authority. They know it better than those who designed it.
- Data-driven approach vs. storytelling approach:
“show me examples of what you are looking for” – “It’s a rare one, there’s none in the DB” “let me tell you how the machine works and fails” – “I don’t understand ” Generate synthetic failure data?
- Explain the worth of machine learning, when they prefer describing things using a (long)
series of logic statements.
- Then explain most ML algos are “black-box”, you can’t “trace” an issue, only fix is: more
training data.
Why unsupervised? I know everything!
- Dealing with people who spent a large chunk of their lives getting to know a particular piece of
equipment, like the back of their hand.
- Realistically, they know 99% of what can go wrong
- Sell the value of anomaly detection – when they know every single way the machine can fail
and would prefer a supervised approach
- Yet there are 800 sensors – Sure enough, anomaly detection uncovers unknown (but rare)
issues.
The tricky bits: dealing with the impact
Some potentially unwanted results:
- Applying proper statistics tells you “by how much you can’t be sure”. You are trading “ignorant
certainty” for “educated uncertainty”… Often people feel like they lost out.
- Customer claims warranty on broken parts, but data shows customer abused the machine
- Onsite maintenance crews’ job insecurity: analytics are out to take my job!
- Some business models involve charging for machines by the “engine-running” hour. You
uncover ways for customers to reduce idle time… Sales is gong to love you
39
What’s next
The human element and the edge
Next steps
Optimising operations
- Machines have complex automation, ultimately automation=low variance = good models.
A Human operates the machine = immense variance. Yet there’s a lot of potential in
- ptimising how they use and control the machine.
Streaming analytics
- Batch = 1,000 lines of code. Re-written in a stream context = 300 lines
Cloud vs. Edge analytics
- Cloud Vs. Edge. Cloud is great for development, but not for low-latency. Edge is great for fast
- feedback. When an operator misbehaves, they need a notification within 5 seconds. Edge
- vercomes wireless network latency and reliability issues.
Sharing the insights with the business
- Connecting with the business: releasing insights into business databases (SalesForce, etc.)
41