IndabaX 2019 Malawi An Introduction to ML Amelia Taylor Lecturer - PowerPoint PPT Presentation

IndabaX 2019 Malawi An Introduction to ML Amelia Taylor Lecturer in AI The Polytechnic, University of Malawi

A bit about myself ● I studied Mathematics and Computer Science and obtained a 1 st Class degree (main project: Natural Language Processing) ● Obtained a James-Watt Scholarship to pursue my PhD in Edinburgh which I obtained in 2006 in Mathematical Logic ● I worked as a research associate (query language for astronomical data) ● I switched to finance and worked on building risk models for funds management, asset allocation and trading.

ML ● For solving problems that require Pattern recognition. ● Machine learning is often used interchangeably for data mining and knowledge discovery in databases.

Applications ● Detecting Financial Fraud (Cyber surveillance) ● Detecting spam emails (or phishing) ● Virtual assistants (Siri, Alexa, Google Now) ● Marketing and Sales (analysing purchasing behaviour) ● Social media ● Health, e.g., wearable of the patient in order to provide information regarding the patient’s condition, heartbeat, blood pressure, etc. 4

Two Types of ML algorithms ● Supervised Learning: the parameters of the algorithms are ‘tuned’ by running the algorithm on test (‘training data’) = input and its corresponding output – Input data is annotated with labels / categories – After the parameters are tuned one gives a new/unlabeled input to that algorithm – Expects the algorithm to label that input – Classification – For example in biology – For example, in automatic translators supervised learning is used extensively

Two Types of ML Algorithms ● Unsupervised Learning - there is no training set where data is labeled ● Most common algorithm for unsupervised learning is cluster analysis: finding hidden patterns or grouping in data.

Why do we want Classification? ● Classification enables systems-level analysis of large data sets. ● Classification enables automation. ● Classification increases the ability to retrieve information from large data sets and enables the interpretation, discovery of new patterns, and acquisition of knowledge from large data sets.

Challenges in Classification ● Linear Regression. ● Neural Networks (perceptrons). ● Naive Bayes Classifier. ● Decision Trees. ● Use of Statistics In Input Data.

Decision Trees

Bayes Formula Example ● 1% of women have breast cancer (and therefore 99% do not). ● 80% of mammograms detect breast cancer when it is there (and therefore 20% miss it). ● 9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result). Put in a table, the probabilities look like this:

How Accurate Is The Test? ● Now suppose you get a positive test result. What are the chances you have cancer? 80%? 99%? 1%?

Bayes Theorem

Applying Bayes on Our example ● Pr(H|E) = Chance of having cancer (H) given a positive test (E). This is what we want to know: How likely is it to have cancer with a positive result? ● Pr(E|H) = Chance of a positive test (E) given that you had cancer (H). This is the chance of a true positive, 80% in our case. ● Pr(H) = Chance of having cancer (1%). ● Pr(not H) = Chance of not having cancer (99%). ● Pr(E|not H) = Chance of a positive test (E) given that you didn’t have cancer (not H). This is a false positive, 9.6% in our case.

Challenges in Clustering ● Data Distribution ● Large number of samples. The number of samples to be processed is very high. Algorithms have to be very conscious of scaling issues. Like many interesting problems, clustering in general is NP-hard, and practical and successful data mining algorithms usually scale linear or log-linear. Quadratic and cubic scaling may also be allowable but a linear behavior is highly desirable. ● High dimensionality. The number of features is very high and may even exceed the number of samples; Sparsity; strong non- Gaussian distribution of feature values: The data is so skewed that it can not be safely modeled by normal distributions. ● Significant outliers. Outliers may have significant importance. Finding these outliers is highly non-trivial, and removing them is not necessarily desirable. ● Legacy clusterings. Previous cluster analysis results are often available. This knowledge should be reused instead of starting each analysis from scratch. ● Distributed data. Large systems often have heterogeneous distributed data sources. Local cluster analysis results have to be integrated into global models.

Ohio Doctors Appointments Dataset ● www.kaggle.com/joniarroba/noshowappointments ● Discover reasons that losses are coming up even though the rate of appointments is going up? – If patients are not reporting at the time of their scheduled appointments, come up with a method to determine whether a patient would show up on the basis of his/her characteristics. She believed that knowing which patients were likely not to show up would enable the hospital to take countermeasures like the following: – Provide constant appointment reminders and confirmations – Make the head count of doctors and hospital staff in line with the demand at hand

Practical ● Open the Jupyter notebook which handles the Ohio Data Set.

END ● Thank you.

IndabaX 2019 Malawi An Introduction to ML Amelia Taylor Lecturer - PowerPoint PPT Presentation

IndabaX 2019 Malawi An Introduction to ML Amelia Taylor Lecturer in AI The Polytechnic, University of Malawi A bit about myself I studied Mathematics and Computer Science and obtained a 1 st Class degree (main project: Natural Language

Transforming Malawi Inclusive of the Malawi Diaspora Presented at: 2018 Malawi Diaspora Conference

Cities: No. 2 Mzuzu, Malawi The Food Retailing System in Mzuzu City, Malawi Lovemore Itai Zuze

ISO MIRROR COMMITTEE IN MALAWI THE BIG PICTURE Gloria Chaonamwene E-mail: mbs@mbsmw.org

Finding'the'niche'market'in' housing'finance'in'Malawi Seminar'on'Housing'Finance'in'Malawi

IMPROVING ACCESS TO PEDIATRIC ULTRASOUND IN MALAWI Jonathon Weber, MD PGY-5, Northwestern

Malawi Annual Report, April 2011 March 2012 Vito Sandifolo Country Manager Chancellor College,

U.S. Government Bilateral Assistance to Malawi Selected Results in Malawi in 2012 $300 Health :

Hacking Healthcare Technology in Africa Mike McKay BaobabHealth.org 37 $172 Malawi

ILLOVO SUGAR MALAWI Plc FINANCIAL RESULTS FOR THE YEAR ENDED 31 MARCH 2016/17 Illovo Malawi

Seed and Subsidies: The Political Economy of Input Programmes in Malawi Blessings Chinsinga

THE AFRICAN COFFEE MARKET THE CASE OF MALAWI COFFEE INDUSTRY) BY HARRISON B. KALUA CHIEF

Community Score Card experience in Ntcheu,Malawi: CAREs perspective Thumbiko Wa-Chizuma

Malawi Tea Revitalization Programme 2020 Working towards a competitive tea industry with living

UTILIZATION OF SOY OKARA IN PREPARATION OF NUTRACEUTICAL BUNS BY JEAN PANKUKU MALAWI AFRICA

Aflatoxin Risk Management: Malawi Case Study - Introduction Forum for Agricultural Risk Management

The Malawi Groundnut Value Chain: Can Valid Nutrition and Smallholder Groundnut Farmers

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College /

Bayesian Classification Autonomous Agents Vasilis Papageorgiou February 23, 2020 Technical

EST5104 Bayesian Inference EST5803 Advanced Bayesian Inference Ricardo Ehlers ehlers@icmc.usp.br

Biostatistics 602 - Statistical Inference March 14th, 2013 Biostatistics 602 - Lecture 16 Hyun

Analyzing #POTUS Sentiment on Twitter to Predict Public Opinion on Presidential Issues By: Jacob

A RISK ANALYSIS OF THE MOLYBDENUM-99 SUPPLY CHAIN USING BAYESIAN NETWORKS 2017 Mo-99 Topical

A few basics of credibility theory Greg Taylor Director, Taylor Fry Consulting Actuaries

KALMAN AND PARTICLE FILTERS Tutorial 10 H. R. B. Orlande, M. J. Colao, G. S. Dulikravich, F.