 
              High-Performance Data Mining for Drug Effect Detection - a project funded by the Swedish Foundation for Strategic Research during 2012-2017 Henrik Boström*, Lars Asker, Hercules Dalianis, Mia Kvist Aron Henriksson, Isak Karlsson, Jing Zhao Department of Computer and Systems Sciences Stockholm University Ulf Johansson, Håkan Sundell Karl Jansson, Henrik Linusson, Tuve Löfström Department of Information Technology University of Borås • Project focus • Organization • Results • Continuation *At KTH Royal Institute of Technology since Oct. 2017
Project focus To develop techniques and tools to support decision making and discovery of drug effects by analyzing electronic health records and chemical compound data Chemical compound data Electronic health records (EHRs) Technical challenges ● heterogeneous data ● high-dimensional, sparse and incomplete data ● temporal dependencies ● guarantees needed for predictions
Project organization PhD students PhD students Aron Henriksson Isak Karlsson Maria Skeppstedt Jing Zhao Supervisors Supervisors Hercules Dalianis Lars Asker Martin Duneld Henrik Boström Adverse drug Clinical text Mia Kvist Panos Papapetrou event (ADE) mining detection Conformal Parallel prediction data mining PhD student PhD students Karl Jansson Henrik Linusson Supervisors Tuve Löfström Håkan Sundell Supervisors Henrik Boström Ulf Johansson Henrik Boström
Scientific output ● Publications - 21 journal papers - 45 conference papers - 12 workshop papers - 4 PhD Theses - 1 Licentiate thesis - 1 forthcoming book ● Awards - The Börje Langefors Prize awarded by SISA to Aron Henriksson for best PhD thesis in informatics at a Swedish university in 2016 - Carl H. Smith Award for best paper for I. Karlsson et al, 2016, Early Random Shapelet Forest, Proc. of the 19th International Conference on Discovery Science - Distinguished Paper Award to Jing Zhao et al., 2015, American Medical Informatics Association (AMIA) Annual Symposium
Main results on ADE detection Random forests are capable of • screening EHRs that should be The random forest algorithm was • assigned ADE codes extended to handle heterogeneous time evolving data by sampling High dimensionality can be handled • temporal patterns by using random indexing to reduce dimensionality on EHRs x … Sparsity can be handled in • random forests by resampling when predicting ADEs The choice of representation of • EHRs has a high impact on the result, e.g., using concept hierarchies of clinical codes, patient-level vs. visit-level analysis, and how time dependencies are encoded
Main results on clinical text mining • Ensembles of semantic spaces • Combining heterogeneous data created by manipulating underlying data from EHRs shown to lead to increased and model hyper-parameters lead to predictive performance – early fusion improved performance on terminology outperforming late fusion strategies development, NER, relation extraction and ADE detection Annotated corpus for learning to identity • drugs and symptoms/disorders in clinical notes Distributional semantics extended to • non-linguistic sequence data Diagnosis codes (ICD) • • Drug codes (ATC) • Clinical measurements
Main results on conformal prediction Standard prediction: Model Database T oxicity = 5.2 P(correct) = ? Conformal prediction: P(correct) = 95% T oxicity = (4.5, 5.9) ● The framework was adapted to specific machine learning techniques – decision trees – random forests – ensembles of neural networks ● The conformal prediction framework was improved – tighter one-tailed predictions – sound procedure for using out-of-bag-instances for calibration ● Application to specific learning situations – handling imbalanced data – streaming data
Main results on parallel data mining Parallel implementations of Random Forests and Extremely Randomized Trees for ● GPU and CPU for both classification and regression tasks ● Streaming solutions for GPUs to support datasets larger than the memory available on the GPU ● GPU solutions were found to outperform state-of-the-art implementations ● GPU solutions were found to scale well (almost linear) with multiple GPUs 400 350 300 250 seconds gpuERT 200 cpuERT gpuRF 150 FastRF 100 WekaRF 50 0 1000x trees
Additional results ● Ethical permissions and access to EHRs: - 2 million electronic patient records from Karolinska University Hospital during 2007-2014 - HEALTH BANK Infrastructure ● Software packages: - random forests (Julia, GPU, Java, Erlang) - text mining tools - conformal prediction (Python, Julia) - adverse event exploration and detection tools ● Critical mass of expertise in the area and established connections with stake holders: - Karolinska University Hospital - Centre for Pharmacoepidemiology at Karolinska Institute - AstraZeneca - Swedish Toxicology Sciences Research Center
Continuation Nordic Center of Excellence in Health-Related e-Sciences ● (NIASC), Nordforsk, 44 MSEK, 2014-2018, partners: Karolinska Institutet, CBS, University of Copenhagen and Cancer Registry of Norway Analyzing registry data to find ways to improve treatment of ● heart failure patients, Stockholm County Council, 4.6MSEK, 2017-2019 Data Analytics for Research and Development, Knowledge ● Foundation, 4 MSEK, 2016-2018, partners: AstraZeneca and Scania R&D Data Driven Innovation – Algorithms, Platforms and ● Ecosystems, Knowledge Foundation, 12 MSEK, 2016-2020, partners: Ellos, Eton and Vinga of Sweden Temporal Data Mining for Detecting Adverse Events in ● Healthcare, Swedish Research Council, 3.4MSEK, 2017-2020
Recommend
More recommend