Stata/SQL/Python integration to emulate prospective cohort studies - - PowerPoint PPT Presentation

stata sql python integration to emulate prospective
SMART_READER_LITE
LIVE PREVIEW

Stata/SQL/Python integration to emulate prospective cohort studies - - PowerPoint PPT Presentation

Stata/SQL/Python integration to emulate prospective cohort studies from big register data Matteo Marrazzo Nicola Orsini Karolinska Institutet 2019 Nordic and Baltic Stata Users Group meeting Stockholm | 30 August Available sources Data


slide-1
SLIDE 1

2019 Nordic and Baltic Stata Users Group meeting Stockholm | 30 August

Stata/SQL/Python integration to emulate prospective cohort studies from big register data

Matteo Marrazzo Nicola Orsini Karolinska Institutet

slide-2
SLIDE 2

30 augusti 2019 2

Available sources

  • Data registers

 Big dimensions  Covering long periods of time  Necessity to develop solid designs

slide-3
SLIDE 3

30 augusti 2019 3

Design valid epidemiological studies

  • Prospective cohorts
  • Measuring exposures
  • Defining outcomes
  • Including confounders and effect modifiers
  • Replication in different points in time
slide-4
SLIDE 4

30 augusti 2019 4

Relational Databases

  • Structured data
  • SQL language
  • Key data processing
  • ODBC
slide-5
SLIDE 5

30 augusti 2019 5

ODBC Stata Integration

. odbc list Data Source Name Driver

  • dbc1 ODBC Driver
  • dbc query “odbc1”
  • dbc load, exec(“`query’”)
slide-6
SLIDE 6

30 augusti 2019 6

Statistical Analysis

  • Poisson regression models to predict rates (poisson)

Exposure + effect modifiers + confounders

  • Predictive margins (margins)

Adjusted rates by exposure

slide-7
SLIDE 7

30 augusti 2019 7

Python integration for visualization

use "C:\rates.dta", clear python: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import os

  • s.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = "C:\\Anaconda3\\Library\\plugins"

from sfi import Data X = np.array(Data.get(“exposure rate")) df = pd.DataFrame({'Exposure': X[:, 0], 'Rate': X[:, 1]}) fig, ax1 = plt.subplots() colorset = ["orange","green"] for i in range(0, 2): sns.distplot(df.loc[df['Exposure'] == i, "Rate"], color = colorset[i], label=i, hist=False) plt.ylim(0, 1) plt.legend(title = 'Exposure',loc='upper right', ncol=2, fancybox=True, shadow=True) plt.xlabel('Rate') plt.ylabel('Distribution') plt.show() end

slide-8
SLIDE 8

30 augusti 2019 8

Distribution of rates by exposure

slide-9
SLIDE 9

30 augusti 2019 9

Animations with python: scatterplot

python: import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import matplotlib.animation as animation import seaborn as sns import os

  • s.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = "C:\\Anaconda3\\Library\\plugins"

from sfi import Data X = np.array(Data.get("day exposure rate")) df = pd.DataFrame({'day': X[:, 0], 'exposure': X[:, 1], 'rate': X[:, 2]})

Import libraries and create dataframe from Stata

slide-10
SLIDE 10

30 augusti 2019 10

Animations with python

fig, ax = plt.subplots(figsize=(16, 9), dpi = 90) ax.set_xlim(0,24) ax.set_xlabel('Month') ax.set_ylabel('Rate') ax.set_ylim(4, 8) ax.set_title('') colorset = ["orange","green"] def get_data(day=0, exposure=0): x = df.loc[(df['exposure'] == exposure) & (df['day'] == day), "day"] y = df.loc[(df['exposure'] == exposure) & (df['day'] == day), "rate"] return x,y

Create the basic plot figure and the function to get x and y

slide-11
SLIDE 11

30 augusti 2019 11

Animations with python

# initialization function def init(): for j in range(2): x,y= get_data(day=0,exposure=j) sc = ax.scatter(x,y, c=colorset[j], s=10) return sc, # animation function def animate(i): for j in range(2): x,y= get_data(day=i,exposure=j) sc = ax.scatter(x,y, c=colorset[j], s=10) return sc,

Create initialization and animation functions

slide-12
SLIDE 12

30 augusti 2019 12

Animations with python

Writer = animation.writers['ffmpeg'] writer = Writer(fps=5, metadata=dict(artist='Example'), bitrate=1800) ani = matplotlib.animation.FuncAnimation(fig, animate, init_func=init, frames=25, interval=5000, blit=True, repeat = True) ani.save("Animation.mp4", writer=writer) end

Run the animation and save the file (‘ffmpeg’ required)

slide-13
SLIDE 13

30 augusti 2019 13

Animations with python

slide-14
SLIDE 14

30 augusti 2019 14

Conclusions

  • We have shown how it’s possible to integrate Stata with relational

databases and python

  • The design, implementation, analysis and visualization can be

simplified by taking the best of every software

  • The new python integration in Stata 16 works efficiently and provides

a solid base to expand Stata capabilities

  • This integration can provide solutions to increasingly complex

research questions