e x plorator y data anal y sis
play

E x plorator y data anal y sis P R E D IC TIN G C TR W ITH MAC H - PowerPoint PPT Presentation

E x plorator y data anal y sis P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor A closer look at feat u res int : an integer : 1 , 2 , etc . print(df.columns) float : decimals : 3.02 , 4.56 , etc .


  1. E x plorator y data anal y sis P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor

  2. A closer look at feat u res int : an integer : 1 , 2 , etc . print(df.columns) float : decimals : 3.02 , 4.56 , etc . ['id', 'click', 'hour', 'C1', ... ] object : string : "hello" , "world" , etc . datetime : datetime : 2018-01-01 , etc . print(df.dtypes) df.select_dtypes( include=['int', 'float']) id object click int64 click int64 ... ... PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  3. Missing data df.info() df.isnull().sum(axis = 0) Data columns (total 24 columns): dtype: object id 50000 non-null object id 0 ... df['id'].isnull() df.isnull().sum(axis = 0).sum() [False, False, False, False, ... ] 0 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  4. Looking at distrib u tions df.groupby(['search_engine_type', df.groupby(['search_engine_type', 'click']).size() 'click']).size().unstack() search_engine_type click click 0 1 1002 0 940 search_engine_type 1 240 1002 940 240 ... ... PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  5. Breakdo w n b y CTR df.reset_index() click search_engine_type 0 1 1002 940 240 df.rename(columns = {0: 'non_clicks'}, inplace = True) click search_engine_type non_clicks clicks 1002 940 240 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  6. Let ' s practice ! P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON

  7. Feat u re engineering P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor

  8. Dealing w ith dates print(df.hour.head(1)) print(df.groupby('hour_of_day') ['click'].sum()) 14102101 click hour_of_day df['hour'] = pd.to_datetime( 1 1092 df['hour'], format = '%y%m%d%H') 2 6546 df['hour_of_day'] = df['hour'].dt.hour print(df.hour.head(1)) 2014-10-21 01:00:0 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  9. Con v erting categorical v ariables v ia hashing Categorical feat u res m u st be con v erted into a n u merical format Hash f u nction : maps arbitrar y inp u t to an integer o u tp u t , ret u rning e x act same o u tp u t for a gi v en inp u t Lambda f u nction : lambda x: f(x) Appl y hash f u nction v ia f(x) = hash(x) as follo w s : df['site_id'] = df['site_id'].apply(lambda x: hash(x), axis = 0) 83a0ad1a -> -9161053084583616050 85f751fd-> 818242008494177460 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  10. A closer look at feat u res E x amples of count() and nunique() : df['ad_type'].count() 50000 df['ad_type'].nunique() 31 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  11. Creating feat u res Most of v ariables are categorical Adding more feat u res is be � er for predicti v e po w er E x ample of ne w feat u re : impressions b y device_id (u ser ) and search_engine_type : df['device_id_count'] = df.groupby('device_id')['click'].transform("count") df['search_engine_type_count'] = df.groupby('search_engine_type')['click'].transform("count") print(df.head(1)) ... device_id_count search_engine_type_count ... 40862 47710 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  12. Let ' s practice ! P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON

  13. Standardi z ing feat u res P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor

  14. Wh y standardi z ation is important Standardi z ation : ens u ring y o u r data � ts ass u mptions that models ha v e Certain feat u res ma y ha v e too high v ariance , w hich might u nfairl y dominate models E x ample : certain co u nt ha v e too large of a range of v al u es d u e to one spam u ser Does not appl y to categorical v ariables s u ch as site_id , app_id , device_id , etc . PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  15. Log normali z ation df.var() print(df['click'].var()) df['device_id_count'] = df[ 'device_id_count'].apply( click 1.294270e-01 lambda x: np.log(x)) hour 1.123316e-01 print(df['click'].var()) df.var().median() 249362570.10134825 15.628476003312514 0.7108583771671939 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  16. Scaling data Standard scaling con v erts all feat u res to ha v e mean of 0 and standard de v iation of 1 Generall y a good practice for machine learning models PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  17. Ho w to standard scale data Scaling can be done u sing StandardScaler() as follo w s : scaler = StandardScaler() X[numeric_cols] = scaler.fit_transform(X[numeric_cols]) dtype: float64 1 10.5 -> 0.85 2 32.3 -> 1.54 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON

  18. Let ' s practice ! P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend