Skip to content
This repository was archived by the owner on Jun 22, 2022. It is now read-only.

Lessons Learned

Jakub edited this page May 19, 2018 · 3 revisions

Data Cleaning

Μαριος Μιχαηλιδης KazAnova (4th place)

Consider sorting ties of time in the same groups of ip,app,device,os, making certain the is_attributed==1 is always last as it was pointed out in the forums. (Test data was sorted by click time then target value.)

Performance

Our own insights:

Pandas’s groupby turned out to be a processing bottleneck whilst creating time delta features. This issue was addressed by removal of groupby and sorting (alas post submission deadline)

Tkm2261 (11th place private LB)

Tkm2261 used BigQuery for almost all feature engineering parts. In his notebook they inform that running their BigQuery script on CPU with 16 cores and 100GB mem on GCP took about 20 minutes.

Feature engineering

Our own insights:

Base features (channel, os, hour, app, ip) were not included in the model - it’s worth adding them.

bestfitting (1st place public LB, 3rd place private LB [0.9840841])

Bestfitting used only 23 features, namely:

  • channel 1011
  • os 544
  • hour 472
  • app 468
  • ip_app_os_device_day_click_time_next_1 20
  • app_channel_os_mean_is_attributed 189
  • ip_app_mean_is_attributed 124
  • ip_app_os_device_day_click_time_next_2 120
  • ip_os_device_count_click_id 113
  • ip_var_hour 94
  • ip_day_hour_count_click_id 91
  • ip_mean_is_attributed 74
  • ip_count_click_id 73
  • ip_app_os_device_day_click_time_lag1 67
  • app_mean_is_attributed 67
  • ip_nunique_os_device 65
  • ip_nunique_app 63
  • ip_nunique_os 51
  • ip_nunique_app_channel 49
  • ip_os_device_mean_is_attributed 46
  • device 41
  • app_channel_os_count_click_id 37
  • ip_hour_mean_is_attributed 21

Kxx (0.968 Private lb, top 33% public LB)

It might be worth including vars with frequencies of values of:

  • ip
  • app
  • channel
  • device
  • os
  • ["ip,app" ]
  • [ip,device]
  • [ip,os]
  • [ip,channel]

Tkm2261 (11th place private LB)

Tkm2261 used time delta features with multiple lag values (-2,-1,1,2), with various partitions:

  • [ip, day, app]
  • [ip, day, os]
  • [ip, day, channel]
  • [ip, app]
  • [ip, os]
  • [ip, channel]
  • [ip, app, os]
  • [ip, app, os, device]

Darragh (9th place private LB)

This team incorporated split seconds into lead times - “this was done by turning the ordering per second, and count of clicks per second, into sub-second time - if there are 100 clicks in a second 16:00:00, first click gets 16:00:00.00, second gets 16:00:00.01, next 16:00:00.02 ... etc. This gave us good lift, around 0.001 early in the competition, but we did not test the lift later to see the drop by leaving it out. “

Entropy:

particularly entropy over time based features like minute. For example, entropy over minutes measured if there is an even spread over minutes or do we have bursts of clicks at certain minute marks.

CPMP (6th place private LB)

CPMP incorporates a few unique features in his solution: “China days” - 24 periods that start at 4 pm. These were used for lag features based on previous day(s) data. User id: ip, device, os triplets. Delta with previous app. Previous count by some grouping, and previous target mean by some grouping. The latter was a weighted average with the overall target mean, the weights being such that groups with few rows in it had a value closer to the overall average. This is a standard normalization in target encoding. Number of clicks per ip, app to number of click per app.

Model

bestfitting (1st place public LB, 3rd place public LB)

User called bestfitting achieved an impressive score of 0.9817 (public LB) using a single LGBM model on 23 features. His best model (0.9835), however, was an ensemble (weighted average) of NN and LGBM models. GRU architecture (RNN) used by bestfitting can be found in the link above.

Tkm2261 (11th place private LB)

Tkm2261 used LightGBM and ensembled it with different seeds. The single best model is 0.9823 on the public LB. The parameter following parameters were used:

{'colsample_bytree': 0.6, 
'learning_rate': 0.1, 
'max_bin': 1023, 
'max_depth': -1, 
'metric': ['binary_logloss', 'auc'], 
'min_child_weight': 30, 
'min_split_gain': 0.0001, 
'num_leaves': 127, 
'objective': 'binary', 
'reg_alpha': 0, 
'scale_pos_weight': 1, 
'seed': 1142, 
'subsample': 0.9, 
'subsample_freq': 1, 
'verbose': -1}

Note: Class imbalance is handled by min_sample_weight, and num_leaves - since AUC is the metric, scale is not important, hence “'scale_pos_weight': 1”