Lessons Learned

Data Cleaning

Μαριος Μιχαηλιδης KazAnova (4th place)

Consider sorting ties of time in the same groups of ip,app,device,os, making certain the is_attributed==1 is always last as it was pointed out in the forums. (Test data was sorted by click time then target value.)

Performance

Our own insights:

Pandas’s groupby turned out to be a processing bottleneck whilst creating time delta features. This issue was addressed by removal of groupby and sorting (alas post submission deadline)

Tkm2261 (11th place private LB)

Tkm2261 used BigQuery for almost all feature engineering parts. In his notebook they inform that running their BigQuery script on CPU with 16 cores and 100GB mem on GCP took about 20 minutes.

Feature engineering

Our own insights:

Base features (channel, os, hour, app, ip) were not included in the model - it’s worth adding them.

bestfitting (1st place public LB, 3rd place private LB [0.9840841])

Bestfitting used only 23 features, namely:

channel 1011
os 544
hour 472
app 468
ip_app_os_device_day_click_time_next_1 20
app_channel_os_mean_is_attributed 189
ip_app_mean_is_attributed 124
ip_app_os_device_day_click_time_next_2 120
ip_os_device_count_click_id 113
ip_var_hour 94
ip_day_hour_count_click_id 91
ip_mean_is_attributed 74
ip_count_click_id 73
ip_app_os_device_day_click_time_lag1 67
app_mean_is_attributed 67
ip_nunique_os_device 65
ip_nunique_app 63
ip_nunique_os 51
ip_nunique_app_channel 49
ip_os_device_mean_is_attributed 46
device 41
app_channel_os_count_click_id 37
ip_hour_mean_is_attributed 21

Kxx (0.968 Private lb, top 33% public LB)

It might be worth including vars with frequencies of values of:

ip
app
channel
device
os
["ip,app" ]
[ip,device]
[ip,os]
[ip,channel]

Tkm2261 (11th place private LB)

Tkm2261 used time delta features with multiple lag values (-2,-1,1,2), with various partitions:

[ip, day, app]
[ip, day, os]
[ip, day, channel]
[ip, app]
[ip, os]
[ip, channel]
[ip, app, os]
[ip, app, os, device]

Darragh (9th place private LB)

This team incorporated split seconds into lead times - “this was done by turning the ordering per second, and count of clicks per second, into sub-second time - if there are 100 clicks in a second 16:00:00, first click gets 16:00:00.00, second gets 16:00:00.01, next 16:00:00.02 ... etc. This gave us good lift, around 0.001 early in the competition, but we did not test the lift later to see the drop by leaving it out. “

Entropy:

particularly entropy over time based features like minute. For example, entropy over minutes measured if there is an even spread over minutes or do we have bursts of clicks at certain minute marks.

CPMP (6th place private LB)

CPMP incorporates a few unique features in his solution: “China days” - 24 periods that start at 4 pm. These were used for lag features based on previous day(s) data. User id: ip, device, os triplets. Delta with previous app. Previous count by some grouping, and previous target mean by some grouping. The latter was a weighted average with the overall target mean, the weights being such that groups with few rows in it had a value closer to the overall average. This is a standard normalization in target encoding. Number of clicks per ip, app to number of click per app.

Model

bestfitting (1st place public LB, 3rd place public LB)

User called bestfitting achieved an impressive score of 0.9817 (public LB) using a single LGBM model on 23 features. His best model (0.9835), however, was an ensemble (weighted average) of NN and LGBM models. GRU architecture (RNN) used by bestfitting can be found in the link above.

Tkm2261 (11th place private LB)

Tkm2261 used LightGBM and ensembled it with different seeds. The single best model is 0.9823 on the public LB. The parameter following parameters were used:

{'colsample_bytree': 0.6, 
'learning_rate': 0.1, 
'max_bin': 1023, 
'max_depth': -1, 
'metric': ['binary_logloss', 'auc'], 
'min_child_weight': 30, 
'min_split_gain': 0.0001, 
'num_leaves': 127, 
'objective': 'binary', 
'reg_alpha': 0, 
'scale_pos_weight': 1, 
'seed': 1142, 
'subsample': 0.9, 
'subsample_freq': 1, 
'verbose': -1}

Note: Class imbalance is handled by min_sample_weight, and num_leaves - since AUC is the metric, scale is not important, hence “'scale_pos_weight': 1”

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lessons Learned

Data Cleaning

Μαριος Μιχαηλιδης KazAnova (4th place)

Performance

Our own insights:

Tkm2261 (11th place private LB)

Feature engineering

Our own insights:

bestfitting (1st place public LB, 3rd place private LB [0.9840841])

Kxx (0.968 Private lb, top 33% public LB)

Tkm2261 (11th place private LB)

Darragh (9th place private LB)

Entropy:

CPMP (6th place private LB)

Model

bestfitting (1st place public LB, 3rd place public LB)

Tkm2261 (11th place private LB)

Uh oh!

Clone this wiki locally