Skip to content

Latest commit

 

History

History
41 lines (33 loc) · 1.91 KB

File metadata and controls

41 lines (33 loc) · 1.91 KB

Otomoto Data Mining

We want to scrape car advertisements and build models capable of predicting price based on car details.

Data Storage

Data is saved as Parquet files in the data/ directory, with each page of results stored as page_XXX.parquet for efficient storage and analysis.

Results

  • classes per step:
    • 14 cls -> 20_000 step,
    • 10 cls -> 30_000 step,
    • 7 cls -> 40_000 step,
  • min_price=20_000
  • max_price=300_000
Model Linear bins step* Accuracy
DecisionTree + TFIDF 20_000 40%
DecisionTree + TFIDF 30_000 46%
DecisionTree + TFIDF 40_000 52%
LogisticRegression + TFIDF 20_000 49%
LogisticRegression + TFIDF 30_000 55%
LogisticRegression + TFIDF 40_000 63%
RF + TFIDF (only descr) 20_000 51%
RF + TFIDF (only descr) 30_000 55%
RF + TFIDF (only descr) 40_000 62%
RFECV (num only) 20_000 52%
RFECV (num only) 30_000 59%
RFECV (num only) 40_000 71%
RF + RFECV + TFIDF (descr + num) 20_000 53%
RF + RFECV + TFIDF (descr + num) 30_000 59%
RF + RFECV + TFIDF (descr + num) 40_000 68%
HerBERT fine-tuned 20_000 56%
HerBert (Near-miss=1) 20_000 85%

*Above steps are for the linear bins. Finally, in the models logarithmic split has been chosen but basing number of bins from linear one.