Skip to content

SGGmilagro/Airbnb-NLP-Text-Analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Airbnb NLP & Text Analytics

Word Frequency · Sentiment Analysis · TF-IDF Framework

This project investigates how Airbnb hosts communicate value across market segments by combining pricing analytics with text mining, applied to a global dataset pulled directly from MongoDB Atlas.

Overview

The objective is to understand the role of language in shaping perceived value on the Airbnb platform. Rather than relying solely on pricing metrics, this analysis treats listing descriptions as structured signals that reveal host communication strategy.

The framework uncovers:

  • Dominant convenience language across all listing types
  • Sentiment patterns by room type
  • Segment-specific differentiation hidden beneath surface-level uniformity
  • Pricing structure driven by property features

This approach is applicable to platform economics, market intelligence, and any domain requiring unstructured text as a signal layer on top of structured data.

Data

Dataset: MongoDB Atlas sample_airbnb — listingsAndReviews collection
Connection: mongolite R package
Markets covered: Istanbul, Montreal, Barcelona, Hong Kong, Sydney, New York, Rio de Janeiro, Porto, Oahu, Maui

Fields extracted include listing descriptions, room type, pricing, bedroom count, market, and review scores.

Methodology

Pricing Analysis

Structured pricing data was analyzed to establish baseline market structure before text analysis:

  • Right-skewed price distribution → median used over mean to handle outliers
  • Median price by room type: Entire home > Shared room > Private room
  • Price scales with bedroom count, peaking at 5 bedrooms (~$415)
  • Supply concentrated in urban and tourism-focused markets

Text Pipeline (tidytext)

Listing text fields (name, summary, description, space, neighborhood_overview, notes, transit, access, interaction, house_rules) were concatenated and tokenized.

Stop words removed via the tidytext stop_words lexicon. Generic structural terms (apartment, room, bed, bedroom, house) manually filtered to expose meaningful signal words.

Word Frequency Dominant terms across all listings: walk, kitchen, minutes, located, beach, city, restaurants, living, guests, station. Indicates a standardized convenience-first communication strategy with low surface-level differentiation.

Sentiment Analysis — Bing Lexicon Overwhelmingly positive sentiment across all listing types (~6:1 ratio). Entire home listings show the highest positive language density, consistent with aspirational framing to justify premium pricing. Positivity is ubiquitous — useful for trust-building but ineffective as a differentiator.

TF-IDF by Room Type Segment-specific language emerges beneath the surface uniformity:

  • Entire home → experiential framing: "condo", "resort", "beaches"
  • Private room → locational framing: "manhattan", "brooklyn"
  • Shared room → affordability framing: "hostelworld", "dormitory"

Interactive Dashboard (R Shiny)

Two-panel interactive dashboard filterable by market and room type:

  • Market Dashboard: price distribution, median by room type, top markets, average price by bedrooms
  • Text Analytics: word frequency, sentiment analysis, TF-IDF by segment

Technologies

R · tidytext · mongolite · ggplot2 · Shiny · MongoDB Atlas

Key Findings

  • Airbnb listing language is highly standardized at the surface level, dominated by convenience signals (walk, kitchen, minutes).
  • Sentiment is overwhelmingly positive across all segments — effective for trust, not for differentiation.
  • TF-IDF reveals that segment-specific language strategy exists beneath the surface: hosts implicitly tailor messaging to customer expectations by room type without explicitly signaling it.
  • Pricing is structurally driven by property features (room type, bedroom count) rather than listing language quality.

Author

Regina Garfias — Risk Analytics

About

Text mining pipeline on Airbnb listings using R — word frequency, sentiment analysis (Bing lexicon), and TF-IDF segmentation by room type. Interactive Shiny dashboard with MongoDB Atlas integration across 10 global markets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages