Word Frequency · Sentiment Analysis · TF-IDF Framework
This project investigates how Airbnb hosts communicate value across market segments by combining pricing analytics with text mining, applied to a global dataset pulled directly from MongoDB Atlas.
The objective is to understand the role of language in shaping perceived value on the Airbnb platform. Rather than relying solely on pricing metrics, this analysis treats listing descriptions as structured signals that reveal host communication strategy.
The framework uncovers:
- Dominant convenience language across all listing types
- Sentiment patterns by room type
- Segment-specific differentiation hidden beneath surface-level uniformity
- Pricing structure driven by property features
This approach is applicable to platform economics, market intelligence, and any domain requiring unstructured text as a signal layer on top of structured data.
Dataset: MongoDB Atlas sample_airbnb — listingsAndReviews collection
Connection: mongolite R package
Markets covered: Istanbul, Montreal, Barcelona, Hong Kong,
Sydney, New York, Rio de Janeiro, Porto, Oahu, Maui
Fields extracted include listing descriptions, room type, pricing, bedroom count, market, and review scores.
Structured pricing data was analyzed to establish baseline market structure before text analysis:
- Right-skewed price distribution → median used over mean to handle outliers
- Median price by room type: Entire home > Shared room > Private room
- Price scales with bedroom count, peaking at 5 bedrooms (~$415)
- Supply concentrated in urban and tourism-focused markets
Listing text fields (name, summary, description, space, neighborhood_overview, notes, transit, access, interaction, house_rules) were concatenated and tokenized.
Stop words removed via the tidytext stop_words lexicon. Generic structural terms (apartment, room, bed, bedroom, house) manually filtered to expose meaningful signal words.
Word Frequency Dominant terms across all listings: walk, kitchen, minutes, located, beach, city, restaurants, living, guests, station. Indicates a standardized convenience-first communication strategy with low surface-level differentiation.
Sentiment Analysis — Bing Lexicon Overwhelmingly positive sentiment across all listing types (~6:1 ratio). Entire home listings show the highest positive language density, consistent with aspirational framing to justify premium pricing. Positivity is ubiquitous — useful for trust-building but ineffective as a differentiator.
TF-IDF by Room Type Segment-specific language emerges beneath the surface uniformity:
- Entire home → experiential framing: "condo", "resort", "beaches"
- Private room → locational framing: "manhattan", "brooklyn"
- Shared room → affordability framing: "hostelworld", "dormitory"
Two-panel interactive dashboard filterable by market and room type:
- Market Dashboard: price distribution, median by room type, top markets, average price by bedrooms
- Text Analytics: word frequency, sentiment analysis, TF-IDF by segment
R · tidytext · mongolite · ggplot2 · Shiny · MongoDB Atlas
- Airbnb listing language is highly standardized at the surface level, dominated by convenience signals (walk, kitchen, minutes).
- Sentiment is overwhelmingly positive across all segments — effective for trust, not for differentiation.
- TF-IDF reveals that segment-specific language strategy exists beneath the surface: hosts implicitly tailor messaging to customer expectations by room type without explicitly signaling it.
- Pricing is structurally driven by property features (room type, bedroom count) rather than listing language quality.
Regina Garfias — Risk Analytics