A data-driven business intelligence study combining geospatial ride-hailing analysis with cryptocurrency social-media sentiment research. The ride-hailing component analyzes over 1.3 million NYC Yellow Taxi trip records to uncover spatial demand hotspots, temporal usage patterns, and popular travel corridors. The cryptocurrency component investigates the relationship between social media sentiment and Bitcoin price movements using Pearson, Spearman, rolling, and lagged cross-correlation techniques.
Conducted under the guidance of CMU Prof. Beibei Li, this research applies statistical and visual analytics to real-world business problems — demonstrating how data science bridges the gap between raw datasets and actionable business insights.
| Analysis | Data Source | Records | Methods |
|---|---|---|---|
| Ride-Hailing GPS | NYC TLC Yellow Taxi | 1.3M | Zone heatmaps, temporal patterns, corridor analysis |
| Bitcoin Sentiment | Yahoo Finance + VADER | 210 | Pearson/Spearman, rolling & lagged correlation |
Raw Data Sources ├― NYC TLC Parquet (1.3M taxi trips, Jan 2021) └― Yahoo Finance API (BTC-USD, 210 trading days) Preprocessing ├― Filter: distance, fare, duration thresholds ├― Merge: taxi zone names & borough labels ├― Derive: hourly/daily features, daily returns └― Generate: synthetic VADER sentiment scores Taxi Analysis Crypto Analysis Zone pickup/dropoff counts Pearson & Spearman correlation Hour × Day-of-week heatmap Rolling 30-day correlation Top corridors (A → B pairs) Lagged cross-correlation (0-7d) Fare & duration distributions Volume & trend with 30-day MA Output: 12 publication-quality visualizations
Taxi Data Pipeline: NYC TLC Yellow Taxi parquet files are loaded with PyArrow and filtered by distance (0.1–100 mi), fare ($1–$500), and duration (1–120 min). Zone names are merged from the TLC lookup table, and time features (hour, day-of-week) are derived for downstream grouping.
Geospatial Analysis: Trips are aggregated by pickup and dropoff zone to produce ranked bar charts. The top-N pickup→dropoff pairs are identified as popular corridors. Borough-level summaries compute average fare, distance, and trip duration for high-level business insights.
Temporal Patterns: Trips are binned by hour and day-of-week. The 24×7 heatmap reveals rush-hour peaks, midday lulls, and weekend late-night surges. Peak hour detection identifies the 5 busiest time slots for fleet allocation recommendations.
Sentiment Correlation: Daily BTC-USD returns are computed from closing prices. Sentiment scores are correlated with returns using Pearson (r ≈ 0.45 overall), Spearman (rank-based robustness), a 30-day rolling window (time-varying strength), and lagged cross-correlation (testing 0–7 day predictive horizons).
def compute_correlation(df): valid = df.dropna(subset=["sentiment_score", "daily_return"]) pearson_r, pearson_p = sp_stats.pearsonr( valid["sentiment_score"], valid["daily_return"] ) spearman_r, spearman_p = sp_stats.spearmanr( valid["sentiment_score"], valid["daily_return"] ) return CorrelationResult(pearson_r, pearson_p, spearman_r, spearman_p)
def weekly_heatmap(df): per_slot = df.groupby(["pickup_date", "pickup_day", "pickup_hour"]).size() avg_slot = per_slot.reset_index(name="trips") avg_slot = avg_slot.groupby(["pickup_day", "pickup_hour"])["trips"].mean() heatmap = avg_slot.reset_index().pivot( index="pickup_hour", columns="pickup_day", values="trips" ) return heatmap[_DAY_ORDER] # Monday first