Business Intelligence and Business Analytics

Research · Led by CMU Prof. Beibei Li · May — Jul 2021

A data-driven business intelligence study combining geospatial ride-hailing analysis with cryptocurrency social-media sentiment research. The ride-hailing component analyzes over 1.3 million NYC Yellow Taxi trip records to uncover spatial demand hotspots, temporal usage patterns, and popular travel corridors. The cryptocurrency component investigates the relationship between social media sentiment and Bitcoin price movements using Pearson, Spearman, rolling, and lagged cross-correlation techniques.

Conducted under the guidance of CMU Prof. Beibei Li, this research applies statistical and visual analytics to real-world business problems — demonstrating how data science bridges the gap between raw datasets and actionable business insights.

Horizontal bar chart ranking the top 20 NYC taxi pickup zones by trip count
Top 20 NYC taxi pickup zones by trip volume (Jan 2021)
2
Analyses
12
Visualizations
1.3M
Trip Records
210
Trading Days

Analysis Comparison

Analysis Data Source Records Methods
Ride-Hailing GPS NYC TLC Yellow Taxi 1.3M Zone heatmaps, temporal patterns, corridor analysis
Bitcoin Sentiment Yahoo Finance + VADER 210 Pearson/Spearman, rolling & lagged correlation

Pipeline Architecture

Raw Data Sources
  ├― NYC TLC Parquet (1.3M taxi trips, Jan 2021)
  └― Yahoo Finance API (BTC-USD, 210 trading days)

Preprocessing
  ├― Filter: distance, fare, duration thresholds
  ├― Merge: taxi zone names & borough labels
  ├― Derive: hourly/daily features, daily returns
  └― Generate: synthetic VADER sentiment scores

Taxi Analysis                  Crypto Analysis
  Zone pickup/dropoff counts       Pearson & Spearman correlation
  Hour × Day-of-week heatmap      Rolling 30-day correlation
  Top corridors (A → B pairs)    Lagged cross-correlation (0-7d)
  Fare & duration distributions    Volume & trend with 30-day MA

Output: 12 publication-quality visualizations

Key Features

Design Decisions

Frameworks & Tools

Language
Python 3.8+
Data Processing
pandas, NumPy, PyArrow
Visualization
matplotlib, seaborn
Statistics
scipy.stats, scikit-learn
Sentiment
VADER (Valence Aware Dictionary)
Market Data
yfinance (Yahoo Finance API)
Python pandas matplotlib seaborn VADER yfinance Geospatial Sentiment Analysis Correlation Business Intelligence

How It Works

Taxi Data Pipeline: NYC TLC Yellow Taxi parquet files are loaded with PyArrow and filtered by distance (0.1–100 mi), fare ($1–$500), and duration (1–120 min). Zone names are merged from the TLC lookup table, and time features (hour, day-of-week) are derived for downstream grouping.

Geospatial Analysis: Trips are aggregated by pickup and dropoff zone to produce ranked bar charts. The top-N pickup→dropoff pairs are identified as popular corridors. Borough-level summaries compute average fare, distance, and trip duration for high-level business insights.

Temporal Patterns: Trips are binned by hour and day-of-week. The 24×7 heatmap reveals rush-hour peaks, midday lulls, and weekend late-night surges. Peak hour detection identifies the 5 busiest time slots for fleet allocation recommendations.

Sentiment Correlation: Daily BTC-USD returns are computed from closing prices. Sentiment scores are correlated with returns using Pearson (r ≈ 0.45 overall), Spearman (rank-based robustness), a 30-day rolling window (time-varying strength), and lagged cross-correlation (testing 0–7 day predictive horizons).

Sample Visualizations

24-by-7 heatmap showing average taxi trip counts per hour and day of week
Hour × Day-of-Week Demand Heatmap
Dual-axis chart overlaying Bitcoin closing price with daily sentiment scores
Bitcoin Price vs Social Media Sentiment
Line chart showing average taxi trips per hour with peaks at morning and evening rush
Hourly Taxi Demand Pattern
Time series of 30-day rolling Pearson correlation between sentiment and Bitcoin returns
Rolling 30-Day Sentiment–Return Correlation

Code Highlights

Sentiment Correlation Analysis
def compute_correlation(df):
    valid = df.dropna(subset=["sentiment_score", "daily_return"])
    pearson_r, pearson_p = sp_stats.pearsonr(
        valid["sentiment_score"], valid["daily_return"]
    )
    spearman_r, spearman_p = sp_stats.spearmanr(
        valid["sentiment_score"], valid["daily_return"]
    )
    return CorrelationResult(pearson_r, pearson_p, spearman_r, spearman_p)
Weekly Demand Heatmap
def weekly_heatmap(df):
    per_slot = df.groupby(["pickup_date", "pickup_day", "pickup_hour"]).size()
    avg_slot = per_slot.reset_index(name="trips")
    avg_slot = avg_slot.groupby(["pickup_day", "pickup_hour"])["trips"].mean()
    heatmap = avg_slot.reset_index().pivot(
        index="pickup_hour", columns="pickup_day", values="trips"
    )
    return heatmap[_DAY_ORDER]  # Monday first

Challenges & Solutions