Back to Portfolio
NLP • Machine Learning • Data Pipeline • Real-Time Analytics

Reddit Sentiment
Analysis Pipeline

A production-grade ML pipeline that extracts Reddit threads, performs real-time sentiment analysis using VADER and BERT transformers, and delivers interactive analytics with 89% accuracy at 10,000 comments/hour throughput.

Thread Analysis

r/technology - AI Discussion

Live
Positive
62%
Neutral
24%
Negative
14%
Total Comments Analyzed1,247

Accuracy

89%

Processing Speed

10K/hr

API Rate Limit

60/min

Data Retention

90 days

Pipeline Capabilities

Production ML Infrastructure

Enterprise-grade data pipeline handling real-time social media sentiment at scale.

Real-Time Reddit Data Extraction

Production-grade ingestion pipeline using PRAW (Python Reddit API Wrapper) with OAuth 2.0 authentication. Extracts threads, comments, and metadata from subreddits with rate limiting (60 requests/min) and automatic retry logic. Data streamed to Apache Kafka for distributed processing.

PRAW APIApache KafkaRedis QueueDocker Containers

ML Sentiment Analysis Engine

Hybrid sentiment model combining VADER lexicon for social media and fine-tuned BERT transformer (distilbert-base-uncased-finetuned-sst-2) for context-aware analysis. Processes 10,000+ comments/hour with 89% accuracy. Multi-class classification: Positive, Negative, Neutral with confidence scores.

HuggingFace TransformersVADERPyTorchscikit-learn

Interactive Analytics Dashboard

Real-time visualization dashboard built with React and D3.js. Features sentiment distribution charts, time-series trend analysis, word clouds, and keyword frequency graphs. Supports CSV/JSON export with pandas DataFrame serialization for offline analysis.

React 18D3.jsRechartsMaterial-UIPandas
Machine Learning Pipeline

End-to-End ML Workflow

From raw Reddit data to actionable sentiment insights.

01

Data Collection

PRAW extracts Reddit threads with metadata (author, upvotes, timestamp, subreddit)

~60 requests/min, 10K comments/day
02

Preprocessing

spaCy tokenization, lemmatization, stopword removal, text normalization

Processing: 500 comments/sec
03

Feature Engineering

TF-IDF vectorization, n-gram extraction (1-3), sentiment lexicon matching

Feature dimension: 5000
04

Model Inference

VADER + DistilBERT ensemble, confidence thresholding (>0.75 for classification)

Latency: <200ms per batch (32)
05

Post-Processing

Aggregation by thread/subreddit, outlier detection, trend calculation

Storage: PostgreSQL + MongoDB

Platform Features

Comprehensive analytics toolkit for social media sentiment analysis.

Advanced Filtering

  • Filter by sentiment polarity (Positive/Negative/Neutral)
  • Time range selection (last hour, day, week, custom)
  • Subreddit multi-select with autocomplete
  • Keyword/phrase search with boolean operators (AND, OR, NOT)
  • Minimum comment karma threshold
  • Author filtering and blacklist support

Visualization Suite

  • Sentiment distribution pie chart with percentages
  • Time-series line chart tracking sentiment trends
  • Word cloud highlighting most frequent terms by sentiment
  • Heatmap showing sentiment intensity across time
  • Top keywords bar chart with frequency counts
  • Thread-level sentiment comparison graphs

Export & Reporting

  • Raw Reddit data export (JSON/CSV)
  • Sentiment-annotated dataset with confidence scores
  • Aggregated statistics report (PDF/Excel)
  • Time-series data for external BI tools
  • API access for programmatic data retrieval
  • Automated email reports with scheduled analysis

System Architecture

01

Reddit Data Ingestion

PRAW client authenticates via OAuth 2.0 and streams Reddit submissions/comments to Kafka topics. Celery workers consume from Kafka, performing data validation, deduplication (using Redis cache), and text preprocessing (lowercasing, URL removal, emoji handling).

02

Text Preprocessing & Cleaning

spaCy pipeline tokenizes text, removes stopwords, and performs lemmatization. Regex patterns clean markdown formatting, URLs, and special characters. Data normalized and batched (size: 32) for efficient ML inference.

03

Sentiment Classification

Dual-model approach: VADER provides quick lexicon-based sentiment for short comments, while DistilBERT transformer handles complex contextual analysis. Models deployed via TorchServe on GPU-enabled EC2 instances (p3.2xlarge) for sub-200ms inference latency.

04

Data Aggregation & Storage

Sentiment scores aggregated by thread, subreddit, and time window (hourly/daily). Results stored in PostgreSQL for structured queries and MongoDB for raw document storage. Elasticsearch indexes enable full-text search across 1M+ analyzed comments.

05

Analytics & Visualization

Next.js frontend queries PostgreSQL via GraphQL API (Apollo Server). D3.js renders interactive sentiment timelines, pie charts, and heatmaps. Real-time updates via WebSocket connections. Export functionality generates CSV/JSON via pandas with gzip compression.

Technology Stack

Industry-standard tools for production ML pipelines.

Data Ingestion

  • PRAW (Reddit API)
  • Apache Kafka
  • Redis
  • Celery Task Queue
  • Docker

ML/NLP Pipeline

  • HuggingFace Transformers
  • VADER Sentiment
  • spaCy
  • PyTorch
  • scikit-learn
  • NLTK

Data Storage

  • PostgreSQL
  • MongoDB
  • AWS S3
  • Elasticsearch
  • Apache Parquet

Frontend & Visualization

  • Next.js 14
  • React 18
  • D3.js
  • Recharts
  • TailwindCSS
  • TypeScript
Use Cases

Real-World Applications

Brand Monitoring

Track public sentiment about products, campaigns, and brand mentions across Reddit communities.

Market Research

Analyze consumer opinions on competitors, industry trends, and emerging topics in real-time.

Crisis Detection

Early warning system for negative sentiment spikes indicating PR crises or product issues.

This production ML pipeline demonstrates expertise in NLP, distributed data processing, and real-time analytics. By combining VADER lexicon analysis with transformer-based deep learning, the system achieves high accuracy while maintaining low latency for actionable insights.