Reddit Sentiment
Analysis Pipeline
A production-grade ML pipeline that extracts Reddit threads, performs real-time sentiment analysis using VADER and BERT transformers, and delivers interactive analytics with 89% accuracy at 10,000 comments/hour throughput.
Thread Analysis
r/technology - AI Discussion
Accuracy
89%
Processing Speed
10K/hr
API Rate Limit
60/min
Data Retention
90 days
Production ML Infrastructure
Enterprise-grade data pipeline handling real-time social media sentiment at scale.
Real-Time Reddit Data Extraction
Production-grade ingestion pipeline using PRAW (Python Reddit API Wrapper) with OAuth 2.0 authentication. Extracts threads, comments, and metadata from subreddits with rate limiting (60 requests/min) and automatic retry logic. Data streamed to Apache Kafka for distributed processing.
ML Sentiment Analysis Engine
Hybrid sentiment model combining VADER lexicon for social media and fine-tuned BERT transformer (distilbert-base-uncased-finetuned-sst-2) for context-aware analysis. Processes 10,000+ comments/hour with 89% accuracy. Multi-class classification: Positive, Negative, Neutral with confidence scores.
Interactive Analytics Dashboard
Real-time visualization dashboard built with React and D3.js. Features sentiment distribution charts, time-series trend analysis, word clouds, and keyword frequency graphs. Supports CSV/JSON export with pandas DataFrame serialization for offline analysis.
End-to-End ML Workflow
From raw Reddit data to actionable sentiment insights.
Data Collection
PRAW extracts Reddit threads with metadata (author, upvotes, timestamp, subreddit)
Preprocessing
spaCy tokenization, lemmatization, stopword removal, text normalization
Feature Engineering
TF-IDF vectorization, n-gram extraction (1-3), sentiment lexicon matching
Model Inference
VADER + DistilBERT ensemble, confidence thresholding (>0.75 for classification)
Post-Processing
Aggregation by thread/subreddit, outlier detection, trend calculation
Platform Features
Comprehensive analytics toolkit for social media sentiment analysis.
Advanced Filtering
- Filter by sentiment polarity (Positive/Negative/Neutral)
- Time range selection (last hour, day, week, custom)
- Subreddit multi-select with autocomplete
- Keyword/phrase search with boolean operators (AND, OR, NOT)
- Minimum comment karma threshold
- Author filtering and blacklist support
Visualization Suite
- Sentiment distribution pie chart with percentages
- Time-series line chart tracking sentiment trends
- Word cloud highlighting most frequent terms by sentiment
- Heatmap showing sentiment intensity across time
- Top keywords bar chart with frequency counts
- Thread-level sentiment comparison graphs
Export & Reporting
- Raw Reddit data export (JSON/CSV)
- Sentiment-annotated dataset with confidence scores
- Aggregated statistics report (PDF/Excel)
- Time-series data for external BI tools
- API access for programmatic data retrieval
- Automated email reports with scheduled analysis
System Architecture
Reddit Data Ingestion
PRAW client authenticates via OAuth 2.0 and streams Reddit submissions/comments to Kafka topics. Celery workers consume from Kafka, performing data validation, deduplication (using Redis cache), and text preprocessing (lowercasing, URL removal, emoji handling).
Text Preprocessing & Cleaning
spaCy pipeline tokenizes text, removes stopwords, and performs lemmatization. Regex patterns clean markdown formatting, URLs, and special characters. Data normalized and batched (size: 32) for efficient ML inference.
Sentiment Classification
Dual-model approach: VADER provides quick lexicon-based sentiment for short comments, while DistilBERT transformer handles complex contextual analysis. Models deployed via TorchServe on GPU-enabled EC2 instances (p3.2xlarge) for sub-200ms inference latency.
Data Aggregation & Storage
Sentiment scores aggregated by thread, subreddit, and time window (hourly/daily). Results stored in PostgreSQL for structured queries and MongoDB for raw document storage. Elasticsearch indexes enable full-text search across 1M+ analyzed comments.
Analytics & Visualization
Next.js frontend queries PostgreSQL via GraphQL API (Apollo Server). D3.js renders interactive sentiment timelines, pie charts, and heatmaps. Real-time updates via WebSocket connections. Export functionality generates CSV/JSON via pandas with gzip compression.
Technology Stack
Industry-standard tools for production ML pipelines.
Data Ingestion
- PRAW (Reddit API)
- Apache Kafka
- Redis
- Celery Task Queue
- Docker
ML/NLP Pipeline
- HuggingFace Transformers
- VADER Sentiment
- spaCy
- PyTorch
- scikit-learn
- NLTK
Data Storage
- PostgreSQL
- MongoDB
- AWS S3
- Elasticsearch
- Apache Parquet
Frontend & Visualization
- Next.js 14
- React 18
- D3.js
- Recharts
- TailwindCSS
- TypeScript
Real-World Applications
Brand Monitoring
Track public sentiment about products, campaigns, and brand mentions across Reddit communities.
Market Research
Analyze consumer opinions on competitors, industry trends, and emerging topics in real-time.
Crisis Detection
Early warning system for negative sentiment spikes indicating PR crises or product issues.
This production ML pipeline demonstrates expertise in NLP, distributed data processing, and real-time analytics. By combining VADER lexicon analysis with transformer-based deep learning, the system achieves high accuracy while maintaining low latency for actionable insights.