NLP • Machine Learning • Data Pipeline • Real-Time Analytics

Reddit Sentiment
Analysis Pipeline

A production-grade ML pipeline that extracts Reddit threads, performs real-time sentiment analysis using VADER and BERT transformers, and delivers interactive analytics with 89% accuracy at 10,000 comments/hour throughput.

Thread Analysis

r/technology - AI Discussion

Live

Positive

62%

Neutral

24%

Negative

14%

Total Comments Analyzed1,247

Accuracy

89%

Processing Speed

10K/hr

API Rate Limit

60/min

Data Retention

90 days

Pipeline Capabilities

Production ML Infrastructure

Enterprise-grade data pipeline handling real-time social media sentiment at scale.

Real-Time Reddit Data Extraction

Production-grade ingestion pipeline using PRAW (Python Reddit API Wrapper) with OAuth 2.0 authentication. Extracts threads, comments, and metadata from subreddits with rate limiting (60 requests/min) and automatic retry logic. Data streamed to Apache Kafka for distributed processing.

PRAW APIApache KafkaRedis QueueDocker Containers

ML Sentiment Analysis Engine

Hybrid sentiment model combining VADER lexicon for social media and fine-tuned BERT transformer (distilbert-base-uncased-finetuned-sst-2) for context-aware analysis. Processes 10,000+ comments/hour with 89% accuracy. Multi-class classification: Positive, Negative, Neutral with confidence scores.

HuggingFace TransformersVADERPyTorchscikit-learn

Interactive Analytics Dashboard

Real-time visualization dashboard built with React and D3.js. Features sentiment distribution charts, time-series trend analysis, word clouds, and keyword frequency graphs. Supports CSV/JSON export with pandas DataFrame serialization for offline analysis.

React 18D3.jsRechartsMaterial-UIPandas

Machine Learning Pipeline

End-to-End ML Workflow

From raw Reddit data to actionable sentiment insights.

Data Collection

PRAW extracts Reddit threads with metadata (author, upvotes, timestamp, subreddit)

~60 requests/min, 10K comments/day

Preprocessing

spaCy tokenization, lemmatization, stopword removal, text normalization

Processing: 500 comments/sec

Feature Engineering

TF-IDF vectorization, n-gram extraction (1-3), sentiment lexicon matching

Feature dimension: 5000

Model Inference

VADER + DistilBERT ensemble, confidence thresholding (>0.75 for classification)

Latency: <200ms per batch (32)

Post-Processing

Aggregation by thread/subreddit, outlier detection, trend calculation

Storage: PostgreSQL + MongoDB

Platform Features

Comprehensive analytics toolkit for social media sentiment analysis.

Advanced Filtering

Filter by sentiment polarity (Positive/Negative/Neutral)
Time range selection (last hour, day, week, custom)
Subreddit multi-select with autocomplete
Keyword/phrase search with boolean operators (AND, OR, NOT)
Minimum comment karma threshold
Author filtering and blacklist support

Visualization Suite

Sentiment distribution pie chart with percentages
Time-series line chart tracking sentiment trends
Word cloud highlighting most frequent terms by sentiment
Heatmap showing sentiment intensity across time
Top keywords bar chart with frequency counts
Thread-level sentiment comparison graphs

Export & Reporting

Raw Reddit data export (JSON/CSV)
Sentiment-annotated dataset with confidence scores
Aggregated statistics report (PDF/Excel)
Time-series data for external BI tools
API access for programmatic data retrieval
Automated email reports with scheduled analysis

System Architecture

Reddit Data Ingestion

PRAW client authenticates via OAuth 2.0 and streams Reddit submissions/comments to Kafka topics. Celery workers consume from Kafka, performing data validation, deduplication (using Redis cache), and text preprocessing (lowercasing, URL removal, emoji handling).

Text Preprocessing & Cleaning

spaCy pipeline tokenizes text, removes stopwords, and performs lemmatization. Regex patterns clean markdown formatting, URLs, and special characters. Data normalized and batched (size: 32) for efficient ML inference.

Sentiment Classification

Dual-model approach: VADER provides quick lexicon-based sentiment for short comments, while DistilBERT transformer handles complex contextual analysis. Models deployed via TorchServe on GPU-enabled EC2 instances (p3.2xlarge) for sub-200ms inference latency.

Data Aggregation & Storage

Sentiment scores aggregated by thread, subreddit, and time window (hourly/daily). Results stored in PostgreSQL for structured queries and MongoDB for raw document storage. Elasticsearch indexes enable full-text search across 1M+ analyzed comments.

Analytics & Visualization

Next.js frontend queries PostgreSQL via GraphQL API (Apollo Server). D3.js renders interactive sentiment timelines, pie charts, and heatmaps. Real-time updates via WebSocket connections. Export functionality generates CSV/JSON via pandas with gzip compression.

Technology Stack

Industry-standard tools for production ML pipelines.

Data Ingestion

PRAW (Reddit API)
Apache Kafka
Redis
Celery Task Queue
Docker

ML/NLP Pipeline

HuggingFace Transformers
VADER Sentiment
spaCy
PyTorch
scikit-learn
NLTK

Data Storage

PostgreSQL
MongoDB
AWS S3
Elasticsearch
Apache Parquet

Frontend & Visualization

Next.js 14
React 18
D3.js
Recharts
TailwindCSS
TypeScript

Use Cases

Real-World Applications

Brand Monitoring

Track public sentiment about products, campaigns, and brand mentions across Reddit communities.

Market Research

Analyze consumer opinions on competitors, industry trends, and emerging topics in real-time.

Crisis Detection

Early warning system for negative sentiment spikes indicating PR crises or product issues.

This production ML pipeline demonstrates expertise in NLP, distributed data processing, and real-time analytics. By combining VADER lexicon analysis with transformer-based deep learning, the system achieves high accuracy while maintaining low latency for actionable insights.

Reddit Sentiment Analysis Pipeline