📋 Project Overview & Problem Statement
Challenge: Traditional sentiment analysis tools lump all geographic regions together, producing misleading global averages that mask critical cultural and regional differences. For example, analyzing "public opinion on alcohol consumption" globally would show a mixed sentiment, completely missing that Saudi Arabia has 95% negative sentiment (religious/cultural) while Germany has 70% positive sentiment (beer culture).
Solution: Public Sentiment Collection Agent uses geographic segmentation, source diversity tracking, and credibility scoring to provide accurate, context-aware sentiment analysis. The system automatically detects data quality issues like echo chamber bias and single-domain concentration.
🚨 Why Geographic Filtering Matters
Example: "Public opinion on alcohol consumption"
❌ WITHOUT geographic filtering: 60% negative, 30% neutral, 10% positive (MISLEADING - lumps all regions together)
✅ WITH geographic filtering:
- Saudi Arabia: 95% negative (religious/cultural context)
- Germany: 70% positive (beer culture)
- USA: 50/50 split (health concerns vs. social acceptance)
Key Benefits
- Geographic Segmentation: Analyze sentiment by country/region to reveal cultural nuances
- Source Diversity Tracking: Monitor news, social media, forums, blogs across regions
- Credibility Scoring: 0-100 score based on source diversity and sample size
- Automatic Bias Detection: Warns about echo chambers, single-domain concentration
- Comprehensive Output: 10 files per analysis (1 report, 4 charts, 5 CSV exports)
🔍 Source Diversity & Credibility Features
Source Type Classification
The system automatically classifies sources into categories:
- News: Professional journalism (BBC, CNN, Reuters, local news sites)
- Social Media: User-generated content (Reddit, Twitter, Facebook)
- Institutional: Government and academic sources (.gov, .edu)
- Blogs: Opinion and commentary (Medium, personal blogs)
- Forums: Community discussions and Q&A sites
Automatic Bias Warnings
The system issues warnings when data quality is compromised:
⚠️ Over 70% sources are social media (potential echo chamber bias)
⚠️ 60% of sources from single domain: reddit.com
⚠️ Only 4 unique sources (low diversity)
Credibility Score Calculation
Credibility Score (0-100) =
Source Diversity Score × 0.6 +
Sample Size Score × 0.4
🟢 70-100: High confidence (diverse sources, adequate sample)
🟡 50-69: Medium confidence (some limitations present)
🔴 0-49: Low confidence (significant data quality concerns)
🛠️ Technical Architecture & Implementation
AI & Analytics Stack
Google Gemini 2.0 Flash
Tavily Search API
Python 3.9+
Pandas 2.0+
Matplotlib
Seaborn
Multi-Agent Framework
5 Specialized Agents
Web Search Integration
NLP Sentiment Analysis
Data Quality Scoring
Auto Visualization
Deployment Options
Google Colab
Jupyter Notebook
Local Python
Streamlit (Optional)
System Architecture
Pipeline Flow:
1. Geographic Listening → Web search with location filters
2. Source Analysis → Diversity tracking & bias detection
3. Sentiment Analysis → Gemini AI with cultural context
4. Credibility Scoring → Quality assessment (0-100)
5. Visualization → 4 professional charts
6. Data Export → 5 CSV files for Excel/Sheets
7. Report Packaging → Executive markdown report
📖 Development Setup & Usage Guide
Quick Start with Google Colab (Recommended)
- Open Colab Notebook: Click "Launch in Google Colab" button above
- Add API Keys: Add GOOGLE_API_KEY and TAVILY_API_KEY to Colab Secrets (🔑 icon)
- Run Setup Cells: Install dependencies and configure APIs
- Run Analysis: Execute run_enhanced_sentiment_pipeline() with your topic and locations
- Download Results: Get markdown report, 4 charts, and 5 CSV files
Example Usage
# Example: Analyze firecracker ban opinions across cultures
results = run_enhanced_sentiment_pipeline(
issue_keyword="Should firecrackers and fireworks be banned?",
locations=["Malaysia", "Germany", "USA", "India"],
num_sources_per_location=15,
output_dir="."
)
# Output:
# - comparative_report_20251024_123456.md
# - regional_comparison_20251024_123456.png
# - credibility_dashboard_20251024_123456.png
# - source_diversity_20251024_123456.png
# - theme_frequency_20251024_123456.png
# - sentiment_distribution_20251024_123456.csv
# - emotion_frequency_20251024_123456.csv
# - theme_comparison_20251024_123456.csv
# - source_attribution_20251024_123456.csv
# - credibility_metrics_20251024_123456.csv
Required API Keys
- Google AI Studio API: Get from Google AI Studio (free tier available)
- Tavily Search API: Get from Tavily (free 1,000 searches/month)
⚠️ Limitations & Disclaimers
Data Collection Limitations
- Language Bias: Primarily English-language sources (non-English opinions underrepresented)
- Digital Divide: Only captures online populations (offline communities excluded)
- Platform Bias: Web search favors certain platforms over others
- Temporal: Snapshot in time, sentiment may shift rapidly
- Sample Size: Small samples may not represent entire populations
Geographic Filtering Challenges
- Geographic attribution is approximate (based on search query modifiers)
- Cross-border content may appear in multiple regions
- VPNs and global platforms complicate true location detection
Recommended Use
✅ Good for: Directional insights, trend detection, hypothesis generation
⚠️ Caution for: Policy decisions, legal proceedings, precise measurement
❌ Not for: Statistical inference about entire populations