📄 AI PDF Summarizer

Core Source Code & Document Processing Implementation

Python 3.9 Streamlit Gemini AI PDF Processing

🔍 About This Code Showcase

This curated code snippet demonstrates how the AI PDF Summarizer extracts, processes, and intelligently summarizes complex PDF documents using advanced NLP techniques.

Full deployment scripts, API integrations, and proprietary details are omitted for clarity and security. This showcase highlights the core document processing and AI summarization algorithms.

📖 Core Algorithm: Document Intelligence Engine

The foundation of the AI PDF Summarizer is its ability to extract meaningful content from PDFs, understand document structure, and generate intelligent summaries tailored to user needs:

📄 document_processor.py

import PyPDF2
import google.generativeai as genai
from typing import List, Dict, Optional
import re

class DocumentProcessor:
    """
    Advanced PDF processing engine that extracts, analyzes, and summarizes
    complex documents while preserving important context and structure.
    """

    def __init__(self, api_key: str):
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-pro')

        # Pre-configured summarization strategies for different document types
        self.summarization_strategies = {
            'research_paper': self._research_paper_strategy,
            'business_report': self._business_report_strategy,
            'technical_manual': self._technical_manual_strategy,
            'legal_document': self._legal_document_strategy,
            'general': self._general_strategy
        }

    def process_pdf_intelligent(self, pdf_file, summary_type: str = 'general') -> Dict:
        """
        Process PDF with intelligent content analysis and context-aware summarization.

        Args:
            pdf_file: Uploaded PDF file object
            summary_type: Type of document for specialized processing

        Returns:
            Dictionary containing extracted text, document metadata, and AI summary
        """

        # Step 1: Extract raw content with structure preservation
        extracted_content = self._extract_structured_content(pdf_file)

        # Step 2: Analyze document characteristics and type
        document_analysis = self._analyze_document_structure(extracted_content)

        # Step 3: Apply intelligent chunking for large documents
        content_chunks = self._intelligent_chunking(extracted_content, document_analysis)

        # Step 4: Generate context-aware summary using appropriate strategy
        strategy = self.summarization_strategies.get(summary_type, self._general_strategy)
        summary = await strategy(content_chunks, document_analysis)

        return {
            'original_text': extracted_content['full_text'],
            'document_metadata': document_analysis,
            'summary': summary,
            'key_insights': summary['insights'],
            'processing_stats': self._generate_processing_stats(extracted_content, summary)
        }

    def _extract_structured_content(self, pdf_file) -> Dict:
        """
        Extract content while preserving document structure, headers, and formatting.
        This enables better understanding of document hierarchy and importance.
        """

        pdf_reader = PyPDF2.PdfReader(pdf_file)
        content_structure = {
            'full_text': '',
            'pages': [],
            'headers': [],
            'sections': [],
            'metadata': pdf_reader.metadata
        }

        for page_num, page in enumerate(pdf_reader.pages):
            page_text = page.extract_text()

            # Detect headers and section breaks using typography patterns
            headers = self._detect_headers(page_text)
            sections = self._identify_sections(page_text, headers)

            content_structure['pages'].append({
                'page_number': page_num + 1,
                'text': page_text,
                'headers': headers,
                'sections': sections
            })

            content_structure['full_text'] += page_text + '\n\n'
            content_structure['headers'].extend(headers)
            content_structure['sections'].extend(sections)

        return content_structure

    def _intelligent_chunking(self, content: Dict, analysis: Dict) -> List[Dict]:
        """
        Smart content chunking that respects document structure and context.
        Prevents breaking related concepts across chunks for better AI processing.
        """

        chunks = []
        current_chunk = ''
        max_chunk_size = 4000  # Optimal for AI processing

        for section in content['sections']:
            section_text = section['content']

            # If adding this section would exceed limit, finalize current chunk
            if len(current_chunk) + len(section_text) > max_chunk_size:
                if current_chunk:  # Don't add empty chunks
                    chunks.append({
                        'content': current_chunk,
                        'section_headers': self._extract_chunk_headers(current_chunk),
                        'importance_score': self._calculate_importance(current_chunk, analysis)
                    })
                current_chunk = section_text
            else:
                current_chunk += '\n\n' + section_text

        # Add final chunk
        if current_chunk:
            chunks.append({
                'content': current_chunk,
                'section_headers': self._extract_chunk_headers(current_chunk),
                'importance_score': self._calculate_importance(current_chunk, analysis)
            })

        # Sort chunks by importance for prioritized processing
        return sorted(chunks, key=lambda x: x['importance_score'], reverse=True)
                

🧠 Advanced Summarization Engine

The summarization engine uses specialized strategies for different document types, ensuring optimal results for research papers, business reports, and technical manuals:

📄 summarization_engine.py

class SummarizationEngine:
    async _research_paper_strategy(self, chunks: List[Dict], analysis: Dict) -> Dict:
        """
        Specialized summarization for academic research papers.
        Focuses on methodology, findings, and implications.
        """

        # Identify key academic sections
        abstract_chunk = self._find_section(chunks, ['abstract', 'summary'])
        methodology_chunk = self._find_section(chunks, ['methodology', 'methods', 'approach'])
        results_chunk = self._find_section(chunks, ['results', 'findings', 'outcomes'])
        conclusion_chunk = self._find_section(chunks, ['conclusion', 'discussion', 'implications'])

        # Generate comprehensive research summary
        summary_prompt = f"""
        Analyze this research paper and provide a comprehensive academic summary:

        Document Analysis: {analysis}

        Focus on:
        1. Research question and objectives
        2. Methodology and experimental design
        3. Key findings and statistical significance
        4. Limitations and future research directions
        5. Practical implications and applications

        Provide citations to specific sections where possible.
        """

        research_summary = await self._generate_ai_summary(
            self._combine_priority_chunks([abstract_chunk, methodology_chunk, results_chunk, conclusion_chunk]),
            summary_prompt
        )

        return {
            'summary_type': 'research_paper',
            'main_summary': research_summary,
            'insights': await self._extract_research_insights(chunks),
            'key_sections': {
                'methodology': methodology_chunk['content'][:500] if methodology_chunk else None,
                'findings': results_chunk['content'][:500] if results_chunk else None
            }
        }

    async _business_report_strategy(self, chunks: List[Dict], analysis: Dict) -> Dict:
        """
        Business-focused summarization emphasizing key metrics and actionable insights.
        Optimized for executive summaries and strategic decision-making.
        """

        # Identify business-critical sections
        executive_summary = self._find_section(chunks, ['executive summary', 'overview'])
        financial_data = self._find_section(chunks, ['financial', 'revenue', 'performance'])
        recommendations = self._find_section(chunks, ['recommendations', 'action items', 'next steps'])

        business_prompt = f"""
        Summarize this business document for executive review:

        Document Type: {analysis.get('document_type', 'Business Report')}

        Extract and highlight:
        1. Key performance indicators and metrics
        2. Strategic recommendations and action items
        3. Risk factors and opportunities
        4. Financial implications and ROI
        5. Timeline and implementation priorities

        Format for executive consumption with clear bullet points.
        """

        business_summary = await self._generate_ai_summary(
            self._combine_priority_chunks([executive_summary, financial_data, recommendations]),
            business_prompt
        )

        return {
            'summary_type': 'business_report',
            'main_summary': business_summary,
            'insights': await self._extract_business_insights(chunks),
            'action_items': await self._extract_action_items(chunks),
            'metrics': await self._extract_key_metrics(chunks)
        }

    def _analyze_document_structure(self, content: Dict) -> Dict:
        """
        Analyze document characteristics to determine optimal processing strategy.
        Uses pattern recognition to classify document type and complexity.
        """

        full_text = content['full_text'].lower()
        headers = [h.lower() for h in content['headers']]

        # Pattern matching for document type classification
        doc_type = 'general'

        if any(keyword in full_text for keyword in ['abstract', 'methodology', 'references', 'citation']):
            doc_type = 'research_paper'
        elif any(keyword in full_text for keyword in ['revenue', 'quarterly', 'executive summary', 'roi']):
            doc_type = 'business_report'
        elif any(keyword in full_text for keyword in ['installation', 'configuration', 'user manual', 'api']):
            doc_type = 'technical_manual'

        return {
            'document_type': doc_type,
            'page_count': len(content['pages']),
            'word_count': len(content['full_text'].split()),
            'complexity_score': self._calculate_complexity(content),
            'section_count': len(content['sections']),
            'has_structured_content': len(headers) > 3
        }
                

⚙️ Technical Implementation Notes

Key Algorithms & Innovations

Intelligent Chunking: Context-aware content segmentation that preserves document structure
Document Classification: Automated document type detection for specialized processing strategies
Structure Preservation: Advanced header and section detection maintains document hierarchy
Multi-Strategy Summarization: Specialized algorithms for research papers, business reports, and technical documents
Importance Scoring: Priority-based content processing ensures critical information is highlighted

Why This Approach Works

Context-Aware Processing: Understands document structure for more accurate summarization
Scalable Architecture: Handles documents from single pages to 100+ page reports
Domain Expertise: Specialized strategies deliver relevant insights for different document types
Quality Assurance: Multiple validation steps ensure summary accuracy and completeness

🚀 Try Live Demo 📖 View Full Project Details