🏪 Sundry Shop Assistant — Source Code

Voice advisor for busy Malaysian kedai runcit owners

FastAPI google-genai WebSocket pandas Vanilla JS

📁 Project Structure

The app has three layers: a vanilla JavaScript frontend that handles mic capture and audio playback, a FastAPI backend that hosts the Gemini Live session, and a pandas-backed MCP-style tool layer that queries the sales dataset.

sundry-shop-assistant/ ├── Dockerfile ← Cloud Run container definition ├── .dockerignore ← Exclude .venv, __pycache__, .env from image ├── .env.example ← Template for GEMINI_API_KEY ├── index.html ← Welcome + Conversation screens ├── dataset.csv ← 150 rows of POS transactions (March 2024) ├── system-prompt.txt ← Santai BM persona rules for Adam ├── run-local.bat ← One-click local dev (creates venv, runs server) ├── README.md ← Project overview ├── deploy.md ← Cloud Run deployment guide │ ├── backend/ ← FastAPI + Gemini Live + MCP bridge │ ├── main.py ← WebSocket /ws endpoint, serves static │ ├── gemini_live.py ← Live API session wrapper + tool loop │ ├── tool_bridge.py ← 10 Gemini FunctionDeclarations │ ├── mcp_tools.py ← 10 pandas-backed tool functions │ └── requirements.txt ← Pinned Python deps │ ├── css/ │ └── style.css ← Mobile-first Poppins UI, pink accent │ └── js/ ├── pcm-processor.js ← AudioWorklet for PCM capture ├── media-handler.js ← 16kHz mic in, 24kHz audio out ├── gemini-client.js ← WebSocket wrapper (mode=audio|text) └── app.js ← UI logic, I/O toggles, event wiring

🔧 MCP Tool Definitions

Each tool is a regular Python function over the sales DataFrame. Return values must be JSON-serializable (native Python types — no numpy). The contract matches MCP's tools/list shape, making it swappable for a real MCP server later.

backend/mcp_tools.py — get_sales_by_category

def get_sales_by_category() -> dict:
    """Revenue and transaction count by product category, ranked best to worst."""
    df = _load()
    grouped = df.groupby("Product Category")["Total Sales"].agg(
        ["sum", "count", "mean"]
    )
    grouped = grouped.sort_values("sum", ascending=False)
    return {
        "categories": [
            {
                "category": cat,
                "revenue_rm": round(float(row["sum"]), 2),
                "transactions": int(row["count"]),
                "avg_basket_rm": round(float(row["mean"]), 2),
            }
            for cat, row in grouped.iterrows()
        ]
    }


# Public mapping — registered with Gemini via tool_bridge.py
ALL_TOOLS = {
    "get_total_sales": get_total_sales,
    "get_top_day": get_top_day,
    "get_weekly_summary": get_weekly_summary,
    "get_sales_by_category": get_sales_by_category,
    "get_slowest_category": get_slowest_category,
    "compare_member_vs_visitor": compare_member_vs_visitor,
    "compare_gender": compare_gender,
    "get_payment_mix": get_payment_mix,
    "get_payment_by_customer_type": get_payment_by_customer_type,
    "get_basket_stats": get_basket_stats,
}

                Why native Python types matter: pandas returns numpy types (np.int64, np.float64) which fail to serialize in the Gemini function-response payload. Every numeric return goes through int() or float() explicitly.
            

📜 Tool Bridge — FunctionDeclarations for Gemini

Each tool is declared to Gemini with a crisp description. Descriptions must be non-overlapping — in a voice conversation, the wrong tool choice causes awkward silence that's harder to recover from than in text chat.

backend/tool_bridge.py

from google.genai import types
from mcp_tools import ALL_TOOLS

_FUNCTION_DECLARATIONS = [
    types.FunctionDeclaration(
        name="get_sales_by_category",
        description=(
            "Rank all product categories by revenue, including "
            "transaction count and average basket per category. "
            "Use this for 'kategori paling laku?' or 'top categories'."
        ),
        parameters=types.Schema(type=types.Type.OBJECT, properties={}),
    ),
    types.FunctionDeclaration(
        name="compare_member_vs_visitor",
        description=(
            "Compare Member vs Visitor customers: revenue share, "
            "average basket, and transaction count. Use this for "
            "loyalty program questions like 'member atau visitor "
            "spend lebih?'."
        ),
        parameters=types.Schema(type=types.Type.OBJECT, properties={}),
    ),
    # ... 8 more declarations
]

def get_tools() -> list[types.Tool]:
    return [types.Tool(function_declarations=_FUNCTION_DECLARATIONS)]

def get_tool_mapping() -> dict:
    return ALL_TOOLS

🎙️ Gemini Live Session — Tool Call Loop

The heart of the app. When the model emits a tool call during a voice session, the Python handler runs, the result is sent back via session.send_tool_response(), and the model continues speaking with the real numbers.

backend/gemini_live.py — receive_loop excerpt

async for response in session.receive():
    server_content = response.server_content
    tool_call = response.tool_call

    if server_content:
        if server_content.model_turn:
            for part in server_content.model_turn.parts:
                if part.inline_data:
                    # Audio bytes streamed back to browser
                    await audio_output_callback(part.inline_data.data)

        # Input/output transcripts for the UI transcript panel
        if server_content.input_transcription:
            await event_queue.put({
                "type": "user",
                "text": server_content.input_transcription.text
            })

    # Tool call handling — this is where MCP meets Live API
    if tool_call:
        function_responses = []
        for fc in tool_call.function_calls:
            func_name = fc.name
            args = fc.args or {}

            if func_name in self.tool_mapping:
                tool_func = self.tool_mapping[func_name]
                loop = asyncio.get_running_loop()
                result = await loop.run_in_executor(
                    None, lambda: tool_func(**args)
                )

                function_responses.append(
                    types.FunctionResponse(
                        name=func_name,
                        id=fc.id,
                        response={"result": result},
                    )
                )

        # Send all tool results back; model resumes speaking
        await session.send_tool_response(
            function_responses=function_responses
        )

                Latency matters more than MCP: During the tool call, the agent's audio stream pauses. If the tool takes 3+ seconds, the owner hears awkward silence. Every tool in this app returns in under 100ms via in-memory pandas — a deliberate latency choice over a more "correct" but slower MCP client/server hop.
            

🌐 WebSocket Bridge — Browser to Gemini

The FastAPI endpoint accepts either PCM audio bytes (when the owner speaks) or JSON text (when they type). The ?mode= query parameter sets whether Gemini responds with audio or text — switching modes reconnects the session.

backend/main.py — /ws endpoint

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    mode = websocket.query_params.get("mode", "audio").lower()
    response_modality = "TEXT" if mode == "text" else "AUDIO"

    await websocket.accept()

    audio_input_queue = asyncio.Queue()
    text_input_queue = asyncio.Queue()

    async def audio_output_callback(data: bytes):
        await websocket.send_bytes(data)

    gemini = GeminiLive(
        api_key=GEMINI_API_KEY,
        model=MODEL,
        response_modality=response_modality,
        voice_name=VOICE_NAME,
    )

    async def receive_from_client():
        while True:
            message = await websocket.receive()
            if message.get("bytes") is not None:
                # Raw PCM 16kHz from the mic
                await audio_input_queue.put(message["bytes"])
            elif message.get("text") is not None:
                # JSON {"text": "..."} from keyboard input
                payload = json.loads(message["text"])
                await text_input_queue.put(payload["text"])

    # Run both loops in parallel — Gemini session drives the events
    async for event in gemini.start_session(
        audio_input_queue=audio_input_queue,
        text_input_queue=text_input_queue,
        audio_output_callback=audio_output_callback,
    ):
        await websocket.send_json(event)

🎚️ Frontend Audio Pipeline

An AudioWorkletProcessor captures PCM frames from the mic, the main-thread handler downsamples to 16kHz Int16, and raw bytes are shipped over WebSocket. Incoming 24kHz Float32 is scheduled for playback with a "next start time" cursor so audio frames don't overlap.

js/media-handler.js — audio capture

async startAudio(onAudioData) {
    await this.initializeAudio();

    this.mediaStream = await navigator.mediaDevices.getUserMedia({
        audio: true,
    });
    const source = this.audioContext.createMediaStreamSource(this.mediaStream);
    this.audioWorkletNode = new AudioWorkletNode(
        this.audioContext,
        "pcm-processor"
    );

    this.audioWorkletNode.port.onmessage = (event) => {
        if (this.isRecording) {
            const downsampled = this.downsampleBuffer(
                event.data,
                this.audioContext.sampleRate,
                16000  // Gemini Live expects 16kHz
            );
            const pcm16 = this.convertFloat32ToInt16(downsampled);
            onAudioData(pcm16);  // Sends bytes over WebSocket
        }
    };

    source.connect(this.audioWorkletNode);
    this.isRecording = true;
}

💬 Santai Bahasa Malaysia System Prompt

The system instruction is the single biggest lever for the agent's feel. It explicitly enforces casual Malay register and gives examples of formal-vs-santai to prevent the model from defaulting to bahasa baku.

system-prompt.txt — excerpt

You are Adam, an AI assistant inside Sundry Shop Assistant
for Malaysian sundry shop (kedai runcit) owners.

## Language style — Bahasa Malaysia santai (casual)

Speak Bahasa Malaysia santai, NOT formal bahasa baku. The owner
is a 52-year-old uncle running his own shop, not a government
officer at a meeting.

Santai style rules:
- Use "tak" not "tidak", "dah" not "sudah", "ni" not "ini"
- Softeners: "lah", "kan", "eh", "je", "Pak", "Bos"
- Code-switch to English for business terms ("transaksi",
  "basket", "average", "member", "card", "cash", "compare")
- Address the owner as "Pak" or "Bos" consistently

Examples of formal vs santai:

Formal (WRONG): "Jualan hari ini adalah sebanyak RM 847.23
                 dengan 23 transaksi."
Santai (RIGHT): "Hari ni dah RM 847, Pak. Dua puluh tiga
                 transaksi setakat ni."

## Never
- Invent numbers — always call an MCP tool before stating a figure
- Speak in long monologues — keep every answer under 30 seconds
- Use heavy English jargon (revenue, conversion, SKU) when BM has
  natural equivalents (jualan, untung, barang)

🚀 Try Live Demo 📖 Full Source on GitHub