📁 Project Structure
The app has three layers: a vanilla JavaScript frontend that handles mic capture and audio playback, a FastAPI backend that hosts the Gemini Live session, and a pandas-backed MCP-style tool layer that queries the sales dataset.
sundry-shop-assistant/
├── Dockerfile ← Cloud Run container definition
├── .dockerignore ← Exclude .venv, __pycache__, .env from image
├── .env.example ← Template for GEMINI_API_KEY
├── index.html ← Welcome + Conversation screens
├── dataset.csv ← 150 rows of POS transactions (March 2024)
├── system-prompt.txt ← Santai BM persona rules for Adam
├── run-local.bat ← One-click local dev (creates venv, runs server)
├── README.md ← Project overview
├── deploy.md ← Cloud Run deployment guide
│
├── backend/ ← FastAPI + Gemini Live + MCP bridge
│ ├── main.py ← WebSocket /ws endpoint, serves static
│ ├── gemini_live.py ← Live API session wrapper + tool loop
│ ├── tool_bridge.py ← 10 Gemini FunctionDeclarations
│ ├── mcp_tools.py ← 10 pandas-backed tool functions
│ └── requirements.txt ← Pinned Python deps
│
├── css/
│ └── style.css ← Mobile-first Poppins UI, pink accent
│
└── js/
├── pcm-processor.js ← AudioWorklet for PCM capture
├── media-handler.js ← 16kHz mic in, 24kHz audio out
├── gemini-client.js ← WebSocket wrapper (mode=audio|text)
└── app.js ← UI logic, I/O toggles, event wiring
🔧 MCP Tool Definitions
Each tool is a regular Python function over the sales DataFrame. Return values must be JSON-serializable (native Python types — no numpy). The contract matches MCP's tools/list shape, making it swappable for a real MCP server later.
def get_sales_by_category() -> dict:
"""Revenue and transaction count by product category, ranked best to worst."""
df = _load()
grouped = df.groupby("Product Category")["Total Sales"].agg(
["sum", "count", "mean"]
)
grouped = grouped.sort_values("sum", ascending=False)
return {
"categories": [
{
"category": cat,
"revenue_rm": round(float(row["sum"]), 2),
"transactions": int(row["count"]),
"avg_basket_rm": round(float(row["mean"]), 2),
}
for cat, row in grouped.iterrows()
]
}
# Public mapping — registered with Gemini via tool_bridge.py
ALL_TOOLS = {
"get_total_sales": get_total_sales,
"get_top_day": get_top_day,
"get_weekly_summary": get_weekly_summary,
"get_sales_by_category": get_sales_by_category,
"get_slowest_category": get_slowest_category,
"compare_member_vs_visitor": compare_member_vs_visitor,
"compare_gender": compare_gender,
"get_payment_mix": get_payment_mix,
"get_payment_by_customer_type": get_payment_by_customer_type,
"get_basket_stats": get_basket_stats,
}
Why native Python types matter: pandas returns numpy types (np.int64, np.float64) which fail to serialize in the Gemini function-response payload. Every numeric return goes through int() or float() explicitly.
📜 Tool Bridge — FunctionDeclarations for Gemini
Each tool is declared to Gemini with a crisp description. Descriptions must be non-overlapping — in a voice conversation, the wrong tool choice causes awkward silence that's harder to recover from than in text chat.
from google.genai import types
from mcp_tools import ALL_TOOLS
_FUNCTION_DECLARATIONS = [
types.FunctionDeclaration(
name="get_sales_by_category",
description=(
"Rank all product categories by revenue, including "
"transaction count and average basket per category. "
"Use this for 'kategori paling laku?' or 'top categories'."
),
parameters=types.Schema(type=types.Type.OBJECT, properties={}),
),
types.FunctionDeclaration(
name="compare_member_vs_visitor",
description=(
"Compare Member vs Visitor customers: revenue share, "
"average basket, and transaction count. Use this for "
"loyalty program questions like 'member atau visitor "
"spend lebih?'."
),
parameters=types.Schema(type=types.Type.OBJECT, properties={}),
),
# ... 8 more declarations
]
def get_tools() -> list[types.Tool]:
return [types.Tool(function_declarations=_FUNCTION_DECLARATIONS)]
def get_tool_mapping() -> dict:
return ALL_TOOLS
🎙️ Gemini Live Session — Tool Call Loop
The heart of the app. When the model emits a tool call during a voice session, the Python handler runs, the result is sent back via session.send_tool_response(), and the model continues speaking with the real numbers.
async for response in session.receive():
server_content = response.server_content
tool_call = response.tool_call
if server_content:
if server_content.model_turn:
for part in server_content.model_turn.parts:
if part.inline_data:
# Audio bytes streamed back to browser
await audio_output_callback(part.inline_data.data)
# Input/output transcripts for the UI transcript panel
if server_content.input_transcription:
await event_queue.put({
"type": "user",
"text": server_content.input_transcription.text
})
# Tool call handling — this is where MCP meets Live API
if tool_call:
function_responses = []
for fc in tool_call.function_calls:
func_name = fc.name
args = fc.args or {}
if func_name in self.tool_mapping:
tool_func = self.tool_mapping[func_name]
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(
None, lambda: tool_func(**args)
)
function_responses.append(
types.FunctionResponse(
name=func_name,
id=fc.id,
response={"result": result},
)
)
# Send all tool results back; model resumes speaking
await session.send_tool_response(
function_responses=function_responses
)
Latency matters more than MCP: During the tool call, the agent's audio stream pauses. If the tool takes 3+ seconds, the owner hears awkward silence. Every tool in this app returns in under 100ms via in-memory pandas — a deliberate latency choice over a more "correct" but slower MCP client/server hop.
🌐 WebSocket Bridge — Browser to Gemini
The FastAPI endpoint accepts either PCM audio bytes (when the owner speaks) or JSON text (when they type). The ?mode= query parameter sets whether Gemini responds with audio or text — switching modes reconnects the session.
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
mode = websocket.query_params.get("mode", "audio").lower()
response_modality = "TEXT" if mode == "text" else "AUDIO"
await websocket.accept()
audio_input_queue = asyncio.Queue()
text_input_queue = asyncio.Queue()
async def audio_output_callback(data: bytes):
await websocket.send_bytes(data)
gemini = GeminiLive(
api_key=GEMINI_API_KEY,
model=MODEL,
response_modality=response_modality,
voice_name=VOICE_NAME,
)
async def receive_from_client():
while True:
message = await websocket.receive()
if message.get("bytes") is not None:
# Raw PCM 16kHz from the mic
await audio_input_queue.put(message["bytes"])
elif message.get("text") is not None:
# JSON {"text": "..."} from keyboard input
payload = json.loads(message["text"])
await text_input_queue.put(payload["text"])
# Run both loops in parallel — Gemini session drives the events
async for event in gemini.start_session(
audio_input_queue=audio_input_queue,
text_input_queue=text_input_queue,
audio_output_callback=audio_output_callback,
):
await websocket.send_json(event)
🎚️ Frontend Audio Pipeline
An AudioWorkletProcessor captures PCM frames from the mic, the main-thread handler downsamples to 16kHz Int16, and raw bytes are shipped over WebSocket. Incoming 24kHz Float32 is scheduled for playback with a "next start time" cursor so audio frames don't overlap.
async startAudio(onAudioData) {
await this.initializeAudio();
this.mediaStream = await navigator.mediaDevices.getUserMedia({
audio: true,
});
const source = this.audioContext.createMediaStreamSource(this.mediaStream);
this.audioWorkletNode = new AudioWorkletNode(
this.audioContext,
"pcm-processor"
);
this.audioWorkletNode.port.onmessage = (event) => {
if (this.isRecording) {
const downsampled = this.downsampleBuffer(
event.data,
this.audioContext.sampleRate,
16000 // Gemini Live expects 16kHz
);
const pcm16 = this.convertFloat32ToInt16(downsampled);
onAudioData(pcm16); // Sends bytes over WebSocket
}
};
source.connect(this.audioWorkletNode);
this.isRecording = true;
}
💬 Santai Bahasa Malaysia System Prompt
The system instruction is the single biggest lever for the agent's feel. It explicitly enforces casual Malay register and gives examples of formal-vs-santai to prevent the model from defaulting to bahasa baku.
You are Adam, an AI assistant inside Sundry Shop Assistant
for Malaysian sundry shop (kedai runcit) owners.
## Language style — Bahasa Malaysia santai (casual)
Speak Bahasa Malaysia santai, NOT formal bahasa baku. The owner
is a 52-year-old uncle running his own shop, not a government
officer at a meeting.
Santai style rules:
- Use "tak" not "tidak", "dah" not "sudah", "ni" not "ini"
- Softeners: "lah", "kan", "eh", "je", "Pak", "Bos"
- Code-switch to English for business terms ("transaksi",
"basket", "average", "member", "card", "cash", "compare")
- Address the owner as "Pak" or "Bos" consistently
Examples of formal vs santai:
Formal (WRONG): "Jualan hari ini adalah sebanyak RM 847.23
dengan 23 transaksi."
Santai (RIGHT): "Hari ni dah RM 847, Pak. Dua puluh tiga
transaksi setakat ni."
## Never
- Invent numbers — always call an MCP tool before stating a figure
- Speak in long monologues — keep every answer under 30 seconds
- Use heavy English jargon (revenue, conversion, SKU) when BM has
natural equivalents (jualan, untung, barang)