Skip to main content
Tilbake til blogg
OpenClawMarch 28, 202610 min

How to wire up Gmail, calendar, and contacts to your AI

How I connected Gmail to an AI agent, indexed 83K emails in ChromaDB for semantic search, and built an auto-classification system that actually works.

DB

David Bakke

Founder, Bakke & Co

PostShare
ForsidebildeOpenClaw

Why email was the first integration

When I started building OpenClaw in early February, I had a list of things I wanted to connect. Home automation, accounting, project management. But email came first because email is where business actually happens.

I run Bakke & Co (consulting), co-founded PortLink (maritime AI), and manage several side projects. I sing in Oslokoret. I DJ. All of these generate email. My inbox had over 83,000 emails when I started this project. Newsletters I never unsubscribed from. Receipts from SaaS products I forgot I was paying for. Important threads about potential partnerships buried under promotional noise.

The goal wasn't to build an email client. It was to build an email brain: something that could search by meaning rather than keywords, classify incoming mail automatically, draft replies in my voice, and give me a morning briefing so I'd know what needed my attention before I opened Gmail.

I called the agent Hermes, after the Greek messenger god. It seemed fitting.

The OAuth setup

Step one was connecting to Gmail. Google's API requires OAuth 2.0, and getting this right took more time than I'd like to admit.

The flow:

  1. Create a Google Cloud project
  2. Enable the Gmail API and Google Calendar API
  3. Configure OAuth consent screen (I used "internal" for testing, then switched to "external" later)
  4. Create OAuth 2.0 credentials (desktop application type)
  5. Run the auth flow to generate refresh tokens

I wrote a single auth script (gmail_auth.py) that handles the entire flow. It stores credentials securely using the macOS Keychain via 1Password, and the refresh tokens stay in the keychain rather than in plaintext config files.

The critical scoping decision: Hermes only gets gmail.readonly, gmail.insert, and gmail.labels. No gmail.send. No gmail.compose. No SMTP access. This wasn't paranoia. Early in February, a migration script had a bug that accidentally sent emails to my contacts. Nothing catastrophic, but enough to make me permanently lock down send permissions. Hermes can read emails and create drafts. It can label and archive. It can never, under any circumstances, send an email on its own.

That rule is in decisions.md, in MEMORY.md, and tattooed into the agent's soul. Some mistakes you only make once.

Indexing 83K emails with ChromaDB

Reading email through the Gmail API is fine for recent messages. But what if I want to find that one email from 2023 where a maritime consultant mentioned autonomous navigation regulations in the North Sea? Gmail's keyword search won't help me because I don't remember the exact words.

That's where ChromaDB comes in. It's a vector database that stores text as embeddings, which means you can search by meaning rather than exact text matches.

The indexing pipeline:

  1. Pull emails from Gmail via API (incremental sync every 30 minutes)
  2. Store metadata in SQLite (crm.db) -- sender, date, subject, labels, thread IDs
  3. Generate embeddings using the bge-m3 model (supports both Norwegian and English, which matters when half your email is in Norwegian)
  4. Store embeddings in ChromaDB

The initial indexing took about 8 hours for 83K emails. After that, the incremental sync handles new emails as they arrive. The ChromaDB instance now has over 89K documents (some emails generate multiple chunks if they're long).

A FastAPI server sits in front of this at localhost:8765, exposing search endpoints that Hermes can call. When I ask "find the email where someone asked about port call scheduling in Bergen," Hermes hits the semantic search endpoint and finds it, even if the email never used the exact phrase "port call scheduling."

The quality of the bge-m3 embeddings surprised me. It handles code-switching between Norwegian and English well, which is a real concern when you live in Norway but work internationally. An email that's half Norwegian and half English still gets indexed correctly and surfaces in searches from either language.

Label System V6: auto-classification

This is the part I'm both proudest of and most frustrated by. "V6" tells you how many times I rewrote it.

The idea is simple: every incoming email gets automatically classified with labels that tell me what kind of email it is and what I need to do about it. In practice, getting this right was five weeks of iteration.

The current system uses three types of labels:

Action labels (11 total) tell me what to do:

  • ACTION_REQUIRED -- someone needs a response
  • WAITING -- I'm waiting for someone else
  • INVOICE, RECEIPT -- financial documents
  • FYI -- informational, no action needed
  • NEWSLETTER, PROMOTION -- content I subscribed to (or didn't)
  • SCAM_DETECTED -- obvious spam that got through
  • And a few others (BOUNCED, AUTOREPLY, PENDING_REVIEW)

Context labels (8 total) tell me which part of my life it relates to:

  • PORTLINK, BAKKE_CO, OSLOKORET, PERSONAL, FINANCE, TECH, DJ, VAKTROMMET

Type labels (14 total) describe the email format:

  • Direct, Thread_Reply, Receipt, Invoice, Booking, Shipping, Security, Social, Service_Alert, Government, Newsletter, Promotion, Offer, Automated

The classification happens in two passes. First, a deterministic rules engine catches the easy stuff: SPF/DKIM failures become SCAM_DETECTED, known newsletter senders get NEWSLETTER, receipts from known vendors get RECEIPT. This handles maybe 40% of email correctly without any AI cost.

For the remaining 60%, I use GPT-4o-mini via OpenAI's Batch API. Not Claude. I tested both, and for bulk classification of email, GPT-4o-mini is good enough and significantly cheaper. When you're classifying thousands of emails, the per-token cost matters. That decision is logged in decisions.md so Hermes doesn't second-guess it.

The classification follows an intent-first approach. The question isn't "what kind of email is this?" It's "what does this email want David to do?" A direct message from a human expecting a reply is ACTION_REQUIRED. A shipping confirmation is FYI with type Shipping. A promotional email with a 20% discount is PROMOTION with type Offer (and the offer details get extracted into database fields).

PROMOTION and SCAM_DETECTED get auto-archived. Everything else stays in the inbox until I deal with it.

What you can do once it's wired up

Once the email brain is working, a bunch of things become possible:

Semantic search across years of email. "Find the conversation where Martin from the port authority discussed terminal capacity." It finds it. Even if the email was in Norwegian and Martin's title was "havnesjef," not "port authority."

Morning briefings. Every morning, Hermes scans emails that arrived overnight, classifies them, and posts a summary to Slack. Four action items, twelve FYI, three newsletters. The action items are listed with one-line summaries so I know what's urgent before I open Gmail.

Draft replies. I can ask Hermes to draft a reply to a specific email. It reads the full thread, understands the context, and writes a draft in my voice. The draft gets created via the Gmail API (using gmail.insert), so it shows up in my Gmail drafts folder. I review, edit if needed, and send manually. Hermes never touches the send button.

Contact tracking. The CRM layer in SQLite tracks who I communicate with, how often, and what about. When I ask "when did I last email someone at the Norwegian Maritime Authority?", Hermes can answer without me searching.

Correction learning. When I reclassify an email that Hermes got wrong, the correction gets stored in a david_decisions table. After two corrections on the same sender, Hermes creates an automatic rule for that sender. The system gets smarter over time, but the learning is bounded and explainable, not a black box.

What was hard to set up

The OAuth flow was fiddly. Google's documentation is thorough but scattered. Getting refresh tokens to persist correctly through the Keychain took three attempts. The first two approaches stored tokens in plaintext files that would get accidentally deleted during workspace cleanup.

ChromaDB indexing was slow initially. The bge-m3 model runs locally on my Mac Studio (98GB unified memory helps here), and the first full index of 83K emails took overnight. I tried a few embedding models before settling on bge-m3. Some were faster but couldn't handle Norwegian. Some handled Norwegian but produced poor results on technical English.

The label system rewrites were painful. Each version meant re-classifying thousands of emails. V5 to V6 involved stripping labels from 58,503 messages in 59 batches. I wrote a batch script that processed 1,000 messages at a time using the Gmail batchModify API. The delete-label API call hangs forever if the label is attached to thousands of messages. You have to strip the label from all messages first, then delete the empty label.

The biggest surprise: Gmail's API rate limits are aggressive. You get 250 quota units per second, and different operations cost different amounts. A batchModify that touches 1,000 messages eats into that quota fast. I had to add backoff logic that slows down when approaching the limit.

What still breaks

The classification isn't perfect. It gets maybe 92% of emails right on the first pass. The remaining 8% are edge cases that are genuinely ambiguous. Is an email from a SaaS company about a feature update a NEWSLETTER or FYI? Reasonable people could disagree.

The incremental sync occasionally misses emails during Gmail API hiccups. I run a full reconciliation once a week to catch anything that slipped through.

Calendar integration is connected but less mature than email. Hermes can read my calendar and check availability when someone proposes a meeting time. It can suggest times for meetings. But it doesn't create calendar events, for the same reason it doesn't send emails: I don't trust any AI enough to modify my calendar without my approval.

How to replicate this

If you want to build something similar, here's the practical path:

Start with OAuth. Get Gmail API access working before anything else. Use a service account or desktop OAuth flow depending on your needs. Lock down the scopes to the minimum you need.

For the vector store, ChromaDB works well for personal-scale email (under 500K emails). If you're building this for an organization, you might want something beefier. The embedding model matters more than the vector database. Pick one that handles your languages.

For classification, don't start with AI. Write deterministic rules first. Known senders, known patterns, header analysis. Let AI handle the ambiguous cases. This keeps your costs down and makes the system more predictable.

Build the correction loop early. The system should get better when you fix its mistakes, not just accept the correction silently.

And whatever you do, don't give it send permissions on day one. Earn that trust over months, not hours.

openclawgmailhermesemailcalendar