Why We Built Our Own Chat System
When we started building CosmiQ, the obvious choice was to use something like assistant-ui or similar chat libraries. They handle streaming, message history, and state out of the box. But our use case broke every assumption these tools made.
A CosmiQ reading is not a simple prompt and response. Every reading goes through a three-stage pipeline:
- Classification: A lightweight LLM analyzes the user's question to determine complexity, emotional tone, and how many tarot cards would best serve the reading. A simple “Will I get the job?” might need one card. A deep exploration of a life transition might need eight.
- Card Selection: This happens on the backend using
crypto.getRandomValues()with an unbiased random index algorithm. We do not useMath.random()because modulo bias can skew distributions. When you are pulling cards that represent life decisions, the randomness needs to be cryptographically sound. - Interpretation: A pro-tier LLM generates the actual reading, with full context about the cards, the user's personal background, and conversation history.
No off-the-shelf chat library handles “pause after the first LLM call, do server-side cryptographic selection, then resume with a different model.” So we built our own.
We chose Zustand over React Context or Redux for a specific reason: state survives component unmounts. When a user navigates away mid-stream and comes back, the streaming content is still there. No optimistic updates that can get out of sync. No “stale closure” bugs. One global store that represents the true state, and components subscribe to exactly what they need. The tradeoff is that debugging requires inspecting the store directly, but the reduction in race conditions and state sync bugs made it worth it.
Solving the Token Cost Problem at Scale
LLM APIs charge by token. Every message in a conversation gets sent with the full history, so a 10-message conversation means the first message gets transmitted 10 times. The cost grows quadratically as conversations lengthen.
We discovered users were having conversations with 500+ messages. These were not one-off readings; they were ongoing dialogues about life situations spanning weeks. Under a naive approach, every new message would send all 500 previous messages to the LLM.
Our first solution was a sliding window: keep the last 6 exchanges (12 messages) and drop older ones. This worked initially, but created a ceiling problem. Once you hit 6 exchanges, every subsequent message carried the same 11,000+ token load. For 500 messages, that meant 250 exchanges, each costing the same high amount.
The solution: intelligent summarization every 5 exchanges. A lightweight LLM (~800 tokens) extracts card-to-question mappings and a 400-500 word narrative. This summary replaces the full history, dropping token count from ~11K back to ~3.5K. The pattern repeats, creating a sawtooth usage curve instead of a flat ceiling. Result: 68% token reduction (~850K vs ~2.7M for 500 messages).
Critical insight: Beyond cost savings, summaries preserve context from the initial reading. With a sliding window, once you pass exchange 6, the original question and first cards pulled are completely lost. With summarization, the initial reading is always part of the compressed context.
Before: Sliding Window
Plateaus at exchange 6, every message after costs the same
Total for 500 messages: ~2.7M tokens
After: Summary Every 5 Exchanges
Drops back down after each summary, sawtooth pattern
Total for 500 messages: ~850K tokens (68% less)
Conversation Memories
When a user asks about their career, would it not be valuable to reference what they asked about their career three months ago? Traditional chat systems treat each conversation as isolated. We wanted continuity across all readings.
Every completed reading gets converted to a 768-dimensional vector embedding using Google's text-embedding-004 model, then stored in PostgreSQL via pgvector. When a new reading starts, we perform a cosine similarity search against the user's past readings with a recency boost that decays by 2% per day.
But raw similarity is not enough. We implemented a defense-in-depth strategy: after the initial vector search retrieves candidate memories, we use prompt engineering to have the LLM reason about whether each past conversation is actually relevant to the current context. The LLM validates that the retrieved memories make sense before incorporating them, filtering out false positives that passed the similarity threshold but are not contextually appropriate.
This two-layer approach combines the efficiency of vector search with the reasoning capability of the LLM, ensuring users only see memories that genuinely enhance their reading.
Personal Context for Deeper Readings
One of our most requested features was the ability to add persistent personal background. Users wanted to share things like “I am in a long-term relationship, working in tech, dealing with burnout” so readings could be interpreted in context rather than generically.
Now users can add their personal context once in settings, and it automatically injects into every prompt. The LLM does not just see “Will my project succeed?” but also sees the user's career history, relationship status, and current life challenges.
This single feature transformed reading quality. Generic interpretations became deeply personal ones. Users consistently report that CosmiQ “understands” their situation in ways other tools cannot.
Astrology
When we added Vedic astrology support, we did not just plug into an API. We built a complete chart generation system from first principles because we needed total control over data quality and real-time computation.
Three Complete Charts Per User
Vedic (Sidereal)
Whole-sign house system, 27 nakshatras, lunar nodes as Rahu/Ketu, planetary dignities
Western (Tropical)
Placidus house system, orb-based aspects (conjunction, trine, square, opposition), Chiron and Lilith
Navamsa (D9)
Harmonic divisional chart for marriage, dharma, and soul purpose analysis
For each chart, we compute positions for all 9 Vedic grahas plus outer planets (Uranus, Neptune, Pluto), calculate their house placements, nakshatras, padas, sign lords, nakshatra lords, and dignity status. That alone gives us over 100 data points per chart.
Data Computed Per Reading
Live transits are calculated in real-time. When a user asks “What is happening for me this week?”, we compute current planetary positions using the same astronomical library, then overlay them on the natal chart to identify active aspects. The LLM receives which natal planets are being activated by current transits.
For timing predictions, we use Google Search grounding to get accurate upcoming astronomical data: eclipse dates, retrograde stations, planetary ingresses. The prompt explicitly instructs the LLM to reference these dates. “Saturn ingresses Aries on May 24, 2025” instead of vague “late 2025.”
Vimshottari Dasha calculations deserve their own mention. This is the Vedic predictive timing system based on the Moon's nakshatra at birth. We compute the exact remaining balance of the birth dasha, then calculate forward through the 120-year cycle to find the current mahadasha (major period), antardasha (sub-period), and the next three upcoming period transitions. The LLM uses these exact dates when discussing life timing.
The depth of data is what makes CosmiQ readings feel accurate. When the LLM has exact planetary positions, dasha periods, current transits, and 500+ lines of astrological definitions to reference, it cannot hallucinate generic advice. The data constrains it to be specific.
From Happy Path to Reliable Distributed Processing
Our first architecture was the obvious one: client sends a request, server streams the response back via WebSocket. The “happy path” worked beautifully in development.
Then reality hit. CosmiQ is a Progressive Web App, and PWAs have a fundamental constraint: when the app goes to background, WebSocket connections drop. A user asks a deep question, switches to check a text message, and when they return, the stream is dead. Their reading is lost.
Customer complaints started coming in: “My reading disappeared.” “I waited 30 seconds and nothing happened.” “The app is broken.” The happy path assumed perfect conditions. Real users on real networks broke those assumptions constantly.
If ANY step fails or connection drops, the entire chain breaks. No recovery. No retry. Reading lost.
Each step runs independently and saves its result to the database. If a step fails, Inngest retries just that step. The client polls the DB for updates, so connection drops do not matter.
The 256KB challenge: Inngest has a 256KB payload limit per event. With conversation history, birth chart data, and card definitions, we easily exceeded this. Solution: we fetch large data (history, cached charts) from the database within each step rather than passing it through the event payload. This required restructuring how data flows through the pipeline.
The client uses a hybrid approach: Supabase Realtime when active, polling when backgrounded. The Zustand store persists streaming state across navigation and tab switches. You can start a reading, close the app entirely, reopen it 30 seconds later, and watch the response complete.
The trade-off is slight latency: Inngest adds a few hundred milliseconds of overhead. But for readings that take 10-30 seconds anyway, this is invisible. The reliability gain is enormous. Customer complaints about lost readings dropped to zero.
Data-Driven Product Development
Two internal dashboards drive our product decisions:
/analytics tracks user registrations, reading volumes, token consumption, reading mode popularity, and retention metrics over time. We can see which features users actually use versus which ones we thought they would use.
/feedback captures user sentiment after readings. Each rating includes the full chat context, so when someone says “did not resonate,” we can see exactly what the LLM said and why. This feedback loop has driven dozens of prompt engineering improvements.
Through A/B testing guided by these dashboards, we have improved conversion rates by 20%. Decisions come from data, not intuition.
Architecture Overview
The Stack
- TypeScript, Next.js 15, React 19, Tailwind CSS
- PostgreSQL via Supabase (RLS, pgvector, Realtime)
- Prisma ORM for type-safe queries
- Inngest for background jobs
- Stripe for billing
- Vercel for hosting
- Zustand for client state
The Result
CosmiQ is not a wrapper around an LLM. It is a carefully engineered system that handles the messy reality of production AI: unreliable networks, backgrounded apps, cost optimization, personalization at scale, and data-driven iteration.
Every decision traced back to a real user problem. The cryptographic randomness, the conversation compression, the hybrid polling, the RAG system: each exists because we hit a wall and engineered our way through it.