Building Reliable NL2SQL Systems for Sensitive Industrial Data

What this project taught me about reliability, safeguards, and trust

University ProjectLLM SystemsRAGReasoning & ReliabilityLangGraphSQL

This project taught me that building AI systems for real-world engineering data is not just about getting a model to produce plausible answers. It is about building trust. In our NL2SQL system for tool condition monitoring data, reliability and safeguards were just as important as natural language quality.

The core goal was to let users query a PostgreSQL database in natural language, without requiring SQL expertise. But in an industrial setting, a wrong query or unsafe query can have serious consequences. So from the beginning, I approached the system as a reliability problem first: how do we make sure generated SQL is grounded, safe, and consistently useful?

A big design decision was to avoid a monolithic “ask model, run query” flow. Instead, I built a staged pipeline with explicit steps: request classification, context handling, schema-aware table selection, SQL generation, safety validation, read-only execution, bounded retries, and answer summarization. This separation made failure modes visible and gave us control points for quality and safety.

One of the most valuable lessons was how much schema grounding matters. By narrowing prompts to relevant tables, relationships, and allowed schema elements, we significantly reduced hallucinated columns and invalid joins. In practice, metadata quality became one of the strongest predictors of answer quality.

Protecting sensitive operational information also shaped many implementation decisions. The system validates SQL before execution, rejects non-read-only or suspicious patterns, executes inside read-only transactions with timeouts, and uses allowlisted schema context. These are not “extra features”; they are essential safeguards when AI interfaces touch production-adjacent data.

I also learned that reliability is deeply tied to evaluation discipline. We did not rely on a few demo prompts. We used unit and integration tests, representative ground-truth runs, and benchmark comparisons across model setups. That process made latency, error rates, and answer behavior measurable, and helped us make informed decisions about deployment trade-offs.

Another practical takeaway was that model quality and system quality are different things. Even strong models can fail without good orchestration and validation. Conversely, a carefully designed workflow can make model behavior far more stable and transparent. This project reinforced my belief that robust AI products come from strong system design around the model, not only model selection.

Overall, this was an important experience in building AI that is usable and safe at the same time. It strengthened how I think about reliability engineering for LLM applications, especially when working with sensitive data and domain-specific workflows where trust must be earned through design.