Introducing DataScribe: Describe Data, Then See It as a Galaxy

The Seed Data Problem

Every project that touches data starts with the same friction. You need realistic rows to build against — a schema that hangs together, foreign keys that resolve, distributions that don’t look obviously fake, names that sound like names and not like Lorem Ipsum LLC. So you write a one-off script. Or you point an LLM at the problem and get back JSON that looks plausible until you try to join it. Or you copy a production snapshot you shouldn’t be copying.

None of these scale. The script becomes a maintenance burden. The LLM output is non-deterministic and quietly wrong about cardinality. The production snapshot is a compliance problem waiting to happen.

We wanted a tool that respected the actual shape of the problem: most of seed data isn’t semantic. Ids are ids. Timestamps are timestamps. Quantities are draws from distributions. The handful of fields that benefit from a language model — names, titles, descriptions — are exactly the fields where a deterministic generator falls flat. The split is obvious once you say it out loud, and it changes the architecture.

So we built one.

Enter DataScribe

DataScribe is an open-source tool for designing and generating synthetic datasets through conversation. You describe what you need in plain language. An agent interviews you, writes a precise generation spec, and a deterministic engine seeds the rows. The language model is used only where it earns its keep — for the genuinely semantic fields.

The result is reproducible, joinable, and shaped like the data you actually wanted.

Try it: datascribe.dev

How It Works

DataScribe has four moving parts, and the boundaries between them are the whole point.

1. Chat

You start with a blank workspace and tell the data-architect agent what you’re trying to build. “A SaaS billing model with workspaces, users, subscriptions, and usage events.” The agent asks follow-up questions — how many of each, what the relationships are, what the time range looks like, which fields need to read like real text — and writes a spec by calling an update_spec tool. No free-form generation, no wandering. Structured questions in, structured spec out.

2. Spec

The spec is the contract. Every field has a generator: sequence, uuid, pattern, integer, float (uniform or normal), boolean, categorical, date, timeseriesLinear, timeseriesRandomWalk, foreignKey, and llm. The first twelve run in code. Only the last one calls a model.

That single distinction is doing a lot of work. It means a workspace with 50,000 usage events doesn’t trigger 50,000 LLM calls. It means timestamps come from a real distribution, not a model’s idea of a distribution. It means foreign keys actually resolve because the engine knows which parent rows exist when it’s generating children.

3. Generate

Hit the button. A deterministic engine walks the spec — seeded PRNG, reproducible by design, same spec plus same seed yields identical rows every time. Then it batches only the llm fields into structured-output calls, so you get a human-readable company name or a plausible support-ticket title without paying for an LLM round-trip on every integer.

Without an API key, the deterministic side still runs end-to-end; llm fields fall back to placeholders. You can prototype the whole pipeline offline.

4. Visualize

Four views: Schema, Data, Charts, and Constellation.

The first three are what you’d expect. The fourth is why I’m writing this post.

The Constellation

Once your data is generated, you can flip to the Constellation tab and watch your dataset become a galaxy.

Each entity is a glowing cluster. Each row is a star orbiting its cluster’s core. The star’s distance from the core and its size both encode the same numeric field, so every cluster is also a radial value map — low values hug the center, high values ride the outer rim. Foreign-key relationships are drawn as faint filaments between related stars across clusters. The whole thing breathes: a slow swirl, twinkle on the stars, a soft pulse on the hubs. You can scroll to zoom, drag to pan, hover to see the underlying row.

It is not the most efficient way to read your data. It is deliberately a step removed from useful.

That’s the point.

Why a Galaxy?

Most data visualizations are designed to answer a question you already have. You know what you’re looking for, you pick the right chart, the chart shows you the answer. That’s a perfectly good loop, and the Charts tab does exactly that — bar charts, distributions, sortable tables, the usual toolkit.

But there’s a different kind of seeing that happens earlier, before you know what to ask. You’re staring at a generated dataset and you don’t yet have a hypothesis. You want to notice something. You want a representation that surprises you into a question.

The Constellation is built for that moment. When your billing data renders as a galaxy and one cluster is visibly twice the radius of the others, you don’t need to read the numbers to know something is asymmetric. When the filaments between two entities tangle into a dense web while another pair sit politely apart, you’ve learned something about your relational structure without writing a single query. When the outer rim of your usage_events cluster is densely packed and the inner ring is sparse, you’ve seen the long tail of your distribution before you knew you were looking for one.

The visualization is a step removed from useful because being too close to “useful” forecloses on noticing. A scatter plot already knows what its axes are. A histogram already knows what’s interesting. A galaxy view doesn’t presume — it just makes shape legible and lets you find your own question.

Encoding Choices

The encoding is deliberately overdetermined. Distance encodes the numeric field. Size also encodes the numeric field. Hub size encodes row count (logarithmically capped, so a big table reads as a large core without becoming a blinding sun). Color encodes entity. Filaments encode foreign keys.

Redundancy is a feature here. Any single encoding can be missed at a glance — a star slightly farther out, a hub slightly larger — but two encodings of the same value reinforce each other into something you actually perceive. You don’t read a galaxy. You take it in.

The faint guide rings around each hub aren’t decoration; they’re a hint about the radial encoding. The slow tangential swirl that gives the whole field its sense of life isn’t physics; it’s just a tuned bit of motion that makes the visualization feel alive enough to hold attention. The “Reheat” button lets you shake the system and watch it resettle, which is useful when clusters have collapsed into a stable configuration that’s hiding their structure.

Inspired By, Not Borrowed From

The Constellation isn’t a force-directed graph in disguise. It borrows the visual idiom — soft glow, additive blending, motion — but the layout is governed by domain semantics, not just connectivity. The radial position of every star means something about that row. The hub sizes mean something about the table. The filaments mean foreign keys.

The intent is to make a dataset feel like a place you can wander through, rather than a table you have to interrogate.

Why This Matters

Synthetic data is usually treated as a chore. Write the script, get the rows, move on. DataScribe is built on the bet that the act of generating data is itself a useful design step — that describing your schema out loud to an agent surfaces decisions you’d otherwise defer, and that seeing the result rendered as something other than a table reveals shape you’d otherwise miss.

This matters because most of the data work we do downstream — analytics, modeling, dashboards — only ever surfaces what we already thought to ask. The hardest part of working with data isn’t computing answers. It’s noticing the right questions. Any tool that nudges that earlier moment, even slightly, compounds across everything you build with it.

We also wanted to make a small point about the role of language models in data tooling. The dominant pattern right now is to throw an LLM at the whole problem and accept the cost and non-determinism as the price of admission. DataScribe is built the other way around: the engine is deterministic and predictable, and the model is invoked surgically, only for fields that genuinely need it. The result is faster, cheaper, reproducible, and — interestingly — more useful. You can re-run the same spec a hundred times and get the same rows, which means you can actually build tests against generated data.

Open Source, Privacy-First

DataScribe is MIT licensed and open source.

You can:

✅ Use it freely at datascribe.dev
✅ Self-host for your team
✅ Fork and customize
✅ Run it locally during development
✅ Contribute improvements

Your conversation, spec, and generated data live in your browser session. We don’t store the specs you write or the datasets you generate. Bring your own Anthropic API key if you want LLM-generated semantic fields; otherwise the deterministic engine runs entirely on your machine.

Tech Stack

For the curious:

Framework: Next.js + React 18 + TypeScript
Styling: Tailwind CSS
AI: Claude Opus via the Anthropic SDK, with adaptive thinking
Engine: Custom deterministic generators with a seeded PRNG
Visualization: Pure HTML canvas + requestAnimationFrame — no charting or physics dependencies
Tests: Vitest, with coverage for the sandboxed expression evaluator, the generation engine, chart and stats helpers, and cross-table joins

The Constellation is roughly 800 lines of canvas code with hand-tuned forces. No three.js, no d3-force, no charting library. The goal was to keep the visualization legible and the dependency surface small enough that the whole repo stays approachable.

Run It Locally

git clone <your-fork>
cd DataScribe
cp .env.local.example .env.local   # add your ANTHROPIC_API_KEY
npm install
npm run dev                        # http://localhost:3000

Without a key, deterministic generation still works end-to-end; llm fields fall back to placeholders. You can prototype the entire pipeline before deciding whether you want semantic fields turned on.

npm test                           # run the Vitest suite

Use Cases

Engineers: Generate realistic seed data for development databases without scraping production. Specs are version-controllable, so the dataset your teammate sees matches the one you see.

Designers: Prototype against datasets that look like real domains. Names that read like names, dates that follow plausible distributions, relationships that resolve.

Product teams: Show stakeholders a working demo with believable data before the real pipeline is wired up.

Researchers and students: Generate datasets with known ground truth for teaching, testing, or benchmarking.

Anyone curious: Open the Constellation tab and just look at your data for a while. See what you notice.

The Bigger Picture

DataScribe reflects how we think about tools at Lab Z. Two ideas in particular.

One: language models are a component, not a strategy. The right architecture uses them where they’re irreplaceable and uses code everywhere else. Cheaper, faster, more reproducible, more honest about what’s actually happening.

Two: visualizations don’t have to be efficient to be valuable. The Charts tab is efficient. The Constellation is contemplative. Both belong in the same product, because the questions you bring to a dataset on minute one are different from the questions you bring on minute ten, and a good tool meets you at both.

If you’ve ever stared at a generated CSV and felt that something was off but couldn’t quite say what — try generating it in DataScribe and flipping to the Constellation. The data hasn’t changed. Your relationship to it has.

Try it: datascribe.dev License: MIT

We’d love to hear what you notice. Open an issue, send a spec that broke, or just tell us what you saw in the galaxy.