ggsql: A Grammar of Graphics for SQL

Posted by thomasp85 20 hours ago

Comments

Comment by anentropic 19 hours ago

Maybe I skim read it too fast, but I did not find any clear description in the blog post or website docs of how this relates to SQL databases

I was kind of guessing that it doesn't run in a database, that it's a SQL-like syntax for a visualisation DSL handled by front end chart library.

That appears to be what is described in https://ggsql.org/get_started/anatomy.html

But then https://ggsql.org/faq.html has a section, "Can I use SQL queries inside the VISUALISE clause," which says, "Some parts of the syntax are passed on directly to the database".

The homepage says "ggsql interfaces directly with your database"

But it's not shown how that happens AFAICT

confused

Comment by thomasp85 19 hours ago

That is fair - it is somewhat of a special concept.

ggsql connects directly with your database backend (if you wish - you can also run it with an in-memory DuckDB backend). Your visual query is translated into a SQL query for each layer of the visualisation and the resulting table is then used for rendering.

E.g.

VISUALISE page_views AS x FROM visits DRAW smooth

will create a SQL query that calculates a smoothing kernel over the data and returns points along that. Those points are then used to create the final line chart

Comment by georgestagg 19 hours ago

ggsql has the concept of a "reader", which can be thought of as the way ggsql interfaces with a SQL database. It handles the connection to the database and generating the correct dialect of SQL for that database.

As an alpha, we support just a few readers today: duckdb, sqlite, and an experimental ODBC reader. We have largely been focusing development mainly around driving duckdb with local files, though duckdb has extensions to talk to some other types of database.

The idea is that ggsql takes your visualisation query, and then generates a selection of SQL queries to be executed on the database. It sends these queries using the reader, then builds the resulting visualisation with the returned data. That is how we can plot a histogram from very many rows of data, the statistics required to produce a histogram are converted into SQL queries, and only a few points are returned to us to draw bars of the correct height.

By default ggsql will connect to an in-memory duckDB database. If you are using the CLI, you can use the `--reader` argument to connect to files on-disk or an ODBC URI.

If you use Positron, you can do this a little easier through its dedicated "Connections" pane, and the ggsql Jupyter kernel has a magic SQL comment that can be issued to set up a particular reader. I plan to expand a little more on using ggsql with these external tools in the docs soon.

Comment by nojito 19 hours ago

Highly suggest leveraging adbc. I would love to use this against our bigquery tables.

Comment by data_ders 18 hours ago

plus 1 for ADBC!

Comment by gazpacho 4 hours ago

Another +1 for ADBC

Comment by chatmasta 16 hours ago

So we could use this with Postgres by putting DuckDB in front with its Postgres extension, pointing to the source data in PG?

Comment by georgestagg 15 hours ago

In principle, yes, that’s the idea! However I will say we have focused mainly on the grammar and using DuckDB for reading from local files for this alpha release, so I expect there may be some bugs around connecting to remote databases still to iron out!

Comment by chatmasta 12 hours ago

Dunno, I expect if DuckDB works as advertised it might just work! That's the beauty of how they've separated the syntax parsing into frontend/backend from the rest of the engine.

Comment by 3 hours ago

Comment by password4321 19 hours ago

Yes this was my question as well, an example showing all the plumbing/dependencies to generate a graph from an external database server would be very helpful.

Comment by thomasp85 19 hours ago

We certainly plan to create a few videos showing how to set it up and use it. If you use it in Positron with the ggsql extension it can interact directly with the connection pane to connect to the various backends you have there

Comment by anentropic 17 hours ago

Please just document the library itself before making a bunch of videos

I eventually found this readme https://github.com/posit-dev/ggsql/tree/main/ggsql-python which tells me far more than anything I found on the website

Comment by tantalor 19 hours ago

> SQL databases ... confused

"SQL" and "databases" are different things

SQL is a declarative language for data manipulation. You can use SQL to query a database, but there's nothing special about databases. You can also write SQL to query other non-database sources like flat files, data streams, or data in a program's memory.

Conversely, you can query a database without SQL.

Comment by philipallstar 12 hours ago

A SQL database is a database you can connect to and query with SQL.

Comment by johnthescott 16 hours ago

> Conversely, you can query a database without SQL.

fond memories of quel.

Comment by getnormality 19 hours ago

I skimmed the article for an explanation of why this is needed, what problem it solves, and didn't find one I could follow. Is the point that we want to be able to ask for visualizations directly against tables in remote SQL databases, instead of having to first pull the data into R data frames so we can run ggplot on it? But why create a new SQL-like language? We already have a package, dbplyr, that translates between R and SQL. Wouldn't it be more direct to extend ggplot to support dbplyr tbl objects, and have ggplot generate the SQL?

Or is the idea that SQL is such a great language to write in that a lot of people will be thrilled to do their ggplots in this SQL-like language?

EDIT: OK, after looking at almost all of the documentation, I think I've finally figured it out. It's a standalone visualization app with a SQL-like API that currently has backends for DuckDB and SQLite and renders plots with Vegalite. They plan to support more backends and renderers in the future. As a commenter below said, it's supposed to help SQL specialists who don't know Python or R make visualizations.

Comment by nchagnet 18 hours ago

I was quite psyched when I read this so maybe I can tell you why it's interesting to me, although I agree the announcement could have done a better job at it.

In my experience, the only thing data fields share is SQL (analysts, scientists and engineers). As you said, you could do the same in R, but your project may not be written in R, or Python, but it likely uses an SQL database and some engine to access the data.

Also I've been using marimo notebooks a lot of analysis where it's so easy to write SQL cells using the background duckdb that plotting directly from SQL would be great.

And finally, I have found python APIs for plotting to be really difficult to remember/get used to. The amount of boilerplate for a simple scatterplot in matplotlib is ridiculous, even with a LLM. So a unified grammar within the unified query language would be pretty cool.

Comment by levocardia 13 hours ago

I share your pain. You might enjoy Plotnine for python, helps ease the pain. The only bad thing about ggplot is that once you learn it you start to hate every other plotting system. Iteration is so fast, and it is so easy to go from scrappy EDA plot to publication-quality plotting, it just blows everything else out of the water.

Comment by mbreese 14 hours ago

But isn't this then just another tool that you're including in your project? I don't get why I would want to add this as a visualization tool to a project, if it's already using R, or Python, etc...

I mean, is it to avoid loading the full data into a dataframe/table in memory?

I just don't see what the pain point this solves is. ggplot solves quite a lot of this already, so I don't doubt that the authors know the domain well. I just don't see the why this.

Comment by nchagnet 4 hours ago

Well there's always going to be a dependency anyway: loading the data, making it a dataframe, visualizing it, this might be 3 libraries already.

In a sense I really get your complaint. It's the xkcd standard thing all over, we now have a new competing standard.

I think for me it's not so much the ggplot connection, or the fact that I won't need a dataframe library.

It's that this might be the first piece of a standard way of plotting: no matter which backend (matplotlib, vega, ggplot), no matter how you are getting your data (dataframes, database), where you're doing this (Jupyter or marimo notebook, python script, R, heck lokkerstudio?). You could have just one way of defining a plot. That's something I've genuinely dreamt about.

And what makes this different from yet another library api to me is that it's integrated within SQL. SQL has already won the query standardisation battle, so this is a very promising idea for the visualization standardisation.

Comment by lioeters 2 hours ago

I see, that's insightful. At first sight I thought of it as a kind of novelty, extending SQL with a visual grammar to integrate with a specific plotting library. But from your comments I can now imagine it has potential as a general solution for that space between data - wherever it comes from, it can typically be queried by SQL - and its visualization.

Thinking further, though, there might be value in extracting the specs of this "grammar of graphics" from SQL syntax and generalized, so other languages can implement the same interface.

Comment by philipallstar 12 hours ago

Anything to standardise some of the horrifying crap that data scientists write to visualise something.

Comment by wonger_ 18 hours ago

[dead]

Comment by epgui 16 hours ago

This isn't about ggplot (or any particular library) per se, it's about using a flavour of SQL with a grammar of graphics: https://en.wikipedia.org/wiki/Wilkinson%27s_Grammar_of_Graph...

What makes it interesting is the interface (SQL) coupled with the formalism (GoG). The actual visualization or runtime is an implementation detail (albeit an important one).

Comment by nojito 19 hours ago

It seems to be for sql users who don’t know python or r.

Comment by nchagnet 19 hours ago

I would even add that it fits into a more general trend where operations are done within SQL instead of in a script/program which would use SQL to load data. Examples of this are duckdb in general, and BigQuery with all its LLM or ML functions.

Comment by oofbey 14 hours ago

There’s certainly some benefit in a declarative language for creating charts from SQL. Obviously this doesn’t do anything that you can’t also do easily in R or Python / matplotlib using about the same number of lines of code. But safely sandboxing those against malicious input is difficult. Whereas with a declarative language like this you could host something where an untrusted user enters the ggsql and you give them the chart.

So it’s something. But for most uses just prompting your favorite LLM to generate the matplotlib code is much easier.

Comment by mstr_anderson 3 hours ago

[flagged]

Comment by lmeyerov 17 hours ago

This is great

We reached a similar conclusion for GFQL (oss graph dataframe query language), where we needed an LLM-friendly interface to our visualization & analytics stack, especially without requiring a code sandbox. We realized we can do quite rich GPU visual analytics pipelines with some basic extensions to opencypher . Doing SQL for the tabular world makes a lot of sense for the same reasons!

For the GFQL version (OpenCypher), an example of data loading, shaping, algorithmic enrichment, visual encodings, and first-class pipelines:

- overall pipelines: https://pygraphistry.readthedocs.io/en/latest/gfql/benchmark...

- declarative visual encodings as simple calls: https://pygraphistry.readthedocs.io/en/latest/gfql/builtin_c...

Comment by JHonaker 15 hours ago

I applaud the project, and I completely agree that the concepts maps nicely to SQL. The R equivalent of a WITH data prep block followed by the VISUALIZE is pretty much how all my plotting code is structured.

However, I don't see what the benefits of this are (other than having a simple DSL, but that creates the yet another DSL problme) over ggplot2. What do I gain by using this over ggplot2 in R?

The only problem, and the only reason I ever leave ggplot2 for visualizations, is how difficult it is to do anything "non-standard" that hasn't already had a geom created in the ggplot ecosystem. When you want to do something "different" it's way easier to drop into the primitive drawing operations of whatever you're using than it is to try to write the ggplot-friendly adapter.

Even wrapping common "partial specificiations" as a function (which should "just work" imo) is difficult depending on whether you're trying to wrap something that composes with the rest of the spec via `+` or via pipe (`|>`, the operator formerly known as `%>%`)

Comment by thomasp85 15 hours ago

We are not out to convince anyone to switch from ggplot2 (and we are not planning to stop developing that).

ggsql is (partly) about reaching new audiences and putting powerful visualisation in new places. If you live in R most of the time I wouldn't expect you to be the prime audience for this (though you may have fun exploring it since it contains some pretty interesting things ggplot2 doesn't have)

Comment by JHonaker 8 hours ago

Totally get that. I was mostly just long-windedly complaining that the one problem I have with it seems to be exacerbated by, not fixed, by this. I was also hoping someone would say “oh it’s actually way easier than you think, see (amazing link).”

I really do think it’s a good idea to explore! Sometimes I feel crazy because I’m the only one in my department that prefers to just write SQL to deal with our DBs instead of fiddling with a python/R connector that always has its own quirks.

Comment by almostjazz 15 hours ago

Side comment: |> and %>% aren't the same btw! The newish base pipe (|>) is faster but doesn't support using the dot (.) placeholder to pipe into something other than the first argument of a function, which can sometimes make things a little cleaner.

Comment by sinnsro 12 hours ago

The base pipe has an underscore as a placeholder. From the docs:

Usage:

     lhs |> rhs

Arguments:

     lhs: expression producing a value.

     rhs: a call expression.

Details: [...]

     It is also possible to use a named argument with the placeholder
     ‘_’ in the ‘rhs’ call to specify where the ‘lhs’ is to be
     inserted.  The placeholder can only appear once on the ‘rhs’.

Comment by Zedseayou 7 hours ago

I believe this wasn't added in the initial implementation of the base pipe so some didn't realize it got included later, and still does not let you use constructs like e.g. combining multiple transformations of the input on the rhs. But for most purposes it's certainly sufficient

Comment by asutekku 6 hours ago

> What do I gain by using this over ggplot2 in R?

You don't have to use R.

Comment by nicoritschel 17 hours ago

This is neat. I do wish there was a way for this to gracefully degrade in contexts without support for the grammar, though.

I devised a similar in spirit (inside SQL, very simplified vs GoG) approach that does degrade (but doesn't read as nice): https://sqlnb.com/spec

Comment by thomasp85 17 hours ago

I'm not quite sure I understand what you mean by "degrade in context" - care to elaborate?

Comment by nicoritschel 16 hours ago

If you're familiar with the percent format for jupyter notebooks, something like that— so things gracefully degrade in a more "basic" execution context.

# %%

foo = 1

# %%

print(foo)

Above is notebook with two "cells" & also a valid Python script. Perhaps it matters less with SQL vs Python, but it's a nice property.

Comment by thomasp85 16 hours ago

Ah - I don't think it really matters here, but if you find yourself in need then you can open a GitHub issue and we can discuss

Comment by kasperset 19 hours ago

Will this ever integrate rest of the ggplot2 dependent packages described here: https://exts.ggplot2.tidyverse.org/gallery/ in the near or distant future? Sorry if it already mentioned somewhere.

Comment by thomasp85 19 hours ago

I don't think we will get the various niche geoms that have been developed by the ggplot2 community anytime soon.

The point of this is not to superseed ggplot2 in any way, but to provide a different approach which can do a lot of the things ggplot2 can, and some that it can't. But ggplot2 will remain more powerful for a lot of tasks in many years to come I predict

Comment by tomjakubowski 7 hours ago

This is amazing work. I've always wanted to be able to make visualizations straight from a SQL REPL. I've been thinking about building something sort of like this with Perspective[1] + DuckDB. Perspective would give an interactive chart (and/or wicked datagrid), and using a Perspective virtual server, you could feed filter + pivot conditions in the UI back to the database query. Maybe this is the kick I need to get this out of ideation.

[1]: https://perspective-dev.github.io/

Comment by tmoertel 15 hours ago

First, ggsql looks awesome. Can't wait to try it out.

Feedback: A notable omission in the ggsql docs: I cannot find any mention of the possible outputs. Can I output a graphic in PDF? In SVG? PNG? How do I control things like output dimensions (e.g., width=8.5in, height=11in)?

The closest I got was finding these few lines of example code in the docs for the Python library:

    # Display or save
    chart.display()  # In Jupyter
    chart.save("chart.html")  # Save to file

Comment by georgestagg 15 hours ago

Currently our only writer module is for vegalite, the output is a vegalite spec (JSON). Tools already exist to render this kind of output to an interactive chart, SVG, PNG, etc. with their own controls for sizing and the like.

Our ggsql Jupyter kernel can use these vegalite specifications to output charts in a Quarto document, for example.

In the future we plan to create a new high performance writer module from scratch, avoiding this intermediate vegalite step, at which point we’ll have better answers for your questions!

Comment by RobinL 1 hour ago

Please retain vega lite speech as an option! It's incredible useful because you can tweak the chart or change it completely at a later date using e.g. the vega lite editor.

Comment by efromvt 19 hours ago

Love the layering approach - that solves a problem I’ve had with other sql/visual hybrids as you move past the basics charts.

Comment by refset 12 hours ago

Also in this vein is Shaper, a SQL-first approach for handling entire dashboards (and powered by DuckDB): https://taleshape.com/shaper/docs/getting-started/

Comment by semmulder 15 hours ago

This seems really cool, will try it out soon!

What made it click for me was the following snippet from: https://ggsql.org/get_started/grammar.html

> We’ve tried to make the learning curve as easy as possible by keeping the grammar close to the SQL syntax that you’re already familiar with. You’ll start with a classic SELECT statement to get the data that you want. Then you’ll use VISUALIZE (or VISUALISE ) to switch from creating a table of data to creating a plot of that data. Then you’ll DRAW a layer that maps columns in your data to aesthetics (visual properties), like position, colour, and shape. Then you tweak the SCALEs, the mappings between the data and the visual properties, to make the plot easier to read. Then you FACET the plot to show how the relationships differ across subsets of the data. Finally you finish up by adding LABELs to explain your plot to others. This allows you to produce graphics using the same structured thinking that you already use to design a SQL query.

Comment by thomasp85 20 hours ago

The new visualisation tool from Posit. Combines SQL with the grammar of graphics, known from ggplot2, D3, and plotnine

Comment by zcw100 19 hours ago

Don't forget Vega! https://vega.github.io/vega/

Comment by thomasp85 20 hours ago

I'm one of the authors - happy to take any questions!

Comment by mi_lk 20 hours ago

I don't think D3 uses grammar of graphics model?

Comment by thomasp85 19 hours ago

I'd say it does, though it is certainly much more low-level than e.g. ggplot2. But the basic premises of the building blocks described be Leland Wilkinson is there

Comment by jorin 18 hours ago

Really cool project! Would love to see a standard established for representing visualizations in SQL! I built a whole dashboarding tool on top of the idea: https://taleshape.com/shaper/docs/getting-started/ But Shaper takes a more pragmatic approach and just uses built in functionality to describe how to visualize the results. The most value I see with viz as SQL is that it's a great format for LLMs to specify what they want while making it easy to audit and reproduce. Just built a slack bot on top of that concept last week: https://taleshape.com/blog/build-your-own-data-analytics-sla...

Comment by tauroid 12 hours ago

Would be quite compelling if the CLI worked with nushell.

I see the (a?) backend is polars, which is good as well.

This in CLI might be the quickest / tersest way to go from "have parquet / csv / other table format" to "see graph", though a keen polars / matplotlib user would also get there pretty quick.

Comment by gh5000 19 hours ago

It is conceivable that this could become a duckdb extension, such that it can be used from within the duckdb CLI? That would be pretty slick.

Comment by thomasp85 19 hours ago

That is conceivable, not a top priority as we want to focus on this being a great experience for every backend, but certainly something we are thinking of

Comment by rustyconover 18 hours ago

With the new PEG parser this is just a Claude session away in DuckDB.

Comment by jiehong 18 hours ago

The cli only produces vega-lite[0] json graphics, right?

It would be nice if it included a rendering engine.

[0]: https://github.com/vega/vega-lite

Comment by thomasp85 17 hours ago

That is certainly in the pipeline. We chose to start with vegalite so we could focus on the internals of the representation

Comment by kasperset 20 hours ago

Looks intriguing. Brings plotting to Sql instead of “transforming” sql for plotting.

Comment by urams 15 hours ago

What feels like a lifetime ago, I made almost all of the R tooling at Uber and actually implemented what was effectively exactly this on top of the R DB tooling. Everyone, I think correctly in retrospect, I showed it to thought it was more or less useless. I had hoped it would be a nice way to pull and instantly visualize data but that was rarely all that valuable like this.

Comment by psadri 15 hours ago

What do you think is missing? I'm a big SQL fan and the idea of direct SQL to X seems appealing at least on the surface.

Comment by urams 14 hours ago

Probably the biggest issue is that it's primarily useful in the context of exploratory analysis and makes iteration on a plot much slower, requiring you to re-run the query to get a new viz. Iterating on a viz is best done with the data cached locally or elsewhere.

In the context of a query used for a dashboard in prod, you're likely using a different viz environment so it's not useful at all there.

Comment by jiehong 18 hours ago

Outstanding!

This can replace a lot of Excel in the end.

It makes so much sense now that it exists!

Comment by ericdfournier 14 hours ago

Very cool. Though it would be great to see this implemented as a PostgreSQL extension (if possible).

Comment by 19 hours ago

Comment by radarsat1 20 hours ago

Wow, love this idea.

Comment by data_ders 19 hours ago

ok, this is definitely up my alley. color me nerd-sniped and forgive the onslaught of questions.

my questions are less about the syntax, which i'm largely familiar with knowing both SQL and ggplot.

i'm more interested in the backend architecture. Looking at the Cargo.toml [1], I was surprised to not see a visualization dependency like D3 or Vega. Is this intentional?

I'm certainly going to take this for a spin and I think this could be incredible for agentic analytics. I'm mostly curious right now what "deployment" looks like both currently in a utopian future.

utopia is easier -- what if databases supported it directly?!? but even then I think I'd rather have databases spit out an intermediate representation (IR) that could be handed to a viz engine, similar to how vega works. or perhaps the SQL is the IR?!

another question that arises from the question of composability: how distinct would a ggplot IR be from a metrics layer spec? could i use ggsql to create an IR that I then use R's ggplot to render (or vise versa maybe?)

as for the deployment story today, I'll likely learn most by doing (with agents). My experiment will be to kick off an agent to do something like: extract this dataset to S3 using dlt [2], model it using dbt [3], then use ggsql to visualize.

p.s. @thomasp85, I was a big fan of tidygraph back in the day [4]. love how small our data world is.

[1]: https://github.com/posit-dev/ggsql/blob/main/Cargo.toml

[2]: https://github.com/dlt-hub/dlt

[3]: https://github.com/dbt-labs/dbt-fusion

[4]: https://stackoverflow.com/questions/46466351/how-to-hide-unc...

Comment by thomasp85 19 hours ago

Let me try to not miss any of the questions :-)

ggsql is modular by design. It consists of various reader modules that takes care of connecting with different data backends (currently we have a DuckDB, an SQLite, and an ODBC reader), a central plot module, and various writer modules that take care of the rendering (currently only Vegalite but I plan to write my own renderer from scratch).

As for deployment I can only talk about a utopian future since this alpha-release doesn't provide much tangible in that area. The ggsql Jupyter kernel already allows you to execute ggsql queries in Jupyter and Quarto notebooks, so deployment of reports should kinda work already, though we are still looking at making it as easy as possible to move database credentials along with the deployment. I also envision deployment of single .ggsql files that result in embeddable visualisations you can reference on websites etc. Our focus in this area will be Posit Connect in the short term

I'm afraid I don't know what IR stands for - can you elaborate?

Comment by stevedh 17 hours ago

Intermediate Representation

Comment by thomasp85 16 hours ago

Ah - yes, in theory you could create a "ggplot2 writer" which renders the plot object to an R file you can execute. It is not too far away from the current Vega-Lite writer we use. The other direction (ggplot2->ggsql) is not really feasible

Comment by persedes 15 hours ago

Soo can I put this on top of e.g. grafana?

Comment by breakfastduck 18 hours ago

This is fantastic. Feels like something that should've been in there from the start!

Comment by estetlinus 13 hours ago

I like it. I can see a world where these visuals become my serving layer in dbt. Small, clean and versioned .sql-files.

Please, for the love of god and in the name of everything holy, kill the Jupyter Notebook.

Comment by hei-lima 19 hours ago

Really cool!

Comment by rvba 18 hours ago

1) does this alllw to export to Excel?

2) how to make manual adjustments?

Comment by thomasp85 18 hours ago

My answers will probably disappoint

1) No (unless you count 'render to image and insert that into your excel document') 2) This is not possible - manual adjustments are not reproducible and we live by that ethos

Comment by tonyarkles 15 hours ago

> 2) This is not possible - manual adjustments are not reproducible and we live by that ethos

Just want to give you a high-five on that one. I've dealt with so many hand-adjusted plots in the past where they work until either the dataset changes just a little bit or the plot library itself gets upgraded... in both cases, the plots completely fall apart when you're not expecting it.

Comment by i000 14 hours ago

What makes ggplot great is that it allows manual adjustments AND has a nice declerative grammar. Hard for me to see the value of a plotting library without being able to adjust plots.