A Better R Programming Experience Thanks to Tree-sitter

Posted by sebg 17 hours ago

Comments

Comment by tylermw 12 hours ago

I read this article a week or so ago and immediately implemented a VS Code extension that I've always wanted: a static analysis tool for targets pipelines. targets is an R package which provides Make-like pipelines for data science and analysis work. You write your pipeline as a DAG and targets orchestrates the analysis and only re-runs downstream nodes if upstream ones are invalidated and the output changes. Fantastic tool, but at a certain level of complexity the DAG becomes a bit hard to navigate and reason about ("wait, what targets are downstream of this one again?"). This isn't really a targets problem, as this will happen with any analysis of decent complexity, but the structure targets adds to the analysis actually allows for a decent amount of static analysis of the environment/code. Enter tree-sitter.

I wrote a VS Code extension that analyzes the pipeline and provides useful hover information (like size, time last invalidated, computation time for that target, and children/parent info) as well as links to quickly jump to different targets and their children/parents. I've dogfooded the hell out of it and it's already vastly improved my targets workflow within a week. Things like providing better error hints in the IDE for targets-specific malformed inputs and showing which targets are emitting errors really take lots of the friction out of an analysis.

All that to say: nice work on extending tree-sitter to R!

tarborist: targets + tree-sitter https://open-vsx.org/extension/tylermorganwall/tarborist

GH: https://github.com/tylermorganwall/tarborist

Comment by kqr 7 hours ago

I only dabble in data analysis. I scratch the surface of what R can do, and my most complicated analysis fits in 100 or so lines of code I manage manually rather than with the help of tools like targets. What sort of work do you do where you get to play around with fun tools like that?

Comment by CrazyStat 3 hours ago

It's not necessarily the number of lines that motivates these tools. Say you're running an NLP pipeline where you want to do sentiment analysis on a large text corpus (tweets, for example) and then relate sentiment over time to some other variables. Each of those steps might only be a dozen lines of code, but the sentiment analysis might take a nonnegligable amount of time. If you can avoid rerunning it when only the later analysis has changed that can save you considerable time while iterating on the second step of the analysis.

The old fashioned way to do this in R is to use the REPL and only rerun the lines of the script that have changed, with the earlier part staying in the environment. But it's easy to make mistakes doing it manually that way; having the computer track what has changed and needs to be rerun is much less error-prone.

Comment by tylermw 59 minutes ago

Yes, the main benefit is caching and reproducibility: with targets (or any other DAG-based approach), you only recompute what needs to be recomputed and you are assured that no stale inputs or temporary analysis artifacts end up in the final product. If you don't own the underlying data sources and those sources can change at any point, a DAG-based approach helps ensure that.

Comment by adamalt 5 hours ago

Long time lurker on HN but this totally deserves my first (edit: second) ever post. Looks amazing, thank you!

Comment by davisvaughan 2 hours ago

It has been a lot of fun watching you iterate on this via bluesky updates!

Comment by tylermw 1 hour ago

Thanks for all the work you (and the rest of the contributors) have done putting this together! I think bringing tree-sitter to R has already shown massive benefits: Just air alone has been a big improvement to my workflow.

Comment by nomilk 14 hours ago

The article makes out like auto completion and help on hover are new things, but RStudio IDE has had them for years and years.

R/RStudio was my first language/IDE. I was horribly shocked when moving into other languages to discover they didn't have things you got out of the box with R/RStudio. "You mean I have to look up documentation for a function/method!?! - that's supposed to be automatic!".

R has a bunch of features which other languages lack to the degree that it's a rude shock to learn that other ecosystems lack them. One is the REPL with extremely convenient RStudio keyboard shortcuts to run lines of code (to achieve similar with ruby, I have an elaborate neovim/slime setup that took hours to configure and still isn't as good as RStudio gives out of the box).

A sign of a brilliant tool is when an idiot can get more done with it than an expert can with alternatives.

Comment by mscbuck 3 hours ago

In my opinion, RStudio is still the best data science IDE and it's not even close. I've been using Positron a bit more lately just for Claude Code reasons, as I prefer having the pane itself rather than using the terminal, but man it's really tough to shake RStudio. Even with the work put into configuring VSCode to get it kind of close to it, it still just always feels a bit janky.

Comment by chocochunks 2 hours ago

Emacs + ESS is superior IMO. RStudio has a bunch of frills I don't care about and doesn't let me configure files as I'd like. ESS showing the function signature in the minibuffer to me is the killer feature. Wish I could get that for EVERYTHING.

Comment by MostlyStable 14 hours ago

Maybe that explains why I was confused about this article. I kept wondering what exactly on offer, and that it couldn't be as simple as help on hover and auto-complete, because those seemed pretty basic and prevalent. It took me a few years to move to RStudio, but at this point, I literally don't know anyone who doesn't use it. To the point that I once had to explain to a labmate that R and RStudio were, in fact, not the same thing.

So either this is not that exciting, or else the additional things that are on offer are not very clearly explained to the point that I missed them.

Comment by nomilk 13 hours ago

I suspect the main benefits are portability (since tree-sitter uses wasm and javascript it can run in any webpage - compared to the previous way of parsing R code which needed an R runtime, so not just any old website could do it; e.g. a shiny app probably could because it has an R runtime available but a standard HTML page couldn't). And the other is tree-sitter is a widely used tool so now anything that uses tree-sitter can now work with R, since the R grammar is available.

Looks like R's tree-sitter grammar has been in use for GitHub search for a while (since 2024), so it's a nice improvement due to R/tree-sitter, although we've probably been benefitting from it for a while already, perhaps without knowing exactly how it worked!

https://github.com/orgs/community/discussions/120397#discuss...

Comment by user3939382 12 hours ago

I believe this should let you do syntax highlighting for R in vim for example.

Comment by kqr 7 hours ago

The ESS package in Emacs has also had several of these features for R for a long time. The difference here is portability and generality. Tree-sitter is a partial solution to the n×m problem, and now R has been invited to participate in that solution. That's something to be celebrated, even if it doesn't have immediate impact on our day-to-day, because it means future innovations in tooling for programming languages get automatically shared to R, instead of having to be reimplemented.

(The n×m problem is that for n languages and m tools like autoformatting, etc., we need an implementation for each tool specific to each language. With tree-sitter, we get n+m implementations instead: generic tools that work across multiple languages.)

Comment by stephbook 9 hours ago

What if you want to share something outside of your precious IDE?

- Merge request on GitHub - Presentation with reveal.js (kind of like PowerPoint)

You'd be stuck with either bland, uncoloured, text-only characters, OR with a fuzzy PNG screenshot where you can't zoom or copy. Or maybe you "parse R" with Regex.

tree-sitter integrates into any web-based technology, allowing you to _share_ code.

Comment by nomilk 9 hours ago

Yes, your comment really should be the focus of article, i.e. genuinely new capabilities and improvements, not existing capabilities done a slightly different way. In any case it’s a minor nitpick and it’s awesome progress for the language and tooling

Comment by epistasis 15 hours ago

Tree-sitter is one of the finer engineering products out there, it enables so much. Thanks to its creator and everyone who has contributed to this project and its many grammars!

Comment by sundarurfriend 3 hours ago

tree-sitter's design has potential, but my impression is that even after all these years, it is yet to be realized. The speed claims turned out to be largely overstated in practice, for the general variety of usage (rather than single task benchmarks or special cases). And the claim with the grammar system was that, given such a coherent system rather than the much-hated regex parsing, people would be able to write better grammars that are less prone to edge case problems and be less buggy. And maybe that's true in cases like this where someone gets paid to write the grammar and maintain it, but in most common cases, the actual quality of the grammars turn out to be much the same, but with more possibility of regression or breakage. It's possible that in ten years' time, tree-sitter will clearly be the way to go, with more polish all around, but at this point it doesn't feel like an easy strong recommend over the traditional parsing systems.

Comment by dash2 7 hours ago

I've been thinking about an R package, or maybe a more general treesitter-based package, to reorganize functions in a project. Something like a tui which shows you functions in files in folders and lets you copy and paste them around; and maybe use graph analysis to automate this, analysing function dependencies and putting each "community" of functions into one file.

Is there any interest in this? There are per-language complexities, for example R functions are often preceded by a roxygen block which ought to travel with it. Has anyone done something similar?

Comment by tylermw 56 minutes ago

I've done exactly what you're talking about using tree-sitter (via the tarborist VS Code extension), specific to targets pipelines:

https://bsky.app/profile/tylermw.com/post/3mjmcykuows2d

So yes, it is possible and quite useful!

Comment by mscbuck 3 hours ago

I think that'd be cool, but I'd say that Claude Code/Codex is often used for this exact thing and they do a decent job of it (at least in my experience with R). Usually once I've kind of wrapped up my model or data work I'll just ask "okay, now organize this so it makes sense", and it usually does a great job at organizing the helpers, etc.

Comment by fn-mote 14 hours ago

Do the tools built on this understand dplyr pipelines and columns in the data frames appearing as bare variables in the code? If so, I’m really impressed. R does some unusual stuff.

Comment by moffkalast 5 hours ago

People really do still be using R in 2026. Old habits I guess.

Comment by mscbuck 3 hours ago

Not that TIOBE or PyPl are the end all be all, but R was in the Top 10 for the first time since 2020 and PyPl has it at #4. A lot of people use R in 2026, because it's still great for data science work, "tidy" language is still fantastic for working with data, and also it's caught up to Python in almost every way when it comes to putting models into production. Both are great "orchestrator" languages, and I've put both into production on sites that get hundreds of thousands of hits a day.

Comment by sieste 2 hours ago

2021 just called and want their comment back.

Comment by TacticalCoder 14 hours ago

I moved to tree-sitter inside Emacs a while ago and I'd say tree-sitter is much easier than it looks like.

I had a first little use case... For whatever reason the options to align let bindings in Clojure code, no matter if I tried the "semantic" or Tonsky's semi-standard way of formatting Clojure code (several tools adopted Tonsky's suggestion) and no matter which option/knob I turned on, I couldn't align like I wanted.

I really, really, really hate the pure horrible chaos of this:

    (let [abc (+ a 2)
          d (inc b)
          vwxyz (+ abc d)]
      ...

But I love the perfection of this [1]:

    (let [abc     (+ a 2)
          d       (inc b)
          vwxyz   (+ abc d)]
      ...

And the cljfmt is pretty agnostic about it: I can both use cljfmt from Emacs and have a hook forcing cljfmt and it'll align everything but it won't mess with those nice vertical alignments.

Now, I know, I know: it is supposed to work directly from cljfmt but many options are, still in the latest version, labelled as experimental and I simply couldn't make it work on my setup, no matter which knob I turned on.

So what did I do? Claude Code CLI, tree-sitter, and three elisp functions.

And I added my own vertical indenting to Clojure let bindings. And it's compatible with cljfmt (as in: if I run cljfmt it doesn't remove my vertical alignments).

I'd say the tree-sitter syntax tree is incredibly verbose (and has to be) but it's not that hard to use tree-sitter.

P.S: and I'm not alone in liking this kind of alignment and, no, we're not receptive to the "but then you modify one line and several lines are detected as modified". And we're less receptive by the day now that we begin to had tools like diff'ing tools that are indentation-agnostic and only do AST diffs.

Comment by eviks 11 hours ago

Can you move the closing ) to also be vertically aligned?

And the first +/inc in parenthesis?