hckrnws
Show HN: I ported Tree-sitter to Go
by odvcencio
This started as a hard requirement for my TUI-based editor application, it ended up going in a few different directions.
A suite of tools that help with semantic code entities: https://github.com/odvcencio/gts-suite
A next-gen version control system called Got: https://github.com/odvcencio/got
I think this has some pretty big potential! I think there's many classes of application (particularly legacy architecture) that can benefit from these kinds of analysis tooling. My next post will be about composing all these together, an exciting project I call GotHub. Thanks!
Oh this is really neat for the Bazel community, as depending on tree-sitter to build a gazelle language extension, with Gazelle written in Go, requires you to use CGO.
Now perhaps we can get rid of the CGO dependency and make it pure Go instead. I have pinged some folks to take a look at it.
would also be nice to have this support gopackagesdriver backend
thanks so much for the note! i really appreciate it. i built this precisely for folks like yourself with this specific pain, thanks again!
Comment was deleted :(
This is great, I was looking for something like this, thanks for making this!
I imagine this can very useful for Go-based forges that need syntax highlighting (i.e. Gitea, Forgejo).
I have a strict no-cgo requirement, so I might use it in my project, which is Git+JJ forge https://gitncoffee.com.
Gitea is definitely watching this one. Initialy tests show a 20x increase in syntax highlighting speed compared to the previous regexp-based approach.
thank you for the kind words! Very cool project! Very happy you can find some utility in it
This looks very interesting, but I wonder how's the rewrite approach gonna impact the long-term maintenance and porting changes _back_ from Tree Sitter.
As you mention WASM-readiness, did you consider using the official Tree Sitter WASM builds nicely packaged with wazero (pure Go WASM runtime) ?
It may help staying sync with upstream for the long term and, while probably a bit slower, has nice security and GC advantages too.
I also have a tree-sitter Rust rewrite. Though I cannot find it more useful for end users... https://github.com/HerringtonDarkholme/tree-sitter
Wouldn't `got` be confused with OpenBSD's Got: https://gameoftrees.org/index.html
Why would people be confused with something that the vast majority never heard of. Naming shouldn't care about none mainstream project.
oh wow! i really thought i was being too clever but i shouldve assumed nothing new under the sun. well im taking name suggestions now!
Well, find and sed have modern "fd" and "sd" alternatives. Naming it "gt" allows you to claim that your version save 33% compared to typing "git".
Don't forget to check the new name with Claude:
Are there revision control systems named "got"?
Searched the web Searched the web
Yes! There is at least one notable revision control system named "Got" — specifically Game of Trees (Got), developed for OpenBSD.
Game of Trees (Got) is a version control system which prioritizes ease of use and simplicity over flexibility. Gameoftrees It's actively developed — version 0.123 was released on February 25, 2026 Gameoftrees , just yesterday.
Got uses Git repositories to store versioned data. At present, Got supports local version control operations only, and Git can be used for any functionality which has not yet been implemented in Got. It will always remain possible to work with both Got and Git on the same repository. Lobsters
So "Got" is essentially a friendlier, simpler front-end to Git's underlying storage format, designed with OpenBSD's philosophy of simplicity and clean code in mind.
Goty McGotface
gotgit, gotcha
uGOT / uGOTme? (sort of like the idea behind uTorrent) but I will agree that sbankowi's idea of Yet another got is great as well. +1 to that as well.
YAGOT (Yet Another GOT)
Probably taken already, better use YAGOT-NG (Next Generation) just to be safe.
might be taken too so just YAGOT2 would work
Comment was deleted :(
I'd be more concerned about confusing it with https://github.com/sindresorhus/got, which is well-established (15k stars on GitHub is nothing to sneeze at).
I've seen a lot of "I've ported X to Go/Rust" posts lately. Is it the expectation that we're all supposed to abandon the original projects in favor of the ported versions which use newer and shinier programming languages? Is development going to continue on those new "Go/Rust" ports or are they just one-off karma farming projects?
Do you have an equivalent of TreeCursors or tree-sitter-generate?
There are at least some use cases where neither queries nor walks are suitable. And I have run into cases where being able to regenerate and compile grammars on the fly is immeasurably helpful.
At least for my use cases, this would be unusable.
Also, what the hell is this:
> partial [..] missing external scanner
Why do you have a parsing mode that guarantees incorrect outputs on some grammars (html comes to mind) and then use it as your “90x faster” benchmark figure?
the 90x figure is on Go source for apples to apples against CGO bound tree-sitter.
your use case is not one i designed for although yeah maybe the readme has some sections too close. the only external scanner missing atm is norg. now that i know your use case i can probably think of a way to close it
So your benchmarks are primarily just “how fast is go’s c interop” rather than any algorithmic improvement on tree-sitter?
Edit: yep, you are just calling a c function in a loop. So your no-op benchmark is just the time it takes for cgo to function. I would not be able to get any perf benefits from e.g. rust
Was excited to try this in my project but it doesn't seem like it's truly a complete port.
Neat, but it really bothers me when projects don't use standard layouts.
Claude attempted a treesitter to go port
Better title
This was my first thought as well, just from reading the title.
well how did it do?
Hard to say. Claude’s very good at writing READMEs. In fact, Copilot often complains about docs that sound like they’re about current capabilities when in fact they’re future plans or just plan aspirational.
Without downloading and testing out your software, how can we know if it’s any good? Why would we do that if it’s obviously vibed? The dilemma.
I’m not at all against vibe coding. I’m just pointing out that having a nice README is trivial. And the burden of proof is on you.
Shouldn't you be able to answer that?
yes and if you clicked the links you would know that i did answer it in the readme.
But how do we know the readme isn't also vibecoded?
I read the README and did not find answers to my questions.
> Pure-Go tree-sitter runtime — no CGo, no C toolchain, WASM-ready.
No you didn't. The readme is obvious LLM slop. Em-dash, rule of three, "not x, y". Why should anyone spend effort reading something you couldn't be bothered to write? Why did you post it to HN from a burner account?
How is OP using Claude relevant?
OK for prototyping. Not OK for prod use if noone actually read it line by line.
ii am trying to not take issue with this comment because im aware of the huge stigma around ai generated code.
i needed this project so i made it for my use case and had to build on top of it. the only way to ensure quality is to read it all line by line.
if you give me code that you yourself have not reviewed i will not review it for you.
I’m just curious, what would need to happen for you to change your opinion about this? Are you basically of the opinion that it’s not good enough today, never will be good enough in the future, and we should just wind back the clock 3 years and pretend these tools don’t exist?
It feels to me like a lot of this is dogma. If the code is broken or needs more testing, that can be solved. But it’s orthogonal: the LLM can be used to implement the unit testing and fuzz testing that would beat this library into shape, if it’s not already there. It’s not about adding a human touch, it’s about pursuing completeness. And that’s true for all new projects going from zero to one, you have to ask yourself whether the author drove it to completeness or not. That’s always been true.
You want people to hedge their projects with disclaimers that it probably sucks and isn’t production worthy. You want them to fess up to the fact that they cheated, or something. But they’re giving it away for free! You can just not use it if you don’t want to! They owe you nothing, not even a note in the readme. And you don’t deserve more or less hacker points depending on whether you used a tool to generate the code or whether you wrote it by hand, because hacker points don’t exist, because the value of all of this is (and always will be) subjective.
To the extent that the modern tools and models can’t oneshot anything, they’re going to keep improving. And it doesn’t seem to me like there’s any identifiable binary event on the horizon that would make you change your mind about this. You’re just against LLMs, and that’s the way it is, and there’s nothing that anyone can do to change your mind?
I mean this in the nicest way possible: the world is just going to move on without you.
>I’m just curious, what would need to happen for you to change your opinion about this?
Imagine a machine that can calculate using logic circuits and one that uses a lookup table.
LLMs right now is the latter (please don't take literally, It is just an example). You can argue that the look up table is so huge that it works most of the time.
But I (and probably the parent commenter) need it to be the former. And that answers your question.
So it does not matter how huge the lookup table will grow in the future so that it will work more often, it is still a lookup table.
So people are divided into two groups right now. One group that goes by appearance, and one that goes by what the thing actually is fundamentally, despite the appearances.
But logic circuits are look up tables.
Every computation/function can be a look up table!
right, so why are you asking me to imagine one machine that can calculate using logic circuits and another that can calculate using a lookup table when we’re in agreement that they’re the same thing?
I reject the premise of your analogy.
> that they’re the same thing
I said no such thing.
I think you will get a better response to a slightly different analogy. In genetic programming (and in machine learning), we have a concept of "overfitting". Overfitting can be understood as a program memorizing too much of its test/training data (i.e. so it is acting more like an oracle than a computation). This, intuitively, becomes less of a problem the greater the training-dataset becomes, but the problem will always be there. Noticing the problem is like noticing the invisible wall at the edge of the game-world.
The most insightful thing about LLMs, is just how _useful_ overfitting can be in practice, when applied to the entire internet. In some sense, stack-overflow-driven-development (which was widespread throughout the industry since at least 2012), was an indication that much of a programmer's job was finding specific solutions to recurring problems, that never seem to get permanently fixed (mostly for reasons of culture, conformity, and churn in the ranks).
The more I see the LLM-ification of software unfold (essentially an attempted controlled demolition of our industry and our culture), the more I think about Arthur Whitney (inventor of the K language and others). In this interview[1], he said two interesting things: (1) he likened programming to poetry, and (2) he said that he designed his languages to not have libraries, and everybody builds from the 50 basic operators that come with the language, resulting in very short programs (in terms of both source code size and compiled/runtime code size).
I wonder if our tendency to depend on libraries of functions, counterintuitively results in more source code (and more compiled/runtime code) in the long run -- similarly to how using LLMs for coding tends to be very verbose as well. In principle, libraries are collections of composable domain-verbs that should allow a programmer to solve domain-problems, and yet, it rarely feels that way. I have ripped out general libraries, and replaced them with custom subroutines more times than I can count, because I usually need a subset of functionality, and I need it to be correct (many libraries are complex and buggy because they have some edge-cases [for example, I once used an AVL library that would sometimes walk the tree in reverse instead of from least to greatest -- unfortunately, the ordering mattered, and I wrote a simpler bespoke implementation]).
Arguably, a buggy program or a buggy library or a buggy function, is just an overfit program, or library, or function (it is overfit to the mental-model of the problem-space in the library writer's mind). These overfit libraries, which are often used as blackboxes by someone rushing to meet a deadline, often result in programs that are themselves overfit to the buggy library, creating _less_ modularity instead of more. _Creating_ an abstraction is practically free, but maintaining it and (most disappointingly) _using_ it has real, often permanent long term costs. I have rarely been able to get two computers, that were meant to share data with NFS, to do so reliably, if they were not running the same exact OS (because the NFS client and server of each OS are bug-compatible, are overfit to each other).
In fact the rise of VMWare, and the big cloud companies, and containerization and virtualization technologies is, conceivably, caused by this very tendency to write software that is overfit to other software (the operating system, the standard library [on some OSes emacs has to be forced to link to glibc, because using any other memory allocator causes it to SEGFAULT, and don't get me started on how no two browser-canvases return the same output in different browser _nor_ on the same browser in a different OS]). (Maybe, just as debt keeps the economy from collapsing, technical debt is the only thing that keeps Silicon Valley from collapsing.)
In some ways, coding-LLMs exaggerate this tendency towards overfitting in comical ways, like fun-house mirrors. And now, a single individual, with nothing but a dream, can create technical debt at the same rate as a thousand employee software company could a decade ago. What a time to be alive.
>he likened programming to poetry
This I can definitely relate..
I don't fully understand the rest, but I ll give it some thought.
This might be true, but we can continue to try and require the communities we have been part of for years to act a certain way regarding disclosures.
If the community majority changes it mind then so be it. But the fight will continue for quite some time until that is decided.
There never was a cohesive generic open source community. There are no meaningful group norms. This was and always will be a fiction.
I’m tempted to just start putting co-authored-by: Claude in every commit I make, even the ones that I write by hand, just to intentionally alienate people like you.
The best guardrails are linters, autoformatters, type checkers, static analyzers, fuzzers, pre-commit rules, unit tests and coverage requirements, microbenchmarks, etc. If you genuinely care about open source code quality, you should be investing in improving these tools and deploying them in the projects you rely on. If the LLMs are truly writing bad or broken code, it will show up here clearly.
But if you can’t rephrase your criticism of a patch in terms of things flagged by tools like those, and you’re not claiming there’s something architecturally wrong with the way it was designed, you don’t have a criticism at all. You’re just whining.
> There never was a cohesive generic open source community. There are no meaningful group norms. This was and always will be a fiction.
It's always been a bit splintered, but it was generally composed of 95%+ of people that know how to program. That is no longer the case in any sense.
> I’m tempted to just start putting co-authored-by: Claude in every commit I make, even the ones that I write by hand, just to intentionally alienate people like you.
I mean it sounds like you are already using claude for everything so this is probably a bit of a noop lol.
> But if you can’t rephrase your criticism of a patch in terms of things flagged by tools like those, and you’re not claiming there’s something architecturally wrong with the way it was designed, you don’t have a criticism at all. You’re just whining.
No, because doing that requires MORE rigor and work than what an LLM driven project had put into it. That difference in effort/work is not tenable, its shallow work being shown, its shallow criticisms thrown at it.
All sense of depth and integrity is gone and killed.
Is that what this was all about? Depth and integrity? Rigor and hard work? Because I thought it was all about writing useful programs for computers.
Yes, it was always about writing useful programs for computers. Which is why people moan about the use of LLMs: because then the writing aspect is gone!
Anyway, this stuff will resolve itself, one way or another.
I see this as the same argument as saying GMO label not needed, no need to mention artificial flavours in food, etc.
I mean this in the nicest way possible: the world is just going to insist that AI generated output is marked clearly as AI produced output.
Not sure whether giving a LICENSE even makes sense.
I tried to control LLM output quality by different means, including fuzzing. Had several cases when LLM "cheated" on that too. So, I have my own shades and grades of being sure the code is not BS.
Well, that’s obviously bad.
But once you told it to stop cheating, did it eventually figure it out? I mean, correctly implementing fuzzer support for a project is entirely within the wheelhouse of current models. It’s not rocket science.
You’ve gotta read the code. It doesn’t matter how it got there but if you don’t fully understand it (which implies reading it) don’t get mad when you try to push slop on people. It’s the equivalent of asking an LLM to write an email for somebody else to read that you didn’t read yourself. It’s basic human trust - of course people get annoyed with you. You’re untrustworthy.
Pack it in everyone, the “OK for prod use” guy has spoken.
That ship has sailed, man…
No it has not - if it had, there'd be no need to shout down folk who disagree.
Not everyone buys into the inevitabilism. Why should I read code "author" didn't bother to write?
Sorry but these are just not accurate as blanket statements anymore, given how good the models have gotten.
As other similar projects have pointed out, if you have a good test suite and a way for the model to validate its correctness, you can get very good results. And you can continue to iterate, optimize, code review, etc.
Because the entire README doesn't even mention it, and it is an important factor in deciding whether it is ready for production use.
I, for one, am definitely not going to use this project for anything serious unless I have thoroughly reviewed the code myself. Prototyping is fine.
Should the README mention if a project was agile? Or if Bill Joy wrote some of the code? Or if they used VS Code?
Dude, you know you are trolling.
There is a material difference between whether you used VSCode or vim to write the code, and if you personally wrote or reviewed any code at all.
I'm not really trolling. I'm trying to push people to consider that the world is already in a state where "I used AI" is neither binary nor dispositive. I think we're used to a 2023 to mid-2025 framing where outside of some narrow, highly structured cases, the code is garbage.
If that's still true as a binary now, it won't be for long. As the robot likes to say, some of these changes are "in flight".
People should say what models/tools they used in even show the prompts.
Showing the prompts is not feasible when using agentic coding tools. I suppose one could persist all chat logs ever used in the project, but is that even useful?
I think it would be useful. I see lots of comments like "it one-shotted this" and am curious if they just had to write one sentence or many pages of instructions.
"show the prompts"
What would the prompt for this look like?
never mind the fact that the model is constantly reseeding itself against the files it’s reading from your working directory, so the prompts are useless on their own.
Because OP obviously downplayed this important fact, which typically shows lower quality/less tested code.
maintenance burden
AI often produces nonsense that a human wouldn't. If a project was written using AI the chances that it is a useless mess are significantly higher than if it was written by a human.
I work on a revision control system project, except merge is CRDT. On Feb 22 there was a server break-in (I did not keep unencrypted sources on the client, server login was YubiKey only, but that is not 100% guarantee). I reported break-in to my Telegram channel that day.
My design docs https://replicated.wiki/blog/partII.html
I used tree-sitter for coarse AST. Some key parts were missing from the server as well, because I expected problems (had lots of adventures in East Asia, evil maids, various other incidents on a regular basis).
When I saw "tree-sitter in go" title, I was very glad at first. Solves some problems for me. Then I saw the full picture.
Wait, are you suggesting that OP broke in to your server and stole code and is republishing it as these repos?
I have questions. Have you reviewed the code here to see if it matches? What, more specifically, do you mean when you say someone broke in? What makes you think that this idea (which is nice but not novel) is worth stealing? If that sounds snarky, it’s not meant to; just trying to understand what’s going on. Why is that more likely than someone using Claude to vibe up some software along the same lines?
1. Just saying, strange coincidence
2. How can we compare Claude's output in a different language?
3. Detecting break-ins and handling evil-maids: unless the trick is already known on the internets, I do not disclose. Odds are not in my favor.
4. Maybe worth, maybe not. I have my adaptations. Trying to make it not worthy of stealing, in fact.
Based on this and your other comments, including the one that’s no longer visible: Please phone a friend. Or find a professional to talk to. I say that with nothing but compassion.
For the people who are downvoting me: I’m being totally sincere. This is not an ad hominem attack. You didn’t see his other comment, it was genuinely concerning.
Comment was deleted :(
Comment was deleted :(
Comment was deleted :(
[dead]
Also, evil maids, what?
I can't speak for the specificity of parent's "evil maids" phrase but the concept of an "Evil maid" is used in security scenarios.
A maid tends to be an example of a person who's mostly a stranger, but is given unmonitored access to your most private spaces for prolonged periods of time. So they theoretically become a good vector for a malicious actor to say "hey I'll give you $$ if you just plug in this USB drive in his bedroom laptop next time you're cleaning his house" - it's often used in the scenario of "ok what if someone has physical access to your resource for a prolonged period of time without you noticing? what are your protections there?"
I wonder if that's what OP meant? :-)
"Evil maids" (example): I put my laptop into a safe, seal the safe, seal the room, go to breakfast. On return, I see there was cleaning (not the usual time, I know the hotel), the cleaner looks strangely confused, the seal on the safe is detached (that is often done by applying ice; adhesive hardens, seal goes off). This level of paranoia was not my norm. Had to learn these tricks cause problems happened (repeatedly). In fact, I frequented that hotel, knew customs and the staff, so noticed irregularities.
And what does this have to do with the price of tea in China?
Ah right, thanks! But it seems he meant literal evil maids. Which I guess count as the figurative kind too.
Comment was deleted :(
LMFAO what is this
That is very very interesting. I work on a similar project https://replicated.wiki/blog/partII.html
I use CRDT merge though, cause 3-way metadata-less merges only provide very incremental improvements over e.g. git+mergiraf.
How do you see got's main improvement over git?
primarily, got is structural VCS intended for concurrent edits of the same file.
it does this via gotreesitter and gts-suite abstractions that enable it to: - have entity-aware diffs - not line by line but function by function - structural blame - attribution resolution for the lifetime of the entity - semver from structure - it can recommend bumps because it knows what is breaking change vs minor vs patch - entity history - because entities are tracked independently, file renames or moves dont affect the entity's history
when gotreesitter cant parse a language, the 3way text merge happens as a fallback. what the structural merge enables is no conflicts unless same entity has conflicting changes
I think I understand the situation.
gah,. sincere apologies for formatting of this post. i ahve been on HN for basically 10 years now without ever having made a post (:
use four spaces " " in front of a line for <pre> formatting
like " this"It's 2 or more spaces, not four
today i learned
"rewrite" a nice code base without mentioning it is vibe coded is not great.
Essentially you use AI to somehow re-implement the original code base in a different language, made it somehow work, and claim it is xx times faster. It is misleading.
i really appreciated this comment the most because of how much work "somehow" is doing here
It looks like porting the custom C lexers is a big part of the trouble you had to go to do this.
yes basically about 70% of the engineering effort was spent porting the external scanners and ensuring parity with original (C) tree-sitter
Interesting. I have a similar usecase but intended to use CGo tree-sitter with Zig
Are these pretty up-to-date grammars? I'm awfully tempted to switch to your project
How large are your binaries getting? I was concerned about the size of some of the grammars
206 binary blobs = 15MB, so not crazy but i built for this use case where you can declare the registry of languages you want to load and not have to own all the grammar binaries by default
If all the languages together add up to 15MB that is a game changer for me.
It means the CLI I am working on can ship support for many languages whilst still being a smallish (sub 50mb) download
I shall definitely check it out!
re: up to date grammars, yes i found the official grammars in use by the original tree-sitter library today
Can someone please explain what's the connection between this and LSP? For example in Helix can one use this instead of various language servers?
Tree-sitter is merely a tool for generating an AST for a given language. LSPs on the other hand have way more capabilities (formatting, diagnostics, project-wise go to definition, inlay hints, documentation on hover, etc.) as you can see in its specification.[0] They can't really replace one another.
[0]: https://microsoft.github.io/language-server-protocol/specifi...
Thanks!
How about making 'got' compatible with git repos like jujutsu? It would be a lot easier to try out.
it is interoperable with git. we like git when its good but attempted to ease the pains in UX somewhat. you can take advantage of got locally but still push it to git remote forges jsut the same. when you pull stuff in this way, got will load the entity history into the git repo ensuring that you can still do got stuff locally (inspect entity histories, etc)
Is it a go-ism that source for implementation and test code lives in the root of the repo or is this an LLM thing?
yeah the tests live with the implementation code always (Go thing) and the repo root thing is like a preference, main is an acceptable package to put stuff in (Go thing), i see this a lot with smaller projects or library type projects
[dead]
Crafted by Rajat
Source Code