Coding

Further adventures in parse-land

I’ve tried a few more parser tools to try and achieve what I think is a fairly simple parser, and it feels like we’re in some kind of before time or something. Performance, ease of use, and end-user-friendliness: pick half of any one of these.

Tree-sitter is a really great choice if your project is in- or adjacent-to-javascript. Rust-sitter looks interesting, but has some severe drawbacks and I dislike how it surfaces the use of recursion to implement repetition the way it does.

I think what I’m going to do is create a small collective project for trying out different parse tools in different languages (C++, Rust, Go, possibly Python or Ruby for contrast, and maybe I’ll throw in tree-sitter and antlr although the threat of a Java runtime always makes me step away from actually using Antlr.

Rust vs Parsing

I’ve been intermittently looking at Rust for the last several years, and each time it’s rankled me in some significant enough way to counter the buy-ins that brought me to it.

My latest foray found me looking at Pyo3 and maturin to build a Python extension, mighty was my glee at the ease with which I could create a package for m/l/w in one shot.

It’s also my favorite kind of problem, it involves parsing. At SEMC we use a Python parser called Esrapy to parse some of our DSLs in the asset pipeline, and it’s a decent tool, but the implementation comes with scaling issues, which we didn’t really feel until we introduced some machine-generated files to parse.

Python, the slow of the internet.

Unpopular Opinion: CPython is stupidly slow. CPython is the Python you’re using if you don’t know which Python you use.

Before Go, Python had taken a firm hold of the systems admin coding, and huge amounts of Linux tooling is written in Python.

During the Great Python 3 Migration of 2019, Python libraries bloated with people introducing bidirectional compatibility, generally by just grabbing some 3rd-party libraries to minimize the footprint of change.

I’m not going to rant about people not knowing the standard ‘dis‘ module exists, or they don’t know about timeit/%timeit… It’s not really an “optimization” issue tho.

Today’s Linux admin activities are agonizingly slow because so many Python developers hear adages about not optimizing Python code they think that you never need to worry about it, so they have no idea how expensive some very common practices are.

Sadly, CPython makes no-need-for-performance-thinking untrue in one really unfortunate detail, one detail that has been agonizingly inflated by the bloat of compatibility code:

Function call overhead :(

The code from this post is in a Jupyter notebook in my github, here.

If you want to interact with it (run it for yourself), you can either use an online notebook viewer (e.g https://nbviewer.jupyter.org/), or Visual Studio Code has really nice support for notebooks, now.

The golang example is here.

Ready to plus it up – a little

The foray into raw C has been fun, but there are three things that I miss: references from C++, multiple-return types from Go et al, and some variation of function-to-structure association.

References save a lot of null checking, and ‘.’ is two less keypresses than ‘->’.

Multiple return types lets you tackle the oddity of “pass me the address of the thing I’m going to give you”

// what I mean:
read(char* into, size_t limit) returns (bool success, size_t bytesRead)

// what c libraries do:
ssize_t /* -1=err */ read(char *into, size_t limit)

// what I do:
error_t read(char *into, size_t limit, size_t *bytesReadp)

But that pointer on the right means an input parameter to my function instead of the output I want to denote.

Methods give you name spacing, first and foremost, but they also just provide a logical separation I find I’ve missed in going back to C.

In particular, Go’s method implementation really appeals to me.

Go doesn’t have the complication of header files, so unlike the C+… family of languages, you aren’t bogged down with boiler-plate and it also didn’t feel the need to convolute structure definitions with pre-declarations of implementations of functions.

When viewed thru a C++, Java, C# etc lense, the following will make your skin crawl:

typedef Buffer struct {
    data     []byte
    cursor   int
}

func (b *Buffer) Read(into []byte) (bytesRead int, err error) {
    ....
}

On the left, (b *Buffer) is the “receiver” type, it’s how Read is associated with Buffer. On the right are the parameters, into, and after that is a list of the return values.

How are you supposed to know what methods a class has? That’s probably part of what made your skin itch.

The answer is: Probably not by going and reading header files or code. Go has incredibly powerful tooling that makes it a doozie to integrate with IDEs and editors. It’s sickeningly fast to parse, and documentation is generally highly accessible.

I’ve about gotten the AMUL code to a place where I’m ready to think about trying language only C++ – that is, C++ language features but none of the STL.

The thing continues

Ordinarily, at this point, I would be sharing my github repos so you can see the code I’m tinkering with. Two things are stopping me.

The last time I shared the code for AMUL, it was with someone I was discussing a possible port to Linux with back in ’94. That someone said “Can’t help you” and I got distracted with my job at Demon Internet. A few years later I found that a lot of concepts from AMUL had surfaced in another MUD language across the 6 months immediately after sharing the source, and that the individual I’d shared it with happened to be the author of said language…

More importantly, my snapshot is not of the last release. I don’t know why that bothers me (re sharing), except that I’m hoping I’ll find a more recent version that has the huge swath of cool stuff I introduced between the version I have code for and the AMUL900 that’s still out there. I think it’s knowing that I have probably lost the files at some point, or they are buried so deep in my nested collection of archives that I’m not finding them :)

Meanwhile, I’m still really super-enjoying working in pure C on this project. I wouldn’t necessarily choose it for a follow-up project, but I’m spending radically less time on boiler plate.

I spent a part of today formalizing bits of stuff I had in external folders – so now there’s a Dockerfile for an ubuntu and an alpine build environment.

One advantage of doing this in C, at least first, is that it’s proving a lot easier to static analyze and bug hunt.

Cold hard cache.

Time to crawl the interwebs. I’m looking for something relatively small and lightweight, a binary blob cache that I can drop into place, import a module in python and have relatively easy access to.

The keys are likely to be large, the blobs may be several MB. I don’t care a great-deal about persistence.

What I’m looking to achieve is something like ‘distcc’ for asset conversion. The backend doesn’t need to know that, it’s just going to get semi-opaque key values that ultimately serve to compartmentalize hash spaces.

Spot the flaws


void someFunction(char* inputStr)
{
char buffer[8];

int n = snprintf(buffer, sizeof(buffer) - 1, inputStr);
buffer[n] = 0;

/* ... */

And while you’re at it – see if you can spot the motivations behind what’s being done.

 

Do Not Do


a[n++] = b[++n];

Hint: If you know what it does, you’re wrong.

 

Rewriting the grid.

Two of the core components of our game server systems are the poorly named “World Grid” and “Update System” (the names reflect what the systems do, but not their overall role in the game server cluster). The combination of these two is alternately called “The Grid” and “The Update System”.

In short: they divide the world up into small parcels of virtual space, track the players moving around in those spaces, and generate data packets describing movements between one player and another.

Written by coders not happy to co-operate together, hodged and podged over 10 years, sprinkled with genuine pixie dust(*); neither component lends itself to maintenance, study or peace of mind.

They were developed for 1999 hardware and lovingly hand-optimized to running with magnificent efficiency on an old Pentium III 800. But you would really have to go out of your way to make them less hostile to a modern CPU like the Dual Xeon Quad-Core 3ghzs our servers currently run on.

For 7 years+, they have been the bane of my existence.

A* vs Shortest Path

Been dabbling with an A* pathing algorithm applied to the graph of CPs in the game, and getting interesting results. For some reason, I missed the fact that A* doesn’t ensure a shortest path.