ParseLand: The project

Following my previous post about adventures with parsing tools, I started a github project to do some experiments and actually capture the work for a change in one place and for one set of specifications.

https://github.com/kfsone/ParseLand

I’m starting out by building a naive (of sorts) implementation in C++ as a reference implementation, to introduce and exercise ideas. Second implementation I’ll probably try using cpp-peglib, and then I’ll see if I can do something in Rust.

Further adventures in parse-land

I’ve tried a few more parser tools to try and achieve what I think is a fairly simple parser, and it feels like we’re in some kind of before time or something. Performance, ease of use, and end-user-friendliness: pick half of any one of these.

Tree-sitter is a really great choice if your project is in- or adjacent-to-javascript. Rust-sitter looks interesting, but has some severe drawbacks and I dislike how it surfaces the use of recursion to implement repetition the way it does.

I think what I’m going to do is create a small collective project for trying out different parse tools in different languages (C++, Rust, Go, possibly Python or Ruby for contrast, and maybe I’ll throw in tree-sitter and antlr although the threat of a Java runtime always makes me step away from actually using Antlr.

Async/Await because?

Maybe it’s because I took using coroutines for granted already 30 years ago, but the pairing of async vs await just never made sense to me.

Specifically, what would it mean to omit the word await?

async fn f1():
  z = await f4()
  # what would "y = z()" mean?

In Rust, you trade boilerplate for specificity that aims to increase safety by making sure you agree with the compiler on what your intent was.

My expectation is that you should only have to mark up when an async method is calling a non-async method that can block.

async fn f1():
  x = f2()  # async fn
  y = f3()  # non-blocking fn
  z = [mayblock] f4()

Infact, I would expect async to serve a double purpose rather than needing two keywords:

async fn f1():
  z = async f4()  # wrap f4 so if it blocks our function yields

Rust vs Parsing

I’ve been intermittently looking at Rust for the last several years, and each time it’s rankled me in some significant enough way to counter the buy-ins that brought me to it.

My latest foray found me looking at Pyo3 and maturin to build a Python extension, mighty was my glee at the ease with which I could create a package for m/l/w in one shot.

It’s also my favorite kind of problem, it involves parsing. At SEMC we use a Python parser called Esrapy to parse some of our DSLs in the asset pipeline, and it’s a decent tool, but the implementation comes with scaling issues, which we didn’t really feel until we introduced some machine-generated files to parse.

Pulsar: Lost Colony

I tried out Pulsar: Lost Colony. If you told me it was a beta or early access demo I’d tell you it had great potential.

It was launched 2 years ago.

The big hurt is that it’s meant to be played co-op by up-to 5 people filling the roles of the crew, and the stand-in AI is inferior, but it’s a lot more fundamental than that. Either the game is missing some critical layers of functionality needed to make FPS-style AI function or they were written horrifically wrong.

The AI falls off or thru pretty much anything and everything, if there is backtracking in the AIs planning, then its weighted on something stupid if at all.

Their state machine appears to lack logic for constraining the lifetime of a decision – so AI will run off the edge of a staircase and fall to its death still trying to reach the x/y it had in mind, and not using its jetpack to right itself.

I was growing tired of the friendly AIs having no cutoff on how many times they’d repeat the same plan and as a result getting stuck running into a pole or stuck on top of one surface with no connection to the surface they want to reach; many of the teleport destinations on away missions have some kind of covering over them, and if the AI lands on that, there it will stay. Your only recourse is to dismiss the AI and add a new one, and set it up again.

Then I realized that the /enemy/ AI is just as terrible. I have a hunch that part of the reason this FPS-based game has no crouch or lean is because they could barely deal with the basic AI they already have. If I had to guess, I’d imagine these guys were rookies going boldly but cavalierly into new territory for themselves, but also blindly not taking the time to do watch GDC or videos or any of the other ways they could have learned the many convergent ways in which different past studios have made these problems simpler and more manageable.

Each of the roles feels incredibly vacuous and shallow, of a depth I’d expect only in early access or beta. No little gimmicks for them to mess with to game a play thru any flair that you might have expected to be added before release never mind 2 years since.

The AI doesn’t have voice lines, so the AI crew gives the gaming a haunted feeling.

Lastly instead of any kind of tutorial there’s literally a manual. Literally a giant PDF file in-game that explains what the buttons in the game do without really actually telling you anything about what the game is or your role in it.

I wouldn’t recommend this game to anyone but developers; it is a really nice, delightful framework for building a game, but balanced almost exactly by so many missed opportunities and so many what-not-tos.

Windows 254 character path limit

Few people know out-of-the-box Windows doesn’t allow you to create files whose full name is longer than 260 characters.

So, you can’t create C:\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234.txt until you turn on a feature called “long paths”.

It’s not the file name but the whole path, so you can create the folder C:\1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456 but you can’t create files in it because adding the “\” separator will be the 260th character.

The reason is simple. Back in C when you needed somewhere to store a file, you created a little buffer of N characters to store it. Every operating system had its own limit, but to access that you had to include extra headers and then remember/type the name of the variable, and usually the header it was in wasn’t short or simple and if you couldn’t remember the name, finding it was a pain. (PATHMAX? MAXPATH? PATH_MAX? PATHLENMAX?)

And so, in the spirit of y2k and 64KB, the just typed one of 250, 255, 256, 260 or similar. Why would you need a longer filename? Well, knowing other programmers were going to likely use upto 256 in code using the operating system, Microsoft allowed 4 bytes of padding.

Anyway, I’ve found this to be a source of amusement observing how programmers deal with discovering this. Typically outrage at “stupid windows”.

Why?

Because times have changed. Today, instead of using the system-provided constant, they type 500, 510, 511, or 512 if they’re over 30; 1000, 1023, 1024 if they’re over 20 and 4096 if they’re under 24.

And my absolute favorite moment was this last week, reviewing code from a former colleague, friend and well-respected engineer who was all busy a-grumbling about having to work around this limitation and right there in the first line of his work-around code he had written:

// paraphrased
bool readFile(Type assetType, const string& name)
{
    char filename[250];
    const string& assetFolder = getAssetFolderFromType(assetType);
    sprintf(filename, "%s/%s", assetFolder.c_str(), name.c_str());
    std::ifstream instream(filename);
...

So I did the little me-thing of asking what this code was solving for (“the 260 character limit”) and he gave me an example filepath for testing it, and then I asked “and that will fit in your filename buffer?”, “yes, the name is just FlowerA.png <additional grumbling about the 260 limit>“, “but you’re combining the name with the path”.

Edit 1

bool readFile(Type assetType, const string& name)
{
    char filepath[250];

“Great, and our FlowerA.png filepath will fit into it?”

obviously as long as its under 250 characters

“which is less than 260”

what? fffuuufufufuuufuufuuu, damn Microsoft

“How is this their fault? They even gave you 10 extra characters if you want to do it the hard way, or 32767 characters if you opt-in to long paths. But the 260 limit is partly there for to try and protect against your bug.”

what bug?

“Well you used 250 and sprintf throughout your code, and we already know that some of the paths alone are over 260 characters plus long filenames coming out to 300”

it’s not my bug if the operating system can’t handle a path that long and the paths get truncated

“That’s your second bug, that you hardcoded too small a number, but the first bug is that sprintf doesn’t truncate”

of course it does, where else can it put the rest of the string. oh. I hate windows, this has never been a problem on Mac”

“Actually, you totally have, there’s a whole bunch of can’t-repro bugs that went away when people used a different setup that you wrote off as being having done it wrong the first time. The difference is everyone else has their full user name and you have a three letter username, which brings you in at 249 characters for the example case”

My alternative for his code:

bool readFile(Type assetType, const string& assetPart)
{
  filesystem::path assetPath(getAssetFolderFromType(assetType));
  fullpath /= assetPart;  // join with native directory separator.
  std::ifstream file(fullpath);
  if (!file.good())
    throw huston::fileaccess_exception("could not open file", getErrno());  // getErrno = his function
  ...

Him: “But how does that solve the 260 character limit?

It doesn’t, but it also doesn’t either truncate the names to 249 characters or stomp the stack. The code opts in to long paths in its manifest, and the user needs to ensure they have long paths enabled. You can do a registry check or have your own path checking method to warn users if they do encounter a long path without having enabled the facility in the OS.

It tickled me for two reasons. First, one of my current coworkers did the same 250 thing (but fortunately he knows to use snprintf if not filesystem/string as an arg to ifstream{}).

Secondly, he had gone thru all his code and done pretty much the same thing everywhere with the 250. This was a distraction for him that needed fixing superfast and he couldn’t task anyone else to it, and in his head what he’d been doing was ensuring the paths were below Microsoft’s limit, for the half a second he’d thought about it.

The performance issue was, of course, that he was using streams and finding them slow, but if he didn’t then his programs would run out of file handles when he was using FILE objects.

This guy is not an idiot. He’s written some truly amazing libraries that are very important to modern AI and graphics coding. This is just not a part of C/C++ he has spent much time in. He literally hadn’t used the layer-3 C io in C++ in a decade.

Quite the reminder that you should never think you know more than 60% of C++ :)

Pro-tip: Write Python like Python

My last post accused Python of being The Slow of the Internet, not because Python is bad but because bad Python is awful.

In many cases, Python is really not slow for the reasons you think it is

Python is a great glue language, a terrific scripting language, because it provides fantastic facilities for manipulating bulky amounts of data. The terrible language that makes our day-to-day lives slower and more miserable is actually anti-Python.

There are two sides to the Python problem: non-engineers using it to write runtime descriptions of data manipulations performed by non-python backends, and engineers writing it as an expose of their non-python backends.

Between the two groups, nobody is really here for Python.

Python, the slow of the internet.

Unpopular Opinion: CPython is stupidly slow. CPython is the Python you’re using if you don’t know which Python you use.

Before Go, Python had taken a firm hold of the systems admin coding, and huge amounts of Linux tooling is written in Python.

During the Great Python 3 Migration of 2019, Python libraries bloated with people introducing bidirectional compatibility, generally by just grabbing some 3rd-party libraries to minimize the footprint of change.

I’m not going to rant about people not knowing the standard ‘dis‘ module exists, or they don’t know about timeit/%timeit… It’s not really an “optimization” issue tho.

Today’s Linux admin activities are agonizingly slow because so many Python developers hear adages about not optimizing Python code they think that you never need to worry about it, so they have no idea how expensive some very common practices are.

Sadly, CPython makes no-need-for-performance-thinking untrue in one really unfortunate detail, one detail that has been agonizingly inflated by the bloat of compatibility code:

Function call overhead :(

The code from this post is in a Jupyter notebook in my github, here.

If you want to interact with it (run it for yourself), you can either use an online notebook viewer (e.g https://nbviewer.jupyter.org/), or Visual Studio Code has really nice support for notebooks, now.

The golang example is here.

‘amul’ breathes again

in 1990 it was a 16-bit, AmigaDOS system

I’ve mucked about trying to resurrect AMUL here and there over the years, reworked a bunch of it under the moniker “SMUGL” but once I was done with the particular aspect (sockets, classes, whatever) I never got to the important task of making it work.

After another couple of forays into poking the bear recently I tasked myself with strictly focusing on finding what was making it crash and fail to load.

It turned out that some time in probably 97-98 I’d deleted a line of code necessary for the game to figure out how much data it was loading, oops.

Largely it turned out to be my lack of C++ understanding in ’92. Apparently I learned the lesson but decide not to apply the learnings.

I’ve tried to mostly stick to pretty much C code, but then luxuriated in a few “really does something useful” C++ features to let me focus on the important and fun parts. I’m going to try and keep the struct/class hierarchies flat, with a very limited amount of inheritance to fill in for the absence of interfaces in C++.

It consists of three main components:

Compiler: Which is a bit generous of a name for what it really does, in my mind anyway. It consumes various text files and spits out a set of binary data files.

Engine: AMUL had a server (manager) and copies of the “frame” (a combination client/server) directly manipulated its memory. SMUGL has a single executable which simply forks itself to become a client. Static data is visible to the child and shared memory homes the mutually-modifiable data.

Game code: Someone actually has to describe the game world, commands and behaviors.

The language itself is part data definition and part functional programming, in that you are writing pattern matching sequences followed by lists of steps to execute when those patterns are matched.

verb=help           # How to deal with a player input starting with help
syntax=player=me    # more specific: "help me"
  respond "I'm just the narrator, ask another player?"
syntax=player
  checknear player  # make sure we're near the player you named
  if helping player fail "You're already helping them."
  if helping someone fail "You're already helping @hp"
  help player       # execute the built-in 'help' action
  tell player "@me is now helping you."
  respond "You are now helping @pl."
syntax=player=madhog verb=tank
  fail "I don't think it can be done."
syntax=player verb
  respond "That's not how this works, you tell *me* what *you* want to do."

The current language is overly specialized, but on a modern system I think I can quite comfortably address some of the issues that effectively forced my hand back in the day.

I’ll probably spend the next few days of hacking replacing the “Posix Shared Memory” code with memory-map backed storage so I can run it on Windows, and doing a better job of separating dynamic from static data. Then I’ll whittle down some of the “muscle work” code that I feel comfortable entrusting to the C++ standard library/my own simple classes/templates.

My big plans are for completely replacing the compiler’s parser and parser generators: I’ve fleshed out a design that intentionally falls short of being a complete parser generator but instead meets you half-way with a fairly clean and readable actual-code description of your language as a state machine.

// names starting lower-case are user-written functions that capture
// token(s) and validate them.
error_t newRoom(Token);     // write a function to capture a new room name
error_t addRoomFlag(Token); // write a function to capture/check each flag

auto RoomLine = Optional("room=") + (Identifier()->newRoom) +
                ZeroPlus(FromList(roomFlagList))->roomFlags +
                Peek(EndOfLine());

auto RoomShortDescription = LineOfText()->shortDescription;
auto RoomLongDescription = ZeroPlus(LineOfText())->longDescription;

auto RoomParser = RoomLine() + RoomShortDescription() + RoomLongDescription();

Lastly I want to abandon the current fork-per-client model. I’ve written a small replacement library that emulates a portion of the Amiga’s “Message Ports” message-passing system. By using that and “thread_local” on my excessive number of globals, I think I can quickly get to a working single-process, multi-threaded version (and then get rid of all the globals)

I’m trying to work up to opening up the source repos, but I’m not quite ready to do it.

FreeBSD!

The last functioning binary I have of my Linux port of my MUD language was apparently compiled on FreeBSD some time in the mid-2000s.

So I’m setting up a bunch of FreeBSD Hyper-V machines to try and get a working environment with which to demo it to someone.

And right now I’m all the way back to 4.2, but I have this hunch it was 2.x something, which will mean going back so far that I use CD ISOs.

I’d forgotten how 8-bit retro BSD feels even when it’s shiny new.