Week 10

ScummVM Server Connection

This week was spent creating and gluing together what will be the two most used pieces of this project, the GUI for the user to verify (or submit) checksums for their files, and the API endpoint on the server so ScummVM can communicate with it.

API Endpoint

An API endpoint is essentially a route (URL) that will read and respond to “requests” that you send to it.

For our application, we have a /api/validate route that accepts JSON requests, sent from ScummVM, processes the input (validates the checksums) using the functions we made last week, and sends back a JSON “response”. This is read by ScummVM, and the results of the entire process are displayed.


Once the project is finished, the user should be able to scan their game files and get a result of the verification. This all starts by clicking a button in the Game Options menu.

I added a “Check integrity” button

Clicking the button gives the user a small alert that this might take a while, and then the checksums are calculated. These values are then composed into the defined request format, and sent to the API endpoint as part of the body of a POST request.

After the request is sent, it is received by the server, processed, and a response is sent back. Depending on the whether or not the game fileset is present on the server we can get varying results. If the fileset is available, we display the results of the server-side comparison on a file-by-file basis.

If the fileset is unavailable, it is inserted into the db as a user fileset and we show a dialog asking the user to send an email to us along with the ID of the new fileset so we can review it.

With all this set up, the user-facing part of the project is finally complete!

That’s all for now, hope you check in next week for more progress!

Thanks for reading!


Week 8 & 9

This week’s blog post will contain work from both week 8 and week 9, since I had college during week 8 and I wasn’t able to get much work done.

These two weeks were spent starting the final piece of the puzzle – handling user checksums, and adding some extra features to the website and CLI tool.

User Checksums

The entire point of this project is to allow users to validate their game files against those present in the db that we spent the first 2/3rds of this project creating.

The user checksums will be validated against whatever we have in the database, and there are two outcomes – the game variant in question has fully populated checksums, or it doesn’t. If it does, the process is pretty simple – just verify the integrity of the game files one-by-one, store the results in a JSON object, and send it back as a response.

If the game variant doesn’t exist in the database, we can insert the submitted checksums in the database. Since the user can submit even incorrect files for verification, we will be adding these filesets to a queue where either a moderator can approve them manually.

Framework Improvements

This week introduced a whole bunch of improvements to the website, parser and CLI scanner.

I’ve introduced a megakey column in filesets to uniquely identify non-detection filesets. This means we can skip insertion of all filesets that already exist.

Game entries in games_list.php now have clickable rows that point to their respective fileset pages. You can also sort by any column by clicking on the heading.

In the fileset page, you can now view checksums of all types and sizes, thanks to a new ‘Expand table’ button. I’ve also added the ability to calculate checksums of AppleDouble and .rsrc files with the scan utility.

Log entries have also gotten an overhaul, now every upload gets its own ‘Transaction’ id and every fileset inserted (or not inserted) is logged.

That’s all for now, hope you check in next week for more progress!

Thanks for reading!


Week 7

Revisiting the DATs

This week was spent adding additional features to the main framework of the project – the parser and the scan utility. I also changed up the UI of the website.


In Classic MacOS, one file could have both a resource fork and a data fork. These were physically separate files, but were treated as one object by the OS. While MacOS handled this fine, transmission protocols couldn’t do the same.

Enter MacBinary – a file format that combines both the forks into one file, suitable for transmission over the interwebs, and also, capable of being used on other file systems. ScummVM relies on MacBinaries to handle files from classic MacOS games, which means my scanner should handle them too.

MacBinary files need to be treated differently because the detection happens based on the checksums of the forks, not the whole file – which means I needed to create functionality to detect MacBinaries, and then calculate checksums of their forks for matching.

The difficulty of this part came from identification of the MacBinaries. I had to translate C++ code into some complicated looking Python, and implement a custom CRC function. Took a couple tries, but it works now!

Incremental DATs

The parser is built to read and interpret DATs, and then insert them into the db as is. This feature would add an additional check to prevent the duplicate entries from being inserted when a DAT with the same entries is uploaded.

Implementing this feature efficiently took some thinking (and refactoring!), but I managed to get it to work!

I first had to add an extra column – timestamp – to the fileset table to determine whether a detection entry was outdated (meaning it wasn’t present in the latest detection entries DAT). These entries were marked as obsolete. During insertion, I checked if the detection fileset was already present in the table, with the help of the value in the key column (a unique identifier of sorts). These entries were skipped during insertion.

For DATs coming from the scanner or clrmamepro, it’s a little more tricky. If a game is detected, we can compare the filesets associated with that game, and mark the new fileset for removal if they are the same. We decided to leave removal of unmatched duplicate filesets for a later date, as handling this scenario needs a little more set up.

Some smaller changes

I had also updated the scanner to handle filenames with punycode, and added support for custom checksum sizes (the printing didn’t work before).

I also altered the UI for the filters in paginated tables – there’s one filter for each column now, with support for multiple filters.

Works on substring matching too!

That’s all for now, hope you check in next week for more progress!

Thanks for reading!



Week 6

Revamping the Website

Last week when I showed the detection entries webpage, it was far from impressive to say the least. This week was spent upgrading the website in both looks and functionality.

The Looks

Not too much say here, except I spent a little time to glam up the website and give it some functional UI improvements as well. The data is displayed labelled table, with a visual distinction on alternate rows. I also added a color change on hovering over the row so it’s easier to tell which row you’re looking at.

Pretty big improvement over last week I think!

The Functionality

Last week, I had made some very rudimentary pages for displaying the detection entries and logs. This week, I gave them essentially a full overhaul.

First, I extracted the common code for pagination into it’s own file. Then I added some more sophisticated navigation buttons, allowing users to skip to the first and last pages, along with the two pages immediately before and after the current page (as you can see in the screenshot). There’s also a text box to go to a specific page number if you don’t want to manually navigate with the buttons.

As you can see at the top of the page, I’ve implemented filters to narrow down the entries shown. In the screenshot, I’m filtering only the entries that have platform: windows.

In the logs page, for matching logs, the log text now includes a hyperlink to the fileset page of the fileset that was matched.

The Fileset Page

Fileset info

The fileset page exists so developers can manage individual filesets, and it works even for the ones that don’t have a game matched to them.

Fileset details shows the source of the fileset, along with game data and status if it is matched to a game. The next table shows the files in the fileset (also with pagination, some games have a lot of files).

The next two parts are more interesting – Developer Actions and Fileset history. Developer Actions has a list of actions that the developer can do on the fileset. Eventually, this will include approving/rejecting the fileset if it was submitted by a user, but for now it just include a button to mark the fileset for deletion.

The Fileset history table displays the previous IDs that the current fileset had before it was merged. This is useful for redirecting older IDs to their new homes. As shown in the screenshot, there’s also a link to the logs page that shows the log entry for the matching of the filesets.

That’s all for now, hope you check in next week for more progress!

Thanks for reading!


Week 5

This week was spent completing the main foundation of the project by matching entries, and starting work on the website. I’ve also implemented a logging system for when DATs are uploaded and matched, and set up a test instance for my code.

Matching Entries

When we upload a DAT from, let’s say, our CLI scanner utility, we calculate checksums of the files in the specified folders. But how do we know what games those checksums correspond to? The directory names could be gibberish, and unless a developer manually adds metadata to every single fileset, there’s no way of knowing which game the fileset corresponds to by parsing the DAT alone. So after we parse, we have to match the entries to a specific game.

Matching entries involves identifying which game a fileset is for, by a method similar to what ScummVM itself uses. ScummVM stores the checksums of a couple files from the game, and then uses this to identify what game files a directory contains. In the same fashion, if we happen to find a detection entry where all detection files are found in the new entry, a match is made. After matching, we must merge the filesets, so we can actually use the files in the new fileset.

Log your changes!

This part is pretty simple – know when the database is changed. Whenever we write to the database (i.e when we upload or match entries), we want to make sure that we have a ledger to keep track of our actions.

I added a logging table to the schema and store in it some useful info, like a timestamp, what kind of action was done, who did it, what changed etc. This will show up on the website we’ll be talking about in just a bit.

The Website

One of the biggest parts of the project is having a website that developers can visit and use as a dashboard of sorts to manage the stuff in the database.

Detection entries page. It doesn’t look like much, but it does work!

This week I implemented two webpages – one to display info about the detection entries, and one to show you the logs. Obviously showing the several thousand games on the same page is a bad idea, so I also implemented pagination to show a more manageable 25 entries per page.

I also spent a little bit of time setting up an apache2 vhost so me and my mentor could test the code on the same environment.

That’s all for now, hope you check in next week for more progress!

Thanks for reading!


Week 4

Finishing up the foundation

It’s already week 4 – and I’ve almost completed the first major part of the project! This week I finished the implementation of exporting detection entries from engines based on AdvancedDetector, and revisited the database schema and DAT parser.

This means I have now finished the code for creating DATs from game files, exporting detection entries from ScummVM and loading DATs from various sources. The only work with DATs that remains (for now) is creating logic for matching entries from untrusted sources with the detection entries.

Dealing with Detection Entries

Once I had the exporting working, I simply had to load it into the parser that I had already written. Only it wasn’t as simple as I had hoped…

The detection entries proved to be quite an interesting foe to tackle, causing all sorts of bugs in every nook and cranny. The first snag that I encountered was filenames having spaces and brackets, which were special characters that the parser depended on. Seems my idea to use regex to keep the code clean was going to make this quite a challenge to tackle!

The solution we came up with is the enclose filenames (and other metadata like game titles) in quotes, but getting regex to ignore matches inside quotes is a little challenging, and I wasn’t going to dig myself deeper into this hole I made for myself. So I ended up rewriting the entire first part of the parser into a  match_outermost_brackets() function.

We put text into quotes to get the parser to ignore their contents, but what about text with quotes themselves? The answer is using an escape character, like a backslash (\). What if the text has a backslash in it? Well, escape that too! I had to write the function to add backlashes before exporting in ScummVM, but PHP has a neat built-in function stripslashes() that gets the original string for you!

Back to the tables

Exporting and parsing the entries is one thing, but now I needed to actually put the data into the db.

After some discussion, we came to the conclusion to modify the schema a little bit to accommodate some more metadata, and also make it easier to use. After updating schema.php to fit the new design, I worked on getting the data into the right places, and added some conditionals to alter the insertion behavior depending on the source of the DAT file.

One last addition to the parser was handling different sizes and types of checksums, from full checksums to the last 5000 bytes (tail) checksums. Once I got this done, I could finally get the data into the db without a hitch.

That’s all for now. Next week I’ll be writing the matching logic and moving on the creating the website to browse these entries. Hope you check in next week for more updates!

Thanks for reading!


Week 3

My expectations for the week

Week 3 of the GSoC summer has rolled around! This week was spent writing the CLI tool for developers to create their own DATs from game files on their computer, and dumping detection entries from ScummVM, also into DATs.

Part 1: Creating the CLI

The CLI needs to do a couple things – taking in the directory path that contains game files, scanning it for files in the root folder, calculating the checksums for these files, and finally combining them all in the right format so it can be parsed by the parser we wrote previously.

Most of the difficulty of this part came from trying to keep the script free from external dependencies, while also keeping it from getting too bloated. Long python scripts can be a real eyesore ?. But after dropping in some list comprehensions and iterators to substitute my rudimentary implementation, I got the code looking pretty decent I think.

Part 2: ScummVM detection entries

If you’re not aware, ScummVM detection entries are multiple file-checksum pairs that can be used to identify which variant of game you are running. Every variant has some files specific to it, and the detection entries contain these files. This lets ScummVM run the game correctly.

Each engine has it’s own detection entries, and our goal is to extract these entries and format them in the same way we did our with the data in the CLI application for our parser, and write them into DAT files. This functionality will be accessed with a command (--dump-all-detection-entries) while running ScummVM from the command line.

Almost all engines in ScummVM, bar three of them (SCUMM, Sky, Glk), use AdvancedDetector for identifying the right game variant. We need to declare a virtual method to the MetaEngineDetection class, which all other detection classes inherit from, that we can define in the various detector classes.

While I haven’t implemented the dumping on the special engines yet, for engines that use AdvancedDetector, I had overridden the virtual method that I declared in the MetaEngineDetection class, and returned the protected _gameDescriptors variable.

This function is then called for every engine and the data is formatted, and dumped into DAT files, that will be run through our parser and inserted into the DB. Quite the pipeline!

That’s all for now, hope you check in next week for more progress!

Thanks for reading!


Week 2

My expectations for the week

It’s the second week of the GSoC summer! After creating the database as defined by the schema, it was time to populate it with some real values!

This week was dedicated to writing the parser for DAT files, inserting them into the db, and a CLI tool to create DATs from directories containing game files.

Part 1: Parsing DAT files

The first order of business was to actually come up with a good way to parse the DAT files. The hardest thing to do was to get the text inside the outermost brackets.

While I could simply try something like \((.|\s)+\) to match opening and closing brackets, but that would end up matching the very first ‘(‘ and the very last ‘)’ – something we don’t want if the file has multiple top-level brackets. So I had to get a little creative.

I had done some research in Week 1, and decided that recursive regex was the best option to keep the code small and maintainable (it’s not really a regular grammar anymore, but that’s besides the point). PHP uses a regex parser that is based on PCRE (Perl Compatible Regular Expressions), so recursion is built-in.

Now, to figure out how that works…


This is what I came up with. The start and end are quite similar to my first guess, but let’s take a look at the inner part of the expression.

The (?:[^)(]+|(?R))*+ is a non-capturing group, and it matches either [^)(]+ or (?R) zero or more times. [^)(]+ simply matches anything that isn’t a bracket, but (?R) is the real special sauce. It will cause the pattern to recursively match itself, which means any nested brackets are matched in this group.

This leaves only the top-level closing bracket to match the outermost (the first) opening bracket, giving us exactly what we want!

regex101 – Yay it works!

Once we have this data, we can extract the checksum data inside the brackets using a much simpler parsing technique, splitting by spaces. Checksum data is in the format rom ( values ). We then split the values by spaces to get the name, size, and checksum value. Quite straightforward compared to what we just did! We can store this data as key-value pairs, which are called associated arrays in PHP.

Part 2: Inserting the data into the DB

I’ll keep this part short – I simply needed to loop through the data, extract the metadata we need (only the engine name for now) and insert into the right tables.

Everything was easy to do, but when I was testing it out with large DAT files, I took forever to actually run. Why? Because insert queries, when executed one by one, are very slow. The largest of the DAT files have ~100k files with 3 checksums each, and each file needs 4 insertions. That’s well over a million queries.

The fix for this was easy enough, just wrap it in a transaction ?. This reduced the running time to a much more manageable 2 minutes. Good enough for now!

Part 3: CLI application

The CLI application is still a work in progress at the moment, but I wanted to mention it here since I got the most important functionality out of the way – calculating the checksums of all the files in a given directory.

This gives the devs that have game files an easy way to create DAT files similar to the ones we were parsing earlier, so that they can then add the checksum data into the database.

There’s still stuff left to do – along with actually creating the interface part of the application, I also have to write the data into a DAT file. Shouldn’t be too hard, since it in basically the inverse of the parsing functionality we made earlier.

That’s all for now, hope you check in next week for more progress!

Thanks for reading!


Week 1

My expectations for the week

The first week of the official coding period has arrived!

This week I wanted to focus on the implementation of the DB schema, and filling it with its initial seed values.

Part 1: Database Schema

At the start of the summer, the ScummVM team gave me a database schema that they had decided on for the system, and tasked me with implementing the schema in the form of a MySQL database using PHP for the backend.

I spent the first day trying to decipher the complex-looking diagram and converting it to code. First order of business was to brush up on my database design knowledge. Figuring out the purpose of each table, understanding the relationships between their entities, and poking and prodding to see if I could spot any holes in the design. (Nothing as of yet!)

I wrote down the queries to create the db and its tables with the correct relations, but it was still missing something major – data!

Part 2: Adding data to the DB

I spent day 2 and 3 on creating dummy values for the db in an attempt to properly understand how the various entities related to each other. I ended up finding some issues with my schema implementation, so it was worth the effort!

The rest of the week was spent on a more pressing task – parsing checksum data from DAT files, created in clrmamepro, to the database. This is something I had no idea how to do in PHP, so I spent quite a while learning the ropes on how to handle strings and regex in PHP ?. (It’s surprisingly easy to do.)

While this part is not quite finished, I have a good idea on how to break up the data and pass it to insertion queries, that will eventually send it the db.

All in all, it’s been quite an eventful week, and I learnt a lot! I’ll probably post again soon once I finish the up parsing the DAT files, hope you check in for that!

Thanks for reading!


Week 0


Hello! My name is Abhinav Chennubhotla, and throughout the GSoC summer I’m going to be working on a system for verifying the integrity of game files in ScummVM; validating their checksums against those present in a central database.

Throughout the summer, I’ll also be blogging my progress here at least weekly, hopefully more.

Project on the GSoC website. Find me on the ScummVM discord at #gsoc-integrity.


ScummVM relies on users to acquire their own game files, which quite often come from old, unreliable media. This project will help users confirm the integrity of the files on their computer, and will help developers determine if bugs are caused by issues in the game files, or ScummVM itself.


My tasks for the summer include:

  • Creating and implementing a database schema for storing checksums.
  • Creating a backend to work with the DB.
  • Building tools – both a website and CLI application – for developers to mass-populate the DB.
  • Creating and deploying an API for the aforementioned tools as well as ScummVM (for the end users) to communicate with the backend.
  • Adding the requisite functionality to ScummVM to calculate checksums, send requests, and receive responses.
    Also add the necessary GUI buttons and such.

The database schema, and the API spec is mostly already defined. The website design will mostly the somewhat similar to Adminer.

That’s all for now. Make sure to check in later for a continuation for my journey!

Thanks for reading!