Week 2 – Shivang's GSoC Blog

Welcome to the weekly blog.
After wrapping up work on macfiles last week, I finally moved on to testing the first two phases of the data upload workflow — starting with scummvm.dat (which contains detection entries from ScummVM used for populating the database) and set.dat (data from older collections, provides more information).

Database Changes: File Size Enhancements

Before diving into the testing, there was a change Sev asked for — to extend the database schema to store three types of file sizes instead of just one. This was necessary due to the nature of macfiles, which have:

A data fork
A resource fork
A third size: the data section of the resource fork itself

This change introduced significant modifications to db_functions.py, which contains the core logic for working with .dat files. I had to be careful to ensure nothing broke during this transition.

Punycode Encoding Logic Fixes

At the same time, I fixed the punycode logic in db_functions.py. Punycode encoding (or rather, an extended version of the standard used in URL encoding) is employed by ScummVM to convert filenames into filesystem-independent ASCII-only representations.

There were inconsistencies between punycode logic in db_functions.py and the original implementation in the Dumper Companion. I made sure both implementations now align, and I ran unit tests from the Dumper Companion to verify correctness.

Feeding the Database – scummvm.dat

With those fixes in place, I moved on to populating the database with data from scummvm.dat. While Sev was working on the C++ side to add the correct filesize tags for detections, I ran manual tests using existing data. The parsing logic worked well, though I had to add support for the new “extra size” fields.

Additionally, I fixed the megakey calculation, which is used later when uploading the scummvm.dat again with updates. This involved sorting files alphabetically before computing the key to ensure consistent results.

I also introduced a small optimization: if a file is less than 5000 bytes, we can safely assume that all checksum types (e.g., md5-full_file, md5-5000B, md5-tail-5000B, md5-oneMB, or the macfile variants like -d/-r ) will be the same. In such cases, we now automatically fill all checksum fields with the same value used in detection.

Uploading and Matching – set.dat

Finally, I worked on uploading set.dat to the database, which usually contains the follwoing – metadata (mostly irrelevant), full size checksums only and filesizes.

- scummvm.dat doesn’t contain full file checksums like set.dat, so a match between files from set.dat and scummvm.dat is only possible when a file’s size is less than the detection checksum size, generally 5000 Bytes.
- This transitions the status from “detection” to “partial” — we now know all files in the game, but not all checksum types.
- If there is no match, we create a new entry in the database with the status dat.

Fixes :

There was an issue with the session variable @fileset_last, which was mistakenly referencing the filechecksumtable instead of the latest entry in the filesettable. This broke the logic for matching entries.

When a detection matched a file, only one checksum was previously being transferred. I fixed this to include all relevant checksums from the detection file.

Bug Fixes and Improvements

Fixed redirection logic in the logs: previously, when a matched detection entry was removed, the log URL still pointed to the deleted fileset ID. I updated this to redirect correctly to the matched fileset.

Updated the dashboard to show unmatched datentries. These were missing earlier because the SQL query used an inner JOIN with the game table, and since set.dat files don’t have game table references, they were filtered out. I replaced it with a LEFT JOIN on fileset to include them.

That’s everything I worked on this past week. I’m still a bit unsure about the set.dat matching logic, so I’ll be discussing it further with Sev to make sure everything is aligned.

Thanks for reading!

Recent Posts

Recent Comments

Archives

Categories