Categories
Uncategorized

Week 4

Welcome to this week’s blog update. This week was focused on fixing bugs related to seeding and set.dat uploading, improving filtering mechanisms, and rewriting parts of the processing logic for scan.dat.

After some manual checks with the recent set.dat updates, a few issues surfaced:

1. Identical Detection Fix

Some detection entries had identical file sets, sizes, and checksums, which led to issues during the automatic merging of set.dat. Previously, we used a megakey to prevent such duplicates, but since it included not only file data but also language and platform information, some identical versions still slipped through.
To solve this, I replaced the megakey check with a more focused comparison: filename, size, and checksum only, and logging the details of the fileset that got clashed.

2. Punycode Encoding Misplacement

The filenames were being encoded using Punycode every time they were processed for database insertion. However, there was no requirement for this encoding as it should have occurred earlier — ideally at the parsing stage, either by the scanning utility that generates .dat files or on the application’s upload interface. I have removed the encoding during database updates. Though I still have to add it at the scan utility side, which I’ll do this week.

3. Path Format Normalization

Another issue was related to inconsistent file paths. Some set.dat entries used Windows-style paths (xyz\abc), while their corresponding detection entries used Unix-style (xyz/abc). Since filtering was done using simple string matching, these mismatches caused failures. I resolved this by normalizing all paths to use forward slashes (/) before storing them in the database.

4. Improving Extra File Detection with clonof and romof

While analyzing filesets, I encountered a previously unnoticed clonof field (similar to romof). These fields indicate that extra files might be listed elsewhere in the .dat file. The previous logic only looked in the resource section, but I found that:

  • Extra files could also exist in the game section.

  • The file references could chain across multiple sections (e.g., A → B → C).

To address this, I implemented an iterative lookup for extra files, ensuring all relevant files across multiple levels are properly detected.

scan.dat Processing Improvements

For scan.dat, I introduced a file update strategy that runs before full fileset matching. All files that match based on file size and checksum are updated first. This allows us to update matching files early, without relying solely on complete fileset comparisons.

Minor Fixes & UI Enhancements

  • Prevented reprocessing of filesets in set.dat if a key already exists in subsequent runs.

  • Passing the --skiplog CLI argument to set.dat processing to  suppress verbose logs during fileset creation and automatic merging.

  • Improved filtering in the dashboard adding more fields like engineid, transcation number and fileset id, and fixing some older issues.

  • Introduced a new “Possible Merges” button in the filesets dashboard to manually inspect and confirm suggested merges.This feature is backed by a new database table that stores fileset matches for later manual review.

Categories
Uncategorized

Week 3

Welcome to this week’s blog update.

Most of this week was spent rewriting the logic for processing set.dat files, as the previous implementation had several inconsistencies. As shown in Figure 1, the earlier logic directly compared checksums to determine matches. However, this approach only worked when the file size was smaller than the checksum size (typically md5-5000), since set.dat files only include full file checksums. This caused us to miss many opportunities to merge filesets that could otherwise be uniquely matched by filename and size alone.

Fig.1 : Previous query used in matching.

Since set.dat files only contain entries that are already present in the detection results (with the exception of some rare variants I discovered later in the week), we should typically expect a one-to-one mapping in most cases. However, some filesets in set.dat can correspond to multiple candidate entries. This happens when the name and size match, but the checksum differs—often due to file variants. This case needs to be handled with manual merge.

Previously, the logic for handling different .dat file types was tightly coupled, making it hard to understand and maintain. I started by decoupling the logic for set.dat entirely. Now, the candidates for the match for  set.dat filesets are filtered by engine name, filename, and file size (if it’s not -1), excluding out the checksum. It is made sure that all the detection files(files with detection flag set to 1) follow the condition.

 

Initially, I was filtering out the fileset only with the highest number of matches, assuming it was correct. However, that approach isn’t reliable—sometimes the correct match might not be the largest group. So all these candidates need to go for the manual merge. Only when all checksums match across candidates can we be confident in an automatic match.

I also added logic to handle partial or full matches of candidate filesets. This can happen when a set.dat is reuploaded with changes. In such cases, all files are compared: if there’s no difference, the fileset is dropped. If differences exist, the fileset is flagged for manual merge.

Finally, I handled an issue with Mac files in set.dat. These files aren’t correctly represented there: they lack prefixes and have checksums computed for the full file rather than individual forks. So, these filesets are dropped early by checking if no candidates are found for that fileset after SQL filtering.

Other Updates

During seeding, I found some entries with the same megakey, differing only by game name or title. Sev advised treating them as a single fileset. So now, only the first such entry is added, and the rest are logged as warnings with metadata, including links to the conflicting fileset.

Other fixes this week included:

  • Removing support for m-type checksums entirely (Sev also removed them from the detections).

  • Dropping sha1 and crc checksums, which mainly came from set.dat.

Next Steps

With the seeding logic refined, the next step is to begin populating the database with individual set.dat entries and confirm everything works as expected.

After that, I’ll start working on fixing the scan.dat functionality. This feature will allow developers to manually scan their game data files and upload the relevant data to the database.

Categories
Uncategorized

Week 2

Welcome to the weekly blog.
After wrapping up work on macfiles last week, I finally moved on to testing the first two phases of the data upload workflow — starting with scummvm.dat (which contains detection entries from ScummVM used for populating the database) and set.dat (data from older collections, provides more information).

Database Changes: File Size Enhancements

Before diving into the testing, there was a change Sev asked for — to extend the database schema to store three types of file sizes instead of just one. This was necessary due to the nature of macfiles, which have:

  • A data fork

  • A resource fork

  • A third size: the data section of the resource fork itself

This change introduced significant modifications to db_functions.py, which contains the core logic for working with .dat files. I had to be careful to ensure nothing broke during this transition.

Punycode Encoding Logic Fixes

At the same time, I fixed the punycode logic in db_functions.py. Punycode encoding (or rather, an extended version of the standard used in URL encoding) is employed by ScummVM to convert filenames into filesystem-independent ASCII-only representations.

There were inconsistencies between punycode logic in db_functions.py and the original implementation in the Dumper Companion. I made sure both implementations now align, and I ran unit tests from the Dumper Companion to verify correctness.

Feeding the Database – scummvm.dat

With those fixes in place, I moved on to populating the database with data from scummvm.dat. While Sev was working on the C++ side to add the correct filesize tags for detections, I ran manual tests using existing data. The parsing logic worked well, though I had to add support for the new “extra size” fields.

Additionally, I fixed the megakey calculation, which is used later when uploading the scummvm.dat again with updates. This involved sorting files alphabetically before computing the key to ensure consistent results.

I also introduced a small optimization: if a file is less than 5000 bytes, we can safely assume that all checksum types (e.g., md5-full_file, md5-5000B, md5-tail-5000B, md5-oneMB, or the macfile variants like -d/-r ) will be the same. In such cases, we now automatically fill all checksum fields with the same value used in detection.

Uploading and Matching – set.dat

Finally, I worked on uploading set.dat to the database, which usually contains the follwoing – metadata (mostly irrelevant), full size checksums only and filesizes.

    • scummvm.dat doesn’t contain full file checksums like set.dat, so a match between files from set.dat and scummvm.dat is only possible when a file’s size is less than the detection checksum size, generally 5000 Bytes.
    • This transitions the status from “detection” to “partial” — we now know all files in the game, but not all checksum types.

    • If there is no match, we create a new entry in the database with the status dat.

Fixes :

There was an issue with the session variable @fileset_last, which was mistakenly referencing the filechecksumtable instead of the latest entry in the filesettable. This broke the logic for matching entries.

When a detection matched a file, only one checksum was previously being transferred. I fixed this to include all relevant checksums from the detection file.

 Bug Fixes and Improvements

Fixed redirection logic in the logs: previously, when a matched detection entry was removed, the log URL still pointed to the deleted fileset ID. I updated this to redirect correctly to the matched fileset.

Updated the dashboard to show unmatched datentries. These were missing earlier because the SQL query used an inner JOIN with the game table, and since set.dat files don’t have game table references, they were filtered out. I replaced it with a LEFT JOIN on fileset to include them.


That’s everything I worked on this past week. I’m still a bit unsure about the set.dat matching logic, so I’ll be discussing it further with Sev to make sure everything is aligned.

Thanks for reading!

Categories
Uncategorized

Week 1

Welcome to this week’s blog. Most of the time this week was spent fixing the portability of the Mac files. So the plan was to test the working of the mac files on both Python and C++ side. On checking the C++ side halfway through, we realised, that some code was broken and was giving incorrect results. So, Sev decided to take a look at it himself while I started working on the same task on the Python side.

On the Python side, the code had three main issues:

  • Not all Mac file variants were being covered.

    Fig. 1 : 7 Mac file variants ( Image taken from macresman.cpp -> MacResManager::open() )
  • Instead of using the data section of the resource fork, the entire resource fork was being used for the checksum calculations, which was different from what the C++ side was doing.

    Fig. 2 : The data section of the resource fork had to be separately extracted
  • There was no file filtering, which caused problems when Mac files were present – specifically, AppleDouble and raw resource fork files, which had their forks spread over multiple files. Instead of showing a single file entry with all the checksums, extra entries were incorrectly displayed as non-Mac files.
Fig. 3 : First file entry should not be a part of this game entry.

I corrected all these issues. For filtering, I added 7 different categories for each file – NON_MAC, MAC_BINARY, APPLE_DOUBLE_RSRC, APPLE_DOUBLE_MACOSX, APPLE_DOUBLE_DOT_, RAW_RSRC and ACTUAL_FORK_MAC.

Fig. 4 shows consistent output for the all the mac file variants. Next task is to create proper test suites for its verification and check the workflow with the C++ side.

Fig. 4 : Checksum calculation of all 7 macfile variants on python side

Thank you for reading.

Categories
Uncategorized

Week 0

Hi, I’m Shivang Nagta, a pre-final year Computer Science undergraduate. I’ll be sharing my weekly blogs here, with updates on my GSoC project — “System for Checking Game Files Integrity.”

My mentors for this project are Sev and Rvanlaar, and I’m really grateful to have them guiding me. This project has been part of the last two GSoC years, so a lot of work has already been done. Here’s the current status:

Work done by the previous developers :
1. Server Side – 
The server has been written in Flask. There’s a dashboard for proper visualization. The database schema and logic for feeding/updating the database have been implemented.

2. Client Side / ScummVM App :
There’s a Check Integrity button in the ScummVM application, which hits the server endpoint for validation with the checksum data of the game files.

Work done by me previously :
1. Client Side / ScummVM App :

  • Fixed the freezing issue in the Check Integrity dialog box. It was caused by the MD5 calculation of large files, which blocked synchronous screen updates. I solved it by implementing a callback system.
  • Engines like GLK and Scumm don’t use the Advanced Detector, so I worked on implementing a custom system to dump their detection entries. Some verification is still needed, as the current logic of these engines introduces complications in the implementation of the custom dumping systems.

2. Server Side :

  • I worked on two particular tasks: Punycode names and the different Mac files portability. Both tasks require final verification and testing. I’ve already mentioned them in the last section of the blog.

Work plan for Official Coding Phase:
1. Testing all the workflows on the server side :

  • Initial seeding by scummvm.dat (checksum data from the detection entries)
  • Uploading set.dat (checksum data from some old collections)
  • Uploading scan.dat (checksum uploaded by developers by scanning the local files using a command line utlility provided on the server)
  • user.dat from api (checksum coming from the client by the Check Integrity feature added on the ScummVM application)
  • Reupload scummvm.dat / set.dat

2. Moderation features :

  • Review the user submitted fileset
  • Have a list of unmatched fileset
  • Manual merge with search feature for a particular fileset ID followed by a merge screen
  • Remove filesets / undo changes, on a new upload (roll back feature)
  • Easy searching and filters of filesets by different field

3. Some fixes :

  • Different types of Mac files (like Appledouble, Macbinary, and Rsrc) have forks represented differently for the same game data. The checksums of Resource forks and Dataforks need to be extracted separately to create correct entries.
  • Often, filenames from one OS are not supported on another. To tackle this, Sev built a method on top of the classic Punycode encoding method (used for URL encoding), but it needs proper integration and testing in this project.

Tomorrow marks the beginning of the offical coding phase. Thank you for reading.