Categories
Final Blog

GSoC: Final Blog

Goals of the Project

The aim of the project was to integrate a File Integrity Service into ScummVM, along with a supporting application for managing the database. The main goals were:

  1. Populating the database with the detection entries used by ScummVM for game detection, followed by existing game file collections containing full MD5 checksums.
  2. Developing a utility for scanning game directories and generating data to further populate the database.
  3. Allowing users to verify their game files through this integrity service, and contribute new game variants.
  4. Building an application for managing the server and the database with role based access.

What I Did

  • A large portion of the work involved rewriting the codebase, as the existing logic for filtering and matching the filesets was not correct.
  • For ensuring correctness of code while matching old game collections, a lot of manual checking was required for over 100 different engines.
  • For the scan utility, I extended support for legacy Mac file formats (AppleDouble, MacBinary, etc.), ensuring proper matching with filesets present in the database.
  • For user integrity service, I worked on reporting information like Missing files, Unknown Files, Mismatched Files and Ok files. Further, I built a moderation queue system for user submitted files and also solidified the checks on user submitted data to allow submission of only valid data.
  • For the application, I added support for manually merging filesets, updating filesets within the application, improved search filtering and logging. Further, I added a configuration page for customising some user display settings and integrated Github OAuth with role-based access control (Admins, Moderators, and Read-only users).

Current State and What’s Left

The application is in a complete working state. All the workflows are functioning properly. The remaining step is to populate the database with all existing data so the integrity service can start operating officially.

Code that got merged

List of PRs that got merged – 

Server Side:

Github OAuth

Extending Web app features

User File Integrity Check, Web App updates and Project Restructuring

Scan Utility and Scanned fileset matching

Ruff formatter and linter

Macfiles support, Initial Seeding and matching with old game collections

Punycode

Readme Update

ScummVM:

Freeze issue for Integrity Dialog

 

Challenges and Learnings

The most challenging part was handling the variations across engines and game variants, and ensuring correctness in the filtering and matching process while populating the database. This often required manual validation of filesets. Working on this project taught me about the level of care needed to maintain the code and the importance of sharing the thoughts with the team. It was a highly rewarding experience working with ScummVM, and I am very grateful to my mentors Sev and Rvanlaar for their guidance and support throughout the project.

Categories
Uncategorized

Week 12

Welcome to this week’s blog. This week, I added features related to updating metadata as well as file data directly from the application UI, along with some smaller fixes and improvements.

For any fileset, you can now update metadata fields directly from the UI. For user filesets in particular, there is an additional step of adding metadata first particularly gameid and engineid as they require creating entries in separate tables. To make filling metadata easier, I also added a dropdown feature that displays all existing values for a field from the database. This way, moderators can either type in a new value or directly choose an existing one. In addition to metadata, I added functionality to update individual files as well. This can be useful for tasks such as manually marking a file as detection file or updating other fields.

For better reliability, confirmation dialogs have been added for most buttons, such as deleting/updating files and adding/updating metadata. Further a separate button has been added for deleting the entire fileset. Another improvement is the ability to delete all filesets in bulk that appear in a filtered search result in the fileset search page.

To enhance logging for scanned files, a new field called data_path has been introduced. This field stores the relative path of the game directory, which is particularly useful when multiple files are scanned at once. This information can later be included in scan.dat related logs.

Lastly, I added an endpoint for sending a fileset ID as a mail notification. This is suppose to be triggered from the mail server whenever a user submits any fileset-related information, using a predefined mail structure in the ScummVM application. (This feature has not yet been integrated with the mail server.)

Categories
Uncategorized

Week 11

Welcome to this week’s blog. This week, I worked on testing the workflow for scan data as well as the user file integrity service.

For the scan data, we tested with the WAGE game archives, which provided a good opportunity to test both the scan utility and the scan matching for the Mac files. Some fixes were indeed needed for the matching process. Initially, I was using both size (data fork size) and size-rd (resource fork’s data section size) simultaneously while filtering filesets. However, this was incorrect, since detection filesets only contain one of these at a time. Additionally, I fixed how matched entries were being processed. Previously, entries matched with detection were placed for manual merge to add specific files while avoiding unnecessary ones like license or readme files from commercial games. However, it made more sense to merge them automatically and later remove such files if necessary—especially since, for the archives like WAGE, the issue of extra files from commercial games would not occur.

I also carried out testing for the user integrity service, focusing on different response cases:

  1. All files are okay when a full fileset matches.

  2. Extra files are present.

  3. Some files are missing.

Another missing piece was reporting files due to checksum mismatches, which previously was being classified under extra files. This is now fixed. I also reviewed the manual merge process for user filesets. Unlike set filesets, the source fileset (user fileset here) should not be deleted after a manual merge, since it could be a possible new variant which would need additional metadata information. To support this, I implemented a feature to update fileset metadata—though it still requires some refinement. An additional thing that I need to add is to create an endpoint in the web server that can be triggered by the mail server. This endpoint will provide the mail information, particularly the user fileset ID, for which the user has provided some additional information via the pre-drafted email that is promted when user uses the ‘check integrity’ feature in the ScummVM application.

A few other fixes this week included:

  • Deleting multiple files from a fileset through dashboard: Previously, the query was being generated incorrectly. Instead of ‘DELETE FROM file WHERE id IN (‘1’, ‘2’, ‘3’)’ it was generating ‘DELETE FROM file WHERE id IN (‘1, 2, 3′)’ which, of course, did not work. This issue is now fixed.

  • Search filter issue: A bug occurred when a single quote (‘) was used as a value in search filters, breaking the query due to missing escaping for the quote. This has also been fixed.

Categories
Uncategorized

Week 10

Welcome to this week’s blog. This week, my work focused on enhancing API security, adding github authentication, refining project structure, and introducing a faster Python package manager (UV).

API Security Improvements

I implemented some checks on the validation endpoint, which processes the user game files data sent from the ScummVM application. These checks are designed to prevent any kind of brute-force attempts –

Checks on validation endpoint

On top of that, I introduced rate limiting using Flask-Limiter. Currently, the validation endpoint allows a maximum of 3 requests per minute per user.

GitHub OAuth & Role-Based Access

GitHub OAuth authentication is now in place, introducing a three-level role-based system. Though, I have tested it with my own dummy organisation, the integration with ScummVM is remaining:

  • Admin – Full access, plus the ability to clear the database.

  • Moderators – Same permissions as Admin, except database clearing.

  • Read-Only – Logged-in users with viewing rights only.

Github OAuth
Project Restructuring & UV Integration

As suggested by my mentor Rvanlaar, I restructured the project into a Python module, making the import logic cleaner and improving overall modularity. I also added UV, a high-performance Python package and project manager, offering faster dependency handling compared to pip.

Other Fixes & Improvements
  • Updated the apache config file to use the Python virtual environment instead of the global installation.

  • Correctly decode MacBinary filenames from headers using MacRoman instead of UTF-8.

  • Improved error handling for the scan utlility.

  • Use one of size or size-rd for filtering filesets for scan.dat in case of macfiles instead of both simultaneously.
Categories
Uncategorized

Week 9

Welcome to this week’s blog. This week was a busy one due to my college workload, but I mostly focused on enhancing the webpage. I worked on the configuration page, the manual merge dashboard, filtering, search-related improvements, and more.

  • Configuration Page:
    I added a new configuration page that allows users to customize their preferences, including:

    • Number of filesets per page

    • Number of logs per page

    • Column width percentages for the fileset search page

    • Column width percentages for the log page

    All these preferences are stored in cookies for persistence.

    User Configuration Page
  • Manual Merge Dashboard:
    I performed some refactoring of the codebase for manual merging. Additionally, I added options to:

    • Show either all files or only the common ones

    • Display either all fields of the files, or just the full-size MD5 and size (or size-rd in the case of Mac files)

  • Search Functionality:
    I improved the search system with the following features:

    • Exact match: Values wrapped in double quotes are matched exactly

    • OR search: Multiple terms separated by spaces are treated as an OR

    • AND search: Terms separated by + are treated as an AND

  • Sorting Enhancements:
    The sorting feature now includes three states for each column: ascending, descending, and default (unsorted).

Minor Fixes & Improvements
  • Added favicon to display on the webpage tab
  • Implemented checksum-based filtering in the fileset search page
  • Included metadata information in seeding logs (unless --skiplog is passed)
Goals for Next Week
  • Add GitHub-based authentication
  • Implement a three-tier user system: admin, moderator, and read-only
  • Add validation checks on user data to prevent brute force attacks
  • Refactor the entire project into a Python module for better structure and cleaner imports
Categories
Uncategorized

Week 8

Welcome to this week’s blog. This week, I primarily worked on the moderation system for user.dat, improved the merge dashboard page for manual merges, and made several other changes related to the website’s user interface.

Moderation System for user.dat

For every incoming user fileset submission, we check whether we already have a matching full fileset. If a full fileset match is found, we return the status of the files. If not, the submission is added to the database as a possible new fileset or a variant, which then goes through a moderation queue.

I worked on implementing this moderation queue. A submitted fileset is added to the review queue only if it is submitted by a minimum number of different users (currently set to three). To prevent redundant count increases from the same user, there is an IP-based check in place. We already had masking logic that anonymizes the last octet of an IPv4 address (e.g., 127.0.0.X) to avoid storing Personally Identifiable Information (PII).

Merge Dashboard

I also worked on enhancing the merge dashboard. The goal was to implement a page that displays both filesets side-by-side for comparison, including checksums, file sizes, and metadata. Moderators can choose the checksums and file sizes from either fileset to perform manual merge. A check was also added to prevent selecting both options for the same comparison field.

However, the page still needs some optimization, as it currently freezes when dealing with a large number of files.

Other Improvements

Some of the other fixes and improvements include:

  • Correctly redirecting filesets after automatic or manual matches

  • Enhancing the homepage

  • Adding error handling while parsing .dat files

  • Implementing proper transaction rollbacks in case of failure

  • Fixing filtering issues that previously caused some data with null fields to not filter.

Categories
Uncategorized

Week 7

Welcome to this week’s blog update. This week, I focused primarily on completing the scan.dat processing, as well as working on user.dat handling which is the actual data coming from the user side. The scan processing was almost complete by the previous week; the remaining task was to add a modification time field to the file table in the database and reflect that change in the frontend.

One significant fix was introducing checksum-based filtering at the very beginning of the filtering logic for all dats. Previously, I had placed it after the maximum matched files were already filtered, which did not align with ScummVM’s detection algorithm. Further the detection entries from Scumm Engine had 1MB checksums so checksum based filtering worked really well there. Another improvement was how I handled duplicate entries. Initially, I was dropping all entries in case of duplicates. However, it’s more efficient to retain the first entry and discard the rest, reducing the need for manually adding extra detection entries for such cases.

Then I worked on user.dat where I rewrote some of the matching logic. The filtering approach remains consistent with scan.dat, and I made minimal changes to the existing response logic. There is some work left related to moderation queue for reviewing user data and IP based checking.

Fig. 1 – Testing user.dat

Other fixes and improvements:

  • Parameterized all old SQL queries: I had postponed this task for a while, but finally sat down to parameterize them all.

  • Formatting compute_hash.py: I had avoided running the Ruff formatter on this file because it was interfering with specific parts of the scan utility code. However, thanks to a suggestion from rvanlaar, I used # fmt: off and # fmt: on comments to selectively disable formatting for those sections.

Categories
Uncategorized

Week-6

Welcome to this week’s blog. This week, I primarily worked on the scan utility and the scan processing logic.

Scan Utility

The scan utility was mostly complete in the first week, but I added three more features:

  1. Modification Time Filter:
    I added the modification time for scanned files into the .dat file. A command-line argument now allows users to specify a cutoff time, filtering out files updated after that time (except files modified today).
    Extracting the modification time was straightforward for non-Mac files since it could be retrieved from the OS. However, for Mac-specific formats—specifically MacBinary and AppleDouble—I had to extract the modification time from the Finder Info.

  2. Size Fields:
    I added all size types (size, size-r, and size-rd) in the .dat file.

  3. Punycode Path Encoding:
    Filepath components are now punycode-encoded individually.

Scan Processing Logic

For processing scan.dat, the first improvement was updating the checksum of all files in the database that matched both the checksum and file size.

The rest of the processing is similar to set.dat logic:
Filtering is used to find candidates with matching detection filenames, sizes and additionally checksums.

  • Single Candidate:
    • If the candidate’s status is partial, it’s upgraded to full (files are updated in case they were skipped earlier due to missing size info).

    • If the candidate’s status is detection,and the number of files in the scan.dat is equal, the status is set to full. Otherwise, it’s flagged for manual merge.

    • If the candidate status is already full, all files are compared, and any differences are reported.

  • Multiple Candidates:
    All candidates are added for manual merging.

Other Fixes and Improvements
  • Fix in set.dat Handling:
    Sometimes, filesets from the candidate list were updated during the same run due to other filesets. These updated filesets could incorrectly appear as false positives for manual merge if their size changed. Now, if a fileset gets updated and its size no longer matches, it’s removed from the candidate list.

  • Database Schema Update:
    An extra column was added to the fileset table to store set.dat metadata.

  • Website Navbar:
    A new navbar has been added to the webpage, along with the updated logo provided by Sev.

  • Database Connection Fix in Flask:
    For development, a “Clear Database” button was added to the webpage. However, the Flask code previously used a global database connection object. This led to multiple user connections persisting and occasionally locking the database. I’ve refactored the code to eliminate the global connection, resolving the issue.

Categories
Uncategorized

Week 5

Welcome to this week’s blog.

This week primarily involved manually reviewing around 100+ set.dat files—excluding a few, such as Scumm and all GLK engines, since their detection entries are not yet available for seeding.

Fig.1 – Result of matching set.dat files

During the process, I fixed several issues wherever possible, improved the matching, manually removed unwanted entries, and documented everything in a spreadsheet. Some key fixes included adding additional filtering based on platform to reduce the number of candidate matches. In some cases, the platform could be extracted from the gameid (e.g., goldenwake-win). This filtering was needed because many detection entries from the seeding process were missing file size information (i.e., size = -1). While I was already filtering candidates by file size, I also had to include those with size = -1 to avoid missing the correct match due to incomplete data. However, this approach in some cases, significantly increased the number of candidates requiring manual merging. Introducing platform-based filtering helped reduce this count, though the improvement wasn’t as substantial as expected.

Another issue stemmed from duplicate files added during seeding for detection purposes. While the detection code in ScummVM intentionally includes these duplicates, I should have removed them during seeding. Cleaning them up did reduce the manual merging effort in some cases.

There were also complications caused by file paths. Initially, the filtering considered full file paths, but I later changed it to use only the filename as mentioned in the last blog. This led to situations where the same detection file appeared in multiple directories. I’ve now resolved this by designating only one file as the detection file and treating the others as non-detection files.

A significant portion of time also went into manually removing extra entries from set.dat, e.g, different language variants. These often caused dropouts in the matching process, but removing them allowed the main entry to be automatically merged.

Some smaller fixes included:

  • Ensuring all checksums are added on a match when the file size is less than the checksum size (since all checksum would be identical in that case). This logic was already implemented, but previously only applied when creating a new entry.

  • Increasing the log text size limit to prevent log creation from failing due to overly large text in the database.

Next, I’ll begin working on the scan utility while waiting for the Scumm and GLK detection entries to become available.

Categories
Uncategorized

Week 4

Welcome to this week’s blog update. This week was focused on fixing bugs related to seeding and set.dat uploading, improving filtering mechanisms, and rewriting parts of the processing logic for scan.dat.

After some manual checks with the recent set.dat updates, a few issues surfaced:

1. Identical Detection Fix

Some detection entries had identical file sets, sizes, and checksums, which led to issues during the automatic merging of set.dat. Previously, we used a megakey to prevent such duplicates, but since it included not only file data but also language and platform information, some identical versions still slipped through.
To solve this, I replaced the megakey check with a more focused comparison: filename, size, and checksum only, and logging the details of the fileset that got clashed.

2. Punycode Encoding Misplacement

The filenames were being encoded using Punycode every time they were processed for database insertion. However, there was no requirement for this encoding as it should have occurred earlier — ideally at the parsing stage, either by the scanning utility that generates .dat files or on the application’s upload interface. I have removed the encoding during database updates. Though I still have to add it at the scan utility side, which I’ll do this week.

3. Path Format Normalization

Another issue was related to inconsistent file paths. Some set.dat entries used Windows-style paths (xyz\abc), while their corresponding detection entries used Unix-style (xyz/abc). Since filtering was done using simple string matching, these mismatches caused failures. I resolved this by normalizing all paths to use forward slashes (/) before storing them in the database.

4. Improving Extra File Detection with clonof and romof

While analyzing filesets, I encountered a previously unnoticed clonof field (similar to romof). These fields indicate that extra files might be listed elsewhere in the .dat file. The previous logic only looked in the resource section, but I found that:

  • Extra files could also exist in the game section.

  • The file references could chain across multiple sections (e.g., A → B → C).

To address this, I implemented an iterative lookup for extra files, ensuring all relevant files across multiple levels are properly detected.

scan.dat Processing Improvements

For scan.dat, I introduced a file update strategy that runs before full fileset matching. All files that match based on file size and checksum are updated first. This allows us to update matching files early, without relying solely on complete fileset comparisons.

Minor Fixes & UI Enhancements

  • Prevented reprocessing of filesets in set.dat if a key already exists in subsequent runs.

  • Passing the --skiplog CLI argument to set.dat processing to  suppress verbose logs during fileset creation and automatic merging.

  • Improved filtering in the dashboard adding more fields like engineid, transcation number and fileset id, and fixing some older issues.

  • Introduced a new “Possible Merges” button in the filesets dashboard to manually inspect and confirm suggested merges.This feature is backed by a new database table that stores fileset matches for later manual review.