Categories
Uncategorized

Week 11

Welcome to this week’s blog. This week, I worked on testing the workflow for scan data as well as the user file integrity service.

For the scan data, we tested with the WAGE game archives, which provided a good opportunity to test both the scan utility and the scan matching for the Mac files. Some fixes were indeed needed for the matching process. Initially, I was using both size (data fork size) and size-rd (resource fork’s data section size) simultaneously while filtering filesets. However, this was incorrect, since detection filesets only contain one of these at a time. Additionally, I fixed how matched entries were being processed. Previously, entries matched with detection were placed for manual merge to add specific files while avoiding unnecessary ones like license or readme files from commercial games. However, it made more sense to merge them automatically and later remove such files if necessary—especially since, for the archives like WAGE, the issue of extra files from commercial games would not occur.

I also carried out testing for the user integrity service, focusing on different response cases:

  1. All files are okay when a full fileset matches.

  2. Extra files are present.

  3. Some files are missing.

Another missing piece was reporting files due to checksum mismatches, which previously was being classified under extra files. This is now fixed. I also reviewed the manual merge process for user filesets. Unlike set filesets, the source fileset (user fileset here) should not be deleted after a manual merge, since it could be a possible new variant which would need additional metadata information. To support this, I implemented a feature to update fileset metadata—though it still requires some refinement. An additional thing that I need to add is to create an endpoint in the web server that can be triggered by the mail server. This endpoint will provide the mail information, particularly the user fileset ID, for which the user has provided some additional information via the pre-drafted email that is promted when user uses the ‘check integrity’ feature in the ScummVM application.

A few other fixes this week included:

  • Deleting multiple files from a fileset through dashboard: Previously, the query was being generated incorrectly. Instead of ‘DELETE FROM file WHERE id IN (‘1’, ‘2’, ‘3’)’ it was generating ‘DELETE FROM file WHERE id IN (‘1, 2, 3′)’ which, of course, did not work. This issue is now fixed.

  • Search filter issue: A bug occurred when a single quote (‘) was used as a value in search filters, breaking the query due to missing escaping for the quote. This has also been fixed.

Categories
Uncategorized

Week 10

Welcome to this week’s blog. This week, my work focused on enhancing API security, adding github authentication, refining project structure, and introducing a faster Python package manager (UV).

API Security Improvements

I implemented some checks on the validation endpoint, which processes the user game files data sent from the ScummVM application. These checks are designed to prevent any kind of brute-force attempts –

Checks on validation endpoint

On top of that, I introduced rate limiting using Flask-Limiter. Currently, the validation endpoint allows a maximum of 3 requests per minute per user.

GitHub OAuth & Role-Based Access

GitHub OAuth authentication is now in place, introducing a three-level role-based system. Though, I have tested it with my own dummy organisation, the integration with ScummVM is remaining:

  • Admin – Full access, plus the ability to clear the database.

  • Moderators – Same permissions as Admin, except database clearing.

  • Read-Only – Logged-in users with viewing rights only.

Github OAuth
Project Restructuring & UV Integration

As suggested by my mentor Rvanlaar, I restructured the project into a Python module, making the import logic cleaner and improving overall modularity. I also added UV, a high-performance Python package and project manager, offering faster dependency handling compared to pip.

Other Fixes & Improvements
  • Updated the apache config file to use the Python virtual environment instead of the global installation.

  • Correctly decode MacBinary filenames from headers using MacRoman instead of UTF-8.

  • Improved error handling for the scan utlility.

  • Use one of size or size-rd for filtering filesets for scan.dat in case of macfiles instead of both simultaneously.
Categories
Uncategorized

Week 9

Welcome to this week’s blog. This week was a busy one due to my college workload, but I mostly focused on enhancing the webpage. I worked on the configuration page, the manual merge dashboard, filtering, search-related improvements, and more.

  • Configuration Page:
    I added a new configuration page that allows users to customize their preferences, including:

    • Number of filesets per page

    • Number of logs per page

    • Column width percentages for the fileset search page

    • Column width percentages for the log page

    All these preferences are stored in cookies for persistence.

    User Configuration Page
  • Manual Merge Dashboard:
    I performed some refactoring of the codebase for manual merging. Additionally, I added options to:

    • Show either all files or only the common ones

    • Display either all fields of the files, or just the full-size MD5 and size (or size-rd in the case of Mac files)

  • Search Functionality:
    I improved the search system with the following features:

    • Exact match: Values wrapped in double quotes are matched exactly

    • OR search: Multiple terms separated by spaces are treated as an OR

    • AND search: Terms separated by + are treated as an AND

  • Sorting Enhancements:
    The sorting feature now includes three states for each column: ascending, descending, and default (unsorted).

Minor Fixes & Improvements
  • Added favicon to display on the webpage tab
  • Implemented checksum-based filtering in the fileset search page
  • Included metadata information in seeding logs (unless --skiplog is passed)
Goals for Next Week
  • Add GitHub-based authentication
  • Implement a three-tier user system: admin, moderator, and read-only
  • Add validation checks on user data to prevent brute force attacks
  • Refactor the entire project into a Python module for better structure and cleaner imports
Categories
Uncategorized

Week 8

Welcome to this week’s blog. This week, I primarily worked on the moderation system for user.dat, improved the merge dashboard page for manual merges, and made several other changes related to the website’s user interface.

Moderation System for user.dat

For every incoming user fileset submission, we check whether we already have a matching full fileset. If a full fileset match is found, we return the status of the files. If not, the submission is added to the database as a possible new fileset or a variant, which then goes through a moderation queue.

I worked on implementing this moderation queue. A submitted fileset is added to the review queue only if it is submitted by a minimum number of different users (currently set to three). To prevent redundant count increases from the same user, there is an IP-based check in place. We already had masking logic that anonymizes the last octet of an IPv4 address (e.g., 127.0.0.X) to avoid storing Personally Identifiable Information (PII).

Merge Dashboard

I also worked on enhancing the merge dashboard. The goal was to implement a page that displays both filesets side-by-side for comparison, including checksums, file sizes, and metadata. Moderators can choose the checksums and file sizes from either fileset to perform manual merge. A check was also added to prevent selecting both options for the same comparison field.

However, the page still needs some optimization, as it currently freezes when dealing with a large number of files.

Other Improvements

Some of the other fixes and improvements include:

  • Correctly redirecting filesets after automatic or manual matches

  • Enhancing the homepage

  • Adding error handling while parsing .dat files

  • Implementing proper transaction rollbacks in case of failure

  • Fixing filtering issues that previously caused some data with null fields to not filter.

Categories
Uncategorized

Week 7

Welcome to this week’s blog update. This week, I focused primarily on completing the scan.dat processing, as well as working on user.dat handling which is the actual data coming from the user side. The scan processing was almost complete by the previous week; the remaining task was to add a modification time field to the file table in the database and reflect that change in the frontend.

One significant fix was introducing checksum-based filtering at the very beginning of the filtering logic for all dats. Previously, I had placed it after the maximum matched files were already filtered, which did not align with ScummVM’s detection algorithm. Further the detection entries from Scumm Engine had 1MB checksums so checksum based filtering worked really well there. Another improvement was how I handled duplicate entries. Initially, I was dropping all entries in case of duplicates. However, it’s more efficient to retain the first entry and discard the rest, reducing the need for manually adding extra detection entries for such cases.

Then I worked on user.dat where I rewrote some of the matching logic. The filtering approach remains consistent with scan.dat, and I made minimal changes to the existing response logic. There is some work left related to moderation queue for reviewing user data and IP based checking.

Fig. 1 – Testing user.dat

Other fixes and improvements:

  • Parameterized all old SQL queries: I had postponed this task for a while, but finally sat down to parameterize them all.

  • Formatting compute_hash.py: I had avoided running the Ruff formatter on this file because it was interfering with specific parts of the scan utility code. However, thanks to a suggestion from rvanlaar, I used # fmt: off and # fmt: on comments to selectively disable formatting for those sections.

Categories
Uncategorized

Week-6

Welcome to this week’s blog. This week, I primarily worked on the scan utility and the scan processing logic.

Scan Utility

The scan utility was mostly complete in the first week, but I added three more features:

  1. Modification Time Filter:
    I added the modification time for scanned files into the .dat file. A command-line argument now allows users to specify a cutoff time, filtering out files updated after that time (except files modified today).
    Extracting the modification time was straightforward for non-Mac files since it could be retrieved from the OS. However, for Mac-specific formats—specifically MacBinary and AppleDouble—I had to extract the modification time from the Finder Info.

  2. Size Fields:
    I added all size types (size, size-r, and size-rd) in the .dat file.

  3. Punycode Path Encoding:
    Filepath components are now punycode-encoded individually.

Scan Processing Logic

For processing scan.dat, the first improvement was updating the checksum of all files in the database that matched both the checksum and file size.

The rest of the processing is similar to set.dat logic:
Filtering is used to find candidates with matching detection filenames, sizes and additionally checksums.

  • Single Candidate:
    • If the candidate’s status is partial, it’s upgraded to full (files are updated in case they were skipped earlier due to missing size info).

    • If the candidate’s status is detection,and the number of files in the scan.dat is equal, the status is set to full. Otherwise, it’s flagged for manual merge.

    • If the candidate status is already full, all files are compared, and any differences are reported.

  • Multiple Candidates:
    All candidates are added for manual merging.

Other Fixes and Improvements
  • Fix in set.dat Handling:
    Sometimes, filesets from the candidate list were updated during the same run due to other filesets. These updated filesets could incorrectly appear as false positives for manual merge if their size changed. Now, if a fileset gets updated and its size no longer matches, it’s removed from the candidate list.

  • Database Schema Update:
    An extra column was added to the fileset table to store set.dat metadata.

  • Website Navbar:
    A new navbar has been added to the webpage, along with the updated logo provided by Sev.

  • Database Connection Fix in Flask:
    For development, a “Clear Database” button was added to the webpage. However, the Flask code previously used a global database connection object. This led to multiple user connections persisting and occasionally locking the database. I’ve refactored the code to eliminate the global connection, resolving the issue.

Categories
Uncategorized

Week 5

Welcome to this week’s blog.

This week primarily involved manually reviewing around 100+ set.dat files—excluding a few, such as Scumm and all GLK engines, since their detection entries are not yet available for seeding.

Fig.1 – Result of matching set.dat files

During the process, I fixed several issues wherever possible, improved the matching, manually removed unwanted entries, and documented everything in a spreadsheet. Some key fixes included adding additional filtering based on platform to reduce the number of candidate matches. In some cases, the platform could be extracted from the gameid (e.g., goldenwake-win). This filtering was needed because many detection entries from the seeding process were missing file size information (i.e., size = -1). While I was already filtering candidates by file size, I also had to include those with size = -1 to avoid missing the correct match due to incomplete data. However, this approach in some cases, significantly increased the number of candidates requiring manual merging. Introducing platform-based filtering helped reduce this count, though the improvement wasn’t as substantial as expected.

Another issue stemmed from duplicate files added during seeding for detection purposes. While the detection code in ScummVM intentionally includes these duplicates, I should have removed them during seeding. Cleaning them up did reduce the manual merging effort in some cases.

There were also complications caused by file paths. Initially, the filtering considered full file paths, but I later changed it to use only the filename as mentioned in the last blog. This led to situations where the same detection file appeared in multiple directories. I’ve now resolved this by designating only one file as the detection file and treating the others as non-detection files.

A significant portion of time also went into manually removing extra entries from set.dat, e.g, different language variants. These often caused dropouts in the matching process, but removing them allowed the main entry to be automatically merged.

Some smaller fixes included:

  • Ensuring all checksums are added on a match when the file size is less than the checksum size (since all checksum would be identical in that case). This logic was already implemented, but previously only applied when creating a new entry.

  • Increasing the log text size limit to prevent log creation from failing due to overly large text in the database.

Next, I’ll begin working on the scan utility while waiting for the Scumm and GLK detection entries to become available.

Categories
Uncategorized

Week 4

Welcome to this week’s blog update. This week was focused on fixing bugs related to seeding and set.dat uploading, improving filtering mechanisms, and rewriting parts of the processing logic for scan.dat.

After some manual checks with the recent set.dat updates, a few issues surfaced:

1. Identical Detection Fix

Some detection entries had identical file sets, sizes, and checksums, which led to issues during the automatic merging of set.dat. Previously, we used a megakey to prevent such duplicates, but since it included not only file data but also language and platform information, some identical versions still slipped through.
To solve this, I replaced the megakey check with a more focused comparison: filename, size, and checksum only, and logging the details of the fileset that got clashed.

2. Punycode Encoding Misplacement

The filenames were being encoded using Punycode every time they were processed for database insertion. However, there was no requirement for this encoding as it should have occurred earlier — ideally at the parsing stage, either by the scanning utility that generates .dat files or on the application’s upload interface. I have removed the encoding during database updates. Though I still have to add it at the scan utility side, which I’ll do this week.

3. Path Format Normalization

Another issue was related to inconsistent file paths. Some set.dat entries used Windows-style paths (xyz\abc), while their corresponding detection entries used Unix-style (xyz/abc). Since filtering was done using simple string matching, these mismatches caused failures. I resolved this by normalizing all paths to use forward slashes (/) before storing them in the database.

4. Improving Extra File Detection with clonof and romof

While analyzing filesets, I encountered a previously unnoticed clonof field (similar to romof). These fields indicate that extra files might be listed elsewhere in the .dat file. The previous logic only looked in the resource section, but I found that:

  • Extra files could also exist in the game section.

  • The file references could chain across multiple sections (e.g., A → B → C).

To address this, I implemented an iterative lookup for extra files, ensuring all relevant files across multiple levels are properly detected.

scan.dat Processing Improvements

For scan.dat, I introduced a file update strategy that runs before full fileset matching. All files that match based on file size and checksum are updated first. This allows us to update matching files early, without relying solely on complete fileset comparisons.

Minor Fixes & UI Enhancements

  • Prevented reprocessing of filesets in set.dat if a key already exists in subsequent runs.

  • Passing the --skiplog CLI argument to set.dat processing to  suppress verbose logs during fileset creation and automatic merging.

  • Improved filtering in the dashboard adding more fields like engineid, transcation number and fileset id, and fixing some older issues.

  • Introduced a new “Possible Merges” button in the filesets dashboard to manually inspect and confirm suggested merges.This feature is backed by a new database table that stores fileset matches for later manual review.

Categories
Uncategorized

Week 3

Welcome to this week’s blog update.

Most of this week was spent rewriting the logic for processing set.dat files, as the previous implementation had several inconsistencies. As shown in Figure 1, the earlier logic directly compared checksums to determine matches. However, this approach only worked when the file size was smaller than the checksum size (typically md5-5000), since set.dat files only include full file checksums. This caused us to miss many opportunities to merge filesets that could otherwise be uniquely matched by filename and size alone.

Fig.1 : Previous query used in matching.

Since set.dat files only contain entries that are already present in the detection results (with the exception of some rare variants I discovered later in the week), we should typically expect a one-to-one mapping in most cases. However, some filesets in set.dat can correspond to multiple candidate entries. This happens when the name and size match, but the checksum differs—often due to file variants. This case needs to be handled with manual merge.

Previously, the logic for handling different .dat file types was tightly coupled, making it hard to understand and maintain. I started by decoupling the logic for set.dat entirely. Now, the candidates for the match for  set.dat filesets are filtered by engine name, filename, and file size (if it’s not -1), excluding out the checksum. It is made sure that all the detection files(files with detection flag set to 1) follow the condition.

 

Initially, I was filtering out the fileset only with the highest number of matches, assuming it was correct. However, that approach isn’t reliable—sometimes the correct match might not be the largest group. So all these candidates need to go for the manual merge. Only when all checksums match across candidates can we be confident in an automatic match.

I also added logic to handle partial or full matches of candidate filesets. This can happen when a set.dat is reuploaded with changes. In such cases, all files are compared: if there’s no difference, the fileset is dropped. If differences exist, the fileset is flagged for manual merge.

Finally, I handled an issue with Mac files in set.dat. These files aren’t correctly represented there: they lack prefixes and have checksums computed for the full file rather than individual forks. So, these filesets are dropped early by checking if no candidates are found for that fileset after SQL filtering.

Other Updates

During seeding, I found some entries with the same megakey, differing only by game name or title. Sev advised treating them as a single fileset. So now, only the first such entry is added, and the rest are logged as warnings with metadata, including links to the conflicting fileset.

Other fixes this week included:

  • Removing support for m-type checksums entirely (Sev also removed them from the detections).

  • Dropping sha1 and crc checksums, which mainly came from set.dat.

Next Steps

With the seeding logic refined, the next step is to begin populating the database with individual set.dat entries and confirm everything works as expected.

After that, I’ll start working on fixing the scan.dat functionality. This feature will allow developers to manually scan their game data files and upload the relevant data to the database.

Categories
Uncategorized

Week 2

Welcome to the weekly blog.
After wrapping up work on macfiles last week, I finally moved on to testing the first two phases of the data upload workflow — starting with scummvm.dat (which contains detection entries from ScummVM used for populating the database) and set.dat (data from older collections, provides more information).

Database Changes: File Size Enhancements

Before diving into the testing, there was a change Sev asked for — to extend the database schema to store three types of file sizes instead of just one. This was necessary due to the nature of macfiles, which have:

  • A data fork

  • A resource fork

  • A third size: the data section of the resource fork itself

This change introduced significant modifications to db_functions.py, which contains the core logic for working with .dat files. I had to be careful to ensure nothing broke during this transition.

Punycode Encoding Logic Fixes

At the same time, I fixed the punycode logic in db_functions.py. Punycode encoding (or rather, an extended version of the standard used in URL encoding) is employed by ScummVM to convert filenames into filesystem-independent ASCII-only representations.

There were inconsistencies between punycode logic in db_functions.py and the original implementation in the Dumper Companion. I made sure both implementations now align, and I ran unit tests from the Dumper Companion to verify correctness.

Feeding the Database – scummvm.dat

With those fixes in place, I moved on to populating the database with data from scummvm.dat. While Sev was working on the C++ side to add the correct filesize tags for detections, I ran manual tests using existing data. The parsing logic worked well, though I had to add support for the new “extra size” fields.

Additionally, I fixed the megakey calculation, which is used later when uploading the scummvm.dat again with updates. This involved sorting files alphabetically before computing the key to ensure consistent results.

I also introduced a small optimization: if a file is less than 5000 bytes, we can safely assume that all checksum types (e.g., md5-full_file, md5-5000B, md5-tail-5000B, md5-oneMB, or the macfile variants like -d/-r ) will be the same. In such cases, we now automatically fill all checksum fields with the same value used in detection.

Uploading and Matching – set.dat

Finally, I worked on uploading set.dat to the database, which usually contains the follwoing – metadata (mostly irrelevant), full size checksums only and filesizes.

    • scummvm.dat doesn’t contain full file checksums like set.dat, so a match between files from set.dat and scummvm.dat is only possible when a file’s size is less than the detection checksum size, generally 5000 Bytes.
    • This transitions the status from “detection” to “partial” — we now know all files in the game, but not all checksum types.

    • If there is no match, we create a new entry in the database with the status dat.

Fixes :

There was an issue with the session variable @fileset_last, which was mistakenly referencing the filechecksumtable instead of the latest entry in the filesettable. This broke the logic for matching entries.

When a detection matched a file, only one checksum was previously being transferred. I fixed this to include all relevant checksums from the detection file.

 Bug Fixes and Improvements

Fixed redirection logic in the logs: previously, when a matched detection entry was removed, the log URL still pointed to the deleted fileset ID. I updated this to redirect correctly to the matched fileset.

Updated the dashboard to show unmatched datentries. These were missing earlier because the SQL query used an inner JOIN with the game table, and since set.dat files don’t have game table references, they were filtered out. I replaced it with a LEFT JOIN on fileset to include them.


That’s everything I worked on this past week. I’m still a bit unsure about the set.dat matching logic, so I’ll be discussing it further with Sev to make sure everything is aligned.

Thanks for reading!