Categories
Uncategorized

Week 8

Welcome to this week’s blog. This week, I primarily worked on the moderation system for user.dat, improved the merge dashboard page for manual merges, and made several other changes related to the website’s user interface.

Moderation System for user.dat

For every incoming user fileset submission, we check whether we already have a matching full fileset. If a full fileset match is found, we return the status of the files. If not, the submission is added to the database as a possible new fileset or a variant, which then goes through a moderation queue.

I worked on implementing this moderation queue. A submitted fileset is added to the review queue only if it is submitted by a minimum number of different users (currently set to three). To prevent redundant count increases from the same user, there is an IP-based check in place. We already had masking logic that anonymizes the last octet of an IPv4 address (e.g., 127.0.0.X) to avoid storing Personally Identifiable Information (PII).

Merge Dashboard

I also worked on enhancing the merge dashboard. The goal was to implement a page that displays both filesets side-by-side for comparison, including checksums, file sizes, and metadata. Moderators can choose the checksums and file sizes from either fileset to perform manual merge. A check was also added to prevent selecting both options for the same comparison field.

However, the page still needs some optimization, as it currently freezes when dealing with a large number of files.

Other Improvements

Some of the other fixes and improvements include:

  • Correctly redirecting filesets after automatic or manual matches

  • Enhancing the homepage

  • Adding error handling while parsing .dat files

  • Implementing proper transaction rollbacks in case of failure

  • Fixing filtering issues that previously caused some data with null fields to not filter.

Categories
Uncategorized

Week 7

Welcome to this week’s blog update. This week, I focused primarily on completing the scan.dat processing, as well as working on user.dat handling which is the actual data coming from the user side. The scan processing was almost complete by the previous week; the remaining task was to add a modification time field to the file table in the database and reflect that change in the frontend.

One significant fix was introducing checksum-based filtering at the very beginning of the filtering logic for all dats. Previously, I had placed it after the maximum matched files were already filtered, which did not align with ScummVM’s detection algorithm. Further the detection entries from Scumm Engine had 1MB checksums so checksum based filtering worked really well there. Another improvement was how I handled duplicate entries. Initially, I was dropping all entries in case of duplicates. However, it’s more efficient to retain the first entry and discard the rest, reducing the need for manually adding extra detection entries for such cases.

Then I worked on user.dat where I rewrote some of the matching logic. The filtering approach remains consistent with scan.dat, and I made minimal changes to the existing response logic. There is some work left related to moderation queue for reviewing user data and IP based checking.

Fig. 1 – Testing user.dat

Other fixes and improvements:

  • Parameterized all old SQL queries: I had postponed this task for a while, but finally sat down to parameterize them all.

  • Formatting compute_hash.py: I had avoided running the Ruff formatter on this file because it was interfering with specific parts of the scan utility code. However, thanks to a suggestion from rvanlaar, I used # fmt: off and # fmt: on comments to selectively disable formatting for those sections.

Categories
Uncategorized

Week-6

Welcome to this week’s blog. This week, I primarily worked on the scan utility and the scan processing logic.

Scan Utility

The scan utility was mostly complete in the first week, but I added three more features:

  1. Modification Time Filter:
    I added the modification time for scanned files into the .dat file. A command-line argument now allows users to specify a cutoff time, filtering out files updated after that time (except files modified today).
    Extracting the modification time was straightforward for non-Mac files since it could be retrieved from the OS. However, for Mac-specific formats—specifically MacBinary and AppleDouble—I had to extract the modification time from the Finder Info.

  2. Size Fields:
    I added all size types (size, size-r, and size-rd) in the .dat file.

  3. Punycode Path Encoding:
    Filepath components are now punycode-encoded individually.

Scan Processing Logic

For processing scan.dat, the first improvement was updating the checksum of all files in the database that matched both the checksum and file size.

The rest of the processing is similar to set.dat logic:
Filtering is used to find candidates with matching detection filenames, sizes and additionally checksums.

  • Single Candidate:
    • If the candidate’s status is partial, it’s upgraded to full (files are updated in case they were skipped earlier due to missing size info).

    • If the candidate’s status is detection,and the number of files in the scan.dat is equal, the status is set to full. Otherwise, it’s flagged for manual merge.

    • If the candidate status is already full, all files are compared, and any differences are reported.

  • Multiple Candidates:
    All candidates are added for manual merging.

Other Fixes and Improvements
  • Fix in set.dat Handling:
    Sometimes, filesets from the candidate list were updated during the same run due to other filesets. These updated filesets could incorrectly appear as false positives for manual merge if their size changed. Now, if a fileset gets updated and its size no longer matches, it’s removed from the candidate list.

  • Database Schema Update:
    An extra column was added to the fileset table to store set.dat metadata.

  • Website Navbar:
    A new navbar has been added to the webpage, along with the updated logo provided by Sev.

  • Database Connection Fix in Flask:
    For development, a “Clear Database” button was added to the webpage. However, the Flask code previously used a global database connection object. This led to multiple user connections persisting and occasionally locking the database. I’ve refactored the code to eliminate the global connection, resolving the issue.

Categories
Uncategorized

Week 5

Welcome to this week’s blog.

This week primarily involved manually reviewing around 100+ set.dat files—excluding a few, such as Scumm and all GLK engines, since their detection entries are not yet available for seeding.

Fig.1 – Result of matching set.dat files

During the process, I fixed several issues wherever possible, improved the matching, manually removed unwanted entries, and documented everything in a spreadsheet. Some key fixes included adding additional filtering based on platform to reduce the number of candidate matches. In some cases, the platform could be extracted from the gameid (e.g., goldenwake-win). This filtering was needed because many detection entries from the seeding process were missing file size information (i.e., size = -1). While I was already filtering candidates by file size, I also had to include those with size = -1 to avoid missing the correct match due to incomplete data. However, this approach in some cases, significantly increased the number of candidates requiring manual merging. Introducing platform-based filtering helped reduce this count, though the improvement wasn’t as substantial as expected.

Another issue stemmed from duplicate files added during seeding for detection purposes. While the detection code in ScummVM intentionally includes these duplicates, I should have removed them during seeding. Cleaning them up did reduce the manual merging effort in some cases.

There were also complications caused by file paths. Initially, the filtering considered full file paths, but I later changed it to use only the filename as mentioned in the last blog. This led to situations where the same detection file appeared in multiple directories. I’ve now resolved this by designating only one file as the detection file and treating the others as non-detection files.

A significant portion of time also went into manually removing extra entries from set.dat, e.g, different language variants. These often caused dropouts in the matching process, but removing them allowed the main entry to be automatically merged.

Some smaller fixes included:

  • Ensuring all checksums are added on a match when the file size is less than the checksum size (since all checksum would be identical in that case). This logic was already implemented, but previously only applied when creating a new entry.

  • Increasing the log text size limit to prevent log creation from failing due to overly large text in the database.

Next, I’ll begin working on the scan utility while waiting for the Scumm and GLK detection entries to become available.