Categories
CLI website

Week12 – Wrap-Up

This week’s work mainly focused on wrapping things up as the end of the project is approaching, and I am striving to refine the details.

Quantitatively, I accomplished the following tasks:

  1. I thoroughly tested my code following the correct workflow.
  2. In the old version, when matching filesets in the command line, it would repeatedly create the same fileset multiple times (manifested in the logs as multiple occurrences of “Updated fileset: xxxx”). After some debugging, I found the problem. As shown in the diagram, the old version would automatically insert a new fileset after all possible match failures, which seems reasonable. However, I had already implemented the logic of “creating a fileset first before matching” in a previous version. This led to duplicate logs since the outer loop of this code was
    for matched_fileset_id, matched_count in matched_list:,
    meaning that the number of potentially matched filesets would result in the same number of duplicate logs. This issue was minor but stemmed from my lack of careful consideration.
  3. I added several new features: for instance, I added checkboxes next to each file on the fileset detail page for developers to conveniently delete unnecessary files, included sorting functionality in the fileset table on the fileset page, and highlighted checksums corresponding to detection types. These features did not involve complex logic, as they were primarily frontend enhancements, and therefore were completed without difficulty.

Looking back at my entire project, I have finished nearly 3000 lines of code over 12 weeks.

I am pleased that most of what I’ve accomplished so far has met expectations. However, there are still some improvements needed before deployment, such as adaptation for MacBinary and other Mac formats, and user IP identification. My sense of responsibility drives me to continue refining this project even after GSOC ends until it can be truly put into production. I look forward to that day! 😄

Categories
CLI

Week 9 — Major Features Finished

Week 9 — Wrapping Up Major Features

This week, I mainly focused on the last scenario, which involves performing integrity checks submitted from the user side. Structurally, this task is not very different from the previous scenarios. However, I initially adopted an unreliable matching method:

I first searched the database for filesets where the game ID, engine ID, platform, and language matched the metadata provided by the user. Then, I used a two-pointer approach to check each file within these filesets.

If any file did not match, I marked it as ‘size_mismatch’ or ‘checksum_mismatch’. However, I quickly realized that this comparison approach was unreliable; it did not conduct boundary checks and overlooked many unexpected situations.

Given the files of a single game, comparing them one by one is an acceptable time complexity (referring to the implementation of the match args in the previous dat_parser.py). Moreover, based on my experience playing games, it is unnecessary to report ‘checksum_mismatch’ or ‘size_mismatch’. Providing three responses: ‘ok,’ ‘extra,’ and ‘missing’ would be sufficient. Therefore, I restructured this part of the logic and ran some tests, and it looks like it’s working very well now!

Categories
CLI

Week6 – Refinement

This week’s work went relatively smoothly, and I encountered nothing too challenging. At the beginning of the week, sev reminded me that my original regex was matching too slowly. I reviewed the code and realized I had fallen into an X-Y problem myself.

That is, I was focused on using regex to solve the string-matching problem, but I overlooked the fact that using an extra regex might not be necessary at all. My current expression can indeed match all the cases in the existing dat files, but the problem is that it’s not necessary to do so. Since the structure within each “rom” section is fixed and different blocks are always separated by a space, token matching is sufficient. There’s no need to use a complex regex to cover all edge cases, and the performance will be much better (linear time complexity).

As planned last week, I added a detection_type column to the file table. This makes it clearer when recalculating the megakey.

Due to the addition of the detection_type column and set.dat , the code logic requires some extra handling. Therefore, I also refactored the original code, decoupling the matching-related code through modularization to facilitate future development and expansion.

Categories
CLI

Week 5 – Merge

In the past few weeks, I completed most of the work on the detection part, including merging and deduplication. This week, I mainly focused on merging the remaining types of dat files.

After discussing with sev, I realized that my previous idea of merging was too simple.

Using a state diagram, it looks like this:

After clarifying my thoughts with this diagram, the subsequent work became much clearer.

First, I found that the original regex matching rules only applied to detection-type dat files and did not adapt well to scan-type files (mainly because the md5 types would fail to match). Therefore, I made some attempts and modified the regex from r'(\w+)\s+"([^"]*)"\s+size\s+(\d+)\s+md5-5000\s+([a-f0-9]+)' to r'(\w+)\s+"([^"]*)"\s+size\s+(\d+)((?:\s+md5(?:-\w+)?(?:-\w+)?\s+[a-f0-9]+)*)'.

At the same time, I refactored my previous code for detecting duplicate files in detection and merging them. The original code did not query and match each entry from the checksum table during the merge, but this step is necessary to minimize collisions.

Initially, I wanted to reuse the code for viewing extended checksum tables within the fileset, but later I found that such reuse introduced bugs and made maintenance difficult. I was simply complicating things.

The logic is quite simple: for each file, compare its corresponding checktype checksum. If they match, it can be considered the same file. When merging scan into detection, this operation removes the original file from the fileset while inserting the file from scan into the fileset (since the information is more complete), but retains the detection status.

Speaking of detection status, a better practice would be to convert it from a Boolean type to a string type (indicating which md5 type it is). This would make recalculating the megakey more convenient. However, I haven’t had the opportunity to modify it this week due to the extensive logic involved. I’m considering adding a new column to the database instead of modifying the existing type. I plan to implement this idea in my work next week.

Categories
CLI

Week4 – Implement

This week, I mainly focused on completing the automatic and manual merging of filesets. During my discussion with Sev, I realized that I had not fully understood the purpose and origin of the Megakey before, so I am documenting it here.

First, what is a Megakey:
Megakey is a combined key, coming from the detection entry.

Why do we need a Megakey:
The purpose of the Megakey is to understand that we are dealing with the same detection entry. “You need this for updating the metadata in the DB since, over time, we will accumulate full sets, but still, we add game entries to the games on a regular basis.

Also, we do occasional target renames, so we cannot use that for the Megakey either.”

Where does it come from:
When you see that this is a detection set, then you need to compute the Megakey (on the Python side).

For example:
For any fileset, there should be a possibility to merge manually. So, let’s say we change the language of an entry from en to en-us. This will create a new fileset with en-us because the Megakey is different, but a developer could go to the log, click on the fileset reference, and merge it.

Or, say, a new file is added to the detection entry. The Megakey will not match, so you will again create a new entry, but the developer who made this change knows what they’re doing and can go and merge manually.

Additionally, I improved the query operations on the fileset page, as I mistakenly performed many redundant calculations before.

I enhanced the comparison page during the merge process by highlighting the items with differences.

So far, both automatic and manual merging are functioning correctly.