Categories
CLI

Week 9 — Major Features Finished

Week 9 — Wrapping Up Major Features

This week, I mainly focused on the last scenario, which involves performing integrity checks submitted from the user side. Structurally, this task is not very different from the previous scenarios. However, I initially adopted an unreliable matching method:

I first searched the database for filesets where the game ID, engine ID, platform, and language matched the metadata provided by the user. Then, I used a two-pointer approach to check each file within these filesets.

If any file did not match, I marked it as ‘size_mismatch’ or ‘checksum_mismatch’. However, I quickly realized that this comparison approach was unreliable; it did not conduct boundary checks and overlooked many unexpected situations.

Given the files of a single game, comparing them one by one is an acceptable time complexity (referring to the implementation of the match args in the previous dat_parser.py). Moreover, based on my experience playing games, it is unnecessary to report ‘checksum_mismatch’ or ‘size_mismatch’. Providing three responses: ‘ok,’ ‘extra,’ and ‘missing’ would be sufficient. Therefore, I restructured this part of the logic and ran some tests, and it looks like it’s working very well now!

Categories
website

Week 8 – Application

In addition to continuing to finish some unfinished features, my work this week mainly focused on the final scenario, which involves users uploading local game checksums, meaning we need to deal with network interactions.

Previously, I had been using SSH port forwarding and tunneling requests through SSH to test my Flask application, but I had never actually accessed it through a browser using a domain name. When I was informed that I needed to expose the localhost running on the VM, I initially thought that I could simply start my Flask app with Gunicorn and then access it directly via domain:port. However, I underestimated the situation. The server only had port 80 open for external access, hosting several web pages.

After some investigation, I discovered that the web service was using Apache2 instead of the Nginx I was familiar with or the increasingly popular serverless setup, which made things feel a bit tricky initially. After researching some information and documentation, I decided to use a reverse proxy to mount my Flask application under the corresponding subdomain on the server. (Unlike PHP, Flask does not process requests based on files, so special configuration is required.)

However, after discussions, I chose a different approach: using mod_wsgi, as it is more compatible with the existing Apache than Gunicorn. After quickly reading through the official documentation and modifying the Apache configuration file, I successfully replaced the original PHP web pages.

Categories
website

Week 7 – Fixing Details

This week, progress on the new feature work has been limited (though I did manage to complete some new features). On the one hand, there weren’t many features left to implement. On the other hand, I discovered some bugs from previous work this week and focused on fixing them.

The first issue was related to querying and displaying the history table for each fileset. Ideally, for example, if we merge fileset 2 into fileset 23, and then fileset 23 into fileset 123, the history page for fileset 123 should trace back to which filesets it was merged from. Initially, my code simply used SELECT * FROM history WHERE fileset = {id} OR oldfileset = {id}, which was clearly insufficient.

So, I initially thought about using a method similar to an unoptimized union-find (a habit from competitive programming 🤣) to quickly query merged filesets. However, I realized this would require an additional table (or file) to store the parent nodes of each node in the union-find structure, which seemed cumbersome. Therefore, I opted to straightforwardly write a recursive query function, which surprisingly performed well, and I used this naive method to solve the problem.

Another issue arose on the manual match page for each fileset. Although the program always identifies the fileset with the most matches, I found that the number of matched files between two filesets was significantly off during testing. Upon investigation, I discovered that the displayed count was consistently five times the actual number of matches. Where did this factor of five come from? Initially, I was baffled.

After logging several entries, I finally pinpointed the problem. The manual match page retrieves data from the database, while the dat_parser uses data from DAT files. These sources store file data in slightly different formats. For example, the format from a DAT file might look like:

[{"name":"aaa", "size":123, "md5":"123456", "md5-5000":"45678"}, {"name":"bbb", "size":222, "md5":"12345678", "md5-5000":"4567890"}...]

But the format retrieved from the database might look like:

[{"name":"aaa", "size":123, "md5":"123456"}, {"name":"aaa", "size":123, "md5-5000":"45678"}...]

As a result, it became clear that scan-type filesets stored in the database had five types of file_checksum (meaning each file was treated as five separate files when matched), leading the counter to naturally increase by five for each such file encountered. After simply reorganizing the retrieved database data, I resolved this bug.

Categories
CLI

Week6 – Refinement

This week’s work went relatively smoothly, and I encountered nothing too challenging. At the beginning of the week, sev reminded me that my original regex was matching too slowly. I reviewed the code and realized I had fallen into an X-Y problem myself.

That is, I was focused on using regex to solve the string-matching problem, but I overlooked the fact that using an extra regex might not be necessary at all. My current expression can indeed match all the cases in the existing dat files, but the problem is that it’s not necessary to do so. Since the structure within each “rom” section is fixed and different blocks are always separated by a space, token matching is sufficient. There’s no need to use a complex regex to cover all edge cases, and the performance will be much better (linear time complexity).

As planned last week, I added a detection_type column to the file table. This makes it clearer when recalculating the megakey.

Due to the addition of the detection_type column and set.dat , the code logic requires some extra handling. Therefore, I also refactored the original code, decoupling the matching-related code through modularization to facilitate future development and expansion.