Categories
CLI website

Week12 – Wrap-Up

This week’s work mainly focused on wrapping things up as the end of the project is approaching, and I am striving to refine the details.

Quantitatively, I accomplished the following tasks:

  1. I thoroughly tested my code following the correct workflow.
  2. In the old version, when matching filesets in the command line, it would repeatedly create the same fileset multiple times (manifested in the logs as multiple occurrences of “Updated fileset: xxxx”). After some debugging, I found the problem. As shown in the diagram, the old version would automatically insert a new fileset after all possible match failures, which seems reasonable. However, I had already implemented the logic of “creating a fileset first before matching” in a previous version. This led to duplicate logs since the outer loop of this code was
    for matched_fileset_id, matched_count in matched_list:,
    meaning that the number of potentially matched filesets would result in the same number of duplicate logs. This issue was minor but stemmed from my lack of careful consideration.
  3. I added several new features: for instance, I added checkboxes next to each file on the fileset detail page for developers to conveniently delete unnecessary files, included sorting functionality in the fileset table on the fileset page, and highlighted checksums corresponding to detection types. These features did not involve complex logic, as they were primarily frontend enhancements, and therefore were completed without difficulty.

Looking back at my entire project, I have finished nearly 3000 lines of code over 12 weeks.

I am pleased that most of what I’ve accomplished so far has met expectations. However, there are still some improvements needed before deployment, such as adaptation for MacBinary and other Mac formats, and user IP identification. My sense of responsibility drives me to continue refining this project even after GSOC ends until it can be truly put into production. I look forward to that day! 😄

Categories
website

Week 11- Testing

This week, I mainly focused on testing the system I previously developed to see if there were any overlooked issues. Fortunately, the problems I discovered were minimal. Below are the issues I fixed and the new features I implemented this week:

  1. Marking a fileset as full on the fileset page is straightforward; I just needed to add a button for that.
  2. The set.dat file may contain checksums for “sha1” and “crc,” which I initially forgot to ignore. As a result, these two checksum types appeared in the fileset.
  3. I added a “ready_for_review” page. I realized I could reuse the file_search page, so I simply created a redirect to this page.
  4. Speaking of the file_search page, I finally fixed the pagination error. The issue arose because I hadn’t considered that the query page could include filter conditions, which caused the results to show the entire table (i.e., without any query conditions). This error was hidden in a small detail, but I managed to fix it.
  5. I added a user_count based on different user IPs. I implemented a simple prototype, but I’m still contemplating whether there’s a better solution, such as creating a new table in the database to store user IP addresses. I’m not sure if this is necessary and I may consult my mentor for advice later.

Overall, this week has been relatively easy, but I’m still thinking about areas for improvement.

Categories
website

Week 10 – Fixing

This week, my main focus was on filling in gaps and fixing issues. Here are two significant changes I made:

Due to the need for the database backend to return counts of matches, missing items, and extras when verifying file integrity for users, I quickly realized that returning only a few mappings containing counts after the database query was insufficient (other functions require detailed information). As a result, I changed the dictionary type from defaultdict(int) to defaultdict(list).

At the same time, I discovered that I couldn’t retrieve user filesets on the fileset query page, even though the logs showed that I had indeed inserted several user-status filesets into the database. After investigating, I found the root cause of the issue. I had not considered inserting metadata for the game when the user-status filesets were added. After making some modifications, I resolved this problem.

In summary, this week’s work was much easier than before; I just needed to test different scenarios and make fixes based on the results.

Categories
CLI

Week 9 — Major Features Finished

Week 9 — Wrapping Up Major Features

This week, I mainly focused on the last scenario, which involves performing integrity checks submitted from the user side. Structurally, this task is not very different from the previous scenarios. However, I initially adopted an unreliable matching method:

I first searched the database for filesets where the game ID, engine ID, platform, and language matched the metadata provided by the user. Then, I used a two-pointer approach to check each file within these filesets.

If any file did not match, I marked it as ‘size_mismatch’ or ‘checksum_mismatch’. However, I quickly realized that this comparison approach was unreliable; it did not conduct boundary checks and overlooked many unexpected situations.

Given the files of a single game, comparing them one by one is an acceptable time complexity (referring to the implementation of the match args in the previous dat_parser.py). Moreover, based on my experience playing games, it is unnecessary to report ‘checksum_mismatch’ or ‘size_mismatch’. Providing three responses: ‘ok,’ ‘extra,’ and ‘missing’ would be sufficient. Therefore, I restructured this part of the logic and ran some tests, and it looks like it’s working very well now!

Categories
website

Week 8 – Application

In addition to continuing to finish some unfinished features, my work this week mainly focused on the final scenario, which involves users uploading local game checksums, meaning we need to deal with network interactions.

Previously, I had been using SSH port forwarding and tunneling requests through SSH to test my Flask application, but I had never actually accessed it through a browser using a domain name. When I was informed that I needed to expose the localhost running on the VM, I initially thought that I could simply start my Flask app with Gunicorn and then access it directly via domain:port. However, I underestimated the situation. The server only had port 80 open for external access, hosting several web pages.

After some investigation, I discovered that the web service was using Apache2 instead of the Nginx I was familiar with or the increasingly popular serverless setup, which made things feel a bit tricky initially. After researching some information and documentation, I decided to use a reverse proxy to mount my Flask application under the corresponding subdomain on the server. (Unlike PHP, Flask does not process requests based on files, so special configuration is required.)

However, after discussions, I chose a different approach: using mod_wsgi, as it is more compatible with the existing Apache than Gunicorn. After quickly reading through the official documentation and modifying the Apache configuration file, I successfully replaced the original PHP web pages.

Categories
website

Week 7 – Fixing Details

This week, progress on the new feature work has been limited (though I did manage to complete some new features). On the one hand, there weren’t many features left to implement. On the other hand, I discovered some bugs from previous work this week and focused on fixing them.

The first issue was related to querying and displaying the history table for each fileset. Ideally, for example, if we merge fileset 2 into fileset 23, and then fileset 23 into fileset 123, the history page for fileset 123 should trace back to which filesets it was merged from. Initially, my code simply used SELECT * FROM history WHERE fileset = {id} OR oldfileset = {id}, which was clearly insufficient.

So, I initially thought about using a method similar to an unoptimized union-find (a habit from competitive programming 🤣) to quickly query merged filesets. However, I realized this would require an additional table (or file) to store the parent nodes of each node in the union-find structure, which seemed cumbersome. Therefore, I opted to straightforwardly write a recursive query function, which surprisingly performed well, and I used this naive method to solve the problem.

Another issue arose on the manual match page for each fileset. Although the program always identifies the fileset with the most matches, I found that the number of matched files between two filesets was significantly off during testing. Upon investigation, I discovered that the displayed count was consistently five times the actual number of matches. Where did this factor of five come from? Initially, I was baffled.

After logging several entries, I finally pinpointed the problem. The manual match page retrieves data from the database, while the dat_parser uses data from DAT files. These sources store file data in slightly different formats. For example, the format from a DAT file might look like:

[{"name":"aaa", "size":123, "md5":"123456", "md5-5000":"45678"}, {"name":"bbb", "size":222, "md5":"12345678", "md5-5000":"4567890"}...]

But the format retrieved from the database might look like:

[{"name":"aaa", "size":123, "md5":"123456"}, {"name":"aaa", "size":123, "md5-5000":"45678"}...]

As a result, it became clear that scan-type filesets stored in the database had five types of file_checksum (meaning each file was treated as five separate files when matched), leading the counter to naturally increase by five for each such file encountered. After simply reorganizing the retrieved database data, I resolved this bug.

Categories
CLI

Week6 – Refinement

This week’s work went relatively smoothly, and I encountered nothing too challenging. At the beginning of the week, sev reminded me that my original regex was matching too slowly. I reviewed the code and realized I had fallen into an X-Y problem myself.

That is, I was focused on using regex to solve the string-matching problem, but I overlooked the fact that using an extra regex might not be necessary at all. My current expression can indeed match all the cases in the existing dat files, but the problem is that it’s not necessary to do so. Since the structure within each “rom” section is fixed and different blocks are always separated by a space, token matching is sufficient. There’s no need to use a complex regex to cover all edge cases, and the performance will be much better (linear time complexity).

As planned last week, I added a detection_type column to the file table. This makes it clearer when recalculating the megakey.

Due to the addition of the detection_type column and set.dat , the code logic requires some extra handling. Therefore, I also refactored the original code, decoupling the matching-related code through modularization to facilitate future development and expansion.

Categories
CLI

Week 5 – Merge

In the past few weeks, I completed most of the work on the detection part, including merging and deduplication. This week, I mainly focused on merging the remaining types of dat files.

After discussing with sev, I realized that my previous idea of merging was too simple.

Using a state diagram, it looks like this:

After clarifying my thoughts with this diagram, the subsequent work became much clearer.

First, I found that the original regex matching rules only applied to detection-type dat files and did not adapt well to scan-type files (mainly because the md5 types would fail to match). Therefore, I made some attempts and modified the regex from r'(\w+)\s+"([^"]*)"\s+size\s+(\d+)\s+md5-5000\s+([a-f0-9]+)' to r'(\w+)\s+"([^"]*)"\s+size\s+(\d+)((?:\s+md5(?:-\w+)?(?:-\w+)?\s+[a-f0-9]+)*)'.

At the same time, I refactored my previous code for detecting duplicate files in detection and merging them. The original code did not query and match each entry from the checksum table during the merge, but this step is necessary to minimize collisions.

Initially, I wanted to reuse the code for viewing extended checksum tables within the fileset, but later I found that such reuse introduced bugs and made maintenance difficult. I was simply complicating things.

The logic is quite simple: for each file, compare its corresponding checktype checksum. If they match, it can be considered the same file. When merging scan into detection, this operation removes the original file from the fileset while inserting the file from scan into the fileset (since the information is more complete), but retains the detection status.

Speaking of detection status, a better practice would be to convert it from a Boolean type to a string type (indicating which md5 type it is). This would make recalculating the megakey more convenient. However, I haven’t had the opportunity to modify it this week due to the extensive logic involved. I’m considering adding a new column to the database instead of modifying the existing type. I plan to implement this idea in my work next week.

Categories
CLI

Week4 – Implement

This week, I mainly focused on completing the automatic and manual merging of filesets. During my discussion with Sev, I realized that I had not fully understood the purpose and origin of the Megakey before, so I am documenting it here.

First, what is a Megakey:
Megakey is a combined key, coming from the detection entry.

Why do we need a Megakey:
The purpose of the Megakey is to understand that we are dealing with the same detection entry. “You need this for updating the metadata in the DB since, over time, we will accumulate full sets, but still, we add game entries to the games on a regular basis.

Also, we do occasional target renames, so we cannot use that for the Megakey either.”

Where does it come from:
When you see that this is a detection set, then you need to compute the Megakey (on the Python side).

For example:
For any fileset, there should be a possibility to merge manually. So, let’s say we change the language of an entry from en to en-us. This will create a new fileset with en-us because the Megakey is different, but a developer could go to the log, click on the fileset reference, and merge it.

Or, say, a new file is added to the detection entry. The Megakey will not match, so you will again create a new entry, but the developer who made this change knows what they’re doing and can go and merge manually.

Additionally, I improved the query operations on the fileset page, as I mistakenly performed many redundant calculations before.

I enhanced the comparison page during the merge process by highlighting the items with differences.

So far, both automatic and manual merging are functioning correctly.

Categories
website

Week3 – Rewrite and Design

The amount of information and work this week has increased compared to the previous two weeks.

At the end of last week, I completed rewriting the original web code in Python. What remains are the new features that still need to be implemented. I asked Sev about the upcoming tasks and understood the entire workflow.

I have summarized the four workflows as follows:

I also successfully deployed the test webpage, which was previously only deployed locally, to the server. Additionally, the data population used data generated from the actual scummvm.dat instead of data I fabricated myself.

I fixed dat_parser, and now it can properly parse strings and insert data into the server.

It seems like a lot of features have indeed been implemented, but there are still some bugs that haven’t been fixed yet (duplicate file detection, database operations being too slow, etc.).

Besides, here is Sev’s suggestion for implementations

” Since we do have entries that have equal set of files with the only difference is the language and/or platform, then add those to the megakey, so, in case there is a change in those fields, it will be treated as a newly created fileset and in order to manage it, implement a way for manual merge of filesets, which will lead to metada of an old version be overridden by the metadata from the incoming one, pertaining history”

I will focus on these issues next week.