Categories
CLI

Week 5 – Merge

In the past few weeks, I completed most of the work on the detection part, including merging and deduplication. This week, I mainly focused on merging the remaining types of dat files.

After discussing with sev, I realized that my previous idea of merging was too simple.

Using a state diagram, it looks like this:

After clarifying my thoughts with this diagram, the subsequent work became much clearer.

First, I found that the original regex matching rules only applied to detection-type dat files and did not adapt well to scan-type files (mainly because the md5 types would fail to match). Therefore, I made some attempts and modified the regex from r'(\w+)\s+"([^"]*)"\s+size\s+(\d+)\s+md5-5000\s+([a-f0-9]+)' to r'(\w+)\s+"([^"]*)"\s+size\s+(\d+)((?:\s+md5(?:-\w+)?(?:-\w+)?\s+[a-f0-9]+)*)'.

At the same time, I refactored my previous code for detecting duplicate files in detection and merging them. The original code did not query and match each entry from the checksum table during the merge, but this step is necessary to minimize collisions.

Initially, I wanted to reuse the code for viewing extended checksum tables within the fileset, but later I found that such reuse introduced bugs and made maintenance difficult. I was simply complicating things.

The logic is quite simple: for each file, compare its corresponding checktype checksum. If they match, it can be considered the same file. When merging scan into detection, this operation removes the original file from the fileset while inserting the file from scan into the fileset (since the information is more complete), but retains the detection status.

Speaking of detection status, a better practice would be to convert it from a Boolean type to a string type (indicating which md5 type it is). This would make recalculating the megakey more convenient. However, I haven’t had the opportunity to modify it this week due to the extensive logic involved. I’m considering adding a new column to the database instead of modifying the existing type. I plan to implement this idea in my work next week.

Categories
CLI

Week4 – Implement

This week, I mainly focused on completing the automatic and manual merging of filesets. During my discussion with Sev, I realized that I had not fully understood the purpose and origin of the Megakey before, so I am documenting it here.

First, what is a Megakey:
Megakey is a combined key, coming from the detection entry.

Why do we need a Megakey:
The purpose of the Megakey is to understand that we are dealing with the same detection entry. “You need this for updating the metadata in the DB since, over time, we will accumulate full sets, but still, we add game entries to the games on a regular basis.

Also, we do occasional target renames, so we cannot use that for the Megakey either.”

Where does it come from:
When you see that this is a detection set, then you need to compute the Megakey (on the Python side).

For example:
For any fileset, there should be a possibility to merge manually. So, let’s say we change the language of an entry from en to en-us. This will create a new fileset with en-us because the Megakey is different, but a developer could go to the log, click on the fileset reference, and merge it.

Or, say, a new file is added to the detection entry. The Megakey will not match, so you will again create a new entry, but the developer who made this change knows what they’re doing and can go and merge manually.

Additionally, I improved the query operations on the fileset page, as I mistakenly performed many redundant calculations before.

I enhanced the comparison page during the merge process by highlighting the items with differences.

So far, both automatic and manual merging are functioning correctly.

Categories
website

Week3 – Rewrite and Design

The amount of information and work this week has increased compared to the previous two weeks.

At the end of last week, I completed rewriting the original web code in Python. What remains are the new features that still need to be implemented. I asked Sev about the upcoming tasks and understood the entire workflow.

I have summarized the four workflows as follows:

I also successfully deployed the test webpage, which was previously only deployed locally, to the server. Additionally, the data population used data generated from the actual scummvm.dat instead of data I fabricated myself.

I fixed dat_parser, and now it can properly parse strings and insert data into the server.

It seems like a lot of features have indeed been implemented, but there are still some bugs that haven’t been fixed yet (duplicate file detection, database operations being too slow, etc.).

Besides, here is Sev’s suggestion for implementations

” Since we do have entries that have equal set of files with the only difference is the language and/or platform, then add those to the megakey, so, in case there is a change in those fields, it will be treated as a newly created fileset and in order to manage it, implement a way for manual merge of filesets, which will lead to metada of an old version be overridden by the metadata from the incoming one, pertaining history”

I will focus on these issues next week.

Categories
website

Week2 – Getting better

This week, my main tasks were parsing the .dat files and fixing errors in the original database functions. After closely examining some .dat files, I found their format very similar to JSON, which I am quite familiar with. Thus, I only needed to perform some bracket-matching operations.

However, when I tried to port the PHP code to Python, I encountered numerous errors. The issues were due to Python’s need for additional boundary checks on strings and differences in database operations between Pymysql and PHP. Additionally, the original code contains some unnecessary transaction operations. It should only require cursor.execute() instead of using conn.commit() . After making the necessary fixes, I successfully resolved these issues.

 

Currently, I have replicated most of the main functionalities of the original code. Only a few minor details related to appearance and database exception handling remain, which I plan to address next week.

Categories
website

Week1 – Start

In the first week, my main task was to replace the original PHP server code with Python code. My plan was to use the Flask library to create a web server.

During the first week, I primarily focused on rewriting all the functions in the original db_functions.php file. This mainly involved operations related to MySQL and SQL statements, so I used the pymysql library to complete the rewrite. The commit record is here: https://github.com/InariInDream/scummvm-sites/commits/integrity/

 

However, the appearance displayed on the web page cannot be generated as simply as in PHP using statements like echo "<h2><u>Fileset: {$id}</u></h2>";. It needs to be rendered through Flask. Therefore, my focus for the next week will be on the design of the appearance (tables, forms).

As of now, I haven’t encountered any significant technical difficulties. It’s just that there is quite a bit of code that needs to be replaced, so it will take some time.