Categories
CLI website

Week12 – Wrap-Up

This week’s work mainly focused on wrapping things up as the end of the project is approaching, and I am striving to refine the details.

Quantitatively, I accomplished the following tasks:

  1. I thoroughly tested my code following the correct workflow.
  2. In the old version, when matching filesets in the command line, it would repeatedly create the same fileset multiple times (manifested in the logs as multiple occurrences of “Updated fileset: xxxx”). After some debugging, I found the problem. As shown in the diagram, the old version would automatically insert a new fileset after all possible match failures, which seems reasonable. However, I had already implemented the logic of “creating a fileset first before matching” in a previous version. This led to duplicate logs since the outer loop of this code was
    for matched_fileset_id, matched_count in matched_list:,
    meaning that the number of potentially matched filesets would result in the same number of duplicate logs. This issue was minor but stemmed from my lack of careful consideration.
  3. I added several new features: for instance, I added checkboxes next to each file on the fileset detail page for developers to conveniently delete unnecessary files, included sorting functionality in the fileset table on the fileset page, and highlighted checksums corresponding to detection types. These features did not involve complex logic, as they were primarily frontend enhancements, and therefore were completed without difficulty.

Looking back at my entire project, I have finished nearly 3000 lines of code over 12 weeks.

I am pleased that most of what I’ve accomplished so far has met expectations. However, there are still some improvements needed before deployment, such as adaptation for MacBinary and other Mac formats, and user IP identification. My sense of responsibility drives me to continue refining this project even after GSOC ends until it can be truly put into production. I look forward to that day! 😄

Categories
website

Week 11- Testing

This week, I mainly focused on testing the system I previously developed to see if there were any overlooked issues. Fortunately, the problems I discovered were minimal. Below are the issues I fixed and the new features I implemented this week:

  1. Marking a fileset as full on the fileset page is straightforward; I just needed to add a button for that.
  2. The set.dat file may contain checksums for “sha1” and “crc,” which I initially forgot to ignore. As a result, these two checksum types appeared in the fileset.
  3. I added a “ready_for_review” page. I realized I could reuse the file_search page, so I simply created a redirect to this page.
  4. Speaking of the file_search page, I finally fixed the pagination error. The issue arose because I hadn’t considered that the query page could include filter conditions, which caused the results to show the entire table (i.e., without any query conditions). This error was hidden in a small detail, but I managed to fix it.
  5. I added a user_count based on different user IPs. I implemented a simple prototype, but I’m still contemplating whether there’s a better solution, such as creating a new table in the database to store user IP addresses. I’m not sure if this is necessary and I may consult my mentor for advice later.

Overall, this week has been relatively easy, but I’m still thinking about areas for improvement.

Categories
website

Week 10 – Fixing

This week, my main focus was on filling in gaps and fixing issues. Here are two significant changes I made:

Due to the need for the database backend to return counts of matches, missing items, and extras when verifying file integrity for users, I quickly realized that returning only a few mappings containing counts after the database query was insufficient (other functions require detailed information). As a result, I changed the dictionary type from defaultdict(int) to defaultdict(list).

At the same time, I discovered that I couldn’t retrieve user filesets on the fileset query page, even though the logs showed that I had indeed inserted several user-status filesets into the database. After investigating, I found the root cause of the issue. I had not considered inserting metadata for the game when the user-status filesets were added. After making some modifications, I resolved this problem.

In summary, this week’s work was much easier than before; I just needed to test different scenarios and make fixes based on the results.

Categories
website

Week 8 – Application

In addition to continuing to finish some unfinished features, my work this week mainly focused on the final scenario, which involves users uploading local game checksums, meaning we need to deal with network interactions.

Previously, I had been using SSH port forwarding and tunneling requests through SSH to test my Flask application, but I had never actually accessed it through a browser using a domain name. When I was informed that I needed to expose the localhost running on the VM, I initially thought that I could simply start my Flask app with Gunicorn and then access it directly via domain:port. However, I underestimated the situation. The server only had port 80 open for external access, hosting several web pages.

After some investigation, I discovered that the web service was using Apache2 instead of the Nginx I was familiar with or the increasingly popular serverless setup, which made things feel a bit tricky initially. After researching some information and documentation, I decided to use a reverse proxy to mount my Flask application under the corresponding subdomain on the server. (Unlike PHP, Flask does not process requests based on files, so special configuration is required.)

However, after discussions, I chose a different approach: using mod_wsgi, as it is more compatible with the existing Apache than Gunicorn. After quickly reading through the official documentation and modifying the Apache configuration file, I successfully replaced the original PHP web pages.

Categories
website

Week 7 – Fixing Details

This week, progress on the new feature work has been limited (though I did manage to complete some new features). On the one hand, there weren’t many features left to implement. On the other hand, I discovered some bugs from previous work this week and focused on fixing them.

The first issue was related to querying and displaying the history table for each fileset. Ideally, for example, if we merge fileset 2 into fileset 23, and then fileset 23 into fileset 123, the history page for fileset 123 should trace back to which filesets it was merged from. Initially, my code simply used SELECT * FROM history WHERE fileset = {id} OR oldfileset = {id}, which was clearly insufficient.

So, I initially thought about using a method similar to an unoptimized union-find (a habit from competitive programming 🤣) to quickly query merged filesets. However, I realized this would require an additional table (or file) to store the parent nodes of each node in the union-find structure, which seemed cumbersome. Therefore, I opted to straightforwardly write a recursive query function, which surprisingly performed well, and I used this naive method to solve the problem.

Another issue arose on the manual match page for each fileset. Although the program always identifies the fileset with the most matches, I found that the number of matched files between two filesets was significantly off during testing. Upon investigation, I discovered that the displayed count was consistently five times the actual number of matches. Where did this factor of five come from? Initially, I was baffled.

After logging several entries, I finally pinpointed the problem. The manual match page retrieves data from the database, while the dat_parser uses data from DAT files. These sources store file data in slightly different formats. For example, the format from a DAT file might look like:

[{"name":"aaa", "size":123, "md5":"123456", "md5-5000":"45678"}, {"name":"bbb", "size":222, "md5":"12345678", "md5-5000":"4567890"}...]

But the format retrieved from the database might look like:

[{"name":"aaa", "size":123, "md5":"123456"}, {"name":"aaa", "size":123, "md5-5000":"45678"}...]

As a result, it became clear that scan-type filesets stored in the database had five types of file_checksum (meaning each file was treated as five separate files when matched), leading the counter to naturally increase by five for each such file encountered. After simply reorganizing the retrieved database data, I resolved this bug.

Categories
website

Week3 – Rewrite and Design

The amount of information and work this week has increased compared to the previous two weeks.

At the end of last week, I completed rewriting the original web code in Python. What remains are the new features that still need to be implemented. I asked Sev about the upcoming tasks and understood the entire workflow.

I have summarized the four workflows as follows:

I also successfully deployed the test webpage, which was previously only deployed locally, to the server. Additionally, the data population used data generated from the actual scummvm.dat instead of data I fabricated myself.

I fixed dat_parser, and now it can properly parse strings and insert data into the server.

It seems like a lot of features have indeed been implemented, but there are still some bugs that haven’t been fixed yet (duplicate file detection, database operations being too slow, etc.).

Besides, here is Sev’s suggestion for implementations

” Since we do have entries that have equal set of files with the only difference is the language and/or platform, then add those to the megakey, so, in case there is a change in those fields, it will be treated as a newly created fileset and in order to manage it, implement a way for manual merge of filesets, which will lead to metada of an old version be overridden by the metadata from the incoming one, pertaining history”

I will focus on these issues next week.

Categories
website

Week2 – Getting better

This week, my main tasks were parsing the .dat files and fixing errors in the original database functions. After closely examining some .dat files, I found their format very similar to JSON, which I am quite familiar with. Thus, I only needed to perform some bracket-matching operations.

However, when I tried to port the PHP code to Python, I encountered numerous errors. The issues were due to Python’s need for additional boundary checks on strings and differences in database operations between Pymysql and PHP. Additionally, the original code contains some unnecessary transaction operations. It should only require cursor.execute() instead of using conn.commit() . After making the necessary fixes, I successfully resolved these issues.

 

Currently, I have replicated most of the main functionalities of the original code. Only a few minor details related to appearance and database exception handling remain, which I plan to address next week.

Categories
website

Week1 – Start

In the first week, my main task was to replace the original PHP server code with Python code. My plan was to use the Flask library to create a web server.

During the first week, I primarily focused on rewriting all the functions in the original db_functions.php file. This mainly involved operations related to MySQL and SQL statements, so I used the pymysql library to complete the rewrite. The commit record is here: https://github.com/InariInDream/scummvm-sites/commits/integrity/

 

However, the appearance displayed on the web page cannot be generated as simply as in PHP using statements like echo "<h2><u>Fileset: {$id}</u></h2>";. It needs to be rendered through Flask. Therefore, my focus for the next week will be on the design of the appearance (tables, forms).

As of now, I haven’t encountered any significant technical difficulties. It’s just that there is quite a bit of code that needs to be replaced, so it will take some time.