Week 3 – Shivang's GSoC Blog

Welcome to this week’s blog update.

Most of this week was spent rewriting the logic for processing set.dat files, as the previous implementation had several inconsistencies. As shown in Figure 1, the earlier logic directly compared checksums to determine matches. However, this approach only worked when the file size was smaller than the checksum size (typically md5-5000), since set.dat files only include full file checksums. This caused us to miss many opportunities to merge filesets that could otherwise be uniquely matched by filename and size alone.

Fig.1 : Previous query used in matching.

Since set.dat files only contain entries that are already present in the detection results (with the exception of some rare variants I discovered later in the week), we should typically expect a one-to-one mapping in most cases. However, some filesets in set.dat can correspond to multiple candidate entries. This happens when the name and size match, but the checksum differs—often due to file variants. This case needs to be handled with manual merge.

Previously, the logic for handling different .dat file types was tightly coupled, making it hard to understand and maintain. I started by decoupling the logic for set.dat entirely. Now, the candidates for the match for set.dat filesets are filtered by engine name, filename, and file size (if it’s not -1), excluding out the checksum. It is made sure that all the detection files(files with detection flag set to 1) follow the condition.

Initially, I was filtering out the fileset only with the highest number of matches, assuming it was correct. However, that approach isn’t reliable—sometimes the correct match might not be the largest group. So all these candidates need to go for the manual merge. Only when all checksums match across candidates can we be confident in an automatic match.

I also added logic to handle partial or full matches of candidate filesets. This can happen when a set.dat is reuploaded with changes. In such cases, all files are compared: if there’s no difference, the fileset is dropped. If differences exist, the fileset is flagged for manual merge.

Finally, I handled an issue with Mac files in set.dat. These files aren’t correctly represented there: they lack prefixes and have checksums computed for the full file rather than individual forks. So, these filesets are dropped early by checking if no candidates are found for that fileset after SQL filtering.

Other Updates

During seeding, I found some entries with the same megakey, differing only by game name or title. Sev advised treating them as a single fileset. So now, only the first such entry is added, and the rest are logged as warnings with metadata, including links to the conflicting fileset.

Other fixes this week included:

Removing support for m-type checksums entirely (Sev also removed them from the detections).
Dropping sha1 and crc checksums, which mainly came from set.dat.

Next Steps

With the seeding logic refined, the next step is to begin populating the database with individual set.dat entries and confirm everything works as expected.

After that, I’ll start working on fixing the scan.dat functionality. This feature will allow developers to manually scan their game data files and upload the relevant data to the database.

Recent Posts

Recent Comments

Archives

Categories

Other Updates

Next Steps

Leave a Reply Cancel reply