{"id":28,"date":"2025-06-23T09:45:37","date_gmt":"2025-06-23T09:45:37","guid":{"rendered":"https:\/\/blogs.scummvm.org\/shivangnagta\/?p=28"},"modified":"2025-06-23T09:45:37","modified_gmt":"2025-06-23T09:45:37","slug":"week-3","status":"publish","type":"post","link":"https:\/\/blogs.scummvm.org\/shivangnagta\/2025\/06\/23\/week-3\/","title":{"rendered":"Week 3"},"content":{"rendered":"<p data-start=\"207\" data-end=\"242\">Welcome to this week&#8217;s blog update.<\/p>\n<p data-start=\"244\" data-end=\"773\">Most of this week was spent rewriting the logic for processing <code data-start=\"307\" data-end=\"316\">set.dat<\/code> files, as the previous implementation had several inconsistencies. As shown in <em data-start=\"396\" data-end=\"406\">Figure 1<\/em>, the earlier logic directly compared checksums to determine matches. However, this approach only worked when the file size was smaller than the checksum size (typically <code data-start=\"576\" data-end=\"586\">md5-5000<\/code>), since <code data-start=\"595\" data-end=\"604\">set.dat<\/code> files only include full file checksums. This caused us to miss many opportunities to merge filesets that could otherwise be uniquely matched by filename and size alone.<\/p>\n<p data-start=\"775\" data-end=\"1224\">\n<figure id=\"attachment_29\" aria-describedby=\"caption-attachment-29\" style=\"width: 864px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-29\" src=\"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-content\/uploads\/sites\/81\/2025\/06\/Screenshot-2025-06-23-at-3.23.50\u202fAM.png\" alt=\"\" width=\"864\" height=\"276\" srcset=\"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-content\/uploads\/sites\/81\/2025\/06\/Screenshot-2025-06-23-at-3.23.50\u202fAM.png 864w, https:\/\/blogs.scummvm.org\/shivangnagta\/wp-content\/uploads\/sites\/81\/2025\/06\/Screenshot-2025-06-23-at-3.23.50\u202fAM-300x96.png 300w, https:\/\/blogs.scummvm.org\/shivangnagta\/wp-content\/uploads\/sites\/81\/2025\/06\/Screenshot-2025-06-23-at-3.23.50\u202fAM-768x245.png 768w\" sizes=\"auto, (max-width: 864px) 100vw, 864px\" \/><figcaption id=\"caption-attachment-29\" class=\"wp-caption-text\">Fig.1 : Previous query used in matching.<\/figcaption><\/figure>\n<p>Since <code data-start=\"781\" data-end=\"790\">set.dat<\/code> files only contain entries that are already present in the detection results (with the exception of some rare variants I discovered later in the week), we should typically expect a one-to-one mapping in most cases. However, some filesets in <code data-start=\"1032\" data-end=\"1041\">set.dat<\/code> can correspond to multiple candidate entries. This happens when the name and size match, but the checksum differs\u2014often due to file variants. This case needs to be handled with manual merge.<\/p>\n<p>Previously, the logic for handling different <code data-start=\"1271\" data-end=\"1277\">.dat<\/code> file types was tightly coupled, making it hard to understand and maintain. I started by decoupling the logic for <code data-start=\"1391\" data-end=\"1400\">set.dat<\/code> entirely. Now, the candidates for the match for\u00a0 <code data-start=\"1421\" data-end=\"1430\">set.dat<\/code> filesets are filtered by engine name, filename, and file size (if it&#8217;s not <code data-start=\"1551\" data-end=\"1555\">-1<\/code>), excluding out the checksum. It is made sure that all the detection files(files with detection flag set to 1) follow the condition.<\/p>\n<p>&nbsp;<\/p>\n<p data-start=\"1559\" data-end=\"1945\">Initially, I was filtering out the fileset only with the highest number of matches, assuming it was correct. However, that approach isn&#8217;t reliable\u2014sometimes the correct match might not be the largest group. So all these candidates need to go for the manual merge. Only when all checksums match across candidates can we be confident in an automatic match.<\/p>\n<p data-start=\"1947\" data-end=\"2235\">I also added logic to handle partial or full matches of candidate filesets. This can happen when a <code data-start=\"2046\" data-end=\"2055\">set.dat<\/code> is reuploaded with changes. In such cases, all files are compared: if there&#8217;s no difference, the fileset is dropped. If differences exist, the fileset is flagged for manual merge.<\/p>\n<p data-start=\"2237\" data-end=\"2515\">Finally, I handled an issue with Mac files in <code data-start=\"2287\" data-end=\"2296\">set.dat<\/code>. These files aren&#8217;t correctly represented there: they lack prefixes and have checksums computed for the full file rather than individual forks. So, these filesets are dropped early by checking if no candidates are found for that fileset after SQL filtering.<\/p>\n<h5 data-start=\"2522\" data-end=\"2539\">Other Updates<\/h5>\n<p data-start=\"2541\" data-end=\"2850\">During seeding, I found some entries with the same megakey, differing only by game name or title. Sev advised treating them as a single fileset. So now, only the first such entry is added, and the rest are logged as warnings with metadata, including links to the conflicting fileset.<\/p>\n<p data-start=\"2852\" data-end=\"2883\">Other fixes this week included:<\/p>\n<ul data-start=\"2884\" data-end=\"3060\">\n<li data-start=\"2884\" data-end=\"2983\">\n<p data-start=\"2886\" data-end=\"2983\">Removing support for <code data-start=\"2909\" data-end=\"2912\">m<\/code>-type checksums entirely (Sev also removed them from the detections).<\/p>\n<\/li>\n<li data-start=\"2984\" data-end=\"3060\">\n<p data-start=\"2986\" data-end=\"3060\">Dropping <code data-start=\"2997\" data-end=\"3003\">sha1<\/code> and <code data-start=\"3008\" data-end=\"3013\">crc<\/code> checksums, which mainly came from <code data-start=\"3050\" data-end=\"3059\">set.dat<\/code>.<\/p>\n<\/li>\n<\/ul>\n<h5 data-start=\"3067\" data-end=\"3081\">Next Steps<\/h5>\n<p data-start=\"3083\" data-end=\"3240\">With the seeding logic refined, the next step is to begin populating the database with individual <code data-start=\"3181\" data-end=\"3190\">set.dat<\/code> entries and confirm everything works as expected.<\/p>\n<p data-start=\"3242\" data-end=\"3425\">After that, I\u2019ll start working on fixing the <code data-start=\"3280\" data-end=\"3290\">scan.dat<\/code> functionality. This feature will allow developers to manually scan their game data files and upload the relevant data to the database.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to this week&#8217;s blog update. Most of this week was spent rewriting the logic for processing set.dat files, as the previous implementation had several inconsistencies. As shown in Figure 1, the earlier logic directly compared checksums to determine matches. However, this approach only worked when the file size was smaller than the checksum size [&hellip;]<\/p>\n","protected":false},"author":29,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-28","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/posts\/28","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/comments?post=28"}],"version-history":[{"count":1,"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/posts\/28\/revisions"}],"predecessor-version":[{"id":30,"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/posts\/28\/revisions\/30"}],"wp:attachment":[{"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/media?parent=28"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/categories?post=28"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.scummvm.org\/shivangnagta\/wp-json\/wp\/v2\/tags?post=28"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}