Short-circuiting

I’ve just finished adding short-circuit detection to the decompiler. This detection is done as part of the initial group creation.

The algorithm for detecting short-circuiting is as follows: Two groups, A and B, that follow sequentially in the byte code (no instructions appear between the last instruction of A and the first instruction of B), are merged if:

  • Both group A and B end with a conditional jump (they have two possible successors each)
  • For each outgoing edge from A, the edge must either go to B, or to a successor of B – in other words, merging may not add any new successors. More formally, succ(A) ⊆ {B} ∪ succ(B), where succ(X) is the possible successors for X.

To make sure the merging happens correctly when several groups should be merged, the groups are iterated in reverse order (the group with the last instruction is processed first).

Here’s an example to show how this works. The number in parenthesis is the effect that instruction has on the stack, and for ease of reading, the jump addresses have been made absolute.

0x00: wordVarDec 123 (0)
0x03: pushWord 41 (1)
0x06: pushWordVar 321 (1)
0x09: eq (-1)
0x0a: jumpTrue 0x17 (-1)
0x0d: pushWord 42 (1)
0x10: pushWordVar 123 (1)
0x13: eq (-1)
0x14: jumpTrue 0x00 (-1)
0x17: stopObjectCodeA (0)

Normally, there would be 4 groups in this script: the first and last instructions would be groups on their own, and instructions 0x03 through 0x0a would be a group, as would 0x0d through 0x14. For simplicitly, we’ll number these groups 1 through 4.

0x00: wordVarDec 123 (0)
===
0x03: pushWord 41 (1)
0x06: pushWordVar 321 (1)
0x09: eq (-1)
0x0a: jumpTrue 0x17 (-1)
===
0x0d: pushWord 42 (1)
0x10: pushWordVar 123 (1)
0x13: eq (-1)
0x14: jumpTrue 0x00 (-1)
===
0x17: stopObjectCodeA (0)

The possible successors for group 3 are group 4 and group 1, while the possible suceessors for group 2 are group 3 and group 4. Since group 4 is a shared successor, and the group 2-group 3 link becomes irrelevant due to the merge, the two groups can be merged to a single one since the algorithm detects them as a short-circuited condition check.

I’m back!

New computer is assembled and seems to be running perfectly. It even seems a bit quieter than the previous one, but I haven’t had the GPU actually do real work yet, so maybe that’ll change things. 🙂

Hopefully, I can make up for the lost time once I’m done with my last exam on Wednesday.

Unexpected downtime…

As luck would have it, the CPU fan on my desktop machine has stopped working properly. That’s not good, because that’s the machine I do most of my development on.

I still have my laptop, and I should be able to get some work done here. Unfortunately, it’s not as comfortable to work with as my desktop, and that means there’s a good chance it’ll harm my productivity and have an impact on the schedule.

I was planning to replace most of the parts anyway, so I’m currently considering if I should do that now, or if I should wait and just buy a new fan now. It’s not that a new fan is expensive, but if I’ll be replacing the CPU in a month or two anyway, then it might make more sense to just get it done now.

I’ll figure out what to do over these next few days – if I replace everything, then it’ll probably be about 2 weeks before I get it back up and running, or if I just replace the fan, it should be ready some time during the upcoming week.

Right now, I’m going to focus on my next exam on Tuesday, and at the same time, I’ll try to figure out what I’m going to do about this as well. After that, I can get back to work on the code flow analysis.

Initial code flow

 

Finally, I’ve been able to start work on the code flow graph.

So far, very little is prepared, but I can now make the most basic graph, with each instruction as its own vertex (click for full-size image):
Grouped code flow graph for samnmax/script-33.dmp

Additionally, jumps are also handled (click for full-size image):
Initial code flow graph for samnmax/room-17-209.dmp

The next step is to combine some of these instructions into groups of instructions, to reduce the number of vertices in the graph.

Grouping completed!

Well, that didn’t take long.

Vertices in the code flow graph are now grouped according to these rules:

  • Only consecutive instructions may be grouped.
  • If there is a jump, it must be the last instruction in the group.
  • If there is a jump to an instruction, that instruction must be the first instruction in the group.
  • Once the stack becomes balanced, the group ends with the instruction that balanced the stack.

Here are the scripts from the last post, but now with grouping:

Grouped code flow graph for samnmax/script-33.dmpGrouped code flow graph for samnmax/room-17-209.dmp
And finally, as an example of a really big graph:

Grouped code flow graph for samnmax/room-9-202.dmp
This concludes the third milestone. The next step is to analyze the graph to detect loops and conditionals.

First disassembler done!

The SCUMM disassembler is now complete, completing the second milestone.

As mentioned in my previous post, the next step is to write documentation on this, and then it’s on to the code flow graph.

One week in…

That concludes the first week of the project.

I finished the first milestone a few days ahead of schedule, and got started on the next one, the SCUMMv6 disassembler. Since the opcode documentation on the ScummVM wiki isn’t really complete, that requires me to jump back and forth between files and make sure I properly understand the code, so it doesn’t go as quickly as it probably could – however, I should still have plenty of time to finish the disassembler by the milestone deadline, June 8.

Once the disassembler’s done, I’ll write some documentation on creating a disassembler, while I’m still focused on that part of the decompiler.

And so it begins…

May 24th just passed – and so, it is now that coding begins.

In order to help me keep track of all the things I need to do, I have setup a local installation of Trac, where I add the individual subtasks and assign them to milestones.

This is a very useful tool on its own, but it would certainly be more useful if I could show my progress directly to the world as things progress. Because of this, I wrote a small tool which is set up to periodically extract my task list and upload it online. This way, everyone can keep up with exactly what’s happening.

The list is updated every 15 minutes, as long as my primary computer is switched on. You can see in the title bar when the list was last generated.

The task list is available at http://blog.birdiesoft.dk/tracreport/. I have also added a link in the sidebar.

Project outline

I figure it’s about time that I post a bit more about what I’ll be doing.

As I mentioned in my first post, my task is to create a generic script decompiler. That’s a bit vague, though – it can’t magically know how to decompile everything, and there’s no way I have time enough to teach it everything.

Instead, I’ll make it work with two different engines: SCUMMv6 and KYRA. For SCUMMv6, I own Sam & Max, and for KYRA, I have a demo of Hand of Fate, so both of these will come in handy when testing the decompiler.

Why those two engines? Well, apart from being pretty well understood, my mentors are also familiar with these engines. Although I could choose engines that neither of them know anything about, that would be a pretty foolish idea, since it’d be harder for me to get help if (well, most likely when) I need it.

The reason for taking two engines is simple: if I only did one engine, it’d be hard to show that I actually took a generic approach. It’s the same problem if I just did two different version of an engine – it might need to be a bit more generic in order to allow for both versions, but it doesn’t say anything about whether the approach will easily extend to other engines.

So now that we know what the end goal is, let’s talk about how we’re going to get there. I’ve prepared a schedule, and here are the primary points on it:

  • June 1: Disassembly framework
  • June 8: First disassembler (SCUMMv6)
  • June 21: Generation of code flow graph
  • June 28: Code flow analysis
  • July 24: Code generation (SCUMMv6)
  • July 29: Second disassembler (KYRA)
  • August 6: Code generation (KYRA)

Of course, these dates are just estimates, and there are other subtasks on the schedule – I’ve just omitted them to create this overview.

So there you have it – a brief overview of my project. In later posts, I’m going to cover these steps in more detail.

Hello, world!

Hi! I’m Michael Madsen, I’m studying computer science at Aalborg University in Denmark, and I’ve been accepted as a GSoC student to work on ScummVM.

My task is to create a generic script decompiler, such that it allows adding support for new instruction sets without too much work.

Readers familiar with ScummVM GSoC projects might think this project sounds familiar – and that’s probably because it’s already been attempted twice, in 2007 and 2009.

So why would my attempt fare better than the previous ones, apart from the old proverb “Third time’s the charm”? Well, there is of course no guarantee that it will – but I believe I have a skill set which is suitable for this task. I enjoy working with low-level stuff, I have good logic skills, and I have some experience with writing compilers from the semester project we did at university a year ago, which should be useful when writing a decompiler.

Over this next month, I will be preparing for the coding process by studying documentation and the instruction sets I’ll be working with, looking for things that stand out as a potential source of problems and otherwise getting familiar with the engines, maybe prepare a few tools which I expect will help me.
I’ve already prepared (very) rough sketches of the algorithms I need to use, and I will spend some time refining them a bit more to get them ready to implement.

I don’t really have much more to say, so if you have questions, you can e-mail me, leave a comment, or catch me on IRC (where I go by the name of Pidgeot). I look forward to working on the project, and I hope you will like the outcome.