GSOC ’23: Final Report

So originally, the goals for the project were to optimize the pixel blending code in the rendering code for the AGS game engine in ScummVM. The problem was: I completed that goal about half way through the coding period. So me and my mentors talked and what I did after was optimize the rendering code that most other engines in ScummVM use. I used SIMD cpu extensions to net a pretty huge performance gain.

Basically, in the AGS renderer, it got a 5x improvement all around and a 14x improvement in the best scenarios. In the global rendering code for all engines to use it got a 2x improvement all around. Here are the speed up results.

The most challenging part is knowing where to start. First, you must get to know your mentors really good, eg: calls, messaging, etc. If you don’t you’ll be left alone and not knowing what to do. And second, if a coding project seems big you should take three steps.
One: Figure out where you are and the actionable steps you can take to get there. What are the big milestones you have to hit along the way? Do you need to complete something else first to efficiently implement another feature? Should you write tests first? etc.
Two: Get the bare minimum code written. I know that sounds funny, but you should just get stuff working to start. This gives you plenty of time to look at your code and to step three.
Three: Make your code the best code anyones seen. Now that you have 90% of the code written, you can optimize it and make it cleaner and tie up any loose ends like updating the tests, making a PR, etc.

Once again, I’d like to thank Google Summer of Code 2023, ScummVM for the opportunity to work on a project like this and learn so much. And I’d like to thank my mentors for helping me when I was stuck, and teaching me how to work in a team.


So this is the end?

So, I’ve updated the AGS blending code to use the new SIMD cpu features detector and added AVX2 support for AGS. All I have left is to get my PR merged, tie up some loose ends, and submit my final submission for GSOC 23. I’ve had a great time coding with my mentors, @criezy, @lephilousophe, @ccawley2011, @somean, and @sev. They have been a great help and very responsive, and it’s been such a blast this past summer coding with them! I’m at college now and got plenty of work ahead of me, so bye!


Optimizing Graphics::ManagedSurface/TransparentSurface

Last post I talked about how we wanted to continue with what I was doing for the AGS blending code and apply it to the TransparentSurface code (and then eventually move that functionality into ManagedSurface). Well, it seems like I’ve done it! Here’s an overview of what I did: First I didn’t optimize anything but started by refactoring the code and making a ManagedSurface version of TransparentSurfaces’s blit function. Only after I refactored everything to a standard that I thought was pretty good, I started to optimize. After that, I did another round of refactoring and listening to my mentor’s critiques of my code and made the code as clean and fast as I could get it. Not much to show here, everything looks the same as it did before (hopefully lol), but there was a cool bug where sprites were about twice the size they should be, but I didn’t get a picture of it. Anyways, this might be my second or third to last blog post, so its getting to the end here.


New Game Plus

Well guys, I already got optimizations done for x86_64 and ARM on the Adventure Game Studio, which is almost everything I wanted done with it anyways. All I have left is optimizing it for PowerPC, (and possibly other architectures, but I can’t think of any other ones). Anyways, my mentors and I both thought it would be better to also work on the general Graphics::ManagedSurface code and optimize that instead of optimizing only AGS code. So my plan looks as follows:

  • Clean up the Graphics::TransparentSurface code by moving its blitting functions into graphics/blit-alpha.cpp.
  • ┬áThen add the relevant blitting methods into Graphics::ManagedSurface, and then…
  • Phase out Graphics::TransparentSurface, first by removing it from the Broken Sword 2.5 engine, and then possibly removing it from the other engines.
  • Then start on a way for me to put CPU extensions detection into ScummVM so I can use SSE2/SSE4/AVX depending on what’s available (or NEON if arm7 supports it).
  • Actually implement the vectorized versions of the blitting/blending code for the new Graphics::ManagedSurface.


And when I finally get back to my AGS code, I will have to just make sure that it compiles on all of ScummVM’s targets (even if I didn’t specifically optimize for them, you know just in case), and try to get PowerPC to work (it was proving quite difficult as I don’t have PowerPC hardware except for my Xbox360 I guess).


Intel and AMD!

I finally ported over my vectorized code to Intel and AMD chips! And with time to spare, because my midterm evaluations are coming up (I’d like to thank my mentors they have helped me so much, I wouldn’t have done all of this this quickly without their help). So yea where to go from now? My current plans are to port the code to PowerPC’s AltiVec extensions and make a AVX2 version of the SSE2 code I made for x86_64 processors. Other than that here are some pictures I took of weird bugs while porting the Arm NEON code to x86 SSE2 with some comments, (which these pictures are dearly needed, this blog has been quite boring without pictures).

I think this was the first picture I took. Here is what the game “Kings Quest 2: AGDI” looks like with only 32bit pixel graphics blitting (I didn’t implement 16bit pixel formats yet here)
Same build as the one above. As you can see by the water on the shore I got alpha blending working correctly, but there is some off by one error at the right of the screen where it overdraws a pixel or 2.
Yea, so when I finally did get 16bit blitting/blending working, I noticed that scaled images were being messed with a lot and well just looked completely borked.
This is probably the worst looking picture of them all. Its got the nasty off by one error and the main character looks like something is not right…

Now don’t worry, I fixed all the bugs. In fact you will be able to tell that it’s fine once my PR (#5144) gets accepted. Hopefully it makes your games run a bit faster (even if you don’t have vector extensions in your computer).


How Testing is Going! (And How Easily Vectorize Any Algorithm)

Good Morning! This post builds off of my last one here about the test code I’m writing. So to actually test these functions I moved the cobbled together benchmarking code from some random function in the engine init code into another function already found in the AGS codebase: Test_Gfx. So from there I actually needed to implement the testing functions for the blender modes and for the actual drawing functions themselves to make sure they work pretty much in the same fashion. So far I only have the blender mode unit test working at it pretty much is just a loop for every parameter of the blending function and it asserts to make sure the original function and the current Arm NEON one are the same. But right now I’m still ironing out a LOT of small edge cases with the blender function that don’t really show up in real usage, but should still be the same.

Oh yea, I also thought it would be nice to just to let people know that using SIMD intrinsics is not as hard as it sounds. So I’ll speak for myself since there is a lot of people reading this who won’t think it’s a difficult task to vectorize a function, but I sure did when I started this GSOC job. So here is a pretty simple way of making things SIMDized.

  1. Load your data into a SIMD data structure.
    For example, Arm NEON has a helpful function: vld1q_u32 (SSE equivalent is _mm_lddqu_si128, a little scarier looking).
    This takes a pointer to data, and stores it into a uint32x4_t structure, which is pretty much just an array of 4 uint32_t’s. Sometimes it’s not this easy though, and you have to load stuff in 1 at a time in serial. Storing is pretty much the same but the opposite of this.
  2. Just translate normal serial operations into SIMD ones.
    + becomes vaddq_u32
    * becomes vmulq_u32
  3. I’ve only told half the story though, while most of porting serial code to SIMD is this simple, you will have to mess with the order and structure of your vectors (uint32x4_t’s and __m128i’s) to make them work nicely. You may want to, for example take a uint32x4_t of ARGB pixels and transform them into 4 uint32x4_t’s of alpha, red, green, and blue of each pixel to make the math operations easier, and this is where the other functions that convert and swizzle vectors come in handy.

Always Test

Hi! I’m posting right before my week long vacation. Today I just wanted to show you guys some things that I did wrong this and last week.


So there are multiple pixel blending modes in AGS, right? One for alpha blending, one for opaque, one for RGB one for ARGB, etc. Well, I thought my games that I was testing would cover most if not all of the cases, but no I was wrong. This week I figured out that my ARGB blending function and Tint blending functions were not working correctly. The ARGB one was easy to fix (it was just a typo), but the Tint blender is proving much more difficult to debug. See you next time where I try to fix that bug and hopefully move onto other tasks.


How I’ve optimized Adventure Game Studio (so far)

Hi! So far I’ve been doing pretty good with optimizations and all (you can see the current results here), and I’d like to take you through my journey in making BITMAP::draw and BITMAP::stretchedDraw faster!


The drawing functions have a lot to deal with. They handle different source and destination pixel formats, clipping, blending, and color keying. So as you saw earlier, I took a look in a profiler to see which one was the worst culprit. Overall the worst culprit was converting the pixel formats, and then blending. Before I started on those though, I wanted to eke out some performance gains that weren’t due to that.

The Inner Loop

In the inner drawing loop (the loop that plots all the pixels row by row, column by column), it first checks to see if it’s out of bounds of the screen/bitmap, which I knew could easily be avoided if the areas were clipped off at the start so I first moved that check to the start and got about a 5% performance boost (getting rid of branches helps a lot).

Compile Time

First I wanted to move as much as I could to compile time. To do that I put the inner drawing loop into its own function and then made variations that worked on different combinations of pixel formats. Understanding what data you’re working with helps a lot when optimizing and thankfully, AGS almost exclusively uses 32 to 32 and 16 to 16 bits per pixel blits. Using that, I created versions of the inner loop for them.

Brute Force

After splitting those pixel format combinations into their own functions, I next wanted to leverage brute force to make the functions faster. Why plot a pixel at a time when you can do 4, 8, or even 16? At this step I created ARM NEON versions of the blending functions and the drawing loop, and that got me most of the performance gains, but I do have a few other things I want to point out before I end off this post.

Smaller Performance Gains

So when basically loop unrolling the drawing loop through SIMD functions, I had to do something about the left over pixels at the end of a row… The extra overhead of the normal pixel by pixel plotting at the end of a row was a big headache so I had to think up of a way to fix it. My solution was to just plot past the end of the row. Now in a normal scenario, this would obviously mess up the bitmap, but what I did was create a mask that would make source pixels that were over the end of the row the same as the destination, effectively not drawing over the end of the row. This still wouldn’t work at the last row because I would run the risk of writing past the end of the pixel buffer into unknown memory, so I still had to use the pixel by pixel method on the last few pixels in the last row of a bitmap, but overall it sped up rendering about 2 times.

What Next?

For me I know that I need to next clean up my code with comments and try adding in some micro-optimizations. Then I want to start porting my code to x86 processors. At that point, I will have completed what I set off to do, but I plan on trying to port this blitting code to the general ScummVM blitting routines.


Messy Code. But its faster!

I got the drawing function 4.2x faster! Although, it only works if both the screen is ARGB8888 and the bitmap is ARGB8888. So What I’m going to focus on is making sure the engine keeps the bitmaps at the same format of the screen so I can make the code cleaner and make the optimizations more generic and make them work for all formats.


Initial Results…

Before I start on any optimizations I first have to measure how fast or slow the code right now is.

First, I took the code through callgrind and took a look at it in kcachegrind. What my mentors remembered were the bottlenecks was correct, the main bottlenecks in drawing were converting the pixels to ARGB, and blending the pixels themselves.

So now, lets measure the performance of this code. I wrote a pretty simple benchmark that measures the amount of time it takes for X number of BITMAP::blit calls to run. Here is the control group results:

So yea, that’s where I’m starting with and I plan on making it faster.