OpenCL 2.0 Review

This is going to be a review in the spirit of G-Truc's excellent OpenGL reviews. Without further ado, let's dive into the details of the new OpenCL specification. We'll take the three parts one-by-one, starting with the platform and runtime.

Platform & Runtime

The platform has several new features:

  • Extended image support
  • Shared memory
  • Pipes
  • Android Driver

Let's get started with the image support. Image support in OpenCL used to be very basic at best, with lots of limitations like for instance not being able to read and write from the same image. This has been mostly addressed with OpenCL 2.0, which adds read_write as a modifier for images, provides mip-mapped images as well as images with more than one sample per pixel, and most importantly, 3D image reads and writes. Additionally, sRGB support has been added. Still missing are writes to multi-sampled images, and mip-mapped images are only an optional feature. Interestingly, mip-mapped reads are supported for 3D textures as well, where they are really useful and hard (read: slow) to emulate otherwise.

OpenCL 2.0 also adds sharing of depth images and multi-sampled textures with OpenGL, similar to the OpenCL 1.2 extension. Unfortunately, the Direct3D11 sharing has not been updated accordingly. This is weird, as all that is required is to update the list of supported DXGI formats to add the depth formats and copy & paste the text from the OpenGL extension.

Shared virtual memory is an interesting feature which will most likely require new hardware. What it adds is a single address space which is shared between the host and device. Basically, you allocate a block of memory which can be read and written by both the host and the device. There are two possible implementations:

  • Coarse-grained sharing: Requires you to use map and unmap the buffer for updates. Not really that interesting, all that you gain in this mode is that the pointer values are the same (if you store pointers inside the buffer, it will work on both the host and device.) This can be likely implemented on current hardware by adjusting all pointer accesses into a shared memory region on the device.
  • Fine-grained sharing: No need for mapping/unmapping, things just work. With atomics, it's even possible to update parts of the data while a kernel is updating other parts. This is actually real shared memory.

I expect the fine-grained sharing to appear on integrated GPUs first (AMD APUs, Intels HD graphics.) Discrete GPUs will eventually follow, AMD should be able to do it already, while NVIDIA will follow with Maxwell. This is one of the great new features of OpenCL 2.0 and will dramatically simplify tasks which require complicated data structures (trees, etc.) For instance, I have to flatten trees to use node-relative indices right now to use them on the GPU, which can be skipped when using the shared memory features.

Next up are pipes which are incredibly exciting as well, as you can use them to transfer data between multiple kernel invocations. Together with the ability to enqueue kernels from the device, this is going to be awesome. It will allow real data-flow programming and load-balancing on the GPU, removing one of the biggest bottlenecks in the current model. For instance, we can now write kernels which expand data, and run the consumer kernels concurrently on the same device instead of having to run the expansion kernel first (with low utilization, but high bandwidth usage) and then run the computation kernel next (which has to fetch all data back from memory again.) This is only really useful with kernel creation on the device, but combined, the possibilities are infinite. A very nice addition!

Finally, there will be a way to load OpenCL implementations on Android systems. For anyone who used Renderscript, this is godsend :)

Language

On the language side, there are three major changes:

  • "Generic" address space
  • C11 atomics
  • The awesome kernel creation and the corresponding language feature, blocks

The generic address space removes the need to overload functions based on address space. I have several functions which take either a local or a global pointer, which currently have to be written twice (or you use ugly macros.) With the generic address space, those functions can be written once and will work with arguments from any address space. A nice and helpful addition.

C11 atomics extend the current atomics and are very interesting if used in conjunction with the shared memory feature. This will allow a much tighter integration. Overall, this feature mostly extends and completes the atomic support and brings it to feature parity with C11/C++11.

Work-group functions add parallel primitives across a single work group. This might not sound like a big deal, but it actually is. For example, let's assume I need the min/max depth of a 2D image tile. Using work-group functions, this is easily expressed using a work-group wide reduce. The advantage here is that the hardware vendors can provide highly optimized implementations. Additionally, you get access to any/all broadcast functions, which can be used to optimize kernel execution (if for example all work-items take the same path, it might be beneficial to load data into local memory instead of fetching, things like that.)

Finally, blocks and device kernel creation. Blocks are simply C lambda functions, which should be familiar to users of Apple's Grand Central Dispatch. The syntax is exactly the same and pretty similar to C++ lambdas (except that you cannot capture variables.) The important part here is that they can be used to create new kernels and schedule them from within a kernel! With OpenCL 2.0, kernels can dispatch new kernels without having to go through the host. This will open completely new possibilities, where an algorithm can adapt to the work without having to go back and forth between the host. What is even more interesting that with the help of atomics, shared memory and pipes, we can now create producer-consumer queues on the device and control it from the host without ever having to do an OpenCL call after starting the kernel.

Extensions & SPIR

On the extensions side, it looks quite similar to OpenCL 1.2, with most of the interesting extensions not being available on a lot of hardware. Hopefully this will improve with OpenCL 2.0, but I don't see ubiquitous support for OpenGL multi-sampled texture sharing coming to AMD and NVIDIA soon.

SPIR has been updated, as far as I can tell, it has been mostly cleaned up. I'm not sure why this is still not part of the core OpenCL specification, but it is the right way forward and I can't wait for implementations to support it. If done correctly, SPIR will allow you to compile all your kernels using the compiler of your choice and just ship the SPIR code instead of having to hope that the compiler on your client's machine works correctly.

Summary

Overall, OpenCL 2.0 looks like a viable long-term programming target. It resolves several nasty limitations (image support, generic address space) and provides forward-looking features (shared memory, device kernel creation). What's still missing for me is proper Direct3D11 interop (all formats, read & write) and the support for the static C++ language subset that AMD is bringing forward. While not critical, it would make programming quite a bit simpler if I wouldn't have to write a stack for integers, one for floats, etc.

OpenCL 2.0 is however much more than I would have expected, given that the development seemed pretty slow over the last year. Now, I can't wait to get my hands on the first OpenCL 2.0 compliant runtime.

If I missed something, please tell me in the comments, and I'll update this blog post. This is my first OpenCL review, so if you have suggestions, please go ahead so I can make those reviews more useful in the future. Thanks!

[Update]: 2014-09-13, added work-group wide functions.

Porting from DirectX11 to OpenGL 4.2: Tales from the trenches

I'm still busy porting my framework from DirectX 11 to OpenGL 4.2. I'm not going into the various reasons why someone might want to port to OpenGL, but rather take a look at my experience doing it.

Software design

In my framework, the render system is a really low-level interface which is just enough to abstract from Direct3D and OpenGL, and not much more. It explicitly deals with vertex buffers, shader programs, and rasterizer states, but does so in an API agnostic manner. Higher-level rendering services like rendering meshes are built on top of this API. For games, it might make sense to do the render layer abstraction slightly higher and deal with rendering of complete objects, but for me, low-level access is important and the abstraction has worked fine so far.

What I did every time I added a feature to my rendering backend is to double-check the OpenGL and Direct3D documentation to make sure I only introduce functions that will work with both APIs. Typically, I would stub out the method in the OpenGL backend and implement it properly in Direct3D, but remove any parameter which is not supported by the other API.

I'm targeting OpenGL 4.2 and Direct3D 11.0 in my framework. There three reasons for OpenGL 4.2:

  • It's the first OpenGL to support shader_image_load_store (or UAVs in Direct3D.)
  • It's the latest one supported both by AMD and NVIDIA
  • It has everything I need in core, so I can avoid using extensions.

Let me elaborate the last point a bit. Previously, with OpenGL 2 and 3, I did use extensions a lot. This turned out to be no good solution, as I often ran into issues where one extension would be missing, incorrectly implemented, or other problems. In my new OpenGL backend, extensions are thus nearly completely banned. Currently, I only use three extensions: Debug output, anisotropic filtering and compressed texture format support. The debug output extension is only used in debug builds, and the reason I'm not using the core version is that AMD does not yet expose it.

Enough about the design, let's go ahead into practice!

Differences

Truth to be told, OpenGL and Direct3D are really similar and most of the porting is straightforward. There are some complications when binding resources to the shader stages. This includes the texture handling as well as things like constant buffers. I've described my approach to textures and samplers previously; for constant buffers (uniform blocks in OpenGL), I use a similar scheme where I statically partition them between shader stages. One area which I don't like at all is however how input layouts are handled. This is solved mostly by OpenGL 4.3 which provides vertex_attrib_binding, but currently I have to bind the layout after binding the buffer using lots of glVertexAttribPointer function calls. Works, but not so nicely.

For texture storage, there's no difference now as OpenGL has texture_storage. For image buffers and UAVs, there's no difference either, you simply bind them to a slot and you're done; with OpenGL, it's a bit easier as you don't need to do anything when creating the resource. Shader reflection is very similar as well, except how OpenGL exposes names stored in constant buffers. In Direct3D, you query a block for all its variables. In OpenGL, you'll get variables stored in buffers too if you just ask for all active uniforms, and if you query a buffer, you get the full name of each variable including the buffer name. This requires some special care to abstract away. Otherwise, the only major difference I can think of is the handling of the render window, which is pretty special in OpenGL as it also holds the primary rendering context.

Interestingly, while I have been extending and fixing the OpenGL support, I didn't have to change the API. This shows how close they actually are, because all my early checking was a brief look into the OpenGL specification to find the corresponding function and make sure I don't expose parameters which are not available in either of the APIs.

Porting

For the actual porting, we have to look back a bit into the history of my framework. Originally, it started with an OpenGL 2.0 backend, and Direct3D 9 was added later. When I switched over to Direct3D 10, the OpenGL backend got updated to 3.0 and I was keeping feature parity mostly between the Direct3D 9 & 10 and the OpenGL 3.0 backend. Later, I removed the Direct3D 9 backend and rewrote the render API around the Direct3D 10/11 model. At this point, I've also updated the OpenGL backend from 3.0 to 4.0, but I only implemented a subset of the complete renderer interface due to time constraints.

This year, I've starting porting by converting applications over, one-by-one. Typically, I would take a small tool like the geometry viewer and make it run with OpenGL, implementing all stubs and fixing all problems along the way. For testing, I would simply run the application in question and switch the rendering backend using a configuration option. If both produced the same result, I would be happy and move along to the next application. This turned out to work quite well most of the time, but recently, I've hit some major issues which required a more systematic and thorough approach to testing.

In particular, my voxel viewer, which is a pretty complex application with manual GPU memory management, UAVs, complex shaders and a highly optimized inner rendering loop, just didn't work when I implemented the three missing features in the OpenGL backend (texture buffers, image load/store support, and buffer copies.) And here is where the sad story starts, debugging OpenGL is in a terrible state currently. I tried:

  • apitrace, which mostly worked (seems to have some issues with separate shader objects and state inspection)
  • GPUPerfStudio2: Would make my application produce errors which it didn't have before (invalid parameters being passed to functions)
  • NSIGHT: Doesn't work with Visual Studio 2012 ...
  • Intel GPA: No OpenGL

Well, which means I had to get back to the simple debugging methods, writing small test applications for each individual feature. Behold, the triforce of instanced draw calls:

Instanced draw debug application running under OpenGL 4.2.
Instanced draw debug application running under OpenGL 4.2.
Instanced draw debug application running under Direct3D 11.
Instanced draw debug application running under Direct3D 11.

You have to believe me that the applications do actually use different backends, even though they look similar. Actually, they are pixel-perfect equal to allow me to do automated testing some day in the future. Writing such small tests is how I spent the last two days, and while I haven't nailed down the particular issue with the voxel viewer, I've found quite a few quirks and real issues along the way. For the images above, this is the complete source code of the test application:

#include <iostream>

#include "niven.Core.Runtime.h"
#include "niven.Core.Exception.h"

#include "niven.Engine.BaseApplication3D.h"

#include "niven.Render.DrawCommand.h"
#include "niven.Render.Effect.h"
#include "niven.Render.EffectLoader.h"
#include "niven.Render.EffectManager.h"
#include "niven.Render.RenderContext.h"
#include "niven.Render.VertexBuffer.h"
#include "niven.Render.VertexLayout.h"

#include "niven.Core.Logging.Context.h"

using namespace niven;

namespace {
const char* LogCat = "DrawInstancedTest";
}

/////////////////////////////////////////////////////////////////////////////
class DrawInstancedTestApplication : public BaseApplication3D
{
    NIV_DEFINE_CLASS(DrawInstancedTestApplication, BaseApplication3D)

public:
    DrawInstancedTestApplication ()
    {
    }

private:

    void InitializeImpl ()
    {
        Log::Context ctx (LogCat, "Initialization");

        Super::InitializeImpl ();

        camera_->GetFrustum ().SetPerspectiveProjectionInfinite (
            Degree (75.0f),
            renderWindow_->GetAspectRatio (), 0.1f);

        camera_->SetPosition (0, 0, -16.0f);

        effectManager_.Initialize (renderSystem_.get (), &effectLoader_);
        shader_ = effectManager_.GetEffectFromBundle ("RenderTest",
            "DrawInstanced");

        static const Render::VertexElement layout [] = {
            Render::VertexElement (0, Render::VertexElementType::Float_3, Render::VertexElementSemantic::Position)
        };

        layout_ = renderSystem_->CreateVertexLayout (layout,
            shader_->GetVertexShaderProgram ());

        static const Vector3f vertices [] = {
            Vector3f (-6, -4, 4),
            Vector3f (0,   4, 4),
            Vector3f (6,  -4, 4)
        };

        vertexBuffer_ = renderSystem_->CreateVertexBuffer (sizeof (Vector3f),
            3, Render::ResourceUsage::Static, vertices);
    }

    void ShutdownImpl ()
    {
        Log::Context ctx (LogCat, "Shutdown");

        effectManager_.Shutdown ();

        Super::ShutdownImpl ();
    }

    void DrawImpl ()
    {
        shader_->Bind (renderContext_);

        Render::DrawInstancedCommand dc;
        dc.SetVertexBuffer (vertexBuffer_);
        dc.vertexCount = 3;
        dc.SetVertexLayout (layout_);
        dc.type = Render::PrimitiveType::TriangleList;
        dc.instanceCount = 3;

        renderContext_->Draw (dc);

        shader_->Unbind (renderContext_);
    }


private:
    Render::EffectManager       effectManager_;
    Render::EffectLoader        effectLoader_;

    Render::Effect*             shader_;
    Render::IVertexLayout*      layout_;
    Render::IVertexBuffer*      vertexBuffer_;
};

NIV_IMPLEMENT_CLASS(DrawInstancedTestApplication, AppDrawInstancedTest)

int main (int /* argc */, char* /* argv */ [])
{
    RuntimeEnvironment env;

    try {
        env.Initialize ();

        DrawInstancedTestApplication app;
        app.Initialize ();
        app.Run ();
        app.Shutdown ();
    } catch (const Exception& e) {
        std::cout < e < std::endl;
    } catch (const std::exception& e) {
        std::cout < e.what () < std::endl;
    }

    return 0;
}

While certainly time consuming, this seems to be the way forward to guarantee feature parity between my two backends and also to have easy to check test cases. In the future, this stuff seems like a good candidate for automated testing, but this requires quite a bit of work on the test runner side. In particular, it should be able to capture the screen contents from outside of the application itself to guarantee that the result on-screen is actually the same and it should be written in a portable manner.

So much for now from the OpenGL porting front, if you have questions, just comment or drop me an email.

Porting from DirectX11 to OpenGL 4.2: Textures & samplers

This is going to be a fairly technical post on a single issue which arises if you port an DirectX11 graphics engine to OpenGL 4.2. The problem I'm going to talk about is shader resource binding, in particular, how to map the Direct3D model of textures and samplers to the OpenGL model of sampler objects, texture units and uniform samplers.

First of all, a word on naming. I'll be using texture unit from here on for the thing called uniform sampler in GLSL, otherwise, it'll be never clear if I mean a sampler object or a sampler in GLSL.

Ok, so where is the problem, actually? There are two differences between Direct3D and OpenGL that we need to work on. The first is that OpenGL was originally designed to have a single shader program which covers multiple stages, while Direct3D always had those separate shader stages with their own resources. The second is that Direct3D10 separated samplers from textures, so you can decide inside the shader which texture you want to sample using which sampler. In OpenGL, the sampler was historically bound to the texture itself.

Binding shaders

Let's look at the first problem: Having a separate shader for each stage is easy with OpenGL 4.2, as the separate shader objects extension has become core. We can now create shader programs for each stage just like in Direct3D, so nothing special to see here. The only minor difficulty is we need to keep a program around to attach our separate programs to and then we have to enable the right stages, but this is no more difficult than in Direct3D.

Binding textures

The real problem comes when we try to bind textures. Unlike Direct3D, OpenGL does not separate textures from samplers cleanly. Even with the sampler object extension, which allows you to specify a sampler independent of the texture, you still have to connect them before sending them to a shader. Inside the shader, every texture unit (remember, this means every uniform sampler) is a combination of texture data and a sampler.

The way I solve this issue is to do a shader pre-process with some patching. I expose the same system as in Direct3D to the user, that is, a number of texture slots and sampler slots for a shader stage. While parsing the shader, I record which texture, sampler combinations are actually used, and those get assigned at compile time to a texture unit. I simply enumerate the units one-by-one. The users have to use a custom macro for texture sampling, otherwise, there is no real difference to HLSL here. HLSL? Yes, as the user has to write the names of the textures and samplers into the GLSL code -- the user never writes a uniform sampler statement though.

For each texture and sampler slot, I record which texture units it is bound to, and when the user sets a slot, all associated texture units get updated. Each shader stage gets a different texture unit range, so if the fragment program is changed, all vertex program bindings remain unaffected. So far, so good, but there is one tricky issue left making this not as great as it could be.

The easy life of a graphics developer

Let's think about this system for a moment. What it means is that each texture, sampler pair occupies a new texture unit. This is probably not an issue, as any recent graphics card supports 32 texture units per stage, so we should be fine, right? The problem here is called GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS. This is the maximum number of texture units you can bind at the same time for all shader stages combined. On AMD hardware, this is currently 32 -- if you statically partition them across 5 shader stages, you get 6 units per stage (and 8 for the fragment shader.) This is not that great. Ideally, you would like to have at least 32 slots per stage (or 160 in total), and then make sure to not use more than the maximum combined number of texture image units across all stages (unless the hardware limit is indeed 160 texture units. But hey, if you sample from 160 textures across 5 shader stages, you're not doing realtime graphics anymore ... I disgress.)

Intel and NVIDIA seem to expose just that number (160 on NVIDIA, 80 for Intel), which makes for easier porting. I'm not sure why AMD actually exposes only 32 there (even on my HD7970), as those texture image units don't have a real meaning. It's not like the hardware actually has textures associated with samplers. Instead, the sampler state is passed along with each texture fetch. If you don't trust me, check the AMD instruction set reference :) In section 8.2.1, you can see the various image instructions, which can take a 128 bit sized sampler definition along with them. That's where all that is necessary for sampling is stored. It's simply four consecutive scalar registers, so in theory, you should be able to define them even from within the shader code (and I'm 100% sure someone will do this on the PS4!)

The correct solution which should work in all cases is to deferr the binding until the next draw call, and do the texture/sampler slot to texture unit assignment there. This gives you the flexiblity to assign all 32 texture units to a single shader stage, at the expense of having to (potentially) change the mappings on every shader change.

Conclusion

Right now, 6 texture units per stage (and 8 for the fragment shader) is plenty enough for me, and I guess that's true for most games as well. Remember that a texture unit can be bound to a texture array as well, so the 8 is not really a practical limit. If you're ok with this limits, my solution seems to be as good as any and allows for a nice and clean mapping between Direct3D11 and OpenGL. I would be curious to hear though how other people solve this problem, as there are surely more clever solutions out there!

Two more reasons why you should be using OpenCL

Are you planning to add parallel computing to your application, and you wonder what API to use? Here are two good reasons why you should be using OpenCL today. If you're not sure what OpenCL is about, take a look at a gentle introduction to OpenCL.

CPU/GPU portability

OpenCL runs on both AMD and NVIDIA graphics cards. That's not much different from DirectCompute, so why bother with OpenCL? The cool thing here is that OpenCL also works on CPUs and on mobile devices. The fact that you can run on CPUs is often overlooked, but there are two good reasons why you really want this:

  • CPUs are pretty fast these days, if you properly use the vector units. OpenCL makes this very easy and allows you to have a high-quality CPU fallback, which will guarantee that your application works reasonably for all your customers.
  • CPUs typically have much more memory. If you run into a scalability problem where you cannot process the problem on your compute device any more (graphics cards currently have at most 6-8 GiB of memory), it's trivial to run it on the CPU where memory is really cheap. For the price of a high-end consumer GPU, you can easily buy 64 GiB of ECC RAM.

The last argument is particularly important if you work on data sets which don't partition very well. For example, if you have a renderer, it is very likely that most scenes will fit on the GPU, but if your customer decides to throw a really complicated mesh at it, you can simply switch to the CPU. Performance will likely suffer, but it will work, and it requires no additional work from your side. There's no other API out there which is that flexible.

Graphics interop

DirectX has DirectCompute, OpenGL has now compute shaders as well (in revision 4.3.) But only OpenCL allows you to use the same compute code with both APIs. For OpenGL, the OpenCL interop is independent of your OpenGL version (you can use OpenGL 4.0 with OpenCL, or OpenGL 3.0 for the matter); the same applies for DirectX.

All that is needed to use OpenCL with both graphics APIs is a minimal interop layer (check the D3D11 sharing extension and the OpenGL sharing extension.) It basically boils down to two steps: First, mark the buffers & textures you want to share. Second, enqueue acquire and release commands into an OpenCL command queue to synchronise your graphics and compute code. Your OpenCL code stays the same, independent of whether the data is shared with a graphics API or allocated by OpenCL directly. The driver also gets full knowledge of the resource dependencies which allows it to make good scheduling decisions. This makes it possible to efficiently execute the compute kernels on the GPU without synchronising with the host.

Another advantage is that the OpenCL compiler is better optimized for compute code than the compilers for your graphics API. In my experience, the DirectX compiler is notorious for long optimization times once loops are present. This is no surprise, as it has been originally written for short shaders and its heuristics are tuned for graphics code, not long compute kernels. On the other hand, all current OpenCL implementations are based on LLVM, a compiler framework designed to efficiently handle complex compute code like the SPEC benchmark suite.

If you have gotten curious about OpenCL and you want to give it a shot, head over to my OpenCL tutorial which should help to get you started. Have fun coding! If you have questions about OpenCL, feel free to comment or contact me directly.

Games 2012

Besides doing research and some secret side project, I'm also playing games on my PC -- and at this time of the year, it's time to look back at 2012 in games. This will be a very personal look back, and I'm going to spoil story elements without remorse, so if you haven't played one of the games yet, you should probably skip the section.

This year, I've played the following games (with played, that means I've done at least one complete run from intro to outro or for a multiplayer-only game, spent a bunch of hours with it):

Notice that some games are older or might have been released last year already. Without further ado, let's start with them one-by-one.

Alan Wake

Well, it took a long time until it finally appeared on the PC, but it was still a fun game. What I really liked where the interesting locations like the clinic in the mountains and the farm towards the end, and the great characters. I wish more games would provide locations like Alan Wake -- in particular, the clinic on the mountain is nicely visible from the lake, making you wonder if you'll ever reach it. It's simple, but really effective, and makes you want to continue the plot. Unfortunately, the fights were not that cool and highly repetitive; and the end was at best mediocre. I would have wished for a more complete ending, but we'll see that endings seem to be kind of a problem for some other games as well.

Battlefield 3

Well, there's not too much to say about it. Graphics wise, nothing comes close to it except for one game (see below.) It's also well polished by now except for some annoying bugs which I assume will never get fixed. Two come into my mind immediately: Not being able to deny a medic revive while being dead (you can only deny after being revived, but you can die before you deny and get into a revive-cycle) and the fact that you can't change the options while being dead (you have to actually enter the game and stand around in the field before you can change options.)

Otherwise, it's a great game, and it got lots of updates this year through the premium service. I have to admit that paying those 50€ once and resting assure that you get all DLCs and extras might be the only viable solution for me for large games. On the other hand, they could have made the game more expensive right away and patch for free ...

The DLCs have been mostly good; in particular, Close Quarters was a great update and Aftermath seems pretty nice so far. Armored kill on the other hand wasn't that great, the balance was really crappy at first (gunship!) and it's still hard to find a server where you can actually enjoy an Armored Kill map. On the other hand, some of the maps (in particular, Alborz Mountains) look absolutely gorgeous.

Dishonored

I had really high hopes for this game. The previews looked great, and I really fell in love with the steampunk world. Story-wise, it promised to be an interesting experience as well. I was a bit worried about the stealth part, but alternative solutions, being able to use force if you want, and the magic abilities made me hope for a killer game. Let's start with the not so great stuff: The texture resolution was really low in many places, making stuff look ugly close up. The sneaking is also not well solved, in some cases, your enemies spot you from far away and in some other you can just walk for minutes behind them. Worse, in some cases, the death of one enemy attracts lots of guards while sometimes you can slice through them without raising any suspicion. I know it's fairly hard to make a good stealth game, but as you can see below, it's possible. What also sucked big time for me was the water. Guys, if every mission in your game starts and ends with a trip on a boat, make sure the water looks amazing. On the PC, there is no excuse for crappy water any more.

On the other hand, there's one really great moment in Dishonored which was absolutely stunning, at this is the lighthouse at the end. Might be that it reminded me of the incredible Batman: Arkham City ending on the Wonder Tower, but that lighthouse was just very well done, the scenery was set up perfectly and it was a great fit. I have to say that I had the "high chaos" ending which changes the weather to a rainstorm, making the whole level very impressive. It's really unfortunate that the ending couldn't keep up with the last mission -- in fact, is was rather dull. A short camera flight, a speaker from the off, that's it. But that lighthouse is just incredible and raises the bar for great locations.

Dota 2

After finally getting a beta key, I played Dota 2 for a brief time. This was my first contact with Dota ever, so I sucked pretty bad the first games, but fortunately, I have some experienced Dota friends who taught me the basics of this incredibly deep and complex game. The main downside is that getting good at Dota 2 requires a lot of time. Anyway, if you like this type of game, Dota 2 is extremely well polished and has an excellent balancing. Brace yourself though for hours and hours of training.

Deponia

A classic point-and-click adventure. Lots of nice gags, crazy puzzles and excellent voice acting. The ending is a bit weird, obviously designed for a sequel, but nonetheless interesting and a good fit. You'll probably going to play this only once, but you'll have a lot of fun in this one play-through  I'm also curious to see if the next parts can keep up with the first.

Endless space

I have to admit it: I'm a huge fan of the best 4X game ever made -- Master of Orion 2. Endless space promised to be 4X in space in the spirit of Master of Orion 2, so it was an easy buy for me. Having done a few full games, I have to say that it's actually closer to Civilization than to Master of Orion 2, and some things are not that well done in Endless Space as they have been in Master of Orion 2 (for instance, in MOO2, you could build transports to distribute the food in your empire, but such elements haven't been included in Endless Space for no good reason.)

Overall, it's a great game as it is right now, and I surely hope that it'll reach Master of Orion 2 awesomeness with the first pay-for addon or at worst Endless Space 2. Keep up the good work guys, and don't fear to add some complexity like espionage!

Fallout: New Vegas

This was surely the game with the crappiest graphics I played this year; it doesn't even have proper shadows most of the time. But heck, the story, the characters, the world and everything else is just phenomenal, making it better than Fallout 3. It just feels right from the beginning to the end, which is also well done. Sure, the end sequence could have been fancier, but all of your decisions have an impact and a glorious ending wouldn't fit into the Fallout universe anyway.

Home

This is a mixed bag. I would count this as an experimental game; it was an interesting experience, but it wasn't my kind of game. For me, a better understanding at what decision had what impact would have been important, but all I can say is give it a try, as each run is different.

Mark of the Ninja

A stealth game, but this time, the stealth was extremely well executed. Unlike Dishonored, the stealth mechanics work great here, the pacing is perfect, and the game ends just in the right moment. Unfortunately, the end is also not that satisfying, but it does fit the scenario. It might have been enough to make it slightly longer. Anyway, if you like 2D stealth games, Mark of the Ninja is surely among the best since quite a few years.

Mass Effect 3

It's really hard to write about this game, as there are basically only two possible opinions about it: Either you like the ending, or you don't, and I'm firmly in the group who absolutely hates the end, including the extended ending.

The game itself is just perfect up and until the last mission, even more so, if you play the later released DLCs (Leviathan in particular.) It's a fast paced action movie, with an interesting story, lots of things going on, and cool characters. Unfortunately, there are some extremely stupid plot holes which make you cringe in your seat. If you need an example: Earth is being attacked, so you fly to the home planet of another race, and watch it being destroyed from its moon. Looking down at the planet, a general of the alien race tells you that you can see his home burning, and the next thing you ask him is to abandon the planet and come over to earth to stop the reapers, cause, you know, earth is your home planet, and well, earth is going to be important for the plot, sometime, hopefully. Gnah!

Besides such problems, everything is fine, there are really emotional moments when people from your team die, and then it comes, the most horrible ending in years. What makes me want to cry is that this is done by the people who made the Baldur's Gate 2 ending, which I remember as if it were yesterday because it was just epic. The ending in ME3 starts similarly: You talk with each of your party members, stating that this is the end, and we're going to fight together, united. You assemble you whole team; a group of characters you have spent hours and hours fighting with and dying for. Your team stands ready, the final speech is given, all of your people are ready to finish the reapers, and then: Boom, you have to pick 2 out of them, because the game doesn't allow for more people. I would have payed 20€ more to get an ending where I can fight with all of them together, just to stand side-by-side  with my team. After the final fight, which is actually not that bad, you wind up at the "set the color" stage, which is probably the most ridiculous ending ever devised. At this moment, it completely doesn't matter what you did in the game at all, but instead, you choose the color of the ending sequence (green, red, or yellow.) That's it. It might be that they needed some consistent ending to continue with ME4 at some point in the future, but in my opinion, the endings in ME3 should have been at least as diverse as in Fallout: New Vegas, and ideally even more. For me, the ending really ruined this otherwise great game, and I consider myself a huge ME3 fan.

The Walking Dead

I'm not sure if this can be really called a game, or more an interactive social study. The story is dark, gritty and gruel. The game has many choices, but you always have the feeling that you only decide which kind of doom one person will eventually face. Story-wise, it's well written, and you definitely get sucked into the game. However, be aware that it's a very linear game; it mainly consists of dialogues and quick-time events. Sure, that's similar to some of the weaker Mass Effect 3 missions, and it allows for extremely dense storytelling, but you might feel not only trapped by the zombies at some points. I have to say that I neither saw the series nor did I read the books, so I can only judge from the game. It's hard to judge who will enjoy this game; if you are unsure, try the first episode, as  the rest is in a very similar spirit.

So much for the content; on the technical side, there are some weak points which are not really understandable in 2012. First of all, the graphics suffer from severe aliasing in many places; I don't get it why I cannot turn on 16x SSAA on this game as it has minimal graphics requirements. Second, it stalls at points during dialogues for a second or so; given that you only move through tiny scenes with a few characters at best, this becomes an annoying issue very quickly. There are also various quirks, for instance, decals float above the characters clothing. Team Fortress 2 shows how a NPR game can be done these days, and I definitely hope that Telltale will make their technology more robust & cleaner in the future. These are really small issues if you look at them isolated, but they stick out as the  graphics are pretty minimal anyway.

Trine 2

The price for best graphics goes to Battlefield 3 overall and to Trine 2. In particular, the scene where you are inside the giant worm are stunning both gameplay and graphics wise, and the addon ups the ante even further. The main problem for a potential Trine 3 is that it'll be hard to top Trine 2, which cleverly expands the gameplay over Trine 1 and improves on the graphics side as well.

What I really like about this game is that the progression is great, new gameplay elements are introduced just at the right time, and both the ending of the game as well as the addon feel satisfying. Definitely one of the highlights of 2012.

X-Com

Firaxis making X-Com, focusing on the round-based strategy; this sounds like nothing could go wrong. And yet, I was really disappointed by this game. Sure, it's well executed, but there are so many things which just feel wrong. It's also one for the worst ending 2012, I'm not sure why, but the ending just sucks on so many levels. Some of the cutscenes were much better at transporting the "world under attack" feeling than the end sequence.

For round-based games, there is still one big reference, which is: Jagged Alliance 2 with the 1.13 patch (and drops set to "enemies drop all weapons they have".) If you ever played Jagged Alliance 2 with the patch, you have seen the pinnacle of round-based strategy, it doesn't get much better than that, and I'm not even sure if it can get much better. It's still perfectly playable in 2012; in fact, I have played it this summer from start to end. X-Com is a good game, but it's far away from the dynamics of Jagged Alliance 2, especially when it comes to your team.

[Update] Added "The Walking Dead".