Shaderoids 2

Right, I left off last time with the beginnings of a game, now its time to add some sounds and other shizzle. It’s up on git hub now here

Sounds

 

Unfortunately one thing that just isn’t possible right now is playing sounds from the GPU (booo). However, I can implement a system where our GPU side code can ‘request’ a given sound to be played. Then the CPU side will just act on those requests.

We’ll start with the simplest possible 1-shot ‘audio engine’ you can build in unity….

    //array of clips + list of allocated audio sources
    AudioClip[] _clips;
    List _audioSources;

    //load list of clips
    void LoadClips(params string[] names) {
        _audioSources = new List();
        _clips = names.Select(a => Resources.Load(a)).ToArray();
    }

    //find a free audio source, allocating if necessary, and play clip
    void PlayClip(int idx) {
        AudioSource src = null;
        for(int i = 0; i < _audioSources.Count; i++) {
            if(!_audioSources[i].isPlaying) {
                src = _audioSources[i];
                break;
            }
        }
        if (!src) {
            src = gameObject.AddComponent();
            _audioSources.Add(src);
        }
        src.clip = _clips[idx];
        src.Play();
    }

I’ve used that pattern a few times for very quick prototypes. It basically lets you load a set of audio files from resources, then ask to play them once.

Lets load 3 audio files I ripped out of the earlier you tube vid of the actual game…

LoadClips("fire", "explode", "blop");

Next, a simple structure that will be mirrored CPU and GPU side. This represents a request to play a sound, and for now just contains the integer id of the sound to be played:

    public struct SoundRequest {
        public int id;
    }

I allocate a compute buffer to contain an array of those requests, alongside a CPU buffer to copy the requests into. On top, the counters structure now has a ‘numSoundRequests’ variable. At the end of the game update we can now read out and process the requests:


    _soundRequestBuffer.GetData(_cpuSoundRequests);
    _countersBuffer.GetData(_cpuCounters);
    for (int i = 0; i < _cpuCounters[0].numSoundRequests; i++)
        PlayClip(_cpuSoundRequests[i].id);

And for the last bit of plumbing, a GPU side function in the compute side for making a request:

//matching ids of sounds loaded in game.cs
#define SND_FIRE 0
#define SND_EXPLODE 1
#define SND_BLOP 2

void PlaySound(int id) {
    int idx;
    InterlockedAdd(_counters[0].numSoundRequests, 1, idx);
    if (idx < _maxSoundRequests) {
        _soundRequestsRW[idx].id = id;
    }
}

Now I can play sounds from GPU code! For example, when the player hits the space bar to fire a bullet:

if (_keyStates[' '].pressed) {
    //...stuff for firing the bullet...
    PlaySound(SND_FIRE);
}

I hook this up to firing and exploding asteroids by now, and it’s already starting to feel gamey!

Text

Hmmmm.. text rendering on GPU. I want my GPU code to at least be able to ask to draw a given character – e.g. DrawCharacter(‘a’). I don’t really want to cheat and start using bitmap fonts, so… I guess I’m programming a vector font. I create Font.cginc, which is full of stuff like this….

    switch (character) {
    case 'a': 
    case 'A':
    {
        AddFontLine(idx++, float2(0, 0), float2(0, 0.5f));
        AddFontLine(idx++, float2(1, 0), float2(1, 0.5f));
        AddFontLine(idx++, float2(0.5f, 1), float2(0, 0.5f));
        AddFontLine(idx++, float2(0.5f, 1), float2(1, 0.5f));
        AddFontLine(idx++, float2(0, 0.5f), float2(1, 0.5f));

        break;
    }

Basically, I’m filling up a big buffer, which has 16 line slots for each character. i.e. for the character code 65 (‘A’), there are up to 16 lines stored in a font buffer (_font) at offset 16*65.

I’ve created a simple compute shader called ‘BuildFont’, which gets dispatched with 256 threads – 1 per ascii character, though I’ve obviously only filled in ‘A’-‘Z’ and a few other bits and pieces!

Now a simple test shader:

[numthreads(256, 1, 1)]
void DrawFont(uint3 id : SV_DispatchThreadID)
{
    int lineIdx = AllocLines(LINES_PER_CHARACTER);
    int charOffset = id.x * LINES_PER_CHARACTER;

    float2 pos = float2(id.x / 16, id.x % 16) * float2(20,25) + 100;
    float2 scl = float2(15,20);

    for (int i = 0; i < LINES_PER_CHARACTER; i++) {
        float2 a = _font[charOffset + i].a * scl + pos;
        float2 b = _font[charOffset + i].b * scl + pos;
        AddLine(lineIdx++, a, b);
    }
}

This renders a 16×16 grid of characters on screen. For each one, we draw all 16 lines (note: empty lines are degenerate so don’t do anything). The result:

vecfont

Woohoo! 1 hardcoded vector font. No better way to spend the first chunk of a 10 hour plane flight. Unfortunately I discovered the plane has wifi, so was able to check out the actual asteroids font, and discovered it is way less curvy than mine. Oh well.

Teeny bit of UI

To get started on the UI and game loop, I’ll add a single ‘UpdateGame’ kernel.

[numthreads(1,1,1)]
void UpdateGame(uint3 id : SV_DispatchThreadID)
{
    int gameMode = _counters[0].gameMode;
    if (gameMode == 0) {
        //init
        gameMode = 1;

    }
    else if (gameMode == 1)
    { 
        //main menu
        DrawText_OneCoinOnePlay();
    }

}

This is probably the closest bit to a classic game, as it’s a single function that gets called once per frame to update general game state. At this stage, we only have the game mode, which starts of as 0 (init) then proceeds to 1 (main menu). I’ve also added our first bit of UI:

void DrawText_OneCoinOnePlay() {
    float2 pos = float2(350,150);
    float2 scl = float2(18, 20);
    float2 spacing = float2(scl.x*1.25, 0);

    DrawCharacter('1', pos, scl); pos += spacing;
    DrawCharacter(' ', pos, scl); pos += spacing;
    DrawCharacter('C', pos, scl); pos += spacing;
    DrawCharacter('O', pos, scl); pos += spacing;
    DrawCharacter('I', pos, scl); pos += spacing;
    DrawCharacter('N', pos, scl); pos += spacing;
    DrawCharacter(' ', pos, scl); pos += spacing;
    DrawCharacter('1', pos, scl); pos += spacing;
    DrawCharacter(' ', pos, scl); pos += spacing;
    DrawCharacter('P', pos, scl); pos += spacing;
    DrawCharacter('L', pos, scl); pos += spacing;
    DrawCharacter('A', pos, scl); pos += spacing;
    DrawCharacter('Y', pos, scl); pos += spacing;
}

At this point though I want to get some proper startup code in. Ideally when the player presses ‘start’ we would init the game state and fire them into the first level. This creates a need for compute shaders to trigger other compute shaders, which leads onto…

More indirect dispatching

Enter the indirect dispatch pattern. The basic concept of indirect dispatch is that sometimes you need a compute shader to control how a following compute shader will be dispatched. For example, our current ‘spawn asteroids’ shader is dispatched 10 times to spawn 10 asteroids, but what if we want to control how many asteroids are spawned from a compute shader instead?

For this purpose I create a new compute buffer called _dispatch, with the type ComputeBuffer.IndirectArguments. It’s just a buffer of uints, each of which will represent one of the arguments we’d normally provide to the Dispatch function.

For this simple purpose I’ll have a load of parameters that can be passed to shaders via what used to be the counters buffer, and is now the globals buffer. In addition, I’ll have a simple setup kernel that checks which kernel is to be dispatched, and configures the indirect buffer accordingly. This approach is not particularly scalable but simple and effective.

So our globals structure now has 2 new ‘request’ integers:

        public int requestClearAsteroids;
        public int requestSpawnAsteroids;

And the game update can make requests like this:

        //when 'c' is pressed, ask to clear and then spawn 4 new asteroids
        if (_keyStates['c'].pressed) {
            _globals[0].requestClearAsteroids = true;
            _globals[0].requestSpawnAsteroids = 4;
            gameMode = 1;
        }

A special ‘SetupDispatch’ shader now takes a predefined kernel id constant to run, and then uses the request values to fill out the dispatch settings:

//dispatch setup function that takes the kernel id being setup for and runs whatever logic
//is necessary for it
[numthreads(1,1,1)]
void SetupDispatch(uint3 id: SV_DispatchThreadID)
{
    switch (_kernelIdRequested) {
    case KID_CLEAR_ASTEROIDS: {
        //the CLEAR ASTEROIDS dispatch will dispatch across nothing if not requested, or
        //all asteroids if requested
        _dispatch[0] =  _globals[0].requestClearAsteroids ? (_maxAsteroids+255)/256 : 0;
        break;
    }
    case KID_SPAWN_ASTEROIDS: {
        //SPAWN_ASTEROIDS dispatches the number of requested asteroids
        _dispatch[0] = (_globals[0].requestSpawnAsteroids + 255) / 256;
        break;
    }
    default: {
        _dispatch[0] = 0; 
        break;
    }
    }
    _dispatch[1] = 1;
    _dispatch[2] = 1;
}

And functions like our SpawnAsteroids kernel look almost identical, with the small caveat that they may need to read the request counter to avoid overrunning:

//spawns 'n' requested asteroids
[numthreads(256,1,1)] 
void SpawnAsteroids(uint3 id : SV_DispatchThreadID) {
    //read count (either from uniform or indirect request)
    int maxCount = _threadCount < 0 ? _globals[0].requestSpawnAsteroids : _threadCount;
    _globals[0].requestSpawnAsteroids = 0;
    if (id.x < maxCount)
    { 
        //usual stuff
    }
}


The last bit of plumbing is a new DispatchIndirect helper CPU side, which first dispatches the setup kernel before dispatching the main one:

    //dispatches the 'setup dispatch' shader to configer the dispatch args for a given
    //dispatch type, then does an actual dispatch indirect on the requested kernel
    void DispatchIndirect(int kernel, int dispatchIndex) {
        _asteroidsShader.SetInt("_kernelIdRequested", dispatchIndex);
        DispatchOne(kernelSetupDispatch);

        BindEverything(kernel);
        _asteroidsShader.SetInt("_threadCount", -1);
        _asteroidsShader.DispatchIndirect(kernel, _dispatchBuffer);
    }

And finally the CPU game update can first dispatch UpdateGame, then dispatch both ClearAsteroids and SpawnAsteroids on the off chance they have work to do.

        DispatchOne(kernelUpdateGame);
        _playerState.GetData(players);
        DispatchIndirect(kernelClearAsteroids, KID_CLEAR_ASTEROIDS);
        DispatchIndirect(kernelSpawnAsteroids, KID_SPAWN_ASTEROIDS);

Phew! Lots of setup, but that’s a nice simple example of indirect dispatch being used in anger.

Spawning the player

One final pain going into this lot is spawning the player from compute. The game has to wait until there’s no asteroids in the way before spawning, which is actually none trivial on GPU. To do so, all asteroids must be checked against the player. Thus I had to write a couple of new kernels. The first checks players that want to spawn, and marks them as ‘can spawn’:

//prepares all players that want to spawn for spawning, and defaults their 'canSpawn'
//to true
[numthreads(256, 1, 1)]
void PreparePlayerSpawning(uint3 id : SV_DispatchThreadID)
{
    if (id.x < _threadCount)
    {
        PlayerState player = _playersRW[id.x];
        if (player.wantsToSpawn) {
            player.position = float2(1024, 768) * 0.5;
            player.velocity = 0;
            player.rotation = 0;
            player.canSpawn = true;
        }
        _playersRW[id.x] = player;
    }
}

The 2nd is very similar to the original CollidePlayerAsteroid, which compares the players to all asteroids, and if they overlap, sets ‘can spawn’ back to false’:

//checks all players that want to spawn against asteroids, and kills the 'canSpawn' flag
//for any that are too close to asteroids
[numthreads(256, 1, 1)]
void UpdatePlayerSpawning(uint3 id : SV_DispatchThreadID)
{
    if (id.x < _threadCount)
    {
        int playerIdx = id.x / _maxAsteroids;
        int asteroidIdx = id.x - (playerIdx*_maxAsteroids);

        PlayerState player = _playersRW[playerIdx];
        AsteroidState asteroid = _asteroidsRW[asteroidIdx];

        if (player.wantsToSpawn && asteroid.alive) {
            if (length(player.position - asteroid.position) < (asteroid.radius + 10 + 50)) {
                _playersRW[playerIdx].canSpawn = false;
            }
        }
    }
}

Then with this cheeky code in UpdateAndDrawPlayer:

        PlayerState player = _playersRW[id.x];
        if (player.wantsToSpawn && player.canSpawn) {
            player.wantsToSpawn = false;
            player.canSpawn = false;
            player.alive = true;
        }

We can detect players that both want to and can spawn, and spawn em!

With some tweaks to the game loop (partly to track current level), and the ability to spawn players, it’s all starting to look pretty cool:

More text and scores and things

Somewhere along the way I felt dirty requesting loads of lines for every render, so added a 2nd buffer and rendering pipeline for requesting characters (just like you can request lines). The basic concept is that, just like with lines, I can submit a list of requests for characters by code, size and position. I then have another procedural pass in DrawLines.shader that uses this function:

        //reads character id, then looks it up in lines buffer which should contain font
        v2f charvert(uint id : SV_VertexID, uint inst : SV_InstanceID)
        {
            int vertsPerCharacter = (LINES_PER_CHARACTER * 2);
            int charIdx = id / vertsPerCharacter;

            int charVertex = id % vertsPerCharacter;
            int charLine = charVertex / 2;
            int charLineVertex = charVertex % 2;

            Character c = characters[charIdx];

            Line l = lines[c.id*LINES_PER_CHARACTER + charLine];
            float2 p = charLineVertex==0 ? l.a : l.b;

            p = p * c.scl + c.pos;

            v2f o;
            o.vertex = toScreen(p);
            return o;
        }

Just like the old lines one, it takes a vertex id, breaks it down into character index, chracter vertex, character line and line vertex, then fiddles with it and outputs it!

Now for some numbers. All numbers in asteroids have a 0 on the end, cos it looks cooler. So the new DrawNumberTimes10 function will do just that…

void DrawNumberTimes10(int number, float2 pos, float2 scl, bool pad) {
    float2 spacing = float2(scl.x*1.25, 0);

    int idx = AllocCharacters(8);

    bool onFirstChar = true;
    for (int base = 1000000; base > 0; base /= 10) {
        int val = clamp(number / base,0,9);
        number -= val * base;
        if (val != 0 || !onFirstChar || base==1) {
            //output character
            DrawCharacter(idx++, '0'+val, pos, scl); pos += spacing;
            onFirstChar = false;
        }
        else {
            //output space and increment pos if padding requested
            DrawCharacter(idx++, ' ', pos, scl);
            if(pad)
                pos += spacing;
        }
    }

    //final 0
    DrawCharacter(idx++, '0', pos, scl); pos += spacing;
}

Now by adding a score value, I can draw the nice numbers on screen:

    DrawNumberTimes10(_playersRW[0].score, float2(100, 700), float2(18, 20), true);
    DrawNumberTimes10(0, float2(800, 700), float2(18, 20), false);
    DrawNumberTimes10(0, float2(500, 680), float2(18, 20)*0.75, false);

Not sure what the other 2 numbers are supposed to be yet. Presumably credits and player 2 score. Oh well who cares – it looks cool:

Summary

Well, after that brain dump of crazy I have quite a nice thing goin on. It’s not asteroids yet, but any game logic is completely separate from CPU. Compute shaders are responsible for init, game loop and all general physics and spawning etc. Next time I’ll probably get some effects and flying saucers in or something…

Also – code:
https://github.com/chriscummings100/shaderoids

Shaderoids 1

Ok, time for something really productive. I have decided to put my pretty solid knowledge of GPU coding to the best possible use, and produce a perfect (as possible) clone of the 1979 game, Asteroids… entirely written on GPU.

Why you ask? Frankly you should be ashamed for asking that. Also, did you not notice the title of this web site?

First up, some rules:

  • All game code, graphics, physics and rendering entirely driven by GPU
  • CPU may only:
    • Read hardware (such as inputs), and pass to GPU
    • Fire off a fixed set of dispatches
    • Read requests from GPU to write to hardware (such as audio)
  • Pure line rendering

fyi, since writing, I’ve plonked all the code up here

Basic architecture / flow

The idea is to make the CPU side of things as minimal as possible. With that in mind, my basic flow is going to be a single MonoBehaviour in C# which:

  • Reads a set of input states (probably A-Z, 0-9 and a few others), and writes them into a compute buffer
  • Executes a series of dispatches to fire off various GPU compute kernels
  • Execute DrawProceduralIndirect to draw a load of lines generated by compute. (high bar: line renderer in compute!)
  • Read set of ‘audio requests’ written to buffer by GPU, and use to trigger some predefined sounds

So lets start with 3 files:

  • Game.cs for our monobehaviour
  • Asteroids.compute to contain all the compute jobs
  • DrawLines is a simple fragment shader that draws a set of lines
using UnityEngine;

public class Game : MonoBehaviour 
{
    public struct Line {
        public Vector3 a;
        public Vector3 b;
    }

    public const int LINE_BUFFER_SIZE = 10000;

    ComputeShader _asteroidsShader;
    Shader _drawLinesShader;
    Material _drawLinesMaterial;

    ComputeBuffer _linesBuffer;


    public void Awake() {
        _asteroidsShader = Resources.Load("asteroids");
        _drawLinesShader = Shader.Find("DrawLines");
        _drawLinesMaterial = new Material(_drawLinesShader);
        _drawLinesMaterial.hideFlags = HideFlags.HideAndDontSave;

        _linesBuffer = ComputeBufferUtils.Alloc(LINE_BUFFER_SIZE);
    }

    public void OnPostRender() {

        Line[] testLines = new Line[3];
        testLines[0] = new Line { a = new Vector3(0, 0, 0), b = new Vector3(1, 0, 0) };
        testLines[1] = new Line { a = new Vector3(0, 0, 0), b = new Vector3(1, 1, 0) };
        testLines[2] = new Line { a = new Vector3(0, 0, 0), b = new Vector3(0, 1, 0) };
        _linesBuffer.SetData(testLines);

        _drawLinesMaterial.SetBuffer("lines", _linesBuffer);
        _drawLinesMaterial.SetPass(0);
        Graphics.DrawProcedural(MeshTopology.Lines, testLines.Length*2);
    }
}    

This code starts by loading up some shaders and building a material, then sets up a compute buffer that’ll contain a set of Line structures to be rendered. ComputeBufferUtils is a handy class for allocating structured buffers that I’ll upload at some point.

The OnPostRender function simply plonks 3 lines into the lines buffer, passes it to the draw line material and finally calls DrawProcedural, requesting line topology, and specifying 2 vertices per line.

DrawProcedural is a tasty beast. It allows us to write a vertex shader takes just an integer index. It is then up to the shader to take this index and convert it into useful vertex positions, which is done using data in the provided compute buffer:

Shader "DrawLines"
{
    Properties
    {
    }
    SubShader
    {
        Tags { "RenderType"="Opaque" }
        LOD 100

        Pass
        {
            CGPROGRAM
            #pragma vertex vert
            #pragma fragment frag
            
            struct v2f
            {
                float4 vertex : SV_POSITION;
            };

            struct Line {
                float3 a;
                float3 b;
            };

            StructuredBuffer lines;
            
            
            v2f vert (uint id : SV_VertexID, uint inst : SV_InstanceID)
            {
                Line l = lines[id / 2];
                float3 p = (id & 1) ? l.a : l.b;
                
                v2f o;
                o.vertex = float4(p, 1);
                return o;
            }
            
            fixed4 frag (v2f i) : SV_Target
            {
                return 1;
            }
            ENDCG
        }
    }
}

This cheaky little shader uses the vertex id to access the lines buffer and work out whether the vertex represents the start or the end of the line.

Result…

firstlines

Woohoo! Nearly there….?

Compute and indirect drawing

Right, time to write that compute shader I loaded earlier!

Let’s start by adding a new structure to the game code, and allocating a couple of extra buffers:

    public struct Counters {
        public int numLines;
    }
    _countersBuffer = ComputeBufferUtils.Alloc(1);
    _dispatchBuffer = ComputeBufferUtils.Alloc(8, ComputeBufferType.IndirectArguments);

The compute shader (Asteroids.compute) mimics the CPU side structure and buffers:

//structures
struct Line {
    float3 a;
    float3 b;
};
struct Counters {
    int numLines;
};

//buffers
RWStructuredBuffer _linesRW;
RWStructuredBuffer _counters;
RWStructuredBuffer _dispatch;

//general use uniform to limit dispatch thread counts
int _threadCount;


Now for some compute kernels. The first, I’ll dispatch once at the start of each frame to clear the line counter:

[numthreads(1,1,1)]
void ClearLines(uint3 id : SV_DispatchThreadID)
{
    _counters[0].numLines = 0;
}

This is designed to be executed as a single thread on the GPU, and just does one tiny bit of ‘setup’ work. This pattern is very common in GPU code – tiny compute jobs to pass around / clear / tweak some data, then large parallel dispatches to do actual work.

The next kernel is a more classic compute job, and designed to run in a highly parallel manner.

//creates lines based on dispatch thread
[numthreads(256, 1, 1)]
void GenerateTestLines(uint3 id : SV_DispatchThreadID)
{
    if (id.x < _threadCount)
    {
        //allocate space
        int lineIdx;
        InterlockedAdd(_counters[0].numLines, 1, lineIdx);

        //build line
        Line l;
        float ang = radians(id.x);
        float3 dir = float3(sin(ang), cos(ang), 0);
        l.a = dir * 0.1f;
        l.b = dir * 0.75f;
        _linesRW[lineIdx] = l;
    }
}

This uses InterlockedAdd to ‘allocate’ a slot in the lines buffer by atomically incrementing the numLines counter. A line is then generated based on the thread index, and written out to the correct slot. In future I’ll use this pattern for outputting all lines that need rendering.

Next, another simple setup kernel that takes the numLines counter and converts it into a set of arguments that can be passed to DrawProceduralIndirect:


//fills out indirect dispatch args 
[numthreads(1, 1, 1)]
void LineDispatchArgs(uint3 id : SV_DispatchThreadID)
{
    _dispatch[0] = _counters[0].numLines*2; //v count per inst (2 verts per line)
    _dispatch[1] = 1; //1 instance
    _dispatch[2] = 0; //verts start at 0
    _dispatch[3] = 0; //instances start at 0
}

This indirect technique is used with both drawing and dispatching, and allows the use of a compute job to setup the work for a following compute job or draw call. In this case I’m setting up the arguments required for DrawProcedural.

That little ‘*2’ in LineDispatchArgs took a while to spot! Interesting fun – if you’re on NVidia and force Gen lines to only build 64 lines, and remove the ‘*2’ you see it alternate between 2 groups of 32 lines. That’s cos you’re only rendering half the lines you generate. As an NVidia GPU works in warps of 32 threads you randomly get the 1st half or the 2nd half. Hardware showing its true form!

The Game.cs OnPostRender now looks roughly like this:

void OnPostRender()
{
        _asteroidsShader.SetBuffer(kernelClearLines, "_counters", _countersBuffer);
        DispatchOne(kernelClearLines);

        _asteroidsShader.SetBuffer(kernelGenerateTestLines, "_counters", _countersBuffer);
        _asteroidsShader.SetBuffer(kernelGenerateTestLines, "_linesRW", _linesBuffer);
          DispatchItems(kernelGenerateTestLines, 360);

        _asteroidsShader.SetBuffer(kernelLineDispatchArgs, "_counters", _countersBuffer);
        _asteroidsShader.SetBuffer(kernelLineDispatchArgs, "_dispatch", _dispatchBuffer);
        DispatchOne(kernelLineDispatchArgs);

        _drawLinesMaterial.SetBuffer("lines", _linesBuffer);
        _drawLinesMaterial.SetPass(0);
        Graphics.DrawProcedural(MeshTopology.Lines, testLines.Length*2);
        Graphics.DrawProceduralIndirect(MeshTopology.Lines, _dispatchBuffer);
}

Where DispatchOne and DispatchItems are a couple of handy helpers for dispatching compute kernels:

    void DispatchOne(int kernel) {
        _asteroidsShader.Dispatch(kernel, 1, 1, 1);
    }
    void DispatchItems(int kernel, int items) {
        uint x,y,z;
        _asteroidsShader.GetKernelThreadGroupSizes(kernel, out x, out y, out z);
        _asteroidsShader.SetInt("_threadCount", items);
        _asteroidsShader.Dispatch(kernel, (items + (int)x - 1) / (int)x, 1, 1);
    }

Anyhoo, making it 2 degrees per line + rendering at 1024*768:

linecircleincompute

(oh yeah – I changed the clear colour to black too!).

Next, I think I need to correct for aspect ratio and stuff. Gonna operate at fixed res of 1024/768 then upscale / downscale points. Hmmm… spent a while trying to make this work with any resolution but got bored. Instead I’ll just convert from (1024,768) to screen space:

p = 2 * (p – float2(1024,768)*0.5) / float2(1024,768);

Also changed lines to be float2 whilst there!

A bit of game

It’s getting a bit big to post all the code up here, but I’ll eventually share it all on git hub!

Let’s get inputs in. To stay true to the goal, I want to avoid any understanding of ‘gameplay’ on CPU, so I’ll just push a big list of keys into game. CPU side, I’ll setup a KeyState structure and read a load of buttons into it each frame:


    public struct KeyState {
        public bool down;
        public bool pressed;
        public bool released;
    }
      for(int i = 0; i < 26; i++) {
            _cpuKeyStates['a' + i] = new KeyState {
                down = Input.GetKey(KeyCode.A + i),
                pressed = Input.GetKeyDown(KeyCode.A + i),
                released = Input.GetKeyUp(KeyCode.A + i),
            };
        }

On top of ‘A’-‘Z’, I also added some handy keys like ‘0’-‘9’, ‘ ‘ and a few other bits.

Now to define a new PlayerState (which also comes with a new _playerState buffer).

    public struct PlayerState {
        public Vector2 position;
        public float rotation;
        public Vector2 speed;
        public bool alive;
    }

For the moment, I’ll init this CPU side to get going

        PlayerState[] initPlayer = new PlayerState[1];
        initPlayer[0].position = new Vector2(1024f, 768f) * 0.5f;
        initPlayer[0].alive = true;
        _playerState.SetData(initPlayer);

The long term goal is to get a proper game loop going, so I don’t feel too dirty about a bit of CPU setup for now!

So compute side I’ve setup corresponding structures and buffers. Also added these few helper functions for drawing lines and manipulating vectors:

int AllocLines(int count) {
    int lineIdx;
    InterlockedAdd(_counters[0].numLines, count, lineIdx);
    return lineIdx;
}
void AddLine(int idx, float2 a, float2 b) {
    _linesRW[idx].a = a;
    _linesRW[idx].b = b;
}

float2 mulpoint(float3x3 trans, float2 p) {
    return mul(trans, float3(p, 1)).xy;
}
float2 mulvec(float3x3 trans, float2 p) {
    return mul(trans, float3(p, 0)).xy;
}

For now it’s 1 player, but I want to plan for a million, so I’ll work as though there’s more than 1. The basic compute kernel for player update looks like this:

[numthreads(256, 1, 1)]
void UpdateAndDrawPlayer(uint3 id : SV_DispatchThreadID)
{
    if (id.x < _threadCount)
    {
        PlayerState player = _playersRW[id.x];
        if(player.alive) {
            //do stuff
        }
        _playersRW[id.x] = player;

    }
}

The core update code splits into 2 sections. First, standard asteroid style inputs:

            float rot = 0;
            float thrust = 0;
            float rotPerSecond = 1;
            float thrustPerSecond = 100;

            if (_keyStates['A'].down) {
                rot += rotPerSecond * _timeStep;
            }
            if (_keyStates['D'].down) {
                rot -= rotPerSecond * _timeStep;
            }
            if (_keyStates['W'].down) {
                thrust += thrustPerSecond * _timeStep;
            }
            player.rotation += rot;

            float2 worldy = float2(sin(player.rotation), cos(player.rotation));
            float2 worldx = float2(-worldy.y, worldy.x);

            player.velocity += worldy * thrust;
            player.position += player.velocity * _timeStep;

Then the second half calculates a transform matrix and generates some hard coded lines:

            worldx *= 50;
            worldy *= 50;
            float3x3 trans = {
                worldx.x, worldy.x, player.position.x,
                worldx.y, worldy.y, player.position.y,
                0, 0, 1
            };

            int lineIdx = AllocLines(5);
            
            float2 leftcorner = mulpoint(trans, float2(-0.7, -1));
            float2 rightcorner = mulpoint(trans, float2(0.7, -1));
            float2 tip = mulpoint(trans, float2(0, 1));
            float2 leftback = mulpoint(trans, float2(-0.2, -0.7f));
            float2 rightback = mulpoint(trans, float2(0.2, -0.7f));

            AddLine(lineIdx++, leftcorner, tip);
            AddLine(lineIdx++, rightcorner, tip);
            AddLine(lineIdx++, leftcorner, leftback);
            AddLine(lineIdx++, rightcorner, rightback);
            AddLine(lineIdx++, leftback, rightback);


Those vectors look roughly right I think. Side by side with the video:

shipsidebyside

After a bit of debugging, I’ve found the bool types in KeyState appeared to be causing problems. Seems odd, but I can fix it with ints and don’t fancy digging too deep right now. Once that’s sorted, we have a game sort of…

Quickly add screen wrapping…

            //wrap player (note: better version should handle overshoot amount)
            player.position = player.position >= 0 ? player.position : float2(1024, 768);
            player.position = player.position <= float2(1024,768) ? player.position  : 0;

This quick and dirty version could be better by handling the fact that if they overshoot by k pixels, they shoot come back k pixels from the other side. Pain in the ass though!

The thrust in the video looks like it’s just a flashing triangle, maybe with a bit of randomness thrown in. To achieve this, I’ll pass through the frame number, which is fed into a wang hash based random number generator. Then some cheeky code to add the extra triangle when thrusting:


            int thrustframe = (_frame / 4);
            if (thrust > 0 && (thrustframe &1)) {
                lineIdx = AllocLines(2);
                float2 thrustback = mulpoint(trans, float2(0.0f, -1.5f-wang_rand(thrustframe)*0.15f));
                AddLine(lineIdx++, leftback, thrustback);
                AddLine(lineIdx++, rightback, thrustback);
            }

Asteroids

Just like with players, I’ll define a structure for the asteroid state, with a bit of extra data in:

    public struct AsteroidState {
        public Vector2 position;
        public float rotation;
        public Vector2 velocity;
        public int alive;
        public float radius;
        public int level;
    }

As before, I create a buffer for this, and init a randomly placed, sensible set of asteroids:

        AsteroidState[] initAsteroids = new AsteroidState[MAX_ASTEROIDS];
        for(int i = 0; i < START_ASTEROIDS; i++) {             while(true) {                 initAsteroids[i].position = new Vector2(Random.Range(0f, 1024f), Random.Range(0f, 768f));                 if ((initAsteroids[i].position - initPlayer[0].position).magnitude > 200f)
                    break;
            }
            initAsteroids[i].alive = 1;
            initAsteroids[i].radius = 30;
            initAsteroids[i].rotation = Random.Range(-Mathf.PI, Mathf.PI);
            initAsteroids[i].velocity = Random.insideUnitCircle * 50f;
            initAsteroids[i].level = 0;
        }
        _asteroidState.SetData(initAsteroids);

A note at this point, after a few bugs I’ve swizzled a few bits around:

  • I’ve split update and render out into the unity Update and OnPostRender functions, as reading input seemed to break a bit when called from render
  • I’ve added a helpful BindEverything function that just takes a compute kernel and binds all my buffers and variables to it. This would be terrible behaviour in production, but this isn’t production.
  • I also changed ClearLines to BeginFrame, cos it’s starting to look like it’ll do more than clear out lines

The BindEverything code:

    void BindEverything(int kernel) {
        _asteroidsShader.SetInt("_maxPlayers", MAX_PLAYERS);
        _asteroidsShader.SetInt("_maxAsteroids", MAX_ASTEROIDS);
        _asteroidsShader.SetFloat("_time", Time.time);
        _asteroidsShader.SetFloat("_timeStep", Time.deltaTime);
        _asteroidsShader.SetInt("_frame", Time.frameCount);

        _asteroidsShader.SetBuffer(kernel, "_dispatch", _dispatchBuffer);
        _asteroidsShader.SetBuffer(kernel, "_counters", _countersBuffer);
        _asteroidsShader.SetBuffer(kernel, "_linesRW", _linesBuffer);
        _asteroidsShader.SetBuffer(kernel, "_keyStates", _keyStates);
        _asteroidsShader.SetBuffer(kernel, "_playersRW", _playerState);
        _asteroidsShader.SetBuffer(kernel, "_asteroidsRW", _asteroidState);
    }

With the asteroids added, the dispatch now looks like this:

    private void Update() {
        //all my inputs are read into _cpyKeyStates here
        _keyStates.SetData(_cpuKeyStates);

        DispatchOne(kernelBeginFrame);
        DispatchItems(kernelUpdateAndDrawPlayer, MAX_PLAYERS);
        DispatchItems(kernelUpdateAndDrawAsteroid, MAX_ASTEROIDS);
        DispatchOne(kernelLineDispatchArgs);
    }

And render is still very simple:

    public void OnPostRender() {

        _drawLinesMaterial.SetBuffer("lines", _linesBuffer);
        _drawLinesMaterial.SetPass(0);
        Graphics.DrawProceduralIndirect(MeshTopology.Lines, _dispatchBuffer);
    }

So, the asteroids update kernel. It’s pretty similar to player update in functionality. The first half updates its position, and the second renders it.

//updates player movement and outputs draw request
[numthreads(256, 1, 1)]
void UpdateAndDrawAsteroid(uint3 id : SV_DispatchThreadID)
{
    if (id.x < _threadCount)     {         AsteroidState asteroid = _asteroidsRW[id.x];         if (asteroid.alive) {             asteroid.position += asteroid.velocity * _timeStep;             asteroid.position = asteroid.position >= 0 ? asteroid.position : float2(1024, 768);
            asteroid.position = asteroid.position <= float2(1024, 768) ? asteroid.position : 0;

            float scl = asteroid.radius;

            float2 worldy = float2(sin(asteroid.rotation), cos(asteroid.rotation));
            float2 worldx = float2(-worldy.y, worldy.x);
            worldx *= scl;
            worldy *= scl;
            float3x3 trans = {
                worldx.x, worldy.x, asteroid.position.x,
                worldx.y, worldy.y, asteroid.position.y,
                0, 0, 1
            };

            //alloc edges
            const int NUM_EDGES = 9;
            int lineIdx = AllocLines(NUM_EDGES);

            //build first point then start iterating
            float randscl = 0.75f;
            float2 first;
            {
                int i = 0;
                float ang = 0;
                float2 pos = float2(sin(ang), cos(ang));
                pos += randscl * float2(wang_rand(id.x*NUM_EDGES + i), wang_rand(id.x*NUM_EDGES * 2 + i));
                first = mulpoint(trans, pos);
            }
            float2 prev = first;
            for (int i = 1; i < NUM_EDGES; i++) {

                //offset every other point using random number
                float ang = (i*3.1415927f*2.0f) / NUM_EDGES;
                float2 pos = float2(sin(ang), cos(ang));
                pos += randscl * float2(wang_rand(id.x*NUM_EDGES + i), wang_rand(id.x*NUM_EDGES * 2 + i));

                //add new line
                float2 curr = mulpoint(trans, pos);
                AddLine(lineIdx++, prev, curr); 
                prev = curr;
            }

            //add final line to joinn previous point to first point
            AddLine(lineIdx++, prev, first);

        }
        _asteroidsRW[id.x] = asteroid;
    }
}

I spent quite a while before I was happy with the look of the asteroids, and I’m still not fully satisfied. The above code basically generates a 9 edge circle, then randomly adjusts the position of each vertex. It also takes account of the asteroid’s radius, in anticipation of varying sizes.

firstasteroids

Bullets

Bullets are the first slightly cheeky one, as I need to be able to spawn them, which means  allocating/freeing of some form. This kind of model is tricky on GPU, so I’m going to go for the simple option of a large circular buffer of bullets. If the buffer overflows it’ll start recycling bullets, but I’ll just make it big enough not too!

Aside from a life time, the bullet state is very simple:

    public struct BulletState {
        public Vector2 position;
        public Vector2 velocity;
        public float lifetime;
    }

And the update equally so:

//updates player movement and outputs draw request
[numthreads(256, 1, 1)]
void UpdateAndDrawBullet(uint3 id : SV_DispatchThreadID)
{
    if (id.x < _threadCount)     {         BulletState bullet = _bulletsRW[id.x];         if (bullet.lifetime > 0) {

            bullet.position += bullet.velocity * _timeStep;
            if (any(bullet.position < 0) || any(bullet.position > float2(1024, 768))) {
                bullet.lifetime = -1;
                return;
            }
            bullet.lifetime -= _timeStep;

            float scl = 2;
            float3x3 trans = {
                scl, 0, bullet.position.x,
                0, scl, bullet.position.y, 
                0, 0, 1
            };

            //alloc edges
            const int NUM_EDGES = 6;
            int lineIdx = AllocLines(NUM_EDGES);

            //build first point then start iterating
            float2 first;
            {
                int i = 0;
                float ang = 0;
                float2 pos = float2(sin(ang), cos(ang));
                first = mulpoint(trans, pos);
            }
            float2 prev = first;
            for (int i = 1; i < NUM_EDGES; i++) {

                float ang = (i*3.1415927f*2.0f) / NUM_EDGES;
                float2 pos = float2(sin(ang), cos(ang));
                float2 curr = mulpoint(trans, pos);
                AddLine(lineIdx++, prev, curr);
                prev = curr;
            }

            //add final line to joinn previous point to first point
            AddLine(lineIdx++, prev, first);

        }
        _bulletsRW[id.x] = bullet;
    }
}

The main difference is that unlike the ship and asteroid, bullets have a life time and automatically die when it runs out. I couldn’t figure out whether bullets wrap in the original game, but it felt better when they didn’t, so they die when going off screen for the moment.

Now to implement the spawning. I’ll start with adding a nextBullet counter:

    public struct Counters {
        public int numLines;
        public int nextBullet;
    }

Now for the code in the player update to fire one:

            if (_keyStates[' '].pressed) {
                int nextBullet;
                InterlockedAdd(_counters[0].nextBullet, 1, nextBullet);
                BulletState b;
                b.position = player.position;
                b.velocity = worldy * 1000;
                b.lifetime = 3;
                _bulletsRW[nextBullet%_maxBullets] = b;
            }

When space is pressed, I increment the bullet counter, then a new bullet is created and written into the new slot. Note the ‘mod’ allows the nextBullet counter to get ever higher but still be used as an index into the _bulletsRW buffer.

Collision

Ok! We have stuff. Now to smash it. If I hit millions of players and asteroids, collision will have to start using optimisation structures, but GPUs are fast and like nothing more than doing 1000s of identical calculations. So collision will be a classic brute force ‘test everything against everything else’ situation.

Lets start with player vs asteroid collision:

[numthreads(256, 1, 1)]  
void CollidePlayerAsteroid(uint3 id : SV_DispatchThreadID) 
{
    if (id.x < _threadCount)
    {
        int playerIdx = id.x / _maxAsteroids;
        int asteroidIdx = id.x - (playerIdx*_maxAsteroids);

        PlayerState player = _playersRW[playerIdx];
        AsteroidState asteroid = _asteroidsRW[asteroidIdx];

        if (player.alive && asteroid.alive) {
            if (length(player.position - asteroid.position) < (asteroid.radius+10)) {
                _playersRW[playerIdx].alive = 0;
                _asteroidsRW[asteroidIdx].alive = 0;
            }
        }
    }
}

The actual collision code here is a basic circle test. If the player is within 10 pixels of the asteroid, they die. The sneaky bit is at the top, where the player and asteroid indices are calculated. This makes more sense when you look at the dispatch CPU side:

        DispatchItems(kernelCollidePlayerAsteroid, MAX_PLAYERS * MAX_ASTEROIDS);

In effect, I dispatch 1 thread for every combination of player and asteroid. The funky calculations in CollidePlayerAsteroid simply decompose the thread id into player index and asteroid index, just like you might convert a pixel index to a pixel coordinate in an image.

Shockingly enough, for now the bullet vs asteroid is pretty similar:

[numthreads(256, 1, 1)]
void CollideBulletAsteroid(uint3 id : SV_DispatchThreadID)
{
    if (id.x < _threadCount)
    {
        int bulletIdx = id.x / _maxAsteroids;
        int asteroidIdx = id.x - (bulletIdx*_maxAsteroids);

        BulletState bullet = _bulletsRW[bulletIdx];
        AsteroidState asteroid = _asteroidsRW[asteroidIdx];

        if (bullet.lifetime > 0 && asteroid.alive) {
            if (length(bullet.position - asteroid.position) < (asteroid.radius + 2)) {
                _bulletsRW[bulletIdx].lifetime = -1;
                _asteroidsRW[asteroidIdx].alive = 0;
            }
        }
    }
}

There’s one ingredient before we can declare real progress though – asteroids smash! To achieve smashing asteroids I’ll take a similar approach to the bullets. The asteroids buffer will be large enough to contain all the asteroids a level will ever need (big, medium and small), and I’ll just use a counter to allocate into it. Once that’s added, I can create a splitting function:

void SplitAsteroid(int idx) {
    int nextIndex;
    InterlockedAdd(_counters[0].nextAsteroid, 2, nextIndex);

    AsteroidState asteroid = _asteroidsRW[idx];
    if (asteroid.level < 2) {

        float childSpeed = 50; 

        AsteroidState child;
        child.position = asteroid.position;
        child.velocity = asteroid.velocity + (float2(wang_rand(nextIndex), wang_rand(nextIndex * 2)) * 2 - 1) * childSpeed;
        child.alive = 1;
        child.radius = asteroid.radius * 0.5;
        child.rotation = (wang_rand(nextIndex * 3) * 2 - 1) * 3.1415927f;
        child.level = asteroid.level + 1;
        _asteroidsRW[nextIndex++] = child;

        child.position = asteroid.position;
        child.velocity = asteroid.velocity + (float2(wang_rand(nextIndex), wang_rand(nextIndex * 2)) * 2 - 1) * childSpeed;
        child.alive = 1;
        child.radius = asteroid.radius * 0.5;
        child.rotation = (wang_rand(nextIndex * 3) * 2 - 1) * 3.1415927f;
        child.level = asteroid.level + 1;
        _asteroidsRW[nextIndex++] = child;
    }


    _asteroidsRW[idx].alive = 0;
}

This code allocates 2 new slots by incrementing nextAsteroid by 2. Assuming the asteroid level < 2 (i.e. it isn’t too small), I then proceed to generate 2 new smaller asteroids. These start at the position of the larger one, but take a on randomly adjusted velocity. Finally, the asteroid is killed off. I also just noticed I probably shouldn’t allocate child asteroids unless I intend to use them, but it doesn’t matter.

A quick tweak to the bullet code:

        if (bullet.lifetime > 0 && asteroid.alive) {
            if (length(bullet.position - asteroid.position) < (asteroid.radius + 2)) {
                _bulletsRW[bulletIdx].lifetime = -1;
                SplitAsteroid(asteroidIdx);
            }
        }

And it’s asteroid time:

Yay! Ok – I have to get on a plane. Hopefully by the time I’ve landed there’ll be sound effects or something…

For UI, sounds and game loops, head to the next post here: Shaderoids 2