Dealing with Dynamic Data (almost) without Reflection

Often you need to manipulate a data structure without knowing it’s shape at compile time i.e. when authoring classes that expose generic interfaces or dealing with data described by runtime metadata (database schema, JSON files etc.).

For example, consider a method that is designed to generate a hashcode from the public properties of an arbitrary type. Firstly, let’s look at a model and a typical hashcode implementation:

    class SimpleModel
    {
        public string Surname { get; set; }

        public string Forename { get; set; }

        public int Age { get; set; }

        public string Description { get; set; }

        public override bool Equals(object obj)
        {
            if (obj is SimpleModel comparisonObject)
            {
                return Equals(comparisonObject);
            }
            return false;
        }

        protected bool Equals(SimpleModel other)
        {
            return string.Equals(Surname, other.Surname) && string.Equals(Forename, other.Forename) && Age == other.Age;
        }

        public override int GetHashCode()
        {
            unchecked
            {
                var hashCode = (Surname != null ? Surname.GetHashCode() : 0);
                hashCode = (hashCode * 397) ^ (Forename != null ? Forename.GetHashCode() : 0);
                hashCode = (hashCode * 397) ^ (Description != null ? Description.GetHashCode() : 0);
                hashCode = (hashCode * 397) ^ Age;
                return hashCode;
            }
        }
    }

If we want to use a similar hashcode approach but against a generic type we might use reflection to implement something like the followiong:

    class HashcodeGenerator
    {
        public int GetHashCodeForArbitaryType<TType>(TType complexObject)
        {
            Type objectType = complexObject.GetType();
            PropertyInfo[] properties = objectType.GetProperties().OrderByDescending(x => x.Name).ToArray();

            object firstValue = properties[0].GetValue(complexObject);
            int hashCode = firstValue?.GetHashCode() ?? 0;

            for (int propertyIndex=1; propertyIndex < properties.Length; propertyIndex++)
            {
                object value = properties[propertyIndex].GetValue(complexObject);
                unchecked
                {
                    hashCode = (hashCode * 397) ^ (value?.GetHashCode() ?? 0);
                }
            }
            return hashCode;
        }
    }

This gets the properties from the type and then generates a hashcode in the same way. While this implementation will work, reflection is notoriously expensive. Generating 1 million hashcodes for our simple model takes around 4.5 seconds on my laptop but more complex examples can quickly escalate into a poor performance and the more complex the model the worse the performance. I’ve certainly worked on systems where the cost was simply too high and we’ve been forced to consider alternative approaches: eliminating reflection was definitely not a micro-optimisation.

One option is to use the Expression framework to build an abstract syntax tree and compile it into IL at runtime. While most people are familiar with the Func<> and Action<> support in C# for functional style programming, an Expression Tree framework introduced alongside it seems to have gone much more under the radar. Allowing you to compile code at runtime, the Expression Tree framework is incredibly useful as, when used appropriately, it can lead to massive performance improvements in data driven designs.

A syntax tree is, as the name suggests, a tree representation of source code with each node of the tree corresponding to a source code construct, for example, calling a method or a comparison operator. Consider the following C# code:

public static Func<int> GetSimpleAddFunc()
{
    BinaryExpression add = Expression.Add(Expression.Constant(5), Expression.Constant(4));
    Expression<Func<int>> lambda = Expression.Lambda<Func<int>>(add);
    return lambda.Compile();
}

public static void DoAddition()
{

    Func<int> addFunc = GetAddFunc();
    int result = addFunc();
}

The _GetAddFunc()_ function returns a typical Func<int>, but one that is compiled at runtime. To do this we first build a simple tree comprised of an addition operator and two constant nodes. The addition operator is a kind of binary expression – a node that has left and right child nodes. Whereas the constant nodes are simple value nodes that have no children.

Finally, we wrap the addition node in a lambda expression of type Func<int>. Lambda expressions represent an entry point for the execution of an expression and as such support compilation.

Following this we can use the compiled function exactly as we would a normal function we’ve compiled from source code.

What if we want to be able to pass the values to add into our function? We can use an expression node called a ParameterExpression to allow us to declare placeholders for values that are passed into the expression, and these will take the form of parameters on our func:

public static Func<int, int, int> GetAddFunc()
{
    ParameterExpression value1 = Expression.Parameter(typeof(int));
    ParameterExpression value2 = Expression.Parameter(typeof(int));

    BinaryExpression add = Expression.Add(value1, value2);
    Expression<Func<int, int, int>> lambda = Expression.Lambda<Func<int, int, int>>(add, value1, value2);
    return lambda.Compile();
}

public static void DoAddition()
{

    Func<int, int, int> addFunc = GetAddFunc();
    int result = addFunc(9,5);
}

There are a wide variety of expression nodes and amongst many other things we can call methods, get from properties, and perform comparisons – if you can write the code you can build the equivalent in a syntax tree.

Going back to our data driven hashcode problem, we can combine the Expression framework with our reflection example to inspect the C# types and compile code that generates a hashcode specifically for our type. Once we’ve undertaken the initial compilation we really are just using compiled C# code to process our runtime discovered type. It’s a bit of a leap in complexity and will introduce some further node types but an example implementation of this technique looks like this:

public Func<TType, int> GetHashCodeForArbitaryTypeFunc<TType>()
{
    Type objectType = typeof(TType);
    PropertyInfo[] properties = objectType.GetProperties().OrderByDescending(x => x.Name).ToArray();

    ParameterExpression parameter = Expression.Parameter(objectType);
    ParameterExpression hashcodeVariable = Expression.Variable(typeof(int));

    List<Expression> blockExpressions = new List<Expression>
    {
        Expression.Assign(hashcodeVariable, Expression.Condition(
            Expression.Equal(
                Expression.Convert(Expression.Property(parameter, properties[0]), typeof(object)), Expression.Constant(null)),
                Expression.Constant(0),
                Expression.Call(Expression.Property(parameter, properties[0]), properties[0].PropertyType.GetMethod("GetHashCode", new Type[0]))))
    };

    if (properties.Length > 1)
    {
        blockExpressions.AddRange(properties.Skip(1).Select(p =>
            Expression.Assign(hashcodeVariable, Expression.ExclusiveOr(
                Expression.Multiply(hashcodeVariable, Expression.Constant(397)),
                Expression.Condition(
                    Expression.Equal(
                        Expression.Convert(Expression.Property(parameter, p), typeof(object)), Expression.Constant(null)),
                        Expression.Constant(0),
                        Expression.Call(Expression.Property(parameter, p), p.PropertyType.GetMethod("GetHashCode", new Type[0])))
            ))));
    }

    Expression<Func<TType, int>> lambda = Expression.Lambda<Func<TType, int>>(
        Expression.Block(new[] { hashcodeVariable}, blockExpressions),
        parameter);

    return lambda.Compile();
}

Firstly, let’s compare the performance – does this yield the performance improvement I promised? On the same laptop the performance of this is much improved – while the reflection based example took 4.2 seconds for generating 1 million hashcodes this new approach gives exactly the same hashcode for a given object but generates 1 million hashcodes in only 0.5 seconds, so around an 8x improvement in performance. Repeated reruns give fairly consistent results. It’s worth also noting that the bigger the model the bigger the savings. If I run the same tests with a 10 property model the reflection approach takes 8 seconds while the expression based approach still clocks in at around 0.5 seconds.

If we start to break down the above, you can see how the structure roughly follows that of our reflection implementation but instead of directly evaluating the hashcode it expresses each step as part of a syntax tree, finally bundling it into a lambda and compiling it into a Func<> we can call directly.

One of the key new node types introduced above is the block expression. This basically represents a scoped set of lines of code – essentially the equivelant of a { } brace block in C#. As such it’s factory method (Expression.Block) accepts an array of expression nodes that will be run in sequence, and these are passed as the second parameter to the factory. As we are declaring a scope we can also state any variables that belong to it and in this case, we are making use of an integer variable to store our hashcode.

We also make use of conditional expression nodes created with the Expression.Condition factory method. These take an expression to be evaluated (the first parameter), if true will return the evaluation of the expression used as the second parameter, and if false the third parameter. We’re also doing casting to object as part of our conditional check so we can consistently look at value types and reference types. If we don’t cast this and compare a value type to a reference type then during compilation of the expression, at runtime, we will receive an error. Given we’re using code to write code and know if the type is a value type, we could always check to see if this is the case and emit the comparison to null – however for the sake of keeping the example clear I’ve left this step out.

Finally, we make use of methods and properties through expression nodes through the Expression.Property and Expression.Call factories respectively. In each case the first parameter is an expression that returns the object to execute against, and the second parameter is PropertyInfo or MethodInfo respectively.

It’s worth noting there is a compile cost to this approach, this is nominal (around 0.004 seconds for our example on my laptop), but given we’re likely considering this approach to deal with a performance sensitive scenario it’s best to buffer this by using a lazy caching factory pattern. Otherwise, if we build the expression and compile each time we run into the model type then our savings will be lost. An example of this is shown in the sample below:

class CachingHashCodeFuncFactory : ICachingHashCodeFuncFactor
{
    private readonly ConcurrentDictionary<Type, Delegate> _compiledFunctions =
        new ConcurrentDictionary<Type, Delegate>();

    public Func<TType, int> GetHashCodeFunc<TType>()
    {
        Type type = typeof(TType);
        Delegate func = _compiledFunctions.GetOrAdd(type, obj => GetHashCodeForArbitaryTypeFunc<TType>());
        return (Func<TType, int>)func;
    }

    private static Func<TType, int> GetHashCodeForArbitaryTypeFunc<TType>()
    {
        Type objectType = typeof(TType);
        PropertyInfo[] properties = objectType.GetProperties().OrderByDescending(x => x.Name).ToArray();

        ParameterExpression parameter = Expression.Parameter(objectType);
        ParameterExpression hashcodeVariable = Expression.Variable(typeof(int));

        List<Expression> blockExpressions = new List<Expression>
        {
            Expression.Assign(hashcodeVariable, Expression.Condition(
                Expression.Equal(
                    Expression.Convert(Expression.Property(parameter, properties[0]), typeof(object)), Expression.Constant(null)),
                Expression.Constant(0),
                Expression.Call(Expression.Property(parameter, properties[0]), properties[0].PropertyType.GetMethod("GetHashCode", new Type[0]))))
        };

        if (properties.Length > 1)
        {
            blockExpressions.AddRange(properties.Skip(1).Select(p =>
                Expression.Assign(hashcodeVariable, Expression.ExclusiveOr(
                    Expression.Multiply(hashcodeVariable, Expression.Constant(397)),
                    Expression.Condition(
                        Expression.Equal(
                            Expression.Convert(Expression.Property(parameter, p), typeof(object)), Expression.Constant(null)),
                        Expression.Constant(0),
                        Expression.Call(Expression.Property(parameter, p), p.PropertyType.GetMethod("GetHashCode", new Type[0])))
                ))));
        }

        Expression<Func<TType, int>> lambda = Expression.Lambda<Func<TType, int>>(
            Expression.Block(new[] { hashcodeVariable }, blockExpressions),
            parameter);

        return lambda.Compile();
    }
}

We’ve now got a dictionary lookup to account for but tests on my laptop show that this only increases the time to generate 1 million hashes to 0.6 seconds, a 0.1 second overhead, and still nearly 10x faster than the original reflection approach. I’ve used a concurrent dictionary in the above as it seems likely that a class like this would be registered as a singleton in your IoC container and need to be thread safe.

It is, I hope, easy to see how this kind of approach can be used for compiling code to do all sorts of things wherever metadata is available – for example instead of using reflection we could use the same approach for compiling code to manipulate JSON based on a JSON schema.

As a final aside, if you run the sample app and have done previous work with hashcodes in .Net, something that might strike you as odd is that the hashcode will change on each execution. This is because the hashcode algorithm has changed with .Net Core. Whereas in Framework the hashcode would only change “between frameworks and domain” (so between frameworks and potentially between compilations), in Core it will change on every execution. While it’s dangerous to on hashcodes in this manner I still suspect this is something that will catch people out.

Leave a Reply

Your email address will not be published. Required fields are marked *