Thomas Denney

Student and app developer

Comparing the implementation of generics

Generics are a common feature of statically typed programming languages that allow programmers to abstract algorithms and data structures over types. Typically implementations of generics - at the compiler and runtime level - will "instantiate" different concrete instances of generics types by replacing their type parameters with other types. The point at which this instantiation occurs, and whether it occurs per type or not varies between C++, Java, and C#.

C++

C++ implements generic classes, methods, and functions through templates, which are compiled into separate copies for each concrete type used with the generic type. Consider a simple generic class Box:

template<typename T>
class Box {
public:
    Box(T value) { _value = value; }
private;
    T _value;
};

Later when a function declares an instance of Box it is required to supply the name of a concrete type for the type parameter T:

Box<int> intBox(42);
Box<std::string> stringBox("Hello, world!");

For each compilation unit - generally a single C++ source file - instances are created by replacing T with the appropriate type and then compiling the generated code. The generated code is then type checked, to ensure that all used methods, fields, and operators are accessible. For example, the STL function std::sort is generic across all types, but will only compile when used with a type that overloads the < operator. It is an unfortunate disadvantage that C++ has no way of dictating this in the type system, but this is coming in the future with concepts.

A big advantage of the C++ approach is that there is no runtime performance cost compared to handwritten code, but it does come at the hefty disadvantage of compile time and code size. As each compilation unit includes copies of all the generic instances it uses, it is necessary for a linker to eliminate the redundant copies. For example, if both a.cpp and b.cpp used Box<int>, then implementations of Box<int> would feature in a.o and b.o, but if the two were linked together into program then a spare copy could be removed. Programmers can avoid this by declaring explicit instantiations in a header file and then forcing all of them to be implemented in a single object file, but this becomes awkward as the code size grows.

Slightly unique to C++ is the option to provide alternative specializations of generic classes for types where it may be more efficient to do so. For example, the STL specializes std::vector<bool> to represent boolean values with a single bit.

Java

The initial release of Java in 1995 did not include generics. The Generic Java project, started in 1998, was integrated into the language in 2004. As a result of the relatively late inclusion of generics, the authors of Generic Java were conscious that it would be necessary to maintain compatibility for existing code that used non-generic interfaces for classes, such as common collections, that could now be implemented using generics.

Prior to the introduction of generics, collections APIs used Object, Java's root reference type, as their element type. The Generic Java project therefore chose to 'erase' the generic type parameters by replacing them with Object in order to compile the generic class, and introduce the appropriate casts in client code. The following generic code:

public class Box<T> {
    public T value;
    public Box(T value) {
        this.value = value;
    }
}
//Client code
Box<Integer> intBox = new Box<Integer>(42);
Integer intFromBox = intBox.value;

Is therefore compiled to:

public class Box {
    public Object value;
    public Box(Object value) {
        this.value = value;
    }
}
//Client code
Box intBox = new Box(new Integer(42));
Integer intFromBox = (Integer)intBox.value;

A consequence of type erasure is that it restricts the type parameters to reference types, and as such it is not possible to create an instance of Box<int>, for example. In order to get around this, Java provides boxed classes which contain a field for Java's eight primitive value types. In Oracle's HotSpot JVM, this comes with the disadvantage that a Java object contains a header of one or two words, and to access any field at least one pointer traversal is required. There is therefore the initial cost of allocating memory on the heap for boxed instance, the memory cost of the object's header, the cost of traversing at least one pointer to get the boxed value, and the cost of garbage collecting the instance. In many cases it is difficult to efficiently use generics with primitive types.

Project Valhalla a recent effort to "incubate advanced Java VM and language features", supports the introduction of value types and generic specialisation to the language in order to make these cases more efficient.

Furthermore, without reflection it is not possible to create new instances of the generic argument either, i.e. object creation via new T() is not possible.

.NET

Most C++ compilers compile to native code "ahead of time" (AOT), whereas Java byte-code is compiled "just in time" (JIT) by a virtual machine, but .NET software can be compiled using either method. Typically, however, execution is via the JIT-compiled Common Language Runtime. At runtime, generic types are instantiated as they are required, so generics can be used with value and reference types. A single instance is shared for all reference types - effectively using type erasure like Java - but separate instances are created for value types. Rather than completely erasing types at compile time, the metadata section of .NET "portable executables" encodes the generic parameters of a type or method, allowing these to be replaced, and the method can then be JIT compiled. After an instance has been compiled it will be reused, and the generated native code remains in memory whilst the application is executed.

For most types found in .NET applications (references), this approach will offer similar performance to Java, as appropriate casts between the base reference type and the expected reference type must be inserted. Meanwhile, the .NET approach offers a significant advantage for value types as a separate instantiation is natively compiled for each value type used by the program, which avoids the need for boxing or unboxing, hence offering better performance.

Despite using type erasure for reference types, .NET allows new T() in generic methods by automatically using reflection methods, provided that a given type has the default constructor available.

As well as providing a JIT-environment for .NET applications, Microsoft provides a tool that compiles .NET bytecode assemblies to native code. This improves performance of .NET applications by removing the need for JIT compilation, and also reduces memory requirements.

Unlike native C++, .NET executables typically consist of multiple assemblies that are dynamically linked at runtime (either by the JIT or OS services). Suppose an assembly A.dll exposes the generic type List<T>, which is instantiated as List<string> and List<int> in both B.dll and C.dll, and that D.exe links to both B.dll and C.dll.

In the JIT environment the bytecode for List<T> will only be included in A.dll, but the native environment presents a greater challenge. The first option would be to include native code for List<string> and List<int> in the natively compiled B.dll and C.dll, but this violates .NET's principal that equal types should be identical.

In order to counteract this problem, Kennedy and Syme introduce further rules for the "home" of a particular instantiation:

  • The assembly that defines a generic type includes the instantiation for all reference types, like Java. In the case described above A.dll would include List<ref>
  • If an assembly needs a value type instantiation, then include it in the assembly. B.dll and C.dll would therefore require their own implementations of List<int>, but they could both use the implementation of List<ref> in A.dll for List<string>, as string is a reference type in .NET
  • At link time do not link to a generic type if an equal instantiation has already been linked to. The implementation of List<int> in C.dll would therefore be ignored if B.dll had already been linked in D.exe, thus resolving the identity issue

Conclusion

Generics are a useful addition to many programming languages, and there are further efforts to add higher-order "kinds" (a generalisation of parametrized types) to functional programming languages such as Haskell. The compilation techniques described above are used by main other programming languages, although recent languages such as Go remain resistant to their addition, citing increased compilation time and executable size as drawbacks.

The native code generated by compilers for statically typed languages is unlikely to change due to the stability of computer architectures, OS calling conventions, and memory layouts. Recent research and development into generics instead focusses on adding restrictions to the types that can be used with generics to increase type safety and programmer productivity whilst reducing the complexity of the compiler.