Note: phiên bản Tiếng Việt của bài này ở link dưới.

https://duongnt.com/string-interning-vie

As we know, the string type in C# is an immutable reference type. This can be wasteful if we have multiple instances with the same content, because we still need to allocate memory for them separately. To alleviate this issue, the CLR introduced a feature called string interning. With this feature, multiple strings with the same content can all reference the same object in memory.

String comparison and its caveat

When using the string type, sometimes it’s easy to forget that we are dealing with a reference type. This is because in .NET, the string type was designed so that comparison works as we might expect most of the time. Take the code below for example.

var s1 = "An ";
var s2 = "example";
var s3 = "An example";
var s4 = s1 + s2;

Console.WriteLine(s3 == s4); // True
Console.WriteLine(s3.Equals(s4)); // True

However, our s3 and s4 are actually two different objects.

Console.WriteLine(ReferenceEquals(s3, s4)); // False

The only reason s3.Equals(s4) is True in the snippet above is because the Equals method of string type has been overridden as we can see here. It calls the EqualsHelper to compare each character in the two strings and return true if their contents are identical.

if (this == null)                        //this is necessary to guard against reverse-pinvokes and
    throw new NullReferenceException();  //other callers who do not use the callvirt instruction

if (value == null)
    return false;

if (Object.ReferenceEquals(this, value))
    return true;

if (this.Length != value.Length)
    return false;

return EqualsHelper(this, value);

String interning for literals

Whenever we define a string literal, it will be interned. This means when we create more string literals with the same content, they will all reference to the same object in memory.

var s1 = "An example";
var s2 = "An example";
Console.WriteLine(ReferenceEquals(s1, s2)); // True

We can even do this.

var s1 = "An example";
var s2 = "An " + "example";
Console.WriteLine(ReferenceEquals(s1, s2)); // True

With string interning, we have some advantages.

  • Because all strings with the same content refer to the same object, we don’t need to allocate as much memory. Consequently, we also lighten the load on the Garbage Collector.
  • Comparing interned strings is much faster because we only need to compare their references.

If so, why stop at literal strings, why not intern dynamic strings as well. The reason is explained in this blog post from Eric Lippert. In short, excessive interning can be harmful because of the following reasons.

  • When there are a big number of interned strings, creating a new string will become more expensive, because we will need to check if the new string is already interned. We can use a hash table to make lookup easier. But maintaining a hash table is also costly, and calculating hash values for all strings is not free either.
  • Managing the lifetime of an interned string is also tricky. Currently, an interned string will live as long as the .NET application. This in itself can lead to memory leak, which is ironic.

Manually intern a string

Having said all that, the framework still allows us to manually control string interning with the string.Intern and string.IsInterned static method. When string.Intern is called with an interned string, it will return a reference to that string. And when called with a non-interned string, it will intern that string then return the reference.

var s1 = "An ";
var s2 = "example";
var s3 = "An example";
var s4 = string.Intern(s1 + s2);

Console.WriteLine(ReferenceEquals(s3, s4)); // True

And we can use string.IsInterned to check if a string is already interned. If the string is interned, string.IsInterned will return a reference to that string. But if the string is not interned yet, string.IsInterned will return null without interning that string.

var s1 = "An ";
var s2 = "example";
var s3 = string.Intern(s1 + s2);

Console.WriteLine(string.IsInterned(s3)); // "An example"
Console.WriteLine(string.IsInterned(s3 + "1")); // "null"

Keep in mind that all literals, even those used as temporary variables, are automatically interned. Let’s see the code below.

var s1 = string.IsInterned("A random string, not used anywhere else.");
Console.WriteLine(s1); // "A random string, not used anywhere else."

Where is the storage for string interning?

For a dynamically interned string, we can refer to the diagram below from the book Pro .NET Memory Management by Konrad Kokosa.

The hash of each interned string is stored in the StringLiteralMap inside the Private Heap, which is unmanaged memory. The StringLiteralMap stores the hash with the corresponding address of each entry in the LargeHeapHandleTable, which resides in the Large Object Heap of managed memory. And the LargeHeapHandleTable contains the reference to the actual string in the Managed Heap. Depending on the size of each string, that reference will point to the Small Object Heap or the Large Object Heap (for strings larger than 85,000 bytes).

For string literals, things are a bit different. When the source code is compiled, all string literals will be written into the metadata. If there are multiple string literals with the same content, the compiler is also smart enough to write that content just once. Then when that code is used, the string literals inside metadata will be checked against the StringLiteralMap, and a new string won’t be allocated if its hash is found in the StringLiteralMap.

Conclusion

String interning is an interesting feature, but it’s actually not something we frequently use in practice. And misusing it can certainly degrade the performance of our application. We should only consider it if we ever need to create a lot of strings with the same content.

A software developer from Vietnam and is currently living in Japan.

One Thought on “Overview of string interning in C#”

Leave a Reply