Note: phiên bản Tiếng Việt của bài này ở link dưới.
https://duongnt.com/string-interning-vie
As we know, the string
type in C# is an immutable reference type. This can be wasteful if we have multiple instances with the same content, because we still need to allocate memory for them separately. To alleviate this issue, the CLR introduced a feature called string interning. With this feature, multiple strings with the same content can all reference the same object in memory.
String comparison and its caveat
When using the string
type, sometimes it’s easy to forget that we are dealing with a reference type. This is because in .NET, the string
type was designed so that comparison works as we might expect most of the time. Take the code below for example.
var s1 = "An ";
var s2 = "example";
var s3 = "An example";
var s4 = s1 + s2;
Console.WriteLine(s3 == s4); // True
Console.WriteLine(s3.Equals(s4)); // True
However, our s3
and s4
are actually two different objects.
Console.WriteLine(ReferenceEquals(s3, s4)); // False
The only reason s3.Equals(s4)
is True
in the snippet above is because the Equals
method of string
type has been overridden as we can see here. It calls the EqualsHelper
to compare each character in the two strings and return true
if their contents are identical.
if (this == null) //this is necessary to guard against reverse-pinvokes and
throw new NullReferenceException(); //other callers who do not use the callvirt instruction
if (value == null)
return false;
if (Object.ReferenceEquals(this, value))
return true;
if (this.Length != value.Length)
return false;
return EqualsHelper(this, value);
String interning for literals
Whenever we define a string literal, it will be interned. This means when we create more string literals with the same content, they will all reference to the same object in memory.
var s1 = "An example";
var s2 = "An example";
Console.WriteLine(ReferenceEquals(s1, s2)); // True
We can even do this.
var s1 = "An example";
var s2 = "An " + "example";
Console.WriteLine(ReferenceEquals(s1, s2)); // True
With string interning, we have some advantages.
- Because all strings with the same content refer to the same object, we don’t need to allocate as much memory. Consequently, we also lighten the load on the Garbage Collector.
- Comparing interned strings is much faster because we only need to compare their references.
If so, why stop at literal strings, why not intern dynamic strings as well. The reason is explained in this blog post from Eric Lippert. In short, excessive interning can be harmful because of the following reasons.
- When there are a big number of interned strings, creating a new string will become more expensive, because we will need to check if the new string is already interned. We can use a hash table to make lookup easier. But maintaining a hash table is also costly, and calculating hash values for all strings is not free either.
- Managing the lifetime of an interned string is also tricky. Currently, an interned string will live as long as the .NET application. This in itself can lead to memory leak, which is ironic.
Manually intern a string
Having said all that, the framework still allows us to manually control string interning with the string.Intern
and string.IsInterned
static method. When string.Intern
is called with an interned string, it will return a reference to that string. And when called with a non-interned string, it will intern that string then return the reference.
var s1 = "An ";
var s2 = "example";
var s3 = "An example";
var s4 = string.Intern(s1 + s2);
Console.WriteLine(ReferenceEquals(s3, s4)); // True
And we can use string.IsInterned
to check if a string is already interned. If the string is interned, string.IsInterned
will return a reference to that string. But if the string is not interned yet, string.IsInterned
will return null
without interning that string.
var s1 = "An ";
var s2 = "example";
var s3 = string.Intern(s1 + s2);
Console.WriteLine(string.IsInterned(s3)); // "An example"
Console.WriteLine(string.IsInterned(s3 + "1")); // "null"
Keep in mind that all literals, even those used as temporary variables, are automatically interned. Let’s see the code below.
var s1 = string.IsInterned("A random string, not used anywhere else.");
Console.WriteLine(s1); // "A random string, not used anywhere else."
Where is the storage for string interning?
For a dynamically interned string, we can refer to the diagram below from the book Pro .NET Memory Management by Konrad Kokosa.
The hash of each interned string is stored in the StringLiteralMap inside the Private Heap, which is unmanaged memory. The StringLiteralMap stores the hash with the corresponding address of each entry in the LargeHeapHandleTable, which resides in the Large Object Heap of managed memory. And the LargeHeapHandleTable contains the reference to the actual string in the Managed Heap. Depending on the size of each string, that reference will point to the Small Object Heap or the Large Object Heap (for strings larger than 85,000 bytes).
For string literals, things are a bit different. When the source code is compiled, all string literals will be written into the metadata. If there are multiple string literals with the same content, the compiler is also smart enough to write that content just once. Then when that code is used, the string literals inside metadata will be checked against the StringLiteralMap, and a new string won’t be allocated if its hash is found in the StringLiteralMap.
Conclusion
String interning is an interesting feature, but it’s actually not something we frequently use in practice. And misusing it can certainly degrade the performance of our application. We should only consider it if we ever need to create a lot of strings with the same content.
One Thought on “Overview of string interning in C#”