Note: phiên bản Tiếng Việt của bài này ở link dưới.

https://duongnt.com/linq-vie

LINQ (Language Integrated Query) is the name for a set of technologies to integrate query capabilities into the C# language. It was introduced as a major part of C# 3.0 and .NET Framework 3.5 back in 2007. In fact, most of the new features in C# 3.0 built toward the larger goal of LINQ. It is one of my most favourite features in C#; not only because of its usefulness, but also due to how it unifies several features into a coherent one.

There are many flavours of LINQ: LINQ to Objects, LINQ to SQL, LINQ to XML,… But today, we will focus on the most basic version, LINQ to Objects, and check out some common mistakes when using it.

We will use the class below in our examples.

public class Item
{
    private string _name;

    public int Id { get; set; }
    public string Name
    {
        get
        {
            Console.WriteLine($"Get name for Id: {Id}");
            return _name;
        }
        set
        {
            _name = value;
        }
    }
    public decimal Price { get;set; }

    public Item(int id, string name, decimal price)
    {
        Id = id;
        Name = name;
        Price = price;
    }
}

Be careful of deferred execution

We will start with a simple collection of Item objects.

IEnumerable<Item> items = new List<Item>
{
    new Item(1, "Book", 30.00m),
    new Item(2, "Toy Car", 31.50m),
    new Item(3, "Water Gun", 32m),
    new Item(4, "Headphone", 33.50m)
};

Below, we create two IEnumerable<int> objects using LINQ.

var maxPrice = 31m;
var itemUnder31 = items.Where(i => i.Price < maxPrice).Select(i => i.Id);

maxPrice = 33m;
var itemUnder33 = items.Where(i => i.Price < maxPrice).Select(i => i.Id);

// Code to print itemUnder31 and itemUnder33 to console

At first glance, one might think that the code above will print these values to console.

# itemUnder31: 1
# itemUnder33: 1, 2, 3

But the actual result is below.

# itemUnder31: 1, 2, 3
# itemUnder33: 1, 2, 3

This is because LINQ uses a technique called deferred execution. A LINQ query is only executed when we actually need its result, not when we create it. By the time we iterate those collections, maxPrice==33m and that is the value LINQ will use for both query. Because of this, we need to be extra careful when creating a query from local variables.

Generally speaking, LINQ only executes a query when it needs to iterate through the query’s result.

Be careful of duplicated iteration

When calling methods like ToList or ToArray, it is obvious that we need to iterate through the whole collection. The same is also true when we use the foreach keyword. But some other methods also perform iteration, take the code below for example.

var names = items.Select(i => i.Name);
var isEmpty = names.Any();
var count = names.Count();

It will print the following lines to console.

Get name for Id: 1
Get name for Id: 1
Get name for Id: 2
Get name for Id: 3

We can see that the Any method iterated through the first element, and Count iterated through all elements in names. This is because Any needs to see at least one element to determine whether a collection is empty. Likewise, to find the total number of elements, Count needs to iterate through everything.

Below are some common methods and their properties.

  • Iterate through all elements: Single, Max, Min, Average, Sum,…
  • Iterate up to the first applicable element: First, Any, ElementAt,…
  • Does not iterate: Where, Select, Intersect, Union, Skip, Take,…

However, just because some iterations happen, that does not mean we are having an issue. Sometimes, LINQ can skip iterating the collection; while other times, duplicated iteration can severely affect our performance or causes bugs.

When LINQ can skip iterating the collection

If the IEnumerable<T> is actually a ICollection<T> then there are cases where Any or Count method won’t iterate through the collection. We can check the code of Any method here.

if (source is ICollection<TSource> collectionoft)
{
    return collectionoft.Count != 0;
}

We can see that LINQ is actually smart enough to use the Count property if it detects that the collection implements the ICollection<T> interface. Note that this is only true if we use the overload of Any that does not receive a predicate. If we call Any and pass a predicate, LINQ will call this overload. As we can see, there is no checking for ICollection<T>, and in this case the Count property wouldn’t be very useful anyway.

Count method also use the same optimization as we can see here. Similar to the Any method, the check for ICollection<T> is only applicable when we call Count without a predicate.

if (source is ICollection<TSource> collectionoft)
{
    return collectionoft.Count;
}

When we want to check if an ICollection<T> is empty or not, should we use the Any method from LINQ or compare the property Count > 0? Some argue that we should use Any so that the intention is clear; while others say we should use Count > 0 (the property) because Any will use it anyway. Personally, I think either way is fine and the difference in performance is too small to matter.

When duplicated iteration is not okay

There are two main cases where duplicated iteration can cause severe effects.

  • Iterating each element takes a long time.
  • Iterating each element has side effects.

A bug that almost went through

In one project, I encountered some code that looks similar to this.


private static int ParallelLimit = 10;

public async Task<CustomType> GetSomeInfoAsync(string url)
{
    // code to asynchronously retrieve some info from a url.
}

public async Task<IList<CustomType>> GetAllInfoAsync(IEnumerable<string> urls)
{
    var rs = new List<CustomType>();

    for (var iteration = 0; ; iteration++)
    {
        var retrieveTasks = urls.Skip(iteration * ParallelLimit)
            .Take(ParallelLimit)
            .Select(url => GetSomeInfoAsync(url))

        if (!retrieveTasks.Any())
        {
            break;
        }

        var infos = await Task.WhenAll(retrieveTasks);
        rs.AddRange(infos);
    }

    return rs;
}

The basic idea here is sound. We want to asynchronously retrieve the data from multiple URLs, but we also want to limit the number of concurrent requests so that we won’t overwhelm the remote server. However, this call retrieveTasks.Any() actually iterates through the first element of retrieveTasks. Then when we call await Task.WhenAll(retrieveTasks), we iterate through the whole task collection again. This means for each batch, we make two calls to the first URL in that batch instead of one.

Fortunately, all GetSomeInfoAsync does is retrieve data without changing anything. Because of that, we can call that method twice on the same URL without corrupting any data (operations that can be called multiple times like this are called idempotent). But what if our operation is non-idempotent?

A money leak

Let’s say someone references the code above to write the following snippet.

public async Task TransferMoneyAsync(Guid accountId, decimal amount)
{
    // code to asynchronously transfer an amount of money to an account
}

public async Task<IList<int>> GetAllContentLengthAsync(IEnumerable<string> urls)
{
    for (var iteration = 0; ; iteration++)
    {
        var transferTasks = acountIds.Skip(iteration * ParallelLimit)
            .Take(ParallelLimit)
            .Select(acountId => TransferMoneyAsync(acountId, amount))

        if (!transferTasks.Any())
        {
            break;
        }

        var contentLengthCollection = await Task.WhenAll(contentLengthTasks);
    }
}

Suddenly, for every 10 accounts, someone will receive double the amount of money.

How to fix this bug

To fix this, all we need to do is convert transferTasks from IEnumerable<Task> to List<Task>. ToList method will force an iteration through the whole collection, creating and starting all tasks in the process. After that, we can iterate the list of tasks as much as we want. Of course, await Task.WhenAll(transferTasks) still waits until all tasks are completed.

var transferTasks = acountIds.Skip(iteration * ParallelLimit)
    .Take(ParallelLimit)
    .Select(acountId => TransferMoneyAsync(acountId, amount))
    .ToList();

Does that mean we should go and convert all IEnumerable<T> to List<T>? The answer is no. Converting an IEnumerable<T> to List<T> before it is needed induces a memory and CPU penalty, especially if our collection has a lot of elements and each element is big. We should only convert it to a List<T> if we can predict that we will iterate through the same collection multiple times.

Conclusion

LINQ helps us express the same logic with much more concise code in a short time. However, because a simple LINQ query can hide a surprising amount of code, it’s entirely possible that bugs will creep in our code.

A software developer from Vietnam and is currently living in Japan.

Leave a Reply