Note: phiên bản Tiếng Việt của bài này ở link dưới.

https://duongnt.com/read-numpy-files-in-csharp-vie

read-numpy-files-in-csharp

As introduced in an earlier article, we can load and run Tensorflow model in C#. Occasionally, we also need to load data from Numpy npy or npz files. For example, in a voice processing project, I loaded the mean and standard deviation of Mel Cepstrum Envelope from npz files. Today, we will use the NumSharp library to read Numpy files in C#. This article was built upon my two answers on Stackoverflow, which can be found here and here.

You can install NumSharp by running the following command.

dotnet add package NumSharp --version 0.30.0

Read Numpy files in npz format

Read a one-dimensional array which holds integer values

First, we use Python to create a sample file.

import numpy as np

my_data = np.random.randint(0, 255, size=(250)) # Generate an array with 250 elements; each element is an integer between 0~254
np.savez_compressed('data.npz', test_data=my_data) # Name that array 'test_data' and save it into 'data.npz' file

The C# code to read data.npz is as simple as this.

using NumSharp;

var dict = np.Load_Npz<int[]>("data.npz"); // 'dict' is of type NpzDictionary<int[]>
var myData = dict["test_data.npy"]; // Yes, the '.npy' path is necessary. And it is indeed 'npy', not 'npz'

The type of myData is int[], and you can access all its values using indexes.

Console.WriteLine(myData[0]); // Print the first element to console

Here, Load_Npz reads data from a file, but this method can also read data from a stream, or from a byte array.

Read a one-dimensional array which holds non-integer values

What if the values in data.npz are not integers? What if the Python code to generate the Numpy array looks like this?.

my_data = np.random.rand(250) # Generate an array with 250 elements. The data type is np.double (float64)

In this case, the C# code will look like this.

var dict = np.Load_Npz<double[]>("data.npz"); // Type T is now 'double[]'
var myData = dict["test_data.npy"]; // myData is of type 'double[]'

As we can see, we only need to change the generic type parameter of Load_Npz from int[] to double[]. Below is the mapping between some common types in Numpy with the corresponding ones in C#.

Numpy Type Size C# Type
numpy.byte 8 bit byte
numpy.single 32 bit float
numpy.double 64 bit double

Read a multidimensional array

In a real-world scenario, we often have to deal with multidimensional arrays. For example, the code below will create a three-dimensional array with shape (250, 250, 3) and use double data type.

my_data = np.random.rand(250, 250, 3)
np.savez_compressed('data.npz', test_data=my_data)

The code to read the new npz file is surprisingly similar to the code to read a one-dimensional array.

var dict = np.Load_Npz<double[,,]>("data.npz"); // Type T is now 'double[,,]' (a three-dimensional array of type double)
var myData = dict["test_data.npy"]; // myData is of type 'double[,,]'

Again, we only need to change the generic type parameter of Load_Npz. Here, our data is a three-dimensional array, so we use double[,,]. If our data is two-dimensional, we use double[,], and so on.

Console.WriteLine(myData[0, 0, 0]); // Print the first element to console

Perhaps you can already guess how to read a three-dimensional array of type integer.

var dict = np.Load_Npz<integer[,,]>("data.npz"); // Type T is now 'integer[,,]'
var myData = dict["test_data.npy"]; // myData is of type 'integer[,,]'

Read npz file which contains multiple arrays

It is possible to store multiple arrays in one npz file. Because each array still has its own name, we can read them using their names.

my_data1 = np.random.rand(250, 250, 3)
my_data2 = np.random.rand(250, 250, 3)
my_data2 = np.random.rand(250, 250, 3)
np.savez_compressed('data.npz', test_data1=my_data1, test_data2=my_data2, test_data3=my_data3)
var dict = np.Load_Npz<integer[,,]>("data.npz");
var myData1 = dict["test_data1.npy"];
var myData2 = dict["test_data2.npy"];
var myData3 = dict["test_data3.npy"];

What if we don’t name the arrays in the npz file?

Remember that we need to use the array’s name when retrieving its data from the NpzDictionary. But we can omit the array’s name when writing it to a npz file.

np.savez_compressed('data.npz', my_data)

In this case, Numpy actually saves our array with the default name arr_0. Because of that, the C# code will simply become this.

var dict = np.Load_Npz<integer[,,]>("data.npz");
var myData = dict["arr_0.npy"]; // Again, the '.npy' part is required

If our npz file contains multiple arrays, they will be named arr_0, arr_1, arr_2,…

np.savez_compressed('data.npz', my_data1, my_data2, my_data3)

And the C# code to read that file.

var dict = np.Load_Npz<integer[,,]>("data.npz");
var myData1 = dict["arr_0.npy"];
var myData2 = dict["arr_1.npy"];
var myData3 = dict["arr_2.npy"];

Read Numpy files in npy format

The npz format is useful because it allows us to store multiple arrays in the same file; while at the same time it reduces the file size. But sometimes, we want to read vanilla npy file. NumSharp also supports this with the Load<T> method.

This is the Python code to create test data.

my_data = np.random.rand(250)
np.save('data.npy', my_data)

And the C# code to read data.npy.

var myData = np.Load<double[]>("data.npy"); // myData is of type 'double[]'

There can only be one array in a npy file, and that array is not named. To handle multidimensional arrays or arrays with different data types, all we need to do is change the generic parameter type T, just like when we call Load_Npz.

Similar to Load_Npz, Load also supports reading data from a stream or a byte array.

Limitation of NumSharp

As we can see in previous sections, NumSharp reads data from Numpy files into .NET arrays. The size of a .NET array is not only limited by the amount of RAM. This document gives the limitation for .NET Core.

  • The array size is limited to a total of 4 billion elements for all dimensions.
  • The maximum index of any given dimension is 0X7FEFFFFF/2,146,435,071 (0X7FFFFFC7/2,147,483,591 for byte arrays and arrays of single-byte structures).

When using .NET Framework, we have an additional limit.

  • The maximum size of an Array is 2 gigabytes (GB). Note that on 64-bit environments, we can avoid this restriction by enabling the gcAllowVeryLargeObjects flag.

Does that mean NumSharp can read Numpy files with up to 4 billion elements, where each dimension has less than 2,146,435,071 elements? Unfortunately, the answer is no. And the bigger the size of your data type, the smaller the array that NumSharp can read. To understand why, we need to dive into the code of NumSharp.

The inner working of NumSharp

Both Load_Npz<T>(string path) and Load<T>(string path) eventually call the Load<T>(Stream stream) method, which in turn calls LoadMatrix(Stream stream). The source code of LoadMatrix can be found here.

int bytes;
Type type;
int[] shape;
if (!parseReader(reader, out bytes, out type, out shape))
    throw new FormatException();

Array matrix = Arrays.Create(type, shape);

if (type == typeof(String))
    return readStringMatrix(reader, matrix, bytes, type, shape);
return readValueMatrix(reader, matrix, bytes, type, shape);

Here, bytes is the size of your date type (4 bytes for int and float, 8 bytes for double,…). And shape is the shape of the array we are trying to read. These two values are then fed into the readValueMatrix method.

int total = 1;
for (int i = 0; i < shape.Length; i++)
    total *= shape[i];
var buffer = new byte[bytes * total];
// omitted

As we can see, NumSharp is trying to create a one-dimensional byte array with size equals bytes * total. Total is the multiplication of the size in shape, in other words, this is the total number of elements in the Numpy file we are trying to read. We already know that a byte array cannot have more than 2,147,483,591 elements in one dimension. Knowing the size of each data type, we can then calculate the maximum elements in one Numpy file that NumSharp can read.

Type Size Maximum elements count
byte 1 byte 2,147,483,591
int/float 4 bytes 536,870,897
double 8 bytes 268,435,448

Conclusion

While having its limitations, the NumSharp library does a good enough job as a substitution for Numpy in .NET. Especially consider that I mostly train models using Python and only use C# to load pre-trained models and perform prediction, I’ve always found NumSharp to be sufficient.

A software developer from Vietnam and is currently living in Japan.

3 Thoughts on “How to read Numpy files in C# with NumSharp”

  • I am using .NET Core 3.1. The documentation for .NET Core says that, by default on 64 bit, it can handle large arrays. For example, int[,] x = new Int32[41000, 41000]; works. And that is over 6GB large.

    When I try to read in a .npy file that has an array of size 41000 x 41000, the ‘readValueMatrix’ blows up. Your site above says “Note that on 64-bit environments, we can avoid this restriction by enabling the gcAllowVeryLargeObjects flag”. .NET Core is clearly allowing me to have very large objects, but not NumSharp.

    Any advice

    • Hi Marc, as mentioned in the Limitation of NumSharp section, this is the problematic code in NumSharp.

      int total = 1;
      for (int i = 0; i < shape.Length; i++)
          total *= shape[i];
      var buffer = new byte[bytes * total];
      

      With a 41000 x 41000 array, total is equal to 41000 x 41000 == 1,681,000,000. For type int, the value of bytes is 4, which means the last line will try to create an one dimension array with 4 x 1,681,000,000 == 6,724,000,000 elements. This is more than allowed by the framework (the gcAllowVeryLargeObjects flag only allows array with size greater than 2GB, it does not increase the maximum elements allowed in 1 dimension, because in each dimension, the index is of type int).

      Since this is a framework limit, I afraid there is little we can do.

Leave a Reply