On Some NumPy Functions

Recently, I have been working on a Recommender System model, and I frequently use NumPy to preprocess data before feeding them into the PyTorch model. I have discovered some problems in NumPy functions, so I am sharing them as my experience and a part of my coding notes.

np.datetime64 This is a data type addressing date and time data in a NumPy matrix. However, it looks like this function is quite costly. When indexing on a matrix with np.datetime64 column(s), the processing will become very slow. Here’s my advice: avoid using it, and try to replace it with some other measures. For example, if you want to do some simple arithmetic on days, you’d better operate integer plus or minus n*24*60*60 on timestamp; do not convert it to a np.datetime64 type.

np.datetime64 is stored as int64 in memory. If you try to store it as int32, it will overflow.

When you concatenate an int32 matrix with an int64 matrix, int32 will convert to float64 automatically, not int64, due to the difference in precision and length.

np.isin This is a function try to detect whether an element in a NumPy container exists in another NumPy container. It is useful when we try to do intersection operations between a matrix and a vector, specifically if we have a matrix and we want to filter the matrix by doing an intersection on a column of this matrix with another specified vector. In this scenario, other intersection functions are not applicable, such as np.in1d or np.intersect1d. However, this is also a costly function. Therefore, the only way I can do now to accelerate this function is to leverage multiple processes.

These are my observations. If you find any error or disagree, it is welcome to discuss it below.

Published by


Leave a Reply

Your email address will not be published. Required fields are marked *