Recently, I have been working on a Recommender System model, and I frequently use NumPy to preprocess data before feeding them into the PyTorch model. I have discovered some problems in NumPy functions, so I am sharing them as my experience and a part of my coding notes.

`np.datetime64`

This is a data type addressing date and time data in a NumPy matrix. However, it looks like this function is quite costly. When indexing on a matrix with `np.datetime64`

column(s), the processing will become very slow. Here’s my advice: avoid using it, and try to replace it with some other measures. For example, if you want to do some simple arithmetic on days, you’d better operate integer plus or minus `n*24*60*60`

on timestamp; do not convert it to a `np.datetime64`

type.

`np.datetime64`

is stored as int64 in memory. If you try to store it as `int32`

, it will overflow.

When you concatenate an `int32`

matrix with an `int64`

matrix, `int32`

will convert to `float64`

automatically, ** not **, due to the difference in precision and length.

`int64`

`np.isin`

This is a function try to detect whether an element in a NumPy container exists in another NumPy container. It is useful when we try to do intersection operations between a matrix and a vector, specifically if we have a matrix and we want to filter the matrix by doing an intersection on a column of this matrix with another specified vector. In this scenario, other intersection functions are not applicable, such as `np.in1d`

or `np.intersect1d`

. However, this is also a costly function. Therefore, the only way I can do now to accelerate this function is to leverage multiple processes.

These are my observations. If you find any error or disagree, it is welcome to discuss it below.

## Leave a Reply