Sparse Matrices: What makes them important for Machine Learning

black and red quadcopter drone
Reading Time: 4 minutes

Dense Matrix and Sparsity of the Matrix

There are two common matrix types: dense and sparse. The main difference is that sparse metrics have many null values ​​and dense metrics have none. Below is an example of a 4-by-4 sparse matrix.

  1. In a Matrix, if most of the elements are nonzero, then the matrix is considered dense.
  2. The number of zero-valued elements divided by the total number of elements (e.g., m × n for an m × n matrix) is sometimes referred to as the sparsity of the matrix.

Reasons to use Sparse Matrix representation

Storage

Since there are fewer non-zero elements than zero, less memory is available to store only those elements.

So, instead of storing zeroes with non-zero elements, we only store non-zero elements

Computing time

You can save computation time by logically designing your data structure to traverse only non-zero elements

Usage of Sparse matrix in Machine Learning:

1.When storing and manipulating sparse matrices in a computer, it is advantageous and often necessary to use special algorithms and data structures that exploit the sparse structure of matrices. A special computer was made for sparse

2.Operations using standard dense matrix structures and algorithms are slow and inefficient when applied to large sparse matrices due to processing zeros and wasting memory.

3.Sparse data compresses easily, requiring significantly less storage space. Some very large sparse matrices cannot be manipulated with standard dense matrix algorithms.

Storage of sparse matrices

For sparse matrices, storing only non-zero entries can significantly reduce storage requirements. Depending on the number and distribution of non-zero entries, different data structures can be used, with significant memory savings compared to the basic approach.

Types of Formats:

  1. Those that support efficient modification,
    • such as DOK (Dictionary of keys), LIL (List of lists), or COO (Coordinate list).
    • These are typically used to construct the matrices.
  2. Those that support efficient access and matrix operations, such as
    • CSR (Compressed Sparse Row) or CSC (Compressed Sparse Column).

The COO format

Stores three arrays: one for the values, one for the row position of each value, and one for the column position of each value. This is known as the Coordinate List (COO) format. Stores three arrays of 11 values ​​for a total of 33 while the matrix contains 30 values. So what’s the point?

Suppose the array has 50 columns and 10,000 rows, or 500,000 values. It also contains only 10,000 non-zero values ​​(meaning it is 98% sparse). Since you store 3 arrays of 10,000 values, do you store 30,000 or 500,000 values? That’s almost 20 times less space for storage. Sparse matrices are therefore useful when working with large data sets that exhibit high sparsity.

Here is an example of how to create and manipulate a sparse matrix using TensorFlow:

import tensorflow as tf

# Create a sparse matrix with only a few non-zero elements
indices = [[0, 0], [1, 2], [2, 1]]
values = [1, 2, 3]
shape = [3, 3]
sparse_matrix = tf.SparseTensor(indices, values, shape)

# Convert the sparse matrix to a dense matrix
dense_matrix = tf.sparse.to_dense(sparse_matrix)

# Perform matrix multiplication on the dense matrix
result = tf.matmul(dense_matrix, dense_matrix)

# Print the result
print(result)

This code will create a 3×3 sparse matrix with non-zero values at indices (0, 0), (1, 2), and (2, 1). It will then convert the sparse matrix to a dense matrix, perform matrix multiplication on the dense matrix, and print the result.

Improve scikit-learn code with sparse data

There are several ways to improve scikit-learn code when working with sparse data:

  1. Use sparse matrices: When working with sparse data, it is often more memory-efficient to use sparse matrices instead of dense matrices. scikit-learn provides several functions for creating and manipulating sparse matrices, including csr_matrix, csc_matrix, and coo_matrix.
  2. Use sparse versions of scikit-learn estimators: Many scikit-learn estimators have a sparse parameter that allows you to use a sparse matrix as input. For example, the LogisticRegression class has a sparse parameter that can be set to True to use a sparse matrix as input.
  3. Use the SelectKBest feature selection method: When working with sparse data, it is often useful to select a subset of the most informative features using the SelectKBest method. This can help reduce the dimensionality of the data and improve the performance of the model.
  4. Use the TruncatedSVD dimensionality reduction method: When working with high-dimensional sparse data, it can be helpful to use dimensionality reduction techniques such as truncated singular value decomposition (SVD) to reduce the number of features. The TruncatedSVD class in scikit-learn can be used for this purpose.
  5. Use the fit_transform method: When working with sparse data, it can be more memory-efficient to use the fit_transform method of scikit-learn estimators and transformers instead of calling the fit and transform methods separately. This will avoid creating an intermediate dense matrix when fitting and transforming the data.

By following these tips, you can improve the performance and memory efficiency of your scikit-learn code when working with sparse data.

Conclusion

In this article, we discussed why sparse matrices are relevant to machine learning and how sparse matrices can help reduce dataset storage and computational costs of running ML algorithms.

Leave a Reply