Dense Matrix and Sparsity of the Matrix
There are two common matrix types: dense and sparse. The main difference is that sparse metrics have many null values and dense metrics have none. Below is an example of a 4-by-4 sparse matrix.
- In a Matrix, if most of the elements are nonzero, then the matrix is considered dense.
- The number of zero-valued elements divided by the total number of elements (e.g., m × n for an m × n matrix) is sometimes referred to as the sparsity of the matrix.
Reasons to use Sparse Matrix representation
Since there are fewer non-zero elements than zero, less memory is available to store only those elements.
So, instead of storing zeroes with non-zero elements, we only store non-zero elements
You can save computation time by logically designing your data structure to traverse only non-zero elements
Usage of Sparse matrix in Machine Learning:
1.When storing and manipulating sparse matrices in a computer, it is advantageous and often necessary to use special algorithms and data structures that exploit the sparse structure of matrices. A special computer was made for sparse
2.Operations using standard dense matrix structures and algorithms are slow and inefficient when applied to large sparse matrices due to processing zeros and wasting memory.
3.Sparse data compresses easily, requiring significantly less storage space. Some very large sparse matrices cannot be manipulated with standard dense matrix algorithms.
Storage of sparse matrices
For sparse matrices, storing only non-zero entries can significantly reduce storage requirements. Depending on the number and distribution of non-zero entries, different data structures can be used, with significant memory savings compared to the basic approach.
Types of Formats:
- Those that support efficient modification,
- such as DOK (Dictionary of keys), LIL (List of lists), or COO (Coordinate list).
- These are typically used to construct the matrices.
- Those that support efficient access and matrix operations, such as
- CSR (Compressed Sparse Row) or CSC (Compressed Sparse Column).
The COO format
Stores three arrays: one for the values, one for the row position of each value, and one for the column position of each value. This is known as the Coordinate List (COO) format. Stores three arrays of 11 values for a total of 33 while the matrix contains 30 values. So what’s the point?
Suppose the array has 50 columns and 10,000 rows, or 500,000 values. It also contains only 10,000 non-zero values (meaning it is 98% sparse). Since you store 3 arrays of 10,000 values, do you store 30,000 or 500,000 values? That’s almost 20 times less space for storage. Sparse matrices are therefore useful when working with large data sets that exhibit high sparsity.
Here is an example of how to create and manipulate a sparse matrix using TensorFlow:
import tensorflow as tf # Create a sparse matrix with only a few non-zero elements indices = [[0, 0], [1, 2], [2, 1]] values = [1, 2, 3] shape = [3, 3] sparse_matrix = tf.SparseTensor(indices, values, shape) # Convert the sparse matrix to a dense matrix dense_matrix = tf.sparse.to_dense(sparse_matrix) # Perform matrix multiplication on the dense matrix result = tf.matmul(dense_matrix, dense_matrix) # Print the result print(result)
This code will create a 3×3 sparse matrix with non-zero values at indices (0, 0), (1, 2), and (2, 1). It will then convert the sparse matrix to a dense matrix, perform matrix multiplication on the dense matrix, and print the result.
Improve scikit-learn code with sparse data
There are several ways to improve scikit-learn code when working with sparse data:
- Use sparse matrices: When working with sparse data, it is often more memory-efficient to use sparse matrices instead of dense matrices. scikit-learn provides several functions for creating and manipulating sparse matrices, including
- Use sparse versions of scikit-learn estimators: Many scikit-learn estimators have a
sparseparameter that allows you to use a sparse matrix as input. For example, the
LogisticRegressionclass has a
sparseparameter that can be set to
Trueto use a sparse matrix as input.
- Use the
SelectKBestfeature selection method: When working with sparse data, it is often useful to select a subset of the most informative features using the
SelectKBestmethod. This can help reduce the dimensionality of the data and improve the performance of the model.
- Use the
TruncatedSVDdimensionality reduction method: When working with high-dimensional sparse data, it can be helpful to use dimensionality reduction techniques such as truncated singular value decomposition (SVD) to reduce the number of features. The
TruncatedSVDclass in scikit-learn can be used for this purpose.
- Use the
fit_transformmethod: When working with sparse data, it can be more memory-efficient to use the
fit_transformmethod of scikit-learn estimators and transformers instead of calling the
transformmethods separately. This will avoid creating an intermediate dense matrix when fitting and transforming the data.
By following these tips, you can improve the performance and memory efficiency of your scikit-learn code when working with sparse data.
In this article, we discussed why sparse matrices are relevant to machine learning and how sparse matrices can help reduce dataset storage and computational costs of running ML algorithms.