In this article, we will compare the List and Vector data structures in Scala and analyze the pros and cons of using them in different use cases. As we all know List is the most popular data structure in scala and quite simply the most beginner friendly of Scala Collections. But is it the obvious choice for all if not most cases? Let us examine Scala’s Vector vs List and their performance in this article and help you pick the right choice for your use case.
What is a Vector?
Vector is a general-purpose, immutable data structure. It provides random access and updates in constant time and very fast appends and prepends. Vectors strike a good balance between fast random selections and fast random functional updates. They are currently the default implementation of immutable indexed sequences.
i.e you get the following result in REPL if you instantiate an IndexedSeq
val seq = IndexedSeq(1, 2)
val seq: IndexedSeq[Int] = Vector(1, 2)
Vector is a base-32 integer trie. A Trie is a type of n-ary search tree data structure that can efficiently store and search data. It is typically used to store strings but can be used to store ordered lists of any type such as digits or bits. Each node contains the value up to that point in the Trie. Each edge is the key we need to insert or search. Here is a sample example of a Trie
Vector is a similar Trie where the root node has 32 children nodes. Each children node can have 32 children and so on. Hence to hold 2^30 elements a Trie has just 5 levels.
This makes traversing an “effectively constant time” (eC) operation. Further
Vector elements exist in a batch of 32, thus providing a high locality. This means that they are stored in neighboring memory locations. Accessing such elements invokes the cache to store them hence providing quick access.
What is a List?
List in scala is an immutable linked-list implementation of ordered elements of type [A]. Hence the splitting of the head and tail is a constant operation.
List is a commonly used data structure in scala and is immutable. Adding elements to
List or removing elements from the
List returns a new one. It is a stack-type data structure optimal for last-in-first-out LIFO access patterns. It is not suited for random access or FIFO-type access patterns.
Vector Vs List in Scala
Lists provide structural sharing. Structural sharing means copying similar prefix or suffix parts to the new memory location (new variable). This offers some optimization in update operations where we create a new structure from the existing one. However, there are some fundamental performance differences.
Head / Tail
List is a linked list, the head or tail operations are incredibly fast and are a constant (C) operation. Whereas in
Vector it is nearly-a-constant-time operation(eC) considering factors such as the maximum size of the
List is a
LinearSeq, getting an element at a specified index is an O(k) time operation where k is the position of the element in the
Vector, on the other hand, is an
IndexedSeq and is way faster. It offers fast random access.
Vectors are Tries and just need to do a constant number of operations (up to 7 in real-time use cases) to look up sufficiently large data. Hence
List takes Linear time here while
Vector takes nearly constant time.
Prepend / Append
Lists are fast for prepend operations. They just update the new node’s pointer to point to the existing
Vectors also, provide fast prepend operations due to their tree-ish structure. Structural sharing helps them copy the existing structure and add a new element on top of it.
Lists prepend in constant (C) time whereas
Vectors prepend in nearly constant (eC) time.
As for append operations,
Vectors are significantly faster than
Lists traverse till the last node and then append the new element which takes a linear O(n) time.
Vector lookup takes place in O(1) i.e constant time. Hence appending to the element also takes a nearly constant time. With help from structural sharing,
Vectors can copy the prefix, add the new element, and return the new
Vector is an
IndexedSeq which modifies the existing element in nearly constant time. It looks up the position to update in constant time. The element is updated. Finally, the prefix/suffix structure up to that point is shared, returning a Vector with the updated element.
Lists on the other hand, have to traverse and find the element before updating them. This takes O(n) time. Hence
Lists take Linear time for updating while
Vectors take nearly constant time.
Lists are a
LinearSeq and are a good choice for iterative traversals. But
Vectors offer equally good traversal time. Iteration in
Lists is following each node’s pointer to its location. This can sometimes result in a cache miss. Addresses of those pointers reside in a different part of the memory and cannot be fetched from the cache.
Traversing elements is slightly more complicated in
Vectors. Still, due to their excellent locality, elements belonging to the same level, especially to a particular node are almost next to each other. Hence
Vectors reduce cache misses by a significant level. Tries traverse in O(n) time, and so
Vectors also traverse in O(n) time.
List does not work well with parallel algorithms. Since a
List requires O(n) traversals to create a sublist or access an element from it, it is not suitable to use
Lists when implementing parallel algorithms. An example of such an algorithm is splitting a list into sublists and sorting each of the sublists. But this cannot be done in isolation. To traverse each of the sublists, we need to traverse the whole list from the start. Furthermore, Lists themselves contain mutable states in their internal implementation making them gullible to race conditions. A note from the official docs on scala says thus,
Note: Despite being an immutable collection, the implementation uses mutable state internally during construction. These state changes are invisible in single-threaded code but can lead to race conditions in some multi-threaded scenarios.
It also makes sense as to why
List was not included in the parallel collections package in scala.
Vector is much more suitable in this aspect providing fast random access and updates, thus making it useful in parallel algorithms without any race conditions. Scala
ParVector is a specialized data structure that allows for efficient copying and splitting while keeping the data immutable in distributed environments.
As a summarization of the above differences, here are performance benchmarks from the docs, that list the differences between Vector vs List in Scala.
After a detailed analysis of Vector vs List in Scala, it is clear that
Vectors gain the upper hand over
Lists in case of random access, and updates while
Lists are better at the head, tail, and prepend operations.
Vector is the obvious choice when you need an
Indexed Sequence collection and when implementing parallel algorithms.
I hope by considering the above points you will be able to decide which data structure is better suited for your use case. Feel free to add to the discussion by dropping your suggestions, and feedback below. Thanks for reading and I hope you have a good day!