Category Archives: big-o

Apache Spark RDD sortByKey algorithm and time complexity

What is the Big-O time complexity for Apache Spark RDD sortByKey?

I am trying to assign row numbers to an RDD based on a particular order.

Say I have a {K,V} pair RDD and I wish to perform an order by key using

myRDD.sortByKey(true).zipWithIndex

What is the time complexity for this operation, in big-O form?

And what is happening under-the-covers? Bubble sort? I hope not! My dataset is very large and runs across partitions, so I'm curious whether the sortByKey function is optimal, or does some kind of intermediate data structure within a partition and then something else across partitions to optimize message passing, or what.