Local Parallelization

Introduction

In general we have seen that if we want to speed up data processing we simply use more computers. In spark we would add more nodes to our cluster and we could then operate on more partitions at once. But this exploration is going to shrink the scope a bit a look at what we can do on a single computer to speed up the processing of data. We will mainly look at two things, speeding up computations and speeding up data access.

Parallelizing Computations

In the long, long ago, you could look at a CPU (central processing unit), the main processor in a computer and look at how many GHz it was. That told you how many billions of operations per second it could do. So a 2 GHz processor could do 2 billion operations per second. The bigger the number, the better the performance. Somewhere in the vicinity of 4Ghz they started to reach a cap where they could not really get much faster.

So engineers started creating an instruction pipeline. There would be a queue of instructions and the processor could do some work on each instruction during each cycle. This worked well for deterministic computations where the processor knew in advance what instructions would be processed. If there were conditional instructions (like an if statement) it could make a guess, but might need to throw away the work if it guessed wrong.

Since then they have gotten smarted and can make educated guesses or consider both the true or false outputs of a conditional. This can make things a lot faster, but we can’t just look at the speed and actually know which offers better performance. And really, there is only so much performance to be wrung out of these things.

Multiple Cores

The next area where advances would come would be in multi-core processors. Historically this was reserved for servers. The motherboard might have two CPU slots and so you would buy two CPUs. But now a single CPU rarely only has a single core. Most are 2, 4 or 8 cores. There have been a couple that got as high as 32 cores. But the 2-8 range is the most common.

Basically this lets each core run its own independent program at the same time. So You might be watching a video on YouTube while parsing a CSV in Python. On a single core the cycles would have to be split between these two tasks. On a CPU one is able to process the video and the other is able to parse the CSV. These are totally independent and one does not impact the performance of the other if they are running on different cores.

There are challenges however if you want to speed up a single program using multiple cores. The cores are isolated so one does not know what the other is doing. That is where things like Spark come in. You can use Spark to spread out the work load between cores. It is overkill because the storage is shared, but getting all the CPUs to split up the work and combine results is not a terribly easy task.

Multi-core CPUs really shine when running multiple programs and not as much when running a single program.

GPUs

GPUs or graphics processing units are what have traditionally been used to render 3D graphics on home computers. They are really good at dealing with a ton of pieces of data and applying the same calculations to them. An example might be running the calculations to figure out how a triangle would be lit based on a particular light source. And a scene is often composed of hundreds of thousands of these triangles.

So that is what a GPU really does, it applies one operation to thousands of pieces of data. A modern GPU might have around 2000 processors to throw at a problem at once. The limitation is that they need to apply the same operation at the same time. Where CPUs could run totally independent programs in each core a GPU needs to run the same program at the same time in each core.

So this means there are particular problems that GPUs are very well suited to solve. Obviously the previously mentioned triangle lighting problem. But also general matrix math where some operation is applied to every value in a matrix. They can also unroll loops. So if you were to have a loop that summed the valued of 1 to 1,000,000,000 it could distribute each addition to a core and collect the results at the end. The key is that every core is doing an addition. There are not some cores doing addition while others are doing comparisons or data reads. They are all reading data, adding data or whatever at the same time.

RAID

RAID or Redundant Array of Independent Disks is the main way for a user of single machine to parallelize I/O operations. RAID lets you split data across multiple hard drives so that you can read or write from multiple hard drives at once, depending on the implementation.

RAID Levels

There are many different RAID levels. These are the configurations of the disks.

RAID 0: This is your main go fast arrangement at the expense of data life. Data is striped onto multiple disks. Lets assume you have two disks and a file that has the contents ABCD. What striping will do is put AC on one disk and BD on another disk. When it reads it will read A off of disk 1 and B off of disk two and put them together. The same with C and D. The same works with writing data. It will split up a file and write multiple pieces at the same time to all the disks in the array. This has a major draw back. If any disk in the array fails, all of the data is corrupted. Essentially every file will be partly lost because no one disk has the whole file and none of the data is redundant. In terms of size we add the total size of every hard drive and that gives us the capacity of our array.
RAID 1: This mirrors data across hard drives. So with drives 1 and 2 and data ABCD the full file, ABCD will be written to both drives 1 and 2. We can still split up reads, where disk 1 will read AC and disk will will read BD and they will be stuck together to make ABCD. Writes cannot do this, they need to write the full file to every disk. So reading is sped up, writes are not, but our data is fully redundant. As long as at least one hard drive remains we keep our data. In terms of size this is limited to the smallest hard drive in the array. Once that drive is full we can no longer continue adding data.
RAID 10 (1+0): This is the raid level that provides the best of both worlds. It is a striped set of mirrored drives. So imagine two RAID 1 arrays of 2 disks each. Each array consists of mirrored but we can’t write quickly to either of them. Each array is also limited by the smallest drive in that array. What RAID 10 has us do is then create a RAID 0 array out of those RAID 1 arrays. So we stripe data onto the arrays. So ABCD get split up with AC getting written to array 1 and BD getting written to array 2. Within those arrays AC is written to every drive in array 1 and BD is written to every drive in array 2. So long as at least one drive survives in each RAID 1 array then the data is preserved.

In general you will see RAID 10 as one of the most common RAID implementations and if you are researching implementing a RAID setup yourself I would recommend you go with RAID 10 using at least 4 drives. You will get very significant speed up in terms of reads, pretty good improvement in reads and very good data protection. But remember, if you want to really protect you data, backup offsite. A flood or fire does not care that you have 10 hard drives, they are all underwater now.

Review

If you are excited about hardware this section should help you consider what you might be able to do with your home computer or a single workstation. Even if you may not get the speed you need, being able to prototype using multiple cores of a CPU or GPU can give you a good idea about what kind of speedup you can expect if you deployed to multiple servers. The same goes for storage. It will help you figure out how much of your bottleneck is coming from storage and how much of an improvement you might see using distributed file storage like Spark.