INTRODUCTION

RSA is a public key cryptosystem, which was named after the creators of the algorithm: Rivest, Shamir, and Adleman [@STALLINGS]. It is widely used for both confidentiality and authentication. The main advantage for using RSA is the keys are created in such a way that the public key is publish and can be used to encrypt all messages to the owner of the public key. Unlike symmetric key schemes, RSA does not require sender and receiver to agree on a common key to encrypt and decrypt messages. To send an encrypted message, the encryptor only has to look up the public key of the intended recipient.

The RSA algorithm works as follows. First, a public key and private key pair are generated. The public key is published and can be used to encrypt messages and verify signatures of the owner of the public key. The corresponding private key is known only to the owner and can be used to decrypt messages encrypted with the owner’s public key. The private key can also be used to sign messages. The public and private keys are generated by choosing two distinct large primes, p and q. Then n is computed by multiplying p and q together. Since n is the product of two primes, we can compute Euler’s totient function by φ(n)=(p-1)(q-1). Then, an integer e is chosen such that 1<e<φ(n) and gcd(e,φ(n))=1, where gcd stands for greatest common denominator. Finally an integer d is computed such that d=e^{-1}(mod φ(n)). Thus, the public key is the modulus n and the encryption exponent e. The private key is the decryption exponent d.

RSA is criticized for the amount of time it takes to encrypt and decrypt messages. To encrypt a message, the sender uses the public key, (n,e). The message is converted to an integer m, such that 0<m<n. Then, the ciphertext message c is computed by c=m^{e}(mod n). The message can be recovered by using the private key, d, and the formula m=c^{d}(mod n). Currently to be considered secure, the size of n needs to be 1024 bits or more. This makes RSA computationally slow because arithmetic modulo large numbers is notoriously slow. For example, on small handheld devices, RSA decryption with 1024 bit modulus can take up to 40 seconds [@BONEH]. Also, RSA decryption can significantly reduce the number of requests a web server can handle. In addition, as computers become more powerful, 1024-bit RSA will not be considered secure. This means that in the future, the size of the modulus will increase. Therefore, we need to create methods we can use RSA in a practical fashion.

To improve the speed of RSA encryption and decryption, we propose to create an instruction set architecture that will be made solely for RSA. In this paper, we will talk strictly about using RSA for encrypting and decrypting messages. However the same instruction set architecture we propose in this paper can be used for signing and verifying messages with RSA.

CHARACTERISTICS OF RSA

There are three areas that the RSA can be optimized: finding the encryption and decryption exponent, prime number generation, and encrypting and decrypting messages. In this section we will describe how finding the encryption and decrypting exponent, prime number generation, and encrypting and decrypting the message is usually done, without specialized instructions.

ENCRYPTION AND DECRYPTION EXPONENT

The current approach to verifying that the encryption exponent is coprime to the φ(n) is by using the Euclidean Algorithm. To find the decryption exponent, the Extended Euclidean Algorithm is used. Therefore, we will include these algorithms in our architecture and create instructions to accommodate the Euclidean Algorithm and Extended Euclidean Algorithm.

The Euclidean Algorithm can be executed using the following pseudocode [@STALLINGS]:

;; load phi of n (computed using Euler's totient function) and encryption exponent
load r1, location_of_phi(n)
load r2, location_of_encryption_exponent

loop while r2 not equal to zero

div t0, r1, r2       ;; t0 = r1 / r2
mult t1, r2, t0         ;; t1 = r2 * r
sub t2, r1, t1       ;; t2 = r1 - t1

store r2, r1            ;; r1 = r2
store t2, r2            ;; r2 = t2

;; end loop

return r2                 ;; this is equal to the gcd of phi of n and the encryption exponent 

If the gcd(φ(n), e) = 1, then e is a valid encryption exponent for RSA, and we can now find the decryption exponent, d. If the gcd(φ(n), e) \neq 1, then a new encryption exponent e needs to be chosen such that 1<e<φ(n). Then, the Euclidean Algorithm needs to be ran again to verify that gcd(φ(n), e) = 1. Typically, to execute the Euclidean Algorithm, a divide, multiply, and subtract instruction would be needed in each iteration of the loop. In the design section, we will describe an instruction that will combine the divide, multiply, and subtract instruction into one instruction.

Once the encryption exponent, e is chosen, we need to find the decryption exponent d. The decryption exponent, d needs to satisfy the following formula: d*e = 1 (mod φ(n)). Another way of describing d is it is the multiplicative inverse of e modular φ(n). The Extended Euclidean Algorithm can be executed using the following pseudocode [@CORMEN]:

;; load phi of n (computed using Euler's totient function) and encryption exponent
load r1, location_of_phi(n)
load r2, location_of_encryption_exponent

;; values that will be used and updated in the loop of the algorithm
store 0, r3              ;; r3 = 0
store 0, r4              ;; r4 = 0
store 1, r5              ;; r5 = 1
store 1, r6              ;; r6 = 1

loop while r2 not equal to zero

div t0, r1, r2        ;; t0 = r1/r2
store r2, t1            ;; t1 = r2
mult t2, r1, t0      ;; t2 = r1 * t0
sub r2, r2, t2        ;; r2 = r2 - t2
store t1, r1            ;; r1 = t1
store r3, t3            ;; t3 = r3
mult t4, t0, r3      ;; t4 = t0 * r3
sub r3, r5, t4        ;; r3 = r5 - t4
store r3, r5            ;; r5 = r3
store r4, t5            ;; t5 = r4
mult t6, t0, r4      ;; t6 = t0 * r4
sub r4, r6, t6        ;; r4 = r6 - t6
store t5, r6            ;; r6 = t5

;; end loop

return r5, r6          ;; r5 and r6 are e and d

Since we chose an e such that gcd(φ(n), e) = 1, there will always exist a d such that ed = 1 (mod φ(n)). The Extended Euclidean Algorithm returns the integer d. Typically to execute the Extended Euclidean Algorithm, one divide, six stores, three multiplies, and three subtracts are needed within the loop. In the design section, we will use two instructions that will combine some of the instructions used in the Extended Euclidean Algorithm to reduce the number of stalls within the loop.

PRIME NUMBER GENERATION

The common approach to generating large primes in making encryption and decryption keys is to randomly select integers and test them for primality. Since for large numbers, being sure of primality can be an expensive operation, algorithms are used that find “probable primes.” These algorithms test numbers for primality with a certainty over a desired threshold. The particular algorithm we will investigate is known as Rabin’s test, whose high-level use in prime generators as follows[@BEAUCHEMIN]:

;;Generates a prime number of binary length l and certainty factor k
;;The number will have a exp(4,-k) chance of being composite (non-prime)
function genPrime (l, k):
    repeat
        n <- Randomly selected l-digit odd number
    until repeatRabin(n, k) = "prime"
    return n

function repeatRabin(n, k)
    from i <- 0 to k or until done
        done <- rabin(n) = "composite"
    end
    if done
        return "composite"
    else
        return "prime"

function rabin(n)
    s <- the number of trailing zeros in binary form of n-1
    t <- n-1 with trailing zeros removed
    a <- integer randomly selected between 2 and n-2
    x <- exp(a,d) mod n
    if x = 1 or x = n - 1
        return "prime"
    for r <- 1 ... s - 1
        x <- exp(x,2) mod n
        if x = 1 return "composite"
        if x = n - 1 return "prime"
    end for
    return "composite"

This technique will take a long amount of time for large values of t and s. For large values of t, exponentiation done in a typical manner would take an increasing number of iterations to calculate (as shown below). The instructions to calculate x^{2} (mod n) would be executed s times. These two factors indicate a heavy reliance on the ability of a system to calculate exponentiation.

ENCRYPTION AND DECRYPTION

One aspect of RSA to improve upon is performing large exponentiation. Currently the implementation of exponentiation is performed by the compiler, which creates inconsistencies in speed between compilers and is generally not optimal. The general practice for compilers is to loop for the value in the exponent while accumulating the total.

Current pseudocode for how compilers handle exponentiation [@COHEN]:

function power (base, exponent)
     set result = base
     loop from 1 to exponent do
          result = result * base
     return result

This technique is very slow, especially for large exponents. It will cause a lot of stalls as the processor waits for the results of the previous computation. It will also stall waiting for another multiplier to be available. We can lessen the number of stalls using a technique known as exponentiation by squaring. This technique is explained further in the design section.

DESIGN

In this section, we will describe specialized instructions that will be used for prime number generation, computing the encryption and decryption exponent, and encrypting a decrypting a message.

ENCRYPTION AND DECRYPTION EXPONENT

The issue with implementing the Euclidean Algorithm the traditional way is a divide, multiply, and subtract instructions are needed for each iteration of the loop. We noticed that the Euclidean Algorithm is finding a modulus in each iteration of the loop, thus we propose to use a modulus instruction for the Euclidean Algorithm, which is defined below.

mod x, y, z : x = z - ( (z/y) * y)

By using the modulus instruction, we have reduced the number of stalls in each loop. Unlike the traditional implementation, which requires both the multiply and subtraction instruction to stall until the instruction before it writes it’s value to memory, the modulus instruction has all the values immediately.

Therefore, we can implement the Euclidean Algorithm with the following pseudocode:

;; load phi of n (computed using Euler's totient function) and encryption exponent
load r1, location_of_phi(n)
load r2, location_of_encryption_exponent

loop while r2 not equal to zero

mod t0, r1, r2      ;; t0 = r2 - ( (r2/r1) * r1)

store r2, r1           ;;r1 = r2
store t0, r2           ;;r2 = t2

;; end loop

return r2                ;; this is equal to the gcd of phi of n and the encryption exponent 

Using the mod instruction, we can eliminate the use of the divide, multiply, and subtract instructions. Also, we use only one temporary register, as opposed to three temporary registers. We will provide an analysis to the speedup that the modulus instruction gives the implementation of the Euclidean Algorithm in the justification and analysis section.

For the implementation of the Extended Euclidean Algorithm, we propose to use two specialized instructions: a modular instruction and a multiply-subtract instruction. The modular instruction combines the divide, multiply, and subtract instruction, and is defined the same as it is above, in the Euclidean Algorithm. The multiply-subtract instruction is defined as follows:

 multsub w, x, y, z : w = x - (y * z)

Like in the Euclidean Algorithm, by using the modular instruction in the Extended Euclidean Algorithm, we are reducing the number of stalls in each loop iteration. We are further reducing the number of stalls by also using the multiply-subtract instruction. Thus, we can implement the Extended Euclidean Algorithm with the following pseudocode:

 ;; load phi of n (computed using Euler's totient function) and encryption exponent
load r1, location_of_phi(n)
load r2, location_of_encryption_exponent

store 0, r3              ;; r3 = 0
store 0, r4              ;; r4 = 0
store 1, r5              ;; r5 = 1
store 1, r6              ;; r6 = 1

loop while r2 not equal to zero

div t0, r1, r2                ;; t0 = r1/r2
store r2, t1                    ;; t1 = r2
mod r2, r1, r2                ;; r2 = r2 - ( (r2/r1) * r1)
store t1, r1                    ;; r1 = t1
store r3, t2                    ;; t2 = r3
multsub r3, r5, t0, r3       ;; r3 = r5 - (t0 * r3)
store t2, r5                    ;; r5 = t2
store r4, t3                    ;; t3 = r4
multsub  r4, r6, t0, r4      ;; r4 = r6 - (t0 * r3) 
store t3, r6                    ;; r6 = t3

;; end loop

return r5, r6                  ;; r5 and r6 are e and d

Using the modular and multiply-subtract instructions, we have reduced the amount of instructions in the loop of the implementation of the Extended Euclidean Algorithm from 13 instructions to 10 instructions. We also reduce the temporary registers from five to three. We will provide an analysis to the speedup given to the Extended Euclidean Algorithm by using the modular instruction and the multiply-subtract instruction in the justification and analysis section.

PRIME NUMBER GENERATION, ENCRYPTION, AND DECRYPTION

One issue already discussed in the previous section is that of stalls during large exponents. The way exponentiation is handled causes many stalls slowing down the program. One way to lessen the number of stalls is by using a technique called exponentiation by squaring. This technique uses the square of the result to accumulate the final solution instead of just the base. In this way it’ll divide the number of multiplies in half, thus divide the number of stalls in half.

The pseudocode for how exponentiation by squaring will work [@COHEN]:

function power (base, exponent)
     set binaryExp = (n1, n2, n3, ... nm) 
     ;; where binaryExp is the binary representation of exponent and n1 is 1
     set result = base
     set singleResult = 1
     loop for x from 0 to (length of binaryExp - 1) do
          result = result * result
          if binaryExp[x] == 1
                singleResult = singleResult * base

     result = result * singleResult
     return result

An example for calculating 3 to the 10th is as follows:

result := 3
binaryExp := "1010"

Iteration for digit 2:
  result := result^{2} = 3^{2} = 9
  1010bin - Digit equals "0"

Iteration for digit 3:
  result := result^{2} = (32)2 = 34  = 81
  1010bin - Digit equals "1" --> result := result*3 = (32)2*3 = 35  = 243

Iteration for digit 4:
  result := result2 = ((32)2*3)2 = 310  = 59049
  1010bin - Digit equals "0"

return result

The idea is to implement this exponentiation by squares algorithm in the processor using a pow command. This pow command takes the destination register, the register containing the base and the register containing the exponent. This command then runs the loop using one multiplier to accumulate the square and one multiplier to accumulate the value when a 1 is hit. The loop will grab one binary digit at a time from the exponent register. We will always run the result accumulator, then depending on the digit, may use the second accumulator. It will then run through one more multiplier to multiply the squares by the 1’s multiplier.

JUSTIFICATION AND ANALYSIS

In this section, we will describe how our specialized instructions will improve the performance of the RSA encryption.

ENCRYPTION AND DECRYPTION EXPONENT

Using the modular instruction in the Euclidean Algorithm, we can reduce the number of stalls needed. Instead of needing to stall for the result of the divide and the result of the multiply within the loop, the modular instruction does not have to stall for any previous results. In the loop of the Euclidean Algorithm without the modular instruction, two of the five instructions need to stall to wait for a previous result.

CPI = Ideal CPI + The number of stalls per instruction

CPIold = 1 + 2/5
CPIold =  1.4

CPInew = 1 (since no instructions in the loop need to stall for a result)

Thus the speedup for the Euclidean Algorithm in the loop is:
Speedup = CPIold / CPI new
Speedup = 1.4 / 1
Speedup = 1.4

Thus, with respect to stalls, we have a speedup of 1.4 by using the modular instruction in the Euclidean Algorithm. Of course, the latency of the modular instruction is likely to be longer than the latency of the divide, multiply, and subtract instructions, since it is performing three calculations in one instruction. However, since we are building a system only for RSA, we can use specialized hardware to reduce the latency of the modular instruction to receive the speedup of 1.4. As mentioned in the design section, another advantage to using the modular instruction is we reduce the number of temporary registers from three to one.

We are assume that the store instruction finishes in one clock cycle. Thus, in the Extended Euclidean Algorithm, three of the thirteen instructions in the loop need to stall for previous results. However, by using the modular and multipy-subtract instructions, no instruction needs to stall to wait for a previous result to be available.

CPI = Ideal CPI + The number of stalls per instruction

CPIold = 1 + 3/13
CPIold =  1.23

CPInew = 1 (since no instructions in the loop need to stall for a result)

Thus the speedup for the Euclidean Algorithm in the loop is:
Speedup = CPIold / CPI new
Speedup = 1.23 / 1
Speedup = 1.23

Thus with respect to stalls, we have a speedup of 1.23 by using the modular and multiply-subtract instructions in the Extended Euclidean Algorithm. As mentioned above, it is likely that the latency of the modular and multiply-subtract instructions are higher than the traditional instructions. However, we are only building a system for RSA encryption and decryption, so we can build hardware that will reduce the latency of the modular and multiply-subtract instructions to receive the speedup of 1.23. Also, an advantage to using the modular and multiply-subtract instructions is we reduce the number of temporary registers needed from five to three.

PRIME NUMBER GENERATION, ENCRYPTION, AND DECRYPTION

Using the pow command we can reduce stalls of large exponents in half. Since the algorithm breaks the exponent into a binary representation of its self and loops through the digits, the worse case scenario is looping half the number of times. One example of this is if the exponent is eight which is 1000 in base two. Since the number of loops is divided in half, the number of multiplies are also divided in half. Even with a large number of ones in the base two representation of the exponent, this can use a seperate multiplyer, since it doesn’t rely on the result of the squares.

Using the above pseudocode we can assume that approximately 1/4 of the instructions are multiplies, thus 1/4 will stall. Using that knowledge and the knowledge that the number of stalls is cut in half we can use the following equations to determin the overall speedup:

CPI = Ideal CPI + The number of stalls per instruction

CPI old = 1 + 1/4
CPI old = 1.25

CPI new = 1 + 1/4/2
CPI new = 1 + 1/8
CPI new = 1.125

Thus the speedup of this section is:
Speedup = CPI old / CPI new
Speedup = 1.25/1.125
Speedup = 1.11

CONCLUSIONS

In analyzing the typical algorithms used as a part of RSA, we have identified two primary bottlenecks in both encryption and decryption: modulus and exponentiation operations. We propose that because these two operations are so prevalent, they warrant specialized instructions. We also noticed that it is beneficial to use a multiply-subtract instruction to increase speedup while finding a decryption exponent. These specialized instructions improve performance by reducing the number of stalls, and therefore improving ideal CPI. The process of finding encryption and decryption exponents benefits from the increased efficiency of modulus operations. Likewise, the processes of prime generation and encryption and decryption in general, both benefit from a faster exponentiation operation.

We find that by implementing these instructions, the whole procedure finds an overall speedup, from specific speedups of 1.11 for all uses of exponentiation, 1.4 for finding an encryption exponent, and 1.23 for finding a decryption exponent. Thus, we met our goals for improving the performance of the RSA algorithm by creating a specialized instruction set architecture.

A further investigation we could make is to look at specific hardware that supports our instruction set architecture. To gain the maximum speedup, we need to use hardware that reduces the latency for all the specialized instructions to a negligible amount. Since we are creating a system with the sole task of encrypting and decrypting messages using RSA, we can create hardware that allows the specialized instructions to have a latency comparable to the traditional instructions.

Bibliography

3 Beauchemin, Pierre, Brassard, Crepeau, Claude, Goutier, Claude, and Pomerance, Carl The Generation of Random Numbers That Are Probably Prime

Bellare, M., Garay, J., and Rabin T. Fast Batch Verification for Modular Exponentiation and Digital Signatures. Advances in Cryptology- Eurocrypt 98 Proceedings, Lecture Notes in Computer Sciencey, 1998.

Boneh, Dan and Shacham, Horavl. Fast Variants of RSA. Appears in Cryptobytes, 2002. Cohen, H., Frey, G. (editors): Handbook of elliptic and hyperelliptic curve cryptography. Discrete Math. Appl., (Chapman and Hall/CRC 2006).

Cormen, T., Leiserson, C., Rivest, R., and Stein, C. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001.

Stallings, W., Cryptography and Network Security 2nd ed. (Prentice-Hall, 1998)