🍊柑橘 RSS 阅读器 - 鸿蒙首款原生 RSS 阅读器

In this part of the Cache Lab, the mission is simple yet devious: optimize matrix transposition for three specific sizes: 32x32, 64x64, and 61x67. Our primary enemy? Cache misses.

Matrix Transposition

A standard transposition swaps rows and columns directly:

void trans(int M, int N, int A[N][M], int B[M][N])
{
    int i, j, tmp;

    for (i = 0; i < N; i++) {
        for (j = 0; j < M; j++) {
            tmp = A[i][j];
            B[j][i] = tmp;
        }
    }    

}

While correct, this approach is a cache-miss nightmare because it ignores how data is actually stored in memory.

Cache Overview

To optimize effectively, we first have to understand our hardware constraints. The lab specifies a directly mapped cache with the following parameters:

Parameter	Value
Sets (S)	32
Block Size (B)	32 bytes
Associativity (E)	1 (Direct-mapped)
Integer Size	4 bytes
Capacity per line	8 integers

We will use Matrix Tiling and Loop Unrolling to optimize the codes.

32x32 Case

In this case, a row of the matrix needs 32/8 = 4 sets of cache to store. And cache conflicts occur every 32/4 = 8 rows. This makes 8x8 tiling the sweet spot.

By processing the matrix in $8 \times 8$ blocks, we ensure that once a line of A is loaded, we use all 8 integers before it gets evicted. We also use loop unrolling with 8 local variables to minimize the overhead of accessing B.

int i,j,k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;
for(i = 0; i<N; i+=8){
    for(j = 0; j<M; j+=8){
        for(k = i; k<N && k<i+8; k++) {
          	
            tmp1 = A[k][j];
            tmp2 = A[k][j+1];
            tmp3 = A[k][j+2];
            tmp4 = A[k][j+3];
            tmp5 = A[k][j+4];
            tmp6 = A[k][j+5];
            tmp7 = A[k][j+6];
            tmp8 = A[k][j+7];

          	
            B[j][k] = tmp1;
            B[j+1][k] = tmp2;
            B[j+2][k] = tmp3;
            B[j+3][k] = tmp4;
            B[j+4][k] = tmp5;
            B[j+5][k] = tmp6;
            B[j+6][k] = tmp7;
            B[j+7][k] = tmp8;
        }
    }
}

61x67 Case

Since 61 and 67 are not powers of two, the conflict misses don’t occur in a regular pattern like they do in the square matrices. This “irregularity” is actually a blessing. We can get away with simple tiling. A 16x16 block size typically yields enough performance to pass the miss-count threshold.

int BLOCK_SIZE = 16;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

64x64 Case

This is the hardest part. In a 64x64 matrix, a row needs 8 sets, but conflict misses occur every $32/8 = 4$ rows. If we use 8x8 tiling, the bottom half of the block will evict the top half.

We can try a 4x4 matrix tiling first.

int BLOCK_SIZE = 4;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

But this isn’t enough to pass the miss-count threshold.

We try a 8x8 matrix tiling. We solve this by partitioning the $8 \times 8$ block into four $4 \times 4$ sub-blocks and using the upper-right corner of B as a “buffer” to store data temporarily.

$\text{Block } A = \begin{pmatrix} A_{TL} & A_{TR} \\ A_{BL} & A_{BR} \end{pmatrix} \quad \xrightarrow{\text{Transpose}} \quad \text{Block } B = \begin{pmatrix} A_{TL}^T & A_{BL}^T \\ A_{TR}^T & A_{BR}^T \end{pmatrix}$

Here are the steps:

Transpose $A_{TL}$ into $B_{TL}$ while simultaneously moving $A_{TR}$ into $B_{TR}$ (as a temp storage).
Move the stored $A_{TR}$ from $B_{TR}$ to its final position, while moving $A_{BL}$ into its spot.
Transpose $A_{BR}$ into $B_{BR}$ .

int i, j, k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;


for (i = 0; i < N; i += 8) {
    for (j = 0; j < M; j += 8) {
        
        


        for (k = 0; k < 4; k++) {
            
            tmp1 = A[i + k][j];     tmp2 = A[i + k][j + 1];
            tmp3 = A[i + k][j + 2]; tmp4 = A[i + k][j + 3]; 
            tmp5 = A[i + k][j + 4]; tmp6 = A[i + k][j + 5];
            tmp7 = A[i + k][j + 6]; tmp8 = A[i + k][j + 7]; 

            
            B[j][i + k]     = tmp1;
            B[j + 1][i + k] = tmp2;
            B[j + 2][i + k] = tmp3;
            B[j + 3][i + k] = tmp4;

            
            
            B[j][i + k + 4]     = tmp5;
            B[j + 1][i + k + 4] = tmp6;
            B[j + 2][i + k + 4] = tmp7;
            B[j + 3][i + k + 4] = tmp8;
        }

        


        for (k = 0; k < 4; k++) {
            
            tmp1 = A[i + 4][j + k]; tmp2 = A[i + 5][j + k];
            tmp3 = A[i + 6][j + k]; tmp4 = A[i + 7][j + k];
            
            
            tmp5 = A[i + 4][j + k + 4]; tmp6 = A[i + 5][j + k + 4];
            tmp7 = A[i + 6][j + k + 4]; tmp8 = A[i + 7][j + k + 4];

            
            int t1 = B[j + k][i + 4];
            int t2 = B[j + k][i + 5];
            int t3 = B[j + k][i + 6];
            int t4 = B[j + k][i + 7];

            
            B[j + k][i + 4] = tmp1;
            B[j + k][i + 5] = tmp2;
            B[j + k][i + 6] = tmp3;
            B[j + k][i + 7] = tmp4;

            
            B[j + k + 4][i]     = t1;
            B[j + k + 4][i + 1] = t2;
            B[j + k + 4][i + 2] = t3;
            B[j + k + 4][i + 3] = t4;

            
            B[j + k + 4][i + 4] = tmp5;
            B[j + k + 4][i + 5] = tmp6;
            B[j + k + 4][i + 6] = tmp7;
            B[j + k + 4][i + 7] = tmp8;
        }
    }
}

Note: The key trick here is traversing B by columns where possible (so B stays right in the cache) and utilizing local registers (temporary variables) to bridge the gap between conflicting cache lines.

Conclusion

Optimizing matrix transposition is less about the math and more about mechanical sympathy—understanding the underlying hardware to write code that plays nice with the CPU’s cache.

The jump from the naive version to these optimized versions isn’t just a marginal gain; it’s often a 10x reduction in cache misses. It serves as a stark reminder that in systems programming, how you access your data is just as important as the algorithm itself.

In this part of the Cache Lab, the mission is simple yet devious: optimize matrix transposition for three specific sizes: 32x32, 64x64, and 61x67. Our primary enemy? Cache misses.

Matrix Transposition

A standard transposition swaps rows and columns directly:

void trans(int M, int N, int A[N][M], int B[M][N])
{
    int i, j, tmp;

    for (i = 0; i < N; i++) {
        for (j = 0; j < M; j++) {
            tmp = A[i][j];
            B[j][i] = tmp;
        }
    }    

}

While correct, this approach is a cache-miss nightmare because it ignores how data is actually stored in memory.

Cache Overview

To optimize effectively, we first have to understand our hardware constraints. The lab specifies a directly mapped cache with the following parameters:

Parameter	Value
Sets (S)	32
Block Size (B)	32 bytes
Associativity (E)	1 (Direct-mapped)
Integer Size	4 bytes
Capacity per line	8 integers

We will use Matrix Tiling and Loop Unrolling to optimize the codes.

32x32 Case

In this case, a row of the matrix needs 32/8 = 4 sets of cache to store. And cache conflicts occur every 32/4 = 8 rows. This makes 8x8 tiling the sweet spot.

int i,j,k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;
for(i = 0; i<N; i+=8){
    for(j = 0; j<M; j+=8){
        for(k = i; k<N && k<i+8; k++) {
          	
            tmp1 = A[k][j];
            tmp2 = A[k][j+1];
            tmp3 = A[k][j+2];
            tmp4 = A[k][j+3];
            tmp5 = A[k][j+4];
            tmp6 = A[k][j+5];
            tmp7 = A[k][j+6];
            tmp8 = A[k][j+7];

          	
            B[j][k] = tmp1;
            B[j+1][k] = tmp2;
            B[j+2][k] = tmp3;
            B[j+3][k] = tmp4;
            B[j+4][k] = tmp5;
            B[j+5][k] = tmp6;
            B[j+6][k] = tmp7;
            B[j+7][k] = tmp8;
        }
    }
}

61x67 Case

int BLOCK_SIZE = 16;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

64x64 Case

This is the hardest part. In a 64x64 matrix, a row needs 8 sets, but conflict misses occur every $32/8 = 4$ rows. If we use 8x8 tiling, the bottom half of the block will evict the top half.

We can try a 4x4 matrix tiling first.

int BLOCK_SIZE = 4;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

But this isn’t enough to pass the miss-count threshold.

We try a 8x8 matrix tiling. We solve this by partitioning the $8 \times 8$ block into four $4 \times 4$ sub-blocks and using the upper-right corner of B as a “buffer” to store data temporarily.

Here are the steps:

Transpose $A_{TL}$ into $B_{TL}$ while simultaneously moving $A_{TR}$ into $B_{TR}$ (as a temp storage).
Move the stored $A_{TR}$ from $B_{TR}$ to its final position, while moving $A_{BL}$ into its spot.
Transpose $A_{BR}$ into $B_{BR}$ .

int i, j, k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;


for (i = 0; i < N; i += 8) {
    for (j = 0; j < M; j += 8) {
        
        


        for (k = 0; k < 4; k++) {
            
            tmp1 = A[i + k][j];     tmp2 = A[i + k][j + 1];
            tmp3 = A[i + k][j + 2]; tmp4 = A[i + k][j + 3]; 
            tmp5 = A[i + k][j + 4]; tmp6 = A[i + k][j + 5];
            tmp7 = A[i + k][j + 6]; tmp8 = A[i + k][j + 7]; 

            
            B[j][i + k]     = tmp1;
            B[j + 1][i + k] = tmp2;
            B[j + 2][i + k] = tmp3;
            B[j + 3][i + k] = tmp4;

            
            
            B[j][i + k + 4]     = tmp5;
            B[j + 1][i + k + 4] = tmp6;
            B[j + 2][i + k + 4] = tmp7;
            B[j + 3][i + k + 4] = tmp8;
        }

        


        for (k = 0; k < 4; k++) {
            
            tmp1 = A[i + 4][j + k]; tmp2 = A[i + 5][j + k];
            tmp3 = A[i + 6][j + k]; tmp4 = A[i + 7][j + k];
            
            
            tmp5 = A[i + 4][j + k + 4]; tmp6 = A[i + 5][j + k + 4];
            tmp7 = A[i + 6][j + k + 4]; tmp8 = A[i + 7][j + k + 4];

            
            int t1 = B[j + k][i + 4];
            int t2 = B[j + k][i + 5];
            int t3 = B[j + k][i + 6];
            int t4 = B[j + k][i + 7];

            
            B[j + k][i + 4] = tmp1;
            B[j + k][i + 5] = tmp2;
            B[j + k][i + 6] = tmp3;
            B[j + k][i + 7] = tmp4;

            
            B[j + k + 4][i]     = t1;
            B[j + k + 4][i + 1] = t2;
            B[j + k + 4][i + 2] = t3;
            B[j + k + 4][i + 3] = t4;

            
            B[j + k + 4][i + 4] = tmp5;
            B[j + k + 4][i + 5] = tmp6;
            B[j + k + 4][i + 6] = tmp7;
            B[j + k + 4][i + 7] = tmp8;
        }
    }
}

Note: The key trick here is traversing B by columns where possible (so B stays right in the cache) and utilizing local registers (temporary variables) to bridge the gap between conflicting cache lines.

Conclusion

Optimizing matrix transposition is less about the math and more about mechanical sympathy—understanding the underlying hardware to write code that plays nice with the CPU’s cache.

In this part of the Cache Lab, the mission is simple yet devious: optimize matrix transposition for three specific sizes: 32x32, 64x64, and 61x67. Our primary enemy? Cache misses.

Matrix Transposition

A standard transposition swaps rows and columns directly:

void trans(int M, int N, int A[N][M], int B[M][N])
{
    int i, j, tmp;

    for (i = 0; i < N; i++) {
        for (j = 0; j < M; j++) {
            tmp = A[i][j];
            B[j][i] = tmp;
        }
    }    

}

While correct, this approach is a cache-miss nightmare because it ignores how data is actually stored in memory.

Cache Overview

To optimize effectively, we first have to understand our hardware constraints. The lab specifies a directly mapped cache with the following parameters:

Parameter	Value
Sets (S)	32
Block Size (B)	32 bytes
Associativity (E)	1 (Direct-mapped)
Integer Size	4 bytes
Capacity per line	8 integers

We will use Matrix Tiling and Loop Unrolling to optimize the codes.

32x32 Case

In this case, a row of the matrix needs 32/8 = 4 sets of cache to store. And cache conflicts occur every 32/4 = 8 rows. This makes 8x8 tiling the sweet spot.

int i,j,k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;
for(i = 0; i<N; i+=8){
    for(j = 0; j<M; j+=8){
        for(k = i; k<N && k<i+8; k++) {
          	
            tmp1 = A[k][j];
            tmp2 = A[k][j+1];
            tmp3 = A[k][j+2];
            tmp4 = A[k][j+3];
            tmp5 = A[k][j+4];
            tmp6 = A[k][j+5];
            tmp7 = A[k][j+6];
            tmp8 = A[k][j+7];

          	
            B[j][k] = tmp1;
            B[j+1][k] = tmp2;
            B[j+2][k] = tmp3;
            B[j+3][k] = tmp4;
            B[j+4][k] = tmp5;
            B[j+5][k] = tmp6;
            B[j+6][k] = tmp7;
            B[j+7][k] = tmp8;
        }
    }
}

61x67 Case

int BLOCK_SIZE = 16;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

64x64 Case

This is the hardest part. In a 64x64 matrix, a row needs 8 sets, but conflict misses occur every $32/8 = 4$ rows. If we use 8x8 tiling, the bottom half of the block will evict the top half.

We can try a 4x4 matrix tiling first.

int BLOCK_SIZE = 4;
int i,j,k,l,tmp;
int a,b;
for(i = 0; i<N; i+=BLOCK_SIZE){
    for(j = 0; j<M; j+=BLOCK_SIZE){
        a = i+BLOCK_SIZE;
        b = j+BLOCK_SIZE;
        for(k = i; k<N && k<a; k++) {
            for(l = j; l<M && l<b; l++){
                tmp = A[k][l];
                B[l][k] = tmp;
            }
        }
    }
}

But this isn’t enough to pass the miss-count threshold.

We try a 8x8 matrix tiling. We solve this by partitioning the $8 \times 8$ block into four $4 \times 4$ sub-blocks and using the upper-right corner of B as a “buffer” to store data temporarily.

Here are the steps:

Transpose $A_{TL}$ into $B_{TL}$ while simultaneously moving $A_{TR}$ into $B_{TR}$ (as a temp storage).
Move the stored $A_{TR}$ from $B_{TR}$ to its final position, while moving $A_{BL}$ into its spot.
Transpose $A_{BR}$ into $B_{BR}$ .

int i, j, k;
int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8;


for (i = 0; i < N; i += 8) {
    for (j = 0; j < M; j += 8) {
        
        


        for (k = 0; k < 4; k++) {
            
            tmp1 = A[i + k][j];     tmp2 = A[i + k][j + 1];
            tmp3 = A[i + k][j + 2]; tmp4 = A[i + k][j + 3]; 
            tmp5 = A[i + k][j + 4]; tmp6 = A[i + k][j + 5];
            tmp7 = A[i + k][j + 6]; tmp8 = A[i + k][j + 7]; 

            
            B[j][i + k]     = tmp1;
            B[j + 1][i + k] = tmp2;
            B[j + 2][i + k] = tmp3;
            B[j + 3][i + k] = tmp4;

            
            
            B[j][i + k + 4]     = tmp5;
            B[j + 1][i + k + 4] = tmp6;
            B[j + 2][i + k + 4] = tmp7;
            B[j + 3][i + k + 4] = tmp8;
        }

        


        for (k = 0; k < 4; k++) {
            
            tmp1 = A[i + 4][j + k]; tmp2 = A[i + 5][j + k];
            tmp3 = A[i + 6][j + k]; tmp4 = A[i + 7][j + k];
            
            
            tmp5 = A[i + 4][j + k + 4]; tmp6 = A[i + 5][j + k + 4];
            tmp7 = A[i + 6][j + k + 4]; tmp8 = A[i + 7][j + k + 4];

            
            int t1 = B[j + k][i + 4];
            int t2 = B[j + k][i + 5];
            int t3 = B[j + k][i + 6];
            int t4 = B[j + k][i + 7];

            
            B[j + k][i + 4] = tmp1;
            B[j + k][i + 5] = tmp2;
            B[j + k][i + 6] = tmp3;
            B[j + k][i + 7] = tmp4;

            
            B[j + k + 4][i]     = t1;
            B[j + k + 4][i + 1] = t2;
            B[j + k + 4][i + 2] = t3;
            B[j + k + 4][i + 3] = t4;

            
            B[j + k + 4][i + 4] = tmp5;
            B[j + k + 4][i + 5] = tmp6;
            B[j + k + 4][i + 6] = tmp7;
            B[j + k + 4][i + 7] = tmp8;
        }
    }
}

Note: The key trick here is traversing B by columns where possible (so B stays right in the cache) and utilizing local registers (temporary variables) to bridge the gap between conflicting cache lines.

Conclusion

Optimizing matrix transposition is less about the math and more about mechanical sympathy—understanding the underlying hardware to write code that plays nice with the CPU’s cache.