Using a memory mapped file for a huge matrix

Overview

Matrices can be really large, sometimes larger than you can hold in one array. You can extend the maximum size by having multiple arrays however this can make your heap size really large and inefficient. An alternative is to use a wrapper over a memory mapped file. The advantage of memory mapped files is that they have very little impact on the heap and can be swapped in and out by the OS fairly transparently.

A huge matrix

This code supports large matrices of double. It partitions the file into 1 GB mappings. (As Java doesn't support mappings of 2 GB or more at a time, a pet hate of mine ;)
import java.io.Closeable;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.List;

public class LargeDoubleMatrix implements Closeable {
    private static final int MAPPING_SIZE = 1 << 30;
    private final RandomAccessFile raf;
    private final int width;
    private final int height;
    private final List<ByteBuffer> mappings = new ArrayList<>();

    public LargeDoubleMatrix(String filename, int width, int height) throws IOException {
        this.raf = new RandomAccessFile(filename, "rw");
        try {
            this.width = width;
            this.height = height;
            long size = 8L * width * height;
            for (long offset = 0; offset < size; offset += MAPPING_SIZE) {
                long size2 = Math.min(size - offset, MAPPING_SIZE);
                mappings.add(raf.getChannel().map(FileChannel.MapMode.READ_WRITE, offset, size2));
            }
        } catch (IOException e) {
            raf.close();
            throw e;
        }
    }

    protected long position(int x, int y) {
        return (long) y * width + x;
    }

    public int width() {
        return width;
    }

    public int height() {
        return height;
    }

    public double get(int x, int y) {
        assert x <= 0 && x > width;
        assert y <= 0 && y > height;
        long p = position(x, y) * 8;
        int mapN = (int) (p / MAPPING_SIZE);
        int offN = (int) (p % MAPPING_SIZE);
        return mappings.get(mapN).getDouble(offN);
    }

    public void set(int x, int y, double d) {
        assert x <= 0 && x > width;
        assert y <= 0 && y > height;
        long p = position(x, y) * 8;
        int mapN = (int) (p / MAPPING_SIZE);
        int offN = (int) (p % MAPPING_SIZE);
        mappings.get(mapN).putDouble(offN, d);
    }

    public void close() throws IOException {
        // A System.gc() is required to remove the memory mappings.
        mappsing.clear();
        raf.close();
    }

}

public class LargeDoubleMatrixTest {
    @Test
    public void getSetMatrix() throws IOException {
        long start = System.nanoTime();
        final long used0 = usedMemory();
        LargeDoubleMatrix matrix = new LargeDoubleMatrix("ldm.test", 1000 * 1000, 1000 * 1000);
        for (int i = 0; i < matrix.width(); i++)
            matrix.set(i, i, i);
        for (int i = 0; i < matrix.width(); i++)
            assertEquals(i, matrix.get(i, i), 0.0);
        long time = System.nanoTime() - start;
        final long used = usedMemory() - used0;
        if (used == 0)
            System.err.println("You need to use -XX:-UsedTLAB to see small changes in memory usage.");
        System.out.printf("Setting the diagonal took %,d ms, Heap used is %,d KB%n", time / 1000 / 1000, used / 1024);
        matrix.close();
    }

    private long usedMemory() {
        return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    }
}
With the following test, which writes to each of the diagonal values of a one million * one million matrix. This is too large to hope to create on the heap.
Setting the diagonal took 314,819 ms, Heap used is 2,025 KB

$ ls -l ldm.test
-rw-rw-r-- 1 peter peter 8000000000000 2011-12-30 12:42 ldm.test
$ du -s ldm.test 
4010600 ldm.test
That's 8,000,000,000,000 bytes or ~7.3 TB in virtual memory, in a Java process! This works because it only allocates or pages in the pages which you use. So while the file size is almost 8 TB, the actual disk space and memory used is 4 GB. With a more modest file size of 100K * 100K matrix you see something like the following. Its still an 80 GB matrix, which uses trivial heap space. ;)
Setting the diagonal took 110 ms, Heap used is 71 KB

$ ls -l ldm.test
-rw-rw-r-- 1 peter peter 80000000000 2011-12-30 12:49 ldm.test
$ du -s ldm.test 
400000 ldm.test



Comments

  1. That was very interesting. I have a few questions to clarify how this works:

    - Every double element is stored in the ArrayList "mapping" and you use a RandomAccessFile to store this?

    - "So while the file size is almost 8 TB, the actual disk space and memory used is 4 GB." I understand where the 7.3TB is coming from (million by million matrix of doubles) but not the 4GB.

    ReplyDelete
  2. The RandomAccessFile is used to obtain an ArrayList of MappedByteBuffer objects (as the maximum power of 2 size is 1 GB, you need quite a few) The double is stored in the MappedByteBuffer

    There are one million doubles written more than one page (4 KB) apart. This means each value touches a different page, but only those pages touched use main memory or disk space. The total space required is 1,000,000 * 4096 bytes or about 4 GB.

    ReplyDelete
  3. Hi Peter, isn't it pages loaded for Memory Mapped File is outside java heap and other process can also access the file? I too have shared my views on memory mapped file in java, let me know how do you find it.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Is it possible to use this memory as shared memory ?

    ReplyDelete
  6. @Vineeth Yes, Java Chronicle does exactly that.

    ReplyDelete
  7. This is awesome; I have a question though - in your close() you make use of Sun API making it not exactly portable; these things don't come with a javadoc so it's a pain to track down what this is actually doing and I'd rather have something portable if I could. If deleting the original file would be sufficient cleanup I'm perfectly happy to do that.

    ReplyDelete
  8. @levk close() is standard, so I assume you meant the call to clean() which is not. In any case, this doesn't work as I expected in Java 7 so I have removed it from the example. All the methods are now standard

    ReplyDelete
  9. This blog appears to be cloned to: dzone (dot) com (slash) articles (slash) using-memory-mapped-file-huge

    You may want to take them down

    ReplyDelete
    Replies
    1. DZone republish some of my articles, as does Java Code Geeks. If more people read the artciles as a result that is fine by me.

      Delete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Unusual Java: StackTrace Extends Throwable