Why do write calls sometimes block for a long time in Linux?

Update: Explanation at the bottom.

Anyone know why fwrite() calls sometimes block?

Here is a test I did. I sequentially append 4096 byte chunks to 10 files in a round robin fashion. Throughput is artificially limited to a fixed rate set at about 1/4 the maximum throughput of the drives, which I accomplish by sleeping at subsecond intervals in between my writes. I time each call to write. My expectation is that writes go immediately to pagecache and are asynchronously written out to disk (unless I were to call fsync, which I don’t).

In fact this is usually what happens, the average time is just a few microseconds, but sometimes the write calls block for somewhere between 400 ms and a few seconds. I am using Linux 3.6.32 (RHEL 6) with ext4. I am using default configurations otherwise (no change to any of the /proc/sys/vm stuff and fiddling with those parameters don’t seem to help).

Here is a trace of average and max latencies taken over 1 second intervals. Note the regular spikes. What is the cause of this? Locking between the flush threads and the write thread? Is there anything you can do to mitigate it? This effects anything that does logging—i.e. almost everything.

I understand why this would happen if I exceeded the throughput that Linux’s background flush threads can handle, but that is clearly not the case here as the drive can sustain 400MB writes over a long period.

I tried this on a few different machines, some with RAID, some with a single disk and all showed the same behavior to varying degrees.

Throughput (mb/sec)Avg. Latency (ms)Max Latency (ms)
99.9150.000 0.054
99.9730.000 0.087
100.0050.000 0.057
100.0890.000 0.041
99.9650.000 0.071
99.9770.000 0.098
99.9990.000 0.076
99.9950.000 0.104
99.9610.000 0.057
100.0160.000 0.226
56.9770.000756.174
99.9660.000 0.100
99.9250.000 0.093
100.0010.000 3.070
100.0740.000 0.084
100.1930.000 0.054
100.2070.000 0.053
99.9980.000 0.055
99.9080.000107.069
99.9800.000117.124
99.9850.000 0.054
99.9480.000 0.061
99.9910.000 0.090
99.9730.000 0.046
99.9890.000 11.923
100.0350.000 0.041
100.3550.000 2.698
99.9990.000 0.052
100.0000.000 0.055
99.9630.000 12.534
99.9750.000 0.058
100.3510.000 0.044
99.9900.000 2.889
100.2840.000 0.042
99.9310.000 0.042
100.2180.000 0.056
99.9920.000 0.065
100.1910.000 0.057
100.0230.000 0.401
99.9180.000 1.957
100.0040.000 61.265
99.9380.000 0.092
100.1790.000 0.057
99.9960.000 0.062

Update

A number of people gave interesting and helpful suggestions, such as “this is your punishment for not using Solaris.” The best suggestion was from Mark Wolfe which was to install latencytop and measure it. To do this on Red Hat you need to install their debug kernel and reboot with that, then latencytop will capture the highest latency kernel operations for each process. This gives a great deal of insight into what is going on.

For those who are curious here are a few of the traces that pop up as causing hundreds of ms of latency:

vfs_write()
  do_sync_write()
    ext4_file_write()
      generic_file_aio_write()
        ext4_da_write_begin() [in this case da means delayed allocation]
          block_write_begin()
            __block_prepare_write()
              ext4_da_get_block_prep()
                ext4_get_block_prep()
                  ext4_get_blocks()
                     call_rw_sem_down_read_failed()

This trace seems to be due to delayed allocation. Turing off delayed allocation makes it go away, though probably at the cost of some throughput.

Here is another one, this one seems to be related to journalling.
sys_write()
  vfs_write()
    do_sync_write()
      ext4_file_write()
        generic_file_aio_write()
          __generic_file_aio_write()
            file_update_time()
              __mark_inode_dirty()
                ext4_dirty_inode()
                  ext4_journal_start_sb()
                    jbd2_journal_start()
                      start_this_handle()
You can check out the details of the code in one of these nice Linux kernel code browsers. My take away from all this was that maybe it is time to look at XFS since that allegedly also has better file locking for fsync which is my other problem.

Notes

  1. boredandroid posted this