Ian Kent [Fri, 8 Sep 2017 23:16:27 +0000 (16:16 -0700)]
autofs: make disc device user accessible
The autofs miscellanous device ioctls that shouldn't require
CAP_SYS_ADMIN need to be accessible to user space applications in order
to be able to get information about autofs mounts.
The module checks capabilities so the miscelaneous device should be fine
with broad permissions.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Kent <[email protected]>
Cc: Colin Walters <[email protected]>
Cc: Ondrej Holy <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Ian Kent [Fri, 8 Sep 2017 23:16:24 +0000 (16:16 -0700)]
autofs: fix AT_NO_AUTOMOUNT not being honored
The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT which
is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering of an
automount by the call. But this flag is unconditionally cleared for all
stat family system calls except statx().
stat family system calls have always triggered mount requests for the
negative dentry case in follow_automount() which is intended but prevents
the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
In order to handle the AT_NO_AUTOMOUNT for both system calls the negative
dentry case in follow_automount() needs to be changed to return ENOENT
when the LOOKUP_AUTOMOUNT flag is clear (and the other required flags are
clear).
AFAICT this change doesn't have any noticable side effects and may, in
some use cases (although I didn't see it in testing) prevent unnecessary
callbacks to the automount daemon.
It's also possible that a stat family call has been made with a path that
is in the process of being mounted by some other process. But stat family
calls should return the automount state of the path as it is "now" so it
shouldn't wait for mount completion.
This is the same semantic as the positive dentry case already handled.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: deccf497d804a4c5fca ("Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()")
Signed-off-by: Ian Kent <[email protected]>
Cc: David Howells <[email protected]>
Cc: Colin Walters <[email protected]>
Cc: Ondrej Holy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Daniel Micay [Fri, 8 Sep 2017 23:16:20 +0000 (16:16 -0700)]
init/main.c: extract early boot entropy from the passed cmdline
Feed the boot command-line as to the /dev/random entropy pool
Existing Android bootloaders usually pass data which may not be known by
an external attacker on the kernel command-line. It may also be the
case on other embedded systems. Sample command-line from a Google Pixel
running CopperheadOS....
console=ttyHSL0,115200,n8 androidboot.console=ttyHSL0
androidboot.hardware=sailfish user_debug=31 ehci-hcd.park=3
lpm_levels.sleep_disabled=1 cma=32M@0-0xffffffff buildvariant=user
veritykeyid=id:
dfcb9db0089e5b3b4090a592415c28e1cb4545ab
androidboot.bootdevice=624000.ufshc androidboot.verifiedbootstate=yellow
androidboot.veritymode=enforcing androidboot.keymaster=1
androidboot.serialno=
FA6CE0305299 androidboot.baseband=msm
mdss_mdp.panel=1:dsi:0:qcom,mdss_dsi_samsung_ea8064tg_1080p_cmd:1:none:cfg:single_dsi
androidboot.slot_suffix=_b fpsimd.fpsimd_settings=0
app_setting.use_app_setting=0 kernelflag=0x00000000 debugflag=0x00000000
androidboot.hardware.revision=PVT radioflag=0x00000000
radioflagex1=0x00000000 radioflagex2=0x00000000 cpumask=0x00000000
androidboot.hardware.ddr=4096MB,Hynix,LPDDR4 androidboot.ddrinfo=
00000006
androidboot.ddrsize=4GB androidboot.hardware.color=GRA00
androidboot.hardware.ufs=32GB,Samsung androidboot.msm.hw_ver_id=
268824801
androidboot.qf.st=2 androidboot.cid=
11111111 androidboot.mid=G-2PW4100
androidboot.bootloader=8996-012001-
1704121145
androidboot.oem_unlock_support=1 androidboot.fp_src=1
androidboot.htc.hrdump=detected androidboot.ramdump.opt=mem@2g:2g,mem@4g:2g
androidboot.bootreason=reboot androidboot.ramdump_enable=0 ro
root=/dev/dm-0 dm="system none ro,0 1 android-verity /dev/sda34"
rootwait skip_initramfs init=/init androidboot.wificountrycode=US
androidboot.boottime=1BLL:85,1BLE:669,2BLL:0,2BLE:1777,SW:6,KL:8136
Among other things, it contains a value unique to the device
(androidboot.serialno=
FA6CE0305299), unique to the OS builds for the
device variant (veritykeyid=id:
dfcb9db0089e5b3b4090a592415c28e1cb4545ab)
and timings from the bootloader stages in milliseconds
(androidboot.boottime=1BLL:85,1BLE:669,2BLL:0,2BLE:1777,SW:6,KL:8136).
[
[email protected]: changelog tweak]
[
[email protected]: line-wrapped command line]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Micay <[email protected]>
Signed-off-by: Laura Abbott <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Laura Abbott <[email protected]>
Cc: Nick Kralevich <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Laura Abbott [Fri, 8 Sep 2017 23:16:17 +0000 (16:16 -0700)]
init: move stack canary initialization after setup_arch
Patch series "Command line randomness", v3.
A series to add the kernel command line as a source of randomness.
This patch (of 2):
Stack canary intialization involves getting a random number. Getting this
random number may involve accessing caches or other architectural specific
features which are not available until after the architecture is setup.
Move the stack canary initialization later to accommodate this.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Laura Abbott <[email protected]>
Signed-off-by: Laura Abbott <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Nick Kralevich <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Markus Elfring [Fri, 8 Sep 2017 23:16:14 +0000 (16:16 -0700)]
binfmt_flat: delete two error messages for a failed memory allocation in decompress_exec()
Omit extra messages for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Markus Elfring <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jean Delvare [Fri, 8 Sep 2017 23:16:11 +0000 (16:16 -0700)]
checkpatch: add 6 missing types to --list-types
Unlike all other types, LONG_LINE, LONG_LINE_COMMENT and LONG_LINE_STRING
are passed to WARN() through a variable. This causes the parser in
list_types() to miss them and consequently they are not present in the
output of --list-types.
Additionally, types TYPO_SPELLING, FSF_MAILING_ADDRESS and AVOID_BUG are
passed with a variable level, causing the parser to miss them too.
So modify the regex to also catch these special cases.
Link: http://lkml.kernel.org/r/20170902175610.7e4a7c9d@endymion
Fixes: 3beb42eced39 ("checkpatch: add --list-types to show message types to show or ignore")
Signed-off-by: Jean Delvare <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jean Delvare [Fri, 8 Sep 2017 23:16:07 +0000 (16:16 -0700)]
checkpatch: rename variables to avoid confusion
The variable name "$msg_type" is sometimes used to set the message type,
and sometimes used to set the message level. This works but is kind of
confusing. Use "$msg_level" in the latter case instead, to make the code
clearer.
Link: http://lkml.kernel.org/r/20170902175345.175db33a@endymion
Signed-off-by: Jean Delvare <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jean Delvare [Fri, 8 Sep 2017 23:16:04 +0000 (16:16 -0700)]
checkpatch: fix typo in comment
Link: http://lkml.kernel.org/r/20170902175249.15bb77f2@endymion
Signed-off-by: Jean Delvare <[email protected]>
Acked-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Joe Perches [Fri, 8 Sep 2017 23:16:01 +0000 (16:16 -0700)]
checkpatch: add --strict check for ifs with unnecessary parentheses
An if statement test like
if ((foo == bar) && (baz != qux))
can arguably be better written without the parentheses as
if (foo == bar && baz != qux)
Add a test to find these cases.
Link: http://lkml.kernel.org/r/dcd0561ddd0fa43c51a420d53b550d738bf42001.1502734458.git.joe@perches.com
Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Takashi Iwai [Fri, 8 Sep 2017 23:15:58 +0000 (16:15 -0700)]
lib/oid_registry.c: X.509: fix the buffer overflow in the utility function for OID string
The sprint_oid() utility function doesn't properly check the buffer size
that it causes that the warning in vsnprintf() be triggered. For
example on v4.1 kernel:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 2357 at lib/vsprintf.c:1867 vsnprintf+0x5a7/0x5c0()
...
We can trigger this issue by injecting maliciously crafted x509 cert in
DER format. Just using hex editor to change the length of OID to over
the length of the SEQUENCE container. For example:
0:d=0 hl=4 l= 980 cons: SEQUENCE
4:d=1 hl=4 l= 700 cons: SEQUENCE
8:d=2 hl=2 l= 3 cons: cont [ 0 ]
10:d=3 hl=2 l= 1 prim: INTEGER :02
13:d=2 hl=2 l= 9 prim: INTEGER :
9B47FAF791E7D1E3
24:d=2 hl=2 l= 13 cons: SEQUENCE
26:d=3 hl=2 l= 9 prim: OBJECT :sha256WithRSAEncryption
37:d=3 hl=2 l= 0 prim: NULL
39:d=2 hl=2 l= 121 cons: SEQUENCE
41:d=3 hl=2 l= 22 cons: SET
43:d=4 hl=2 l= 20 cons: SEQUENCE <=== the SEQ length is 20
45:d=5 hl=2 l= 3 prim: OBJECT :organizationName
<=== the original length is 3, change the length of OID to over the length of SEQUENCE
Pawel Wieczorkiewicz reported this problem and Takashi Iwai provided
patch to fix it by checking the bufsize in sprint_oid().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: "Lee, Chun-Yi" <[email protected]>
Reported-by: Pawel Wieczorkiewicz <[email protected]>
Cc: David Howells <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Pawel Wieczorkiewicz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Eric Dumazet [Fri, 8 Sep 2017 23:15:54 +0000 (16:15 -0700)]
radix-tree: must check __radix_tree_preload() return value
__radix_tree_preload() only disables preemption if no error is returned.
So we really need to make sure callers always check the return value.
idr_preload() contract is to always disable preemption, so we need
to add a missing preempt_disable() if an error happened.
Similarly, ida_pre_get() only needs to call preempt_enable() in the
case no error happened.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: <[email protected]> [4.11+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Baoquan He [Fri, 8 Sep 2017 23:15:51 +0000 (16:15 -0700)]
lib/cmdline.c: remove meaningless comment
One line of code was commented out by c++ style comment for debugging, but
forgot removing it.
Clean it up.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Dan Carpenter [Fri, 8 Sep 2017 23:15:48 +0000 (16:15 -0700)]
lib/string.c: check for kmalloc() failure
This is mostly to keep the number of static checker warnings down so we
can spot new bugs instead of them being drowned in noise. This function
doesn't return normal kernel error codes but instead the return value is
used to display exactly which memory failed. I chose -1 as hopefully
that's a helpful thing to print.
Link: http://lkml.kernel.org/r/20170817115420.uikisjvfmtrqkzjn@mwanda
Signed-off-by: Dan Carpenter <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Heikki Krogerus <[email protected]>
Cc: Daniel Micay <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:45 +0000 (16:15 -0700)]
lib/rhashtable: fix comment on locks_mul default value
As of commit
4cf0b354d92 ("rhashtable: avoid large lock-array
allocations"), the default value for the locks multiplier was reduced
from 128 to 32.
Update the header file to reflect this.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Florian Westphal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Yury Norov [Fri, 8 Sep 2017 23:15:41 +0000 (16:15 -0700)]
bitmap: introduce BITMAP_FROM_U64()
The macro is the compile-time analogue of bitmap_from_u64() with the same
purpose: convert the 64-bit number to the properly ordered pair of 32-bit
parts, suitable for filling the bitmap in 32-bit BE environment.
Use it to make test_bitmap_parselist() correct for 32-bit BE ABIs.
Tested on BE mips/qemu.
[
[email protected]: tweak code comment]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Noam Camus <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Yury Norov [Fri, 8 Sep 2017 23:15:38 +0000 (16:15 -0700)]
lib/test_bitmap.c: add test for bitmap_parselist()
Do some basic checks for bitmap_parselist().
[
[email protected]: fix printk warning]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Noam Camus <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Yury Norov [Fri, 8 Sep 2017 23:15:34 +0000 (16:15 -0700)]
lib/bitmap.c: make bitmap_parselist() thread-safe and much faster
Current implementation of bitmap_parselist() uses a static variable to
save local state while setting bits in the bitmap. It is obviously wrong
if we assume execution in multiprocessor environment. Fortunately, it's
possible to rewrite this portion of code to avoid using the static
variable.
It is also possible to set bits in the mask per-range with bitmap_set(),
not per-bit, as it is implemented now, with set_bit(); which is way
faster.
The important side effect of this change is that setting bits in this
function from now is not per-bit atomic and less memory-ordered. This is
because set_bit() guarantees the order of memory accesses, while
bitmap_set() does not. I think that it is the advantage of the new
approach, because the bitmap_parselist() is intended to initialise bit
arrays, and user should protect the whole bitmap during initialisation if
needed. So protecting individual bits looks expensive and useless. Also,
other range-oriented functions in lib/bitmap.c don't worry much about
atomicity.
With all that, setting 2k bits in map with the pattern like 0-2047:128/256
becomes ~50 times faster after applying the patch in my testing
environment (arm64 hosted on qemu).
The second patch of the series adds the test for bitmap_parselist(). It's
not intended to cover all tricky cases, just to make sure that I didn't
screw up during rework.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yury Norov <[email protected]>
Cc: Noam Camus <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Florian Fainelli [Fri, 8 Sep 2017 23:15:31 +0000 (16:15 -0700)]
lib: add test module for CONFIG_DEBUG_VIRTUAL
Add a test module that allows testing that CONFIG_DEBUG_VIRTUAL works
correctly, at least that it can catch invalid calls to virt_to_phys()
against the non-linear kernel virtual address map.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Florian Fainelli <[email protected]>
Cc: "Luis R. Rodriguez" <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Andy Shevchenko [Fri, 8 Sep 2017 23:15:28 +0000 (16:15 -0700)]
lib/hexdump.c: return -EINVAL in case of error in hex2bin()
In some cases caller would like to use error code directly without
shadowing.
-EINVAL feels a rightful code to return in case of error in hex2bin().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:25 +0000 (16:15 -0700)]
block/cfq: cache rightmost rb_node
For the same reasons we already cache the leftmost pointer, apply the same
optimization for rb_last() calls. Users must explicitly do this as
rb_root_cached only deals with the smallest node.
[
[email protected]: brain fart #1]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:21 +0000 (16:15 -0700)]
mem/memcg: cache rightmost node
Such that we can optimize __mem_cgroup_largest_soft_limit_node(). The
only overhead is the extra footprint for the cached pointer, but this
should not be an issue for mem_cgroup_tree_per_node.
[
[email protected]: brain fart #2]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:18 +0000 (16:15 -0700)]
fs/epoll: use faster rb_first_cached()
... such that we can avoid the tree walks to get the node with the
smallest key. Semantically the same, as the previously used rb_first(),
but O(1). The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:15 +0000 (16:15 -0700)]
procfs: use faster rb_first_cached()
... such that we can avoid the tree walks to get the node with the
smallest key. Semantically the same, as the previously used rb_first(),
but O(1). The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for procfs.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:12 +0000 (16:15 -0700)]
lib/interval-tree: correct comment wrt generic flavor
interval_tree.h _is_ the generic flavor.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:08 +0000 (16:15 -0700)]
lib/interval_tree: fast overlap detection
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().
As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available. While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().
[
[email protected]: fix deadlock from typo vm_lock_anon_vma()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Jérôme Glisse <[email protected]>
Acked-by: Christian König <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Doug Ledford <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Christian Benvenuti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:05 +0000 (16:15 -0700)]
block/cfq: replace cfq_rb_root leftmost caching
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:15:01 +0000 (16:15 -0700)]
locking/rtmutex: replace top-waiter and pi_waiters leftmost caching
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:58 +0000 (16:14 -0700)]
sched/deadline: replace earliest dl and rq leftmost caching
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:55 +0000 (16:14 -0700)]
sched/fair: replace cfs_rq->rb_leftmost
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:52 +0000 (16:14 -0700)]
lib/rbtree_test.c: support rb_root_cached
We can work with a single rb_root_cached root to test both cached and
non-cached rbtrees. In addition, also add a test to measure latencies
between rb_first and its fast counterpart.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:49 +0000 (16:14 -0700)]
lib/rbtree_test.c: add (inorder) traversal test
This adds a second test for regular rb-tree testing in that there is no
need to repeat it for the augmented flavor.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:46 +0000 (16:14 -0700)]
lib/rbtree_test.c: make input module parameters
Allows for more flexible debugging.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:42 +0000 (16:14 -0700)]
rbtree: add some additional comments for rebalancing cases
While overall the code is very nicely commented, it might not be
immediately obvious from the diagrams what is going on. Add a very
brief summary of each case. Opposite cases where the node is the left
child are left untouched.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:39 +0000 (16:14 -0700)]
rbtree: optimize root-check during rebalancing loop
The only times the nil-parent (root node) condition is true is when the
node is the first in the tree, or after fixing rbtree rule #4 and the
case 1 rebalancing made the node the root. Such conditions do not apply
most of the time:
(i) The common case in an rbtree is to have more than a single node,
so this is only true for the first rb_insert().
(ii) While there is a chance only one first rotation is needed, cases
where the node's uncle is black (cases 2,3) are more common as we can
have the following scenarios during the rotation looping:
case1 only, case1+1, case2+3, case1+2+3, case3 only, etc.
This patch, therefore, adds an unlikely() optimization to this
conditional. When profiling with CONFIG_PROFILE_ANNOTATED_BRANCHES, a
kernel build shows that the incorrect rate is less than 15%, and for
workloads that involve insert mostly trees overtime tend to have less
than 2% incorrect rate.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Davidlohr Bueso [Fri, 8 Sep 2017 23:14:36 +0000 (16:14 -0700)]
rbtree: cache leftmost node internally
Patch series "rbtree: Cache leftmost node internally", v4.
A series to extending rbtrees to internally cache the leftmost node such
that we can have fast overlap check optimization for all interval tree
users[1]. The benefits of this series are that:
(i) Unify users that do internal leftmost node caching.
(ii) Optimize all interval tree users.
(iii) Convert at least two new users (epoll and procfs) to the new interface.
This patch (of 16):
Red-black tree semantics imply that nodes with smaller or greater (or
equal for duplicates) keys always be to the left and right,
respectively. For the kernel this is extremely evident when considering
our rb_first() semantics. Enabling lookups for the smallest node in the
tree in O(1) can save a good chunk of cycles in not having to walk down
the tree each time. To this end there are a few core users that
explicitly do this, such as the scheduler and rtmutexes. There is also
the desire for interval trees to have this optimization allowing faster
overlap checking.
This patch introduces a new 'struct rb_root_cached' which is just the
root with a cached pointer to the leftmost node. The reason why the
regular rb_root was not extended instead of adding a new structure was
that this allows the user to have the choice between memory footprint
and actual tree performance. The new wrappers on top of the regular
rb_root calls are:
- rb_first_cached(cached_root) -- which is a fast replacement
for rb_first.
- rb_insert_color_cached(node, cached_root, new)
- rb_erase_cached(node, cached_root)
In addition, augmented cached interfaces are also added for basic
insertion and deletion operations; which becomes important for the
interval tree changes.
With the exception of the inserts, which adds a bool for updating the
new leftmost, the interfaces are kept the same. To this end, porting rb
users to the cached version becomes really trivial, and keeping current
rbtree semantics for users that don't care about the optimization
requires zero overhead.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthias Kaehlcke [Fri, 8 Sep 2017 23:14:33 +0000 (16:14 -0700)]
bitops: avoid integer overflow in GENMASK(_ULL)
GENMASK(_ULL) performs a left-shift of ~0UL(L), which technically
results in an integer overflow. clang raises a warning if the overflow
occurs in a preprocessor expression. Clear the low-order bits through a
substraction instead of the left-shift to avoid the overflow.
(akpm: no change in .text size in my testing)
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthias Kaehlcke <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Babu Moger [Fri, 8 Sep 2017 23:14:29 +0000 (16:14 -0700)]
include: warn for inconsistent endian config definition
We have seen some generic code use config parameter CONFIG_CPU_BIG_ENDIAN
to decide the endianness.
Here are the few examples.
include/asm-generic/qrwlock.h
drivers/of/base.c
drivers/of/fdt.c
drivers/tty/serial/earlycon.c
drivers/tty/serial/serial_core.c
Display warning if CPU_BIG_ENDIAN is not defined on big endian
architecture and also warn if it defined on little endian architectures.
Here is our original discussion
https://lkml.org/lkml/2017/5/24/620
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Babu Moger <[email protected]>
Suggested-by: Arnd Bergmann <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]> (powerpc)
Cc: Michal Simek <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Babu Moger [Fri, 8 Sep 2017 23:14:25 +0000 (16:14 -0700)]
arch/microblaze: add choice for endianness and update Makefile
microblaze architectures can be configured for either little or big endian
formats. Add a choice option for the user to select the correct endian
format(default to big endian).
Also update the Makefile so toolchain can compile for the format it is
configured for.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Babu Moger <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]> (powerpc)
Cc: Peter Zijlstra <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Babu Moger [Fri, 8 Sep 2017 23:14:22 +0000 (16:14 -0700)]
arch: define CPU_BIG_ENDIAN for all fixed big endian archs
Patch series "Define CPU_BIG_ENDIAN or warn for inconsistencies", v3.
While working on enabling queued rwlock on SPARC, found this following
code in include/asm-generic/qrwlock.h which uses CONFIG_CPU_BIG_ENDIAN to
clear a byte.
static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
{
return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
}
Problem is many of the fixed big endian architectures don't define
CPU_BIG_ENDIAN and clears the wrong byte.
Define CPU_BIG_ENDIAN for all the fixed big endian architecture to fix it.
Also found few more references of this config parameter in
drivers/of/base.c
drivers/of/fdt.c
drivers/tty/serial/earlycon.c
drivers/tty/serial/serial_core.c
Be aware that this may cause regressions if someone has worked-around
problems in the above code already. Remove the work-around.
Here is our original discussion
https://lkml.org/lkml/2017/5/24/620
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Babu Moger <[email protected]>
Suggested-by: Arnd Bergmann <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Acked-by: David S. Miller <[email protected]>
Acked-by: Stafford Horne <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Michael Ellerman <[email protected]> (powerpc)
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Greg KH <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Alexey Dobriyan [Fri, 8 Sep 2017 23:14:18 +0000 (16:14 -0700)]
treewide: make "nr_cpu_ids" unsigned
First, number of CPUs can't be negative number.
Second, different signnnedness leads to suboptimal code in the following
cases:
1)
kmalloc(nr_cpu_ids * sizeof(X));
"int" has to be sign extended to size_t.
2)
while (loff_t *pos < nr_cpu_ids)
MOVSXD is 1 byte longed than the same MOV.
Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".
Code savings on allyesconfig kernel: -3KB
add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
function old new delta
coretemp_cpu_online 450 512 +62
rcu_init_one 1234 1272 +38
pci_device_probe 374 399 +25
...
pgdat_reclaimable_pages 628 556 -72
select_fallback_rq 446 369 -77
task_numa_find_cpu 1923 1807 -116
Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:14:15 +0000 (16:14 -0700)]
vga: optimise console scrolling
Where possible, call memset16(), memmove() or memcpy() instead of using
open-coded loops. I don't like the calling convention that uses a byte
count instead of a count of u16s, but it's a little late to change that.
Reduces code size of fbcon.o by almost 400 bytes on my laptop build.
[
[email protected]: fix build]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David Miller <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:14:11 +0000 (16:14 -0700)]
drivers/scsi/sym53c8xx_2/sym_hipd.c: convert to use memset32
memset32() can be used to initialise these three arrays. Minor code
footprint reduction.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:14:07 +0000 (16:14 -0700)]
drivers/block/zram/zram_drv.c: convert to using memset_l
zram was the motivation for creating memset_l(). Minchan Kim sees a 7%
performance improvement on x86 with 100MB of non-zero deduplicatable
data:
perf stat -r 10 dd if=/dev/zram0 of=/dev/null
vanilla: 0.
232050465 seconds time elapsed ( +- 0.51% )
memset_l: 0.
217219387 seconds time elapsed ( +- 0.07% )
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Tested-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:14:04 +0000 (16:14 -0700)]
alpha: add support for memset16
Alpha already had an optimised fill-memory-with-16-bit-quantity
assembler routine called memsetw(). It has a slightly different calling
convention from memset16() in that it takes a byte count, not a count of
words. That's the same convention used by ARM's __memset routines, so
rename Alpha's routine to match and add a memset16() wrapper around it.
Then convert Alpha's scr_memsetw() to call memset16() instead of
memsetw().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:14:00 +0000 (16:14 -0700)]
ARM: implement memset32 & memset64
Reuse the existing optimised memset implementation to implement an
optimised memset32 and memset64.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Russell King <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:13:56 +0000 (16:13 -0700)]
x86: implement memset16, memset32 & memset64
These are single instructions on x86. There's no 64-bit instruction for
x86-32, but we don't yet have any user for memset64() on 32-bit
architectures, so don't bother to implement it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:13:52 +0000 (16:13 -0700)]
lib/string.c: add testcases for memset16/32/64
[
[email protected]: minor tweaks]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthew Wilcox [Fri, 8 Sep 2017 23:13:48 +0000 (16:13 -0700)]
lib/string.c: add multibyte memset functions
Patch series "Multibyte memset variations", v4.
A relatively common idiom we're missing is a function to fill an area of
memory with a pattern which is larger than a single byte. I first
noticed this with a zram patch which wanted to fill a page with an
'unsigned long' value. There turn out to be quite a few places in the
kernel which can benefit from using an optimised function rather than a
loop; sometimes text size, sometimes speed, and sometimes both. The
optimised PowerPC version (not included here) improves performance by
about 30% on POWER8 on just the raw memset_l().
Most of the extra lines of code come from the three testcases I added.
This patch (of 8):
memset16(), memset32() and memset64() are like memset(), but allow the
caller to fill the destination with a value larger than a single byte.
memset_l() and memset_p() allow the caller to use unsigned long and
pointer values respectively.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Masahiro Yamada [Fri, 8 Sep 2017 23:13:45 +0000 (16:13 -0700)]
linux/kernel.h: move DIV_ROUND_DOWN_ULL() macro
This macro is useful to avoid link error on 32-bit systems.
We have the same definition in two drivers, so move it to
include/linux/kernel.h
While we are here, refactor DIV_ROUND_UP_ULL() by using
DIV_ROUND_DOWN_ULL().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Mark Brown <[email protected]>
Cc: Cyrille Pitchen <[email protected]>
Cc: Jaroslav Kysela <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Liam Girdwood <[email protected]>
Cc: Boris Brezillon <[email protected]>
Cc: Marek Vasut <[email protected]>
Cc: Brian Norris <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: David Woodhouse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
David Rientjes [Fri, 8 Sep 2017 23:13:41 +0000 (16:13 -0700)]
fs, proc: unconditional cond_resched when reading smaps
If there are large numbers of hugepages to iterate while reading
/proc/pid/smaps, the page walk never does cond_resched(). On archs
without split pmd locks, there can be significant and observable
contention on mm->page_table_lock which cause lengthy delays without
rescheduling.
Always reschedule in smaps_pte_range() if necessary since the pagewalk
iteration can be expensive.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Alexey Dobriyan [Fri, 8 Sep 2017 23:13:38 +0000 (16:13 -0700)]
proc: uninline proc_create()
Save some code from ~320 invocations all clearing last argument.
add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657)
function old new delta
proc_create - 17 +17
__ksymtab_proc_create - 16 +16
__kstrtab_proc_create - 12 +12
yam_init_driver 301 298 -3
...
cifs_proc_init 249 228 -21
via_fb_pci_probe 2304 2280 -24
Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Michal Hocko [Fri, 8 Sep 2017 23:13:35 +0000 (16:13 -0700)]
fs, proc: remove priv argument from is_stack
Commit
b18cb64ead40 ("fs/proc: Stop trying to report thread stacks")
removed the priv parameter user in is_stack so the argument is
redundant. Drop it.
[
[email protected]: remove unused variable]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Anshuman Khandual [Fri, 8 Sep 2017 23:13:32 +0000 (16:13 -0700)]
mm/mempolicy.c: remove BUG_ON() checks for VMA inside mpol_misplaced()
VMA and its address bounds checks are too late in this function. They
must have been verified earlier in the page fault sequence. Hence just
remove them.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
David Rientjes [Fri, 8 Sep 2017 23:13:29 +0000 (16:13 -0700)]
mm/swapfile.c: fix swapon frontswap_map memory leak on error
Free frontswap_map if an error is encountered before enable_swap_info().
Signed-off-by: David Rientjes <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]> [4.12+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Darrick J. Wong [Fri, 8 Sep 2017 23:13:25 +0000 (16:13 -0700)]
mm: kvfree the swap cluster info if the swap file is unsatisfactory
If initializing a small swap file fails because the swap file has a
problem (holes, etc.) then we need to free the cluster info as part of
cleanup. Unfortunately a previous patch changed the code to use kvzalloc
but did not change all the vfree calls to use kvfree.
Found by running generic/357 from xfstests.
Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia
Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures")
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]> [4.12+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Tetsuo Handa [Fri, 8 Sep 2017 23:13:22 +0000 (16:13 -0700)]
mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt
We are by error initializing alloc_flags before gfp_allowed_mask is
applied. This could cause problems after pm_restrict_gfp_mask() is called
during suspend operation. Apply gfp_allowed_mask before initializing
alloc_flags so that the first allocation attempt uses correct flags.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 83d4ca8148fd9092 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Cyrill Gorcunov [Fri, 8 Sep 2017 23:13:19 +0000 (16:13 -0700)]
tools/testing/selftests/kcmp/kcmp_test.c: add KCMP_EPOLL_TFD testing
KCMP's KCMP_EPOLL_TFD mode merged in commit
0791e3644e5ef2 ("kcmp: add
KCMP_EPOLL_TFD mode to compare epoll target files") we've had no selftest
for it yet (except in criu development list). Thus add it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Cyrill Gorcunov <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Michal Hocko [Fri, 8 Sep 2017 23:13:15 +0000 (16:13 -0700)]
mm/sparse.c: fix typo in online_mem_sections
online_mem_sections() accidentally marks online only the first section
in the given range. This is a typo which hasn't been noticed because I
haven't tested large 2GB blocks previously. All users of
pfn_to_online_page would get confused on the the rest of the pfn range
in the block.
All we need to fix this is to use iterator (pfn) rather than start_pfn.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Laurent Dufour [Fri, 8 Sep 2017 23:13:12 +0000 (16:13 -0700)]
mm/memory.c: fix mem_cgroup_oom_disable() call missing
Seen while reading the code, in handle_mm_fault(), in the case
arch_vma_access_permitted() is failing the call to
mem_cgroup_oom_disable() is not made.
To fix that, move the call to mem_cgroup_oom_enable() after calling
arch_vma_access_permitted() as it should not have entered the memcg OOM.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: bae473a423f6 ("mm: introduce fault_env")
Signed-off-by: Laurent Dufour <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Roman Gushchin [Fri, 8 Sep 2017 23:13:09 +0000 (16:13 -0700)]
mm: memcontrol: use per-cpu stocks for socket memory uncharging
We've noticed a quite noticeable performance overhead on some hosts with
significant network traffic when socket memory accounting is enabled.
Perf top shows that socket memory uncharging path is hot:
2.13% [kernel] [k] page_counter_cancel
1.14% [kernel] [k] __sk_mem_reduce_allocated
1.14% [kernel] [k] _raw_spin_lock
0.87% [kernel] [k] _raw_spin_lock_irqsave
0.84% [kernel] [k] tcp_ack
0.84% [kernel] [k] ixgbe_poll
0.83% < workload >
0.82% [kernel] [k] enqueue_entity
0.68% [kernel] [k] __fget
0.68% [kernel] [k] tcp_delack_timer_handler
0.67% [kernel] [k] __schedule
0.60% < workload >
0.59% [kernel] [k] __inet6_lookup_established
0.55% [kernel] [k] __switch_to
0.55% [kernel] [k] menu_select
0.54% libc-2.20.so [.] __memcpy_avx_unaligned
To address this issue, the existing per-cpu stock infrastructure can be
used.
refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
charge to a per-cpu stock instead of calling atomic
page_counter_uncharge().
To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
will explicitly drain the cached charge, if the cached value exceeds
CHARGE_BATCH.
This allows significantly optimize the load:
1.21% [kernel] [k] _raw_spin_lock
1.01% [kernel] [k] ixgbe_poll
0.92% [kernel] [k] _raw_spin_lock_irqsave
0.90% [kernel] [k] enqueue_entity
0.86% [kernel] [k] tcp_ack
0.85% < workload >
0.74% perf-11120.map [.] 0x000000000061bf24
0.73% [kernel] [k] __schedule
0.67% [kernel] [k] __fget
0.63% [kernel] [k] __inet6_lookup_established
0.62% [kernel] [k] menu_select
0.59% < workload >
0.59% [kernel] [k] __switch_to
0.57% libc-2.20.so [.] __memcpy_avx_unaligned
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Shakeel Butt [Fri, 8 Sep 2017 23:13:05 +0000 (16:13 -0700)]
mm: fadvise: avoid fadvise for fs without backing device
The fadvise() manpage is silent on fadvise()'s effect on memory-based
filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
sysfs, kernfs). The current implementaion of fadvise is mostly a noop
for such filesystems except for FADV_DONTNEED which will trigger
expensive remote LRU cache draining. This patch makes the noop of
fadvise() on such file systems very explicit.
However this change has two side effects for ramfs and one for tmpfs.
First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
pages of ramfs (allocated through read, readahead & read fault) and
tmpfs (allocated through read fault). Also fadvise(FADV_WILLNEED) could
create such clean zero'ed pages for ramfs. This change removes those
possibilities.
One of our generic libraries does fadvise(FADV_DONTNEED). Recently we
observed high latency in fadvise() and noticed that the users have
started using tmpfs files and the latency was due to expensive remote
LRU cache draining. For normal tmpfs files (have data written on them),
fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
draining.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Greg Thelen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Matthias Kaehlcke [Fri, 8 Sep 2017 23:13:02 +0000 (16:13 -0700)]
mm/zsmalloc.c: change stat type parameter to int
zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however
some callers pass an enum fullness_group value. Change the type to int to
reflect the actual use of the functions and get rid of 'enum-conversion'
warnings
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthias Kaehlcke <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Doug Anderson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Joonsoo Kim [Fri, 8 Sep 2017 23:12:59 +0000 (16:12 -0700)]
mm/mlock.c: use page_zone() instead of page_zone_id()
page_zone_id() is a specialized function to compare the zone for the pages
that are within the section range. If the section of the pages are
different, page_zone_id() can be different even if their zone is the same.
This wrong usage doesn't cause any actual problem since
__munlock_pagevec_fill() would be called again with failed index.
However, it's better to use more appropriate function here.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Kemi Wang [Fri, 8 Sep 2017 23:12:55 +0000 (16:12 -0700)]
mm: consider the number in local CPUs when reading NUMA stats
To avoid deviation, the per cpu number of NUMA stats in
vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
Since NUMA stats does not be read by users frequently, and kernel does not
need it to make a decision, it will not be a problem to make the readers
more expensive.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemi Wang <[email protected]>
Reported-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Ying Huang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Kemi Wang [Fri, 8 Sep 2017 23:12:52 +0000 (16:12 -0700)]
mm: update NUMA counter threshold size
There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).
This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
as a small threshold greatly increases the update frequency of the global
counter from local per cpu counter(suggested by Ying Huang).
The rationality is that these statistics counters don't affect the
kernel's decision, unlike other VM counters, so it's not a problem to use
a large threshold.
With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.
Benchmark provided by Jesper D Brouer(increase loop times to
10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench
Threshold CPU cycles Throughput(88 threads)
32 799
241760478
64 640
301628829
125 537
358906028 <==> system by default (base)
256 468
412397590
512 428
450550704
4096 399
482520943
20000 394
489009617
30000 395
488017817
65533 369(-31.3%)
521661345(+45.3%) <==> with this patchset
N/A 342(-36.3%)
562900157(+56.8%) <==> disable zone_statistics
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemi Wang <[email protected]>
Reported-by: Jesper Dangaard Brouer <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Kemi Wang [Fri, 8 Sep 2017 23:12:48 +0000 (16:12 -0700)]
mm: change the call sites of numa statistics items
Patch series "Separate NUMA statistics from zone statistics", v2.
Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed. This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).
A link to the MM summit slides:
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang). The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.
With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark. Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).
I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.
Benchmark provided by Jesper D Brouer(increase loop times to
10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
Threshold CPU cycles Throughput(88 threads)
32 799
241760478
64 640
301628829
125 537
358906028 <==> system by default
256 468
412397590
512 428
450550704
4096 399
482520943
20000 394
489009617
30000 395
488017817
65533 369(-31.3%)
521661345(+45.3%) <==> with this patchset
N/A 342(-36.3%)
562900157(+56.8%) <==> disable zone_statistics
This patch (of 3):
In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.
E.g. cat /proc/zoneinfo
***Base*** ***With this patch***
nr_free_pages 3976 nr_free_pages 3976
nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
nr_zone_active_anon 0 nr_zone_active_anon 0
nr_zone_inactive_file 0 nr_zone_inactive_file 0
nr_zone_active_file 0 nr_zone_active_file 0
nr_zone_unevictable 0 nr_zone_unevictable 0
nr_zone_write_pending 0 nr_zone_write_pending 0
nr_mlock 0 nr_mlock 0
nr_page_table_pages 0 nr_page_table_pages 0
nr_kernel_stack 0 nr_kernel_stack 0
nr_bounce 0 nr_bounce 0
nr_zspages 0 nr_zspages 0
numa_hit 0 *nr_free_cma 0*
numa_miss 0 numa_hit 0
numa_foreign 0 numa_miss 0
numa_interleave 0 numa_foreign 0
numa_local 0 numa_interleave 0
numa_other 0 numa_local 0
*nr_free_cma 0* numa_other 0
... ...
vm stats threshold: 10 vm stats threshold: 10
... ...
The next patch updates the numa stats counter size and threshold.
[
[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemi Wang <[email protected]>
Reported-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ying Huang <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Anshuman Khandual [Fri, 8 Sep 2017 23:12:45 +0000 (16:12 -0700)]
mm/memory.c: remove reduntant check for write access
Flags argument has been copied into vmf.flags and it is not changed in
between. Hence a single write access check can be used for both PUD and
PMD.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Andrea Arcangeli [Fri, 8 Sep 2017 23:12:42 +0000 (16:12 -0700)]
userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS
This is an enhancement to avoid a non cooperative userfaultfd manager
having to unregister all regions before it can close the uffd after all
userfaultfd activity completed.
The UFFDIO_UNREGISTER would serialize against the handle_userfault by
taking the mmap_sem for writing, but we can simply repeat the page fault
if we detect the uffd was closed and so the regular page fault paths
should takeover.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Laurent Dufour [Fri, 8 Sep 2017 23:12:39 +0000 (16:12 -0700)]
mm: remove useless vma parameter to offset_il_node
While reading the code I found that offset_il_node() has a vm_area_struct
pointer parameter which is unused.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Laurent Dufour <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:35 +0000 (16:12 -0700)]
mm/hmm: fix build when HMM is disabled
Combinatorial Kconfig is painfull. Withi this patch all below combination
build.
1)
2)
CONFIG_HMM_MIRROR=y
3)
CONFIG_DEVICE_PRIVATE=y
4)
CONFIG_DEVICE_PUBLIC=y
5)
CONFIG_HMM_MIRROR=y
CONFIG_DEVICE_PUBLIC=y
6)
CONFIG_HMM_MIRROR=y
CONFIG_DEVICE_PRIVATE=y
7)
CONFIG_DEVICE_PRIVATE=y
CONFIG_DEVICE_PUBLIC=y
8)
CONFIG_HMM_MIRROR=y
CONFIG_DEVICE_PRIVATE=y
CONFIG_DEVICE_PUBLIC=y
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:32 +0000 (16:12 -0700)]
mm/hmm: avoid bloating arch that do not make use of HMM
This moves all new code including new page migration helper behind kernel
Kconfig option so that there is no codee bloat for arch or user that do
not want to use HMM or any of its associated features.
arm allyesconfig (without all the patchset, then with and this patch):
text data bss dec hex filename
83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
83722364 46511131 27582964 157816459 968168b vmlinux
[
[email protected]: struct hmm is only use by HMM mirror functionality]
Link: http://lkml.kernel.org/r/[email protected]
[
[email protected]: fix build (arm multi_v7_defconfig)]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:28 +0000 (16:12 -0700)]
mm/hmm: add new helper to hotplug CDM memory region
Unlike unaddressable memory, coherent device memory has a real resource
associated with it on the system (as CPU can address it). Add a new
helper to hotplug such memory within the HMM framework.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Reviewed-by: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:24 +0000 (16:12 -0700)]
mm/device-public-memory: device memory cache coherent with CPU
Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion. Add a new type of
ZONE_DEVICE to represent such memory. The use case are the same as for
the un-addressable device memory but without all the corners cases.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:21 +0000 (16:12 -0700)]
mm/migrate: allow migrate_vma() to alloc new page on empty entry
This allows callers of migrate_vma() to allocate new page for empty CPU
page table entry (pte_none or back by zero page). This is only for
anonymous memory and it won't allow new page to be instanced if the
userfaultfd is armed.
This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:17 +0000 (16:12 -0700)]
mm/migrate: support un-addressable ZONE_DEVICE page in migration
Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:13 +0000 (16:12 -0700)]
mm/migrate: migrate_vma() unmap page from vma while collecting pages
Common case for migration of virtual address range is page are map only
once inside the vma in which migration is taking place. Because we
already walk the CPU page table for that range we can directly do the
unmap there and setup special migration swap entry.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:09 +0000 (16:12 -0700)]
mm/migrate: new memory migration helper for use with device memory
This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory (which
can be allocated through special allocator). It differs from numa
migration by working on a range of virtual address and thus by doing
migration in chunk that can be large enough to use DMA engine or special
copy offloading engine.
Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As an
example IBM platform with CAPI bus can make use of this feature to migrate
between regular memory and CAPI device memory. New CPU architecture with
a pool of high performance memory not manage as cache but presented as
regular memory (while being faster and with lower latency than DDR) will
also be prime user of this patch.
Migration to private device memory will be useful for device that have
large pool of such like GPU, NVidia plans to use HMM for that.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:06 +0000 (16:12 -0700)]
mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
Introduce a new migration mode that allow to offload the copy to a device
DMA engine. This changes the workflow of migration and not all
address_space migratepage callback can support this.
This is intended to be use by migrate_vma() which itself is use for thing
like HMM (see include/linux/hmm.h).
No additional per-filesystem migratepage testing is needed. I disables
MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
added comment in those to explain why (part of this patch). The commit
message is unclear it should say that any callback that wish to support
this new mode need to be aware of the difference in the migration flow
from other mode.
Some of these callbacks do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations. But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.
Usual flow is:
For each page {
1 - lock page
2 - call migratepage() callback
3 - (extra locking in some migratepage() callback)
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
5 - copy page
6 - (unlock any extra lock of migratepage() callback)
7 - return from migratepage() callback
8 - unlock page
}
The new mode MIGRATE_SYNC_NO_COPY:
1 - lock multiple pages
For each page {
2 - call migratepage() callback
3 - abort in all problematic migratepage() callback
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
} // finished all calls to migratepage() callback
5 - DMA copy multiple pages
6 - unlock all the pages
To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
new callback migratepages() (for instance) that deals with multiple
pages in one transaction.
Because the problematic cases are not important for current usage I did
not wanted to complexify this patchset even more for no good reason.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:12:02 +0000 (16:12 -0700)]
mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory
This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory. It
is useful to device driver that want to manage multiple physical device
memory under same struct device umbrella.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:58 +0000 (16:11 -0700)]
mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
This introduce a simple struct and associated helpers for device driver to
use when hotpluging un-addressable device memory as ZONE_DEVICE. It will
find a unuse physical address range and trigger memory hotplug for it
which allocates and initialize struct page for the device memory.
Device driver should use this helper during device initialization to
hotplug the device memory. It should only need to remove the memory once
the device is going offline (shutdown or hotremove). There should not be
any userspace API to hotplug memory expect maybe for host device driver to
allow to add more memory to a guest device driver.
Device's memory is manage by the device driver and HMM only provides
helpers to that effect.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:54 +0000 (16:11 -0700)]
mm/memcontrol: support MEMORY_DEVICE_PRIVATE
HMM pages (private or public device pages) are ZONE_DEVICE page and thus
need special handling when it comes to lru or refcount. This patch make
sure that memcontrol properly handle those when it face them. Those pages
are use like regular pages in a process address space either as anonymous
page or as file back page. So from memcg point of view we want to handle
them like regular page for now at least.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:50 +0000 (16:11 -0700)]
mm/memcontrol: allow to uncharge page without using page->lru field
HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.
There is no change to memcontrol logic, it is the same as it was
before this patch.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:46 +0000 (16:11 -0700)]
mm/ZONE_DEVICE: special case put_page() for device private pages
A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have
any user. For device private pages this is important to catch and thus we
need to special case put_page() for this.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:43 +0000 (16:11 -0700)]
mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory
HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory. Reasons for HMM and
migration to device memory is explained with HMM core patch.
This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support
different types of memory.
A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type. There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type. All specific code
path are protect with test against the memory type.
Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).
The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.
The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.
If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.
[
[email protected]: fix warning]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Michal Hocko [Fri, 8 Sep 2017 23:11:39 +0000 (16:11 -0700)]
mm/memory_hotplug: introduce add_pages
There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be done
by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the
currently existing __add_pages.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Jérôme Glisse <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:35 +0000 (16:11 -0700)]
mm/hmm/mirror: device page fault handler
This handles page fault on behalf of device driver, unlike
handle_mm_fault() it does not trigger migration back to system memory for
device memory.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:31 +0000 (16:11 -0700)]
mm/hmm/mirror: helper to snapshot CPU page table
This does not use existing page table walker because we want to share
same code for our page fault handler.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:27 +0000 (16:11 -0700)]
mm/hmm/mirror: mirror process address space on device with HMM helpers
This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).
This patch provide a simple API for device driver to achieve address space
mirroring thus avoiding each device driver to grow its own CPU page table
walker and its own CPU page table synchronization mechanism.
This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.
[
[email protected]: fix hmm for "mmu_notifier kill invalidate_page callback"]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:23 +0000 (16:11 -0700)]
mm/hmm: heterogeneous memory management (HMM for short)
HMM provides 3 separate types of functionality:
- Mirroring: synchronize CPU page table and device page table
- Device memory: allocating struct page for device memory
- Migration: migrating regular memory to device memory
This patch introduces some common helpers and definitions to all of
those 3 functionality.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Jérôme Glisse [Fri, 8 Sep 2017 23:11:19 +0000 (16:11 -0700)]
hmm: heterogeneous memory management documentation
Patch series "HMM (Heterogeneous Memory Management)", v25.
Heterogeneous Memory Management (HMM) (description and justification)
Today device driver expose dedicated memory allocation API through their
device file, often relying on a combination of IOCTL and mmap calls.
The device can only access and use memory allocated through this API.
This effectively split the program address space into object allocated
for the device and useable by the device and other regular memory
(malloc, mmap of a file, share memory, â) only accessible by
CPU (or in a very limited way by a device by pinning memory).
Allowing different isolated component of a program to use a device thus
require duplication of the input data structure using device memory
allocator. This is reasonable for simple data structure (array, grid,
image, â) but this get extremely complex with advance data
structure (list, tree, graph, â) that rely on a web of memory
pointers. This is becoming a serious limitation on the kind of work
load that can be offloaded to device like GPU.
New industry standard like C++, OpenCL or CUDA are pushing to remove
this barrier. This require a shared address space between GPU device
and CPU so that GPU can access any memory of a process (while still
obeying memory protection like read only). This kind of feature is also
appearing in various other operating systems.
HMM is a set of helpers to facilitate several aspects of address space
sharing and device memory management. Unlike existing sharing mechanism
that rely on pining pages use by a device, HMM relies on mmu_notifier to
propagate CPU page table update to device page table.
Duplicating CPU page table is only one aspect necessary for efficiently
using device like GPU. GPU local memory have bandwidth in the TeraBytes/
second range but they are connected to main memory through a system bus
like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
is necessary to allow migration of process memory from main system memory
to device memory. Issue is that on platform that only have PCIE the
device memory is not accessible by the CPU with the same properties as
main memory (cache coherency, atomic operations, ...).
To allow migration from main memory to device memory HMM provides a set of
helper to hotplug device memory as a new type of ZONE_DEVICE memory which
is un-addressable by CPU but still has struct page representing it. This
allow most of the core kernel logic that deals with a process memory to
stay oblivious of the peculiarity of device memory.
When page backing an address of a process is migrated to device memory the
CPU page table entry is set to a new specific swap entry. CPU access to
such address triggers a migration back to system memory, just like if the
page was swap on disk. HMM also blocks any one from pinning a ZONE_DEVICE
page so that it can always be migrated back to system memory if CPU access
it. Conversely HMM does not migrate to device memory any page that is pin
in system memory.
To allow efficient migration between device memory and main memory a new
migrate_vma() helpers is added with this patchset. It allows to leverage
device DMA engine to perform the copy operation.
This feature will be use by upstream driver like nouveau mlx5 and probably
other in the future (amdgpu is next suspect in line). We are actively
working on nouveau and mlx5 support. To test this patchset we also worked
with NVidia close source driver team, they have more resources than us to
test this kind of infrastructure and also a bigger and better userspace
eco-system with various real industry workload they can be use to test and
profile HMM.
The expected workload is a program builds a data set on the CPU (from
disk, from network, from sensors, â). Program uses GPU API (OpenCL,
CUDA, ...) to give hint on memory placement for the input data and also
for the output buffer. Program call GPU API to schedule a GPU job, this
happens using device driver specific ioctl. All this is hidden from
programmer point of view in case of C++ compiler that transparently
offload some part of a program to GPU. Program can keep doing other stuff
on the CPU while the GPU is crunching numbers.
It is expected that CPU will not access the same data set as the GPU while
GPU is working on it, but this is not mandatory. In fact we expect some
small memory object to be actively access by both GPU and CPU concurrently
as synchronization channel and/or for monitoring purposes. Such object
will stay in system memory and should not be bottlenecked by system bus
bandwidth (rare write and read access from both CPU and GPU).
As we are relying on device driver API, HMM does not introduce any new
syscall nor does it modify any existing ones. It does not change any
POSIX semantics or behaviors. For instance the child after a fork of a
process that is using HMM will not be impacted in anyway, nor is there any
data hazard between child COW or parent COW of memory that was migrated to
device prior to fork.
HMM assume a numbers of hardware features. Device must allow device page
table to be updated at any time (ie device job must be preemptable).
Device page table must provides memory protection such as read only.
Device must track write access (dirty bit). Device must have a minimum
granularity that match PAGE_SIZE (ie 4k).
Reviewer (just hint):
Patch 1 HMM documentation
Patch 2 introduce core infrastructure and definition of HMM, pretty
small patch and easy to review
Patch 3 introduce the mirror functionality of HMM, it relies on
mmu_notifier and thus someone familiar with that part would be
in better position to review
Patch 4 is an helper to snapshot CPU page table while synchronizing with
concurrent page table update. Understanding mmu_notifier makes
review easier.
Patch 5 is mostly a wrapper around handle_mm_fault()
Patch 6 add new add_pages() helper to avoid modifying each arch memory
hot plug function
Patch 7 add a new memory type for ZONE_DEVICE and also add all the logic
in various core mm to support this new type. Dan Williams and
any core mm contributor are best people to review each half of
this patchset
Patch 8 special case HMM ZONE_DEVICE pages inside put_page() Kirill and
Dan Williams are best person to review this
Patch 9 allow to uncharge a page from memory group without using the lru
list field of struct page (best reviewer: Johannes Weiner or
Vladimir Davydov or Michal Hocko)
Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best
reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
Patch 11 add helper to hotplug un-addressable device memory as new type
of ZONE_DEVICE memory (new type introducted in patch 3 of this
serie). This is boiler plate code around memory hotplug and it
also pick a free range of physical address for the device memory.
Note that the physical address do not point to anything (at least
as far as the kernel knows).
Patch 12 introduce a new hmm_device class as an helper for device driver
that want to expose multiple device memory under a common fake
device driver. This is usefull for multi-gpu configuration.
Anyone familiar with device driver infrastructure can review
this. Boiler plate code really.
Patch 13 add a new migrate mode. Any one familiar with page migration is
welcome to review.
Patch 14 introduce a new migration helper (migrate_vma()) that allow to
migrate a range of virtual address of a process using device DMA
engine to perform the copy. It is not limited to do copy from and
to device but can also do copy between any kind of source and
destination memory. Again anyone familiar with migration code
should be able to verify the logic.
Patch 15 optimize the new migrate_vma() by unmapping pages while we are
collecting them. This can be review by any mm folks.
Patch 16 add unaddressable memory migration to helper introduced in patch
7, this can be review by anyone familiar with migration code
Patch 17 add a feature that allow device to allocate non-present page on
the GPU when migrating a range of address to device memory. This
is an helper for device driver to avoid having to first allocate
system memory before migration to device memory
Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
memory (CDM)
Patch 19 add an helper to hotplug CDM memory
Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759
v6 http://lwn.net/Articles/619737/
v7 http://lwn.net/Articles/627316/
v8 https://lwn.net/Articles/645515/
v9 https://lwn.net/Articles/651553/
v10 https://lwn.net/Articles/654430/
v11 http://www.gossamer-threads.com/lists/linux/kernel/
2286424
v12 http://www.kernelhub.org/?msg=972982&p=2
v13 https://lwn.net/Articles/706856/
v14 https://lkml.org/lkml/2016/12/8/344
v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
v16 http://www.spinics.net/lists/linux-mm/msg119814.html
v17 https://lkml.org/lkml/2017/1/27/847
v18 https://lkml.org/lkml/2017/3/16/596
v19 https://lkml.org/lkml/2017/4/5/831
v20 https://lwn.net/Articles/720715/
v21 https://lkml.org/lkml/2017/4/24/747
v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
v23 https://www.mail-archive.com/
[email protected]/msg1404788.html
v24 https://lwn.net/Articles/726691/
This patch (of 19):
This adds documentation for HMM (Heterogeneous Memory Management). It
presents the motivation behind it, the features necessary for it to be
useful and and gives an overview of how this is implemented.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:11:15 +0000 (16:11 -0700)]
mm: memory_hotplug: memory hotremove supports thp migration
This patch enables thp migration for memory hotremove.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:11:12 +0000 (16:11 -0700)]
mm: migrate: move_pages() supports thp migration
This patch enables thp migration for move_pages(2).
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:11:08 +0000 (16:11 -0700)]
mm: mempolicy: mbind and migrate_pages support thp migration
This patch enables thp migration for mbind(2) and migrate_pages(2).
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:11:04 +0000 (16:11 -0700)]
mm: soft-dirty: keep soft-dirty bits over thp migration
Soft dirty bit is designed to keep tracked over page migration. This
patch makes it work in the same manner for thp migration too.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Zi Yan [Fri, 8 Sep 2017 23:11:01 +0000 (16:11 -0700)]
mm: thp: check pmd migration entry in common path
When THP migration is being used, memory management code needs to handle
pmd migration entries properly. This patch uses !pmd_present() or
is_swap_pmd() (depending on whether pmd_none() needs separate code or
not) to check pmd migration entries at the places where a pmd entry is
present.
Since pmd-related code uses split_huge_page(), split_huge_pmd(),
pmd_trans_huge(), pmd_trans_unstable(), or
pmd_none_or_trans_huge_or_clear_bad(), this patch:
1. adds pmd migration entry split code in split_huge_pmd(),
2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
them.
Until this commit, a pmd entry should be:
1. pointing to a pte page,
2. is_swap_pmd(),
3. pmd_trans_huge(),
4. pmd_devmap(), or
5. pmd_none().
Signed-off-by: Zi Yan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Zi Yan [Fri, 8 Sep 2017 23:10:57 +0000 (16:10 -0700)]
mm: thp: enable thp migration in generic path
Add thp migration's core code, including conversions between a PMD entry
and a swap entry, setting PMD migration entry, removing PMD migration
entry, and waiting on PMD migration entries.
This patch makes it possible to support thp migration. If you fail to
allocate a destination page as a thp, you just split the source thp as
we do now, and then enter the normal page migration. If you succeed to
allocate destination thp, you enter thp migration. Subsequent patches
actually enable thp migration for each caller of page migration by
allowing its get_new_page() callback to allocate thps.
[
[email protected]: fix gcc-4.9.0 -Wmissing-braces warning]
Link: http://lkml.kernel.org/r/[email protected]
[
[email protected]: fix x86_64 allnoconfig warning]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:10:53 +0000 (16:10 -0700)]
mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:10:49 +0000 (16:10 -0700)]
mm: thp: introduce separate TTU flag for thp freezing
TTU_MIGRATION is used to convert pte into migration entry until thp
split completes. This behavior conflicts with thp migration added later
patches, so let's introduce a new TTU flag specifically for freezing.
try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given
for head page is like below (assuming anonymous thp):
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
and ttu_flag given for tail pages is:
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION)
__unmap_and_move() calls try_to_unmap() with ttu_flag:
(TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
...
/* PMD-mapped THP migration entry */
if (!pvmw.pte && (flags & TTU_MIGRATION)) {
if (!PageAnon(page))
continue;
set_pmd_migration_entry(&pvmw, page);
continue;
}
...
}
so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration
entry.)
I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:10:46 +0000 (16:10 -0700)]
mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
_PAGE_PSE is used to distinguish between a truly non-present
(_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and
should be treated as present.
But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would
cause confusion between one of those PMDs undergoing a THP split, and a
soft-dirty PMD. Dropping _PAGE_PSE check in pmd_present() does not work
well, because it can hurt optimization of tlb handling in thp split.
Thus, we need to move the bit.
In the current kernel, bits 1-4 are not used in non-present format since
commit
00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to work
around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1. Bit 7
is used as reserved (always clear), so please don't use it for other
purpose.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Naoya Horiguchi [Fri, 8 Sep 2017 23:10:42 +0000 (16:10 -0700)]
mm: mempolicy: add queue_pages_required()
Patch series "mm: page migration enhancement for thp", v9.
Motivations:
1. THP migration becomes important in the upcoming heterogeneous memory
systems. As David Nellans from NVIDIA pointed out from other threads
(http://www.mail-archive.com/
[email protected]/msg1349227.html),
future GPUs or other accelerators will have their memory managed by
operating systems. Moving data into and out of these memory nodes
efficiently is critical to applications that use GPUs or other
accelerators. Existing page migration only supports base pages, which
has a very low memory bandwidth utilization. My experiments (see
below) show THP migration can migrate pages more efficiently.
2. Base page migration vs THP migration throughput.
Here are cross-socket page migration results from calling
move_pages() syscall:
In x86_64, a Intel two-socket E5-2640v3 box,
- single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
- single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
- 512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.
In ppc64, a two-socket Power8 box,
- single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
- single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
- 256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.
THP migration can give us 3x and 1.15x throughput over base page
migration in x86_64 and ppc64 respectivley.
You can test it out by using the code here:
https://github.com/x-y-z/thp-migration-bench
3. Existing page migration splits THP before migration and cannot
guarantee the migrated pages are still contiguous. Contiguity is
always what GPUs and accelerators look for. Without THP migration,
khugepaged needs to do extra work to reassemble the migrated pages
back to THPs.
This patch (of 10):
Introduce a separate check routine related to MPOL_MF_INVERT flag. This
patch just does cleanup, no behavioral change.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Nellans <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>