Commit Graph

261 Commits

Author SHA1 Message Date
Lioncash
1591f208c0
memory: Move AddressSpaceDispatch from AddressSpace to FlatView
As we are going to share FlatView's between AddressSpace's,
and AddressSpaceDispatch is a structure to perform quick lookup
in FlatView, this moves ASD to FlatView.

After previosly open coded ASD rendering, we can also remove
as->next_dispatch as the new FlatView pointer is stored
on a stack and set to an AS atomically.

flatview_destroy() is executed under RCU instead of
address_space_dispatch_free() now.

This makes mem_begin/mem_commit to work with ASD and mem_add with FV
as later on mem_add will be taking FV as an argument anyway.

This should cause no behavioural change.

Backports commit 66a6df1dc6d5b28cc3e65db0d71683fbdddc6b62 from qemu
2018-03-11 20:40:24 -04:00
Alex Bennée
e56ed38819
include/exec/helper-head.h: support f16 in helper calls
This allows us to explicitly pass float16 to helpers rather than
assuming uint32_t and dealing with the result. Of course they will be
passed in i32 sized registers by default.

Backports commit 35737497008aeabce5dc381a41d3827bec486192 from qemu
2018-03-08 12:28:05 -05:00
Paolo Bonzini
c88064b52c
memory: remove memory_region_test_and_clear_dirty
It is unused after g364fb has been converted to use DirtyBitmapSnapshot.

Backports commit 77302fb5df05ffca9f41b5b54e3b67c601719d57 from qemu
2018-03-08 09:02:06 -05:00
Laurent Vivier
0aecb15f3b
accel/tcg: add size paremeter in tlb_fill()
The MC68040 MMU provides the size of the access that
triggers the page fault.

This size is set in the Special Status Word which
is written in the stack frame of the access fault
exception.

So we need the size in m68k_cpu_unassigned_access() and
m68k_cpu_handle_mmu_fault().

To be able to do that, this patch modifies the prototype of
handle_mmu_fault handler, tlb_fill() and probe_write().
do_unassigned_access() already includes a size parameter.

This patch also updates handle_mmu_fault handlers and
tlb_fill() of all targets (only parameter, no code change).

Backports commit 98670d47cd8d63a529ff230fd39ddaa186156f8c from qemu
2018-03-06 10:56:34 -05:00
Richard Henderson
7fe5f620df
tcg: Dynamically allocate TCGOps
With no fixed array allocation, we can't overflow a buffer.
This will be important as optimizations related to host vectors
may expand the number of ops used.

Use QTAILQ to link the ops together.

Backports commit 15fa08f8451babc88d733bd411d4c94976f9d0f8 from qemu
2018-03-05 16:34:40 -05:00
Peter Xu
1bb34aadf9
cpu: refactor cpu_address_space_init()
Normally we create an address space for that CPU and pass that address
space into the function. Let's just do it inside to unify address space
creations. It'll simplify my next patch to rename those address spaces.

Backports commit 80ceb07a83375e3a0091591f96bd47bce2f640ce from qemu
2018-03-05 14:39:25 -05:00
Marc-André Lureau
ffa45adb57
memory: remove unused memory_region_set_global_locking()
This was never used since its introduction in commit
196ea13104f8 ("memory: Add global-locking property to memory
regions").

Backports commit e2fbe20851ceec5ccd7b539a89db0420393fb85d from qemu
2018-03-05 14:14:43 -05:00
Richard Henderson
d450156414
tcg: Remove GET_TCGV_* and MAKE_TCGV_*
The GET and MAKE functions weren't really specific enough.
We now have a full complement of functions that convert exactly
between temporaries, arguments, tcgv pointers, and indices.

The target/sparc change is also a bug fix, which would have affected
a host that defines TCG_TARGET_HAS_extr[lh]_i64_i32, i.e. MIPS64.

Backports commit dc41aa7d34989b552efe712ffe184236216f960b from qemu
2018-03-05 09:12:26 -05:00
Richard Henderson
2bb5011b18
tcg: Introduce tcgv_{i32,i64,ptr}_{arg,temp}
Transform TCGv_* to an "argument" or a temporary.
For now, an argument is simply the temporary index.

Backports commit ae8b75dc6ec808378487064922f25f1e7ea7a9be from qemu
2018-03-05 08:46:12 -05:00
Emilio G. Cota
8552d95c52
exec-all: extract tb->tc_* into a separate struct tc_tb
In preparation for adding tc.size to be able to keep track of
TB's using the binary search tree implementation from glib.

Backports commit e7e168f41364c6e83d0f75fc1b3ce7f9c41ccf76 from qemu
2018-03-05 02:57:22 -05:00
Emilio G. Cota
5fc83f3eb2
exec-all: introduce TB_PAGE_ADDR_FMT
And fix the following warning when DEBUG_TB_INVALIDATE is enabled
in translate-all.c:

CC mipsn32-linux-user/accel/tcg/translate-all.o
/data/src/qemu/accel/tcg/translate-all.c: In function ‘tb_alloc_page’:
/data/src/qemu/accel/tcg/translate-all.c:1201:16: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘tb_page_addr_t {aka unsigned int}’ [-Werror=format=]
printf("protecting code page: 0x" TARGET_FMT_lx "\n",
^
cc1: all warnings being treated as errors
/data/src/qemu/rules.mak:66: recipe for target 'accel/tcg/translate-all.o' failed
make[1]: *** [accel/tcg/translate-all.o] Error 1
Makefile:328: recipe for target 'subdir-mipsn32-linux-user' failed
make: *** [subdir-mipsn32-linux-user] Error 2
cota@flamenco:/data/src/qemu/build ((18f3fe1...) *$)$

Backports commit 67a5b5d2f6eb6d3b980570223ba5c478487ddb6f from qemu
2018-03-05 02:49:44 -05:00
Emilio G. Cota
b4a7d8b773
exec-all: bring tb->invalid into tb->cflags
This gets rid of a hole in struct TranslationBlock.

Backports commit 84f1c148da2b35fbb5a436597872765257e8914e from qemu
2018-03-05 02:46:21 -05:00
Emilio G. Cota
210d13ec49
tcg: consolidate TB lookups in tb_lookup__cpu_state
This avoids duplicating code. cpu_exec_step will also use the
new common function once we integrate parallel_cpus into tb->cflags.

Note that in this commit we also fix a race, described by Richard Henderson
during review. Think of this scenario with threads A and B:

(A) Lookup succeeds for TB in hash without tb_lock
(B) Sets the TB's tb->invalid flag
(B) Removes the TB from tb_htable
(B) Clears all CPU's tb_jmp_cache
(A) Store TB into local tb_jmp_cache

Given that order of events, (A) will keep executing that invalid TB until
another flush of its tb_jmp_cache happens, which in theory might never happen.
We can fix this by checking the tb->invalid flag every time we look up a TB
from tb_jmp_cache, so that in the above scenario, next time we try to find
that TB in tb_jmp_cache, we won't, and will therefore be forced to look it
up in tb_htable.

Performance-wise, I measured a small improvement when booting debian-arm.
Note that inlining pays off:

Performance counter stats for 'taskset -c 0 qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=jessie.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 1' (10 runs):

Before:
18714.917392 task-clock # 0.952 CPUs utilized ( +- 0.95% )
23,142 context-switches # 0.001 M/sec ( +- 0.50% )
1 CPU-migrations # 0.000 M/sec
10,558 page-faults # 0.001 M/sec ( +- 0.95% )
53,957,727,252 cycles # 2.883 GHz ( +- 0.91% ) [83.33%]
24,440,599,852 stalled-cycles-frontend # 45.30% frontend cycles idle ( +- 1.20% ) [83.33%]
16,495,714,424 stalled-cycles-backend # 30.57% backend cycles idle ( +- 0.95% ) [66.66%]
76,267,572,582 instructions # 1.41 insns per cycle
12,692,186,323 branches # 678.186 M/sec ( +- 0.92% ) [83.35%]
263,486,879 branch-misses # 2.08% of all branches ( +- 0.73% ) [83.34%]

19.648474449 seconds time elapsed ( +- 0.82% )

After, w/ inline (this patch):
18471.376627 task-clock # 0.955 CPUs utilized ( +- 0.96% )
23,048 context-switches # 0.001 M/sec ( +- 0.48% )
1 CPU-migrations # 0.000 M/sec
10,708 page-faults # 0.001 M/sec ( +- 0.81% )
53,208,990,796 cycles # 2.881 GHz ( +- 0.98% ) [83.34%]
23,941,071,673 stalled-cycles-frontend # 44.99% frontend cycles idle ( +- 0.95% ) [83.34%]
16,161,773,848 stalled-cycles-backend # 30.37% backend cycles idle ( +- 0.76% ) [66.67%]
75,786,269,766 instructions # 1.42 insns per cycle
12,573,617,143 branches # 680.708 M/sec ( +- 1.34% ) [83.33%]
260,235,550 branch-misses # 2.07% of all branches ( +- 0.66% ) [83.33%]

19.340502161 seconds time elapsed ( +- 0.56% )

After, w/o inline:
18791.253967 task-clock # 0.954 CPUs utilized ( +- 0.78% )
23,230 context-switches # 0.001 M/sec ( +- 0.42% )
1 CPU-migrations # 0.000 M/sec
10,563 page-faults # 0.001 M/sec ( +- 1.27% )
54,168,674,622 cycles # 2.883 GHz ( +- 0.80% ) [83.34%]
24,244,712,629 stalled-cycles-frontend # 44.76% frontend cycles idle ( +- 1.37% ) [83.33%]
16,288,648,572 stalled-cycles-backend # 30.07% backend cycles idle ( +- 0.95% ) [66.66%]
77,659,755,503 instructions # 1.43 insns per cycle
12,922,780,045 branches # 687.702 M/sec ( +- 1.06% ) [83.34%]
261,962,386 branch-misses # 2.03% of all branches ( +- 0.71% ) [83.35%]

19.700174670 seconds time elapsed ( +- 0.56% )

Backports commit f6bb84d53110398f4899c19dab4e0fe9908ec060 from qemu
2018-03-05 02:42:46 -05:00
Emilio G. Cota
68ddc0cb08
exec-all: fix typos in TranslationBlock's documentation
Backports commit eb5e2b9e3b141de0c435eedc31c26cbbdefbee1b from qemu
2018-03-05 02:10:28 -05:00
Richard Henderson
31b8b67cd3
tcg: Move USE_DIRECT_JUMP discriminator to tcg/cpu/tcg-target.h
Replace the USE_DIRECT_JUMP ifdef with a TCG_TARGET_HAS_direct_jump
boolean test. Replace the tb_set_jmp_target1 ifdef with an unconditional
function tb_target_set_jmp_target.

While we're touching all backends, add a parameter for tb->tc_ptr;
we're going to need it shortly for some backends.

Move tb_set_jmp_target and tb_add_jump from exec-all.h to cpu-exec.c.

Backports commit a85833933628384d74ec412024d55cf012640287 from qemu
2018-03-04 21:52:35 -05:00
Lluís Vilanova
ed7225e685
tcg: Add generic translation framework
Backports commit bb2e0039dc07177f928f9fe24758967da02d60a2 from qemu
2018-03-04 14:31:16 -05:00
Paolo Bonzini
6997a5a090
gen-icount: check cflags instead of use_icount global
Backports commit cd42d5b23691ad73edfd6dbcfc935a960a9c5a65 from qemu
2018-03-04 14:26:26 -05:00
Lluís Vilanova
3a196c62ae
target: [tcg] Use a generic enum for DISAS_ values
Used later. An enum makes expected values explicit and
bounds the value space of switches.

Backports commit 77fc6f5e28667634916f114ae04c6029cd7b9c45 from qemu
2018-03-04 14:08:43 -05:00
Richard Henderson
b8a16f841a
tcg: Add generic DISAS_NORETURN
This will allow some amount of cleanup to happen before
switching the backends over to enum DisasJumpType.

Backports commit 5dc66895b0113034cd37fd5e65911d7959fc26a9 from qemu
2018-03-04 13:49:18 -05:00
Peter Maydell
26c8f31d9e
memory.h: Move MemTxResult type to memattrs.h
Move the MemTxResult type to memattrs.h. We're going to want to
use it in cpu/qom.h, which doesn't want to include all of
memory.h. In practice MemTxResult and MemTxAttrs are pretty
closely linked since both are used for the new-style
read_with_attrs and write_with_attrs callbacks, so memattrs.h
is a reasonable home for this rather than creating a whole
new header file for it.

Backports commit 3114d092b1740f9db9aa559aeb48ee387011e1da from qemu
2018-03-04 13:10:47 -05:00
Alexey Kardashevskiy
e723b8dd49
memory: Open code FlatView rendering
We are going to share FlatView's between AddressSpace's and per-AS
memory listeners won't suit the purpose anymore so open code
the dispatch tree rendering.

Since there is a good chance that dispatch_listener was the only
listener, this avoids address_space_update_topology_pass() if there is
no registered listeners; this should improve starting time.

This should cause no behavioural change.

Backports commit 1b04a1580917d9e41fd37ca62cbff9b4bf061e96 from qemu
2018-03-04 02:06:48 -05:00
Lluís Vilanova
32b3c3815d
tcg: Pass generic CPUState to gen_intermediate_code()
Needed to implement a target-agnostic gen_intermediate_code()
in the future.

Backports commit 9c489ea6bed134fecfd556b439c68bba48fbe102 from qemu
2018-03-03 23:34:18 -05:00
Richard Henderson
fc52eea5e2
tcg: Expand glue macros before stringifying helper names
Backports commit 44368ac62dc5ba014b68b2c1a8ec6fedc3242a5d from qemu
2018-03-03 23:07:21 -05:00
Alex Bennée
7d02489baf
include/exec/exec-all: document common exit conditions
As a precursor to later patches attempt to come up with a more
concrete wording for what each of the common exit cases would be.

Backports commit df0311e634828fdc99ca59352aef68503d631aad from qemu
2018-03-03 22:31:28 -05:00
Peter Maydell
3bd5694a0a
memory: Rename memory_region_init_rom() and _rom_device() to _nomigrate()
Rename memory_region_init_rom() to memory_region_init_rom_nomigrate()
and memory_region_init_rom_device() to
memory_region_init_rom_device_nomigrate().

Backports commit b59821a95bd1d7cb4697fd7748725c910582e0e7 from qemu
2018-03-03 22:29:01 -05:00
Peter Maydell
7b0027a828
memory: Rename memory_region_init_ram() to memory_region_init_ram_nomigrate()
Rename memory_region_init_ram() to memory_region_init_ram_nomigrate().
This leaves the way clear for us to provide a memory_region_init_ram()
which does handle migration.

Backports commit 1cfe48c1ce219b60a9096312f7a61806fae64ab3 from qemu
2018-03-03 22:25:39 -05:00
Peter Maydell
152c56f6a9
memory: Document that the RAM MR initializers do not handle migration
The various functions for initializing RAM MemoryRegions do not do
anything to cause the data in the MemoryRegion to be migrated.
Note in their documentation comments that this is the responsibility
of the caller.

(We will shortly add a new function that *does* do this for you.)

Backports commit a5c0234bb2754f5248e67929a34c843dbe039da5 from qemu
2018-03-03 22:20:32 -05:00
Pranith Kumar
d0a70720a3
Revert "exec.c: Fix breakpoint invalidation race"
Now that we have proper locking after MTTCG patches have landed, we
can revert the commit. This reverts commit

a9353fe897ca2687e5b3385ed39e3db3927a90e0.

Backports commit 406bc339b0505fcfc2ffcbca1f05a3756e338a65 from qemu
2018-03-03 22:14:35 -05:00
Yang Zhong
1135db176f
tcg: add CONFIG_TCG guards in headers
Add CONFIG_TCG around TLB-related functions and structure declarations.
Some of these functions are defined in ./accel/tcg/cputlb.c, which will
not be linked in if TCG is disabled, and have no stubs; therefore, their
callers will also be compiled out for --disable-tcg.

Backports commit b11ec7f2e44b285a3967d629b55d1a6970b06787 from qemu
2018-03-03 21:37:52 -05:00
Yang Zhong
d70c141675
tcg: move page_size_init() function
translate-all.c will be disabled if tcg is disabled in the build,
so page_size_init() function and related variables will be moved
to exec.c file.

Backports commit a0be0c585f5dcc4d50a37f6a20d3d625c5ef3a2c from qemu
2018-03-03 21:30:08 -05:00
Thomas Huth
cf5d583ef0
cpu: Introduce a wrapper for tlb_flush() that can be used in common code
Commit 1f5c00cfdb8114c ("qom/cpu: move tlb_flush to cpu_common_reset")
moved the call to tlb_flush() from the target-specific reset handlers
into the common code qom/cpu.c file, and protected the call with
"#ifdef CONFIG_SOFTMMU" to avoid that it is called for linux-user
only targets. But since qom/cpu.c is common code, CONFIG_SOFTMMU is
*never* defined here, so the tlb_flush() was simply never executed
anymore. Fix it by introducing a wrapper for tlb_flush() in a file
that is re-compiled for each target, i.e. in translate-all.c.

Backports commit 2cd53943115be5118b5b2d4b80ee0a39c94c4f73 from qemu
2018-03-03 21:24:55 -05:00
Emilio G. Cota
1a4e5da043
gen-icount: use tcg_ctx.tcg_env instead of cpu_env
We are relying on cpu_env being defined as a global, yet most
targets (i.e. all but arm/a64) have it defined as a local variable.
Luckily all of them use the same "cpu_env" name, but really
compilation shouldn't break if the name of that local variable
changed.

Fix it by using tcg_ctx.tcg_env, which all targets set in their
translate_init function. This change also helps paving the way
for the upcoming "translation loop common to all targets" work.

Backports commit 53f6672bcf57d82b794a2cc3a3469be7d35c8653 from qemu
2018-03-03 21:08:58 -05:00
Richard Henderson
68275ba6f3
tcg/arm: Use indirect branch for goto_tb
Backports commit 3fb53fb4d12f2e7833bd1659e6013237b130ef20 from qemu
2018-03-03 17:11:18 -05:00
Emilio G. Cota
d3ada2feb5
tcg: allocate TB structs before the corresponding translated code
Allocating an arbitrarily-sized array of tbs results in either
(a) a lot of memory wasted or (b) unnecessary flushes of the code
cache when we run out of TB structs in the array.

An obvious solution would be to just malloc a TB struct when needed,
and keep the TB array as an array of pointers (recall that tb_find_pc()
needs the TB array to run in O(log n)).

Perhaps a better solution, which is implemented in this patch, is to
allocate TB's right before the translated code they describe. This
results in some memory waste due to padding to have code and TBs in
separate cache lines--for instance, I measured 4.7% of padding in the
used portion of code_gen_buffer when booting aarch64 Linux on a
host with 64-byte cache lines. However, it can allow for optimizations
in some host architectures, since TCG backends could safely assume that
the TB and the corresponding translated code are very close to each
other in memory. See this message by rth for a detailed explanation:

https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html
Subject: Re: GSoC 2017 Proposal: TCG performance enhancements

Backports commit 6e3b2bfd6af488a896f7936e99ef160f8f37e6f2 from qemu
2018-03-03 17:05:49 -05:00
Emilio G. Cota
7d0440dec4
tb-hash: improve tb_jmp_cache hash function in user mode
Optimizations to cross-page chaining and indirect branches make
performance more sensitive to the hit rate of tb_jmp_cache.
The constraint of reserving some bits for the page number
lowers the achievable quality of the hashing function.

However, user-mode does not have this requirement. Thus,
with this change we use for user-mode a hashing function that
is both faster and of better quality than the previous one.

Measurements:

Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.

- SPECint06 (test set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

2.2x +-+--------------------------------------------------------------------------------------------------------------+-+
| |
| jr |
2x +jr+multhash +....................................................+++++...................................+-+
| jr+hash |$$$ |
| |$+$ |
| ### $ |
1.8x +-+......................................................................#|#.$...................................+-+
| ++#+# $ |
| |# # $ |
1.6x +-+....................................................................***.#.$....................++$$$..........+-+
| $$$ *+* # $ |$+$ |
| ++$$$ ### $ * * # $ +++|$ $ |
| ++###+$ # # $ * * # $ ### ****## $ |
1.4x +-+...................***+#.$.........***.#.$..........................*.*.#.$...........#+#$$.*++*|#.$..........+-+
| *+* # $ * * # $ * * # $ # # $ * *+# $ |
| * * # $ +++++ * * # $ * * # $ *** # $ * * # $ ###$$ |
1.2x +-+...................*.*.#.$.***##$$.*.*.#.$..........................*.*.#.$.........*.*.#.$.*..*.#.$.***+#+$..+-+
| * * # $ *+* # $ * * # $ +++ * * # $ ++###$$ * * # $ * * # $ * * # $ |
| ***##$$ * * # $ * * # $ * * # $ ***##$$ ++### * * # $ *** #+$ * * # $ * * # $ * * # $ |
| *+*+#+$ ***##$$$ * * # $ * * # $ * * # $ *+* # $ ++####$$ ***+# * * # $ * * # $ * * # $ * * # $ * * # $ |
1x +-++-*+*+#+$+*+*+#-+$+*+*-#+$+*+*+#+$+*+*+#+$+*-*+#+$+***++#+$+*+*+#$$+*+*+#+$+*+*+#+$+*+*-#+$+*+-*+#+$+*+*+#+$-++-+
| * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ |
| * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ |
0.8x +-+--***##$$-***##$$$-***##$$-***##$$-***##$$-***##$$-***###$$-***##$$-***##$$-***##$$-***##$$-****##$$-***##$$--+-+
astar bzip2 gcc gobmk h264ref hmmlibquantum mcf omnetpperlbench sjengxalancbmk hmean
png: http://imgur.com/4UXTrEc

Here I also tried the hash function suggested by Paolo ("multhash"):

return ((uint64_t) (pc * 2654435761) >> 32) & (TB_JMP_CACHE_SIZE - 1);

As you can see it is just as good as the other new function ("hash"),
which is what I ended up going with.

- SPECint06 (train set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

2.6x +-+--------------------------------------------------------------------------------------------------------------+-+
| |
| jr ### |
2.4x +jr+hash...........................................................................................#.#...........+-+
| # # |
| # # |
2.2x +-+................................................................................................#.#...........+-+
| # # |
| # # |
2x +-+................................................................................................#.#...........+-+
| **** # |
| * * # |
1.8x +-+.............................................................................................*..*.#...........+-+
| +++ * * # |
| #### #### * * # |
1.6x +-+......................................####.............................#..#.****..#..........*..*.#...........+-+
| +++ #++# **** # * * # #### * * # |
| ### # # * * # * * # # # * * # |
1.4x +-+...................****+#..........****..#..........................*..*..#.*..*..#....#..#..*..*.#...........+-+
| *++* # * * # * * # * * # *** # * * # #### |
| * * # #### * * # * * # * * # * * # * * # **** # |
1.2x +-+...................*..*.#..****++#.*..*..#..........................*..*..#.*..*..#..*.*..#..*..*.#..*..*..#..+-+
| ****### * * # * * # * * # * * # * * # * * # * * # * * # |
| * * # ***### * * # * * # * * # ****## * * # * * # * * # * * # * * # |
1x +-+--****###--***###--****##--****###-****###--***###--***###--****##--****###-****###--***###--****##--****###--+-+
astar bzip2 gcc gobmk h264ref hmmlibquantum mcf omnetpperlbench sjengxalancbmk hmean
png: http://imgur.com/ArCbHqo

- NBench, x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

1.12x +-+-------------------------------------------------------------------------------------------------------------+-+
| |
| jr +++ |
1.1x +jr+hash...........................................................####.........................................+-+
| +++#| # |
| | #++# |
1.08x +-+................................+++................+++.+++..*****..#.........................................+-+
| | +++ | | * | * # |
| | | | | *+++* # |
1.06x +-+................................****###.............|...|...*...*..#.........................+++.............+-+
| *| * |# ****### * * # | |
| *| *++# *| * |# * * # #### |
1.04x +-+................................*++*..#............*|.*.|#..*...*..#........................#.|#.............+-+
| * * # *++*++# * * # +++#++# |
| * * # * * # * * # | # # +++#### |
1.02x +-+................................*..*..#......+++...*..*..#..*...*..#.....................****..#..*****++#...+-+
| +++ * * # +++ | * * # * * # +++ *| * # *+++* # |
| +++ | +++ +++ ++++++ * * # *****### * * # * * # | +++ ++++++ *++* # * * # |
1x +-++-+++++####++****###++++-+####+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-+++####-+*****###++*++*++#++*+-+*++#+-++-+
| *****| # *++* |# *****| # * * # * *++# * * # * * # **** |# * * # * * # * * # |
| * | *| # * *++# * | *++# * * # * * # * * # * * # *| *++# * * # * * # * * # |
0.98x +-+...*.|.*++#..*..*..#..*+++*..#..*..*..#..*...*..#..*..*..#..*...*..#..*++*..#..*...*..#..*..*..#..*...*..#...+-+
| *+++* # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # |
| * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # |
0.96x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
ASSIGNMENT BITFIELD FOURFP EMULATION HUFFMAN LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT hmean
png: http://imgur.com/ZXFX0hJ

- NBench, arm-linux-user. Host: Intel i7-4790K @ 4.00GHz

1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
| #### |
| jr # # +++ |
1.25x +jr+hash.....................#..#...........................................####................................+-+
| # # # # |
| # # # # |
1.2x +-+..........................#..#...........................................#..#................................+-+
| # # # # |
| # # # # |
1.15x +-+..........................#..#...........................................#..#................................+-+
| # # #### # # |
| # # # # # # |
1.1x +-+..........................#..#..................................#..#.....#..#................................+-+
| # # # # # # +++ |
| # # #### # # # # #### |
1.05x +-+..........................#..#...............#..#.....####......#..#.....#..#.........................#..#...+-+
| # # # # # # # # # # +++ # # |
| +++ ***** # #### ***** # # # +++# # **** # ****### # # |
1x +-++-+*****###++****+++++*+-+*++#+-****++#-+*+++*-+#+++++#++#++*****++#+-*++*++#-+*****-++++*++*++#++*****++#+-++-+
| * * # * * | * * # * * # * * # **** # * * # * * # * *### * *++# * * # |
| * * # * *### * * # * * # * * # * * # * * # * * # * * # * * # * * # |
0.95x +-+...*...*..#..*..*.|#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#...+-+
| * * # * * |# * * # * * # * * # * * # * * # * * # * * # * * # * * # |
| * * # * * |# * * # * * # * * # * * # * * # * * # * * # * * # * * # |
0.9x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
ASSIGNMENT BITFIELD FOURFP EMULATION HUFFMAN LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT hmean
png: http://imgur.com/FfD27ey

Backports commit 6f1653180f5701c6a8f1b35b89a80b1e3260928e from qemu
2018-03-03 14:11:29 -05:00
Emilio G. Cota
8f4f15e5f5
tcg: Introduce goto_ptr opcode and tcg_gen_lookup_and_goto_ptr
Instead of exporting goto_ptr directly to TCG frontends, export
tcg_gen_lookup_and_goto_ptr(), which calls goto_ptr with the pointer
returned by the lookup_tb_ptr() helper. This is the only use case
we have for goto_ptr and lookup_tb_ptr, so having this function is
very convenient. Furthermore, it trivially allows us to avoid calling
the lookup helper if goto_ptr is not implemented by the backend.

Backports commit cedbcb01529cb6cf9a2289cdbebbc63f6149fc18 from qemu
2018-03-02 21:05:18 -05:00
Peter Xu
fce1b469e5
memory: tune last param of iommu_ops.translate()
This patch converts the old "is_write" bool into IOMMUAccessFlags. The
difference is that "is_write" can only express either read/write, but
sometimes what we really want is "none" here (neither read nor write).
Replay is an good example - during replay, we should not check any RW
permission bits since thats not an actual IO at all.

Backports commit bf55b7afce53718ef96f4e6616da62c0ccac37dd from qemu
2018-03-02 18:59:12 -05:00
Paolo Bonzini
c27870520a
exec: revert MemoryRegionCache
MemoryRegionCache did not know about virtio support for IOMMUs (because the
two features were developed at the same time). Revert MemoryRegionCache
to "normal" address_space_* operations for 2.9, as it is simpler than
undoing the virtio patches.

Backports commit 90c4fe5fc517a045e7a7cf2f23472e114042ca29 from qemu
2018-03-02 14:30:41 -05:00
Dr. David Alan Gilbert
55d79cf4c0
RAMBlocks: qemu_ram_is_shared
Provide a helper to say whether a RAMBlock was created as a
shared mapping.

Backports commit 463a4ac23bcf0f0b65c850fa66f5ae6e43edd243 from qemu
2018-03-02 13:05:35 -05:00
Dr. David Alan Gilbert
5dfbee8930
memory_region: Fix name comments
The 'name' parameter to memory_region_init_* had been marked as debug
only, however vmstate_region_ram uses it as a parameter to
qemu_ram_set_idstr to set RAMBlock names and these form part of the
migration stream.

Backports commit e8f5fe2de125a0bfbefbaa6a69af81f4817cb7a0 from qemu
2018-03-02 13:01:23 -05:00
Yongji Xie
23f5b17a08
memory: Introduce DEVICE_HOST_ENDIAN for ram device
At the moment ram device's memory regions are DEVICE_NATIVE_ENDIAN. It's
incorrect. This memory region is backed by a MMIO area in host, so the
uint64_t data that MemoryRegionOps read from/write to this area should be
host-endian rather than target-endian. Hence, current code does not work
when target and host endianness are different which is the most common case
on PPC64. To fix it, this introduces DEVICE_HOST_ENDIAN for the ram device.

This has been tested on PPC64 BE/LE host/guest in all possible combinations
including TCG.

Backports commit c99a29e702528698c0ce2590f06ca7ff239f7c39 from qemu
2018-03-02 11:24:32 -05:00
Alex Bennée
454932263c
cputlb and arm/sparc targets: convert mmuidx flushes from varg to bitmap
While the vargs approach was flexible the original MTTCG ended up
having munge the bits to a bitmap so the data could be used in
deferred work helpers. Instead of hiding that in cputlb we push the
change to the API to make it take a bitmap of MMU indexes instead.

For ARM some the resulting flushes end up being quite long so to aid
readability I've tended to move the index shifting to a new line so
all the bits being or-ed together line up nicely, for example:

tlb_flush_page_by_mmuidx(other_cs, pageaddr,
(1 << ARMMMUIdx_S1SE1) |
(1 << ARMMMUIdx_S1SE0));

Backports commit 0336cbf8532935d8e23c2aabf3e2ce2c0697b6ac from qemu
2018-03-02 10:12:40 -05:00
Alex Bennée
e3e57ca08e
cputlb: drop flush_global flag from tlb_flush
We have never has the concept of global TLB entries which would avoid
the flush so we never actually use this flag. Drop it and make clear
that tlb_flush is the sledge-hammer it has always been.

Backports commit  d10eb08f5d8389c814b554d01aa2882ac58221bf from qemu
2018-03-01 19:36:04 -05:00
Jason Wang
29932d0719
memory: handle alias in memory_region_is_iommu()
Backports commit 12d37882f0c0def5dee1c21be5d8fea9c21baada from qemu
2018-03-01 13:06:18 -05:00
Jason Wang
fdca6292a1
exec: introduce address_space_get_iotlb_entry()
This patch introduces a helper to query the iotlb entry for a
possible iova. This will be used by later device IOTLB API to enable
the capability for a dataplane (e.g vhost) to query the IOTLB.

Backports commit 052c8fa9983f553fdfa0d61034774070dd639c2b from qemu
2018-03-01 13:05:08 -05:00
Paolo Bonzini
81ad780e5e
exec: introduce MemoryRegionCache
Device models often have to perform multiple access to a single
memory region that is known in advance, but would to use "DMA-style"
functions instead of address_space_map/unmap. This can happen
for example when the data has to undergo endianness conversion.
Introduce a new data structure to cache the result of
address_space_translate without forcing usage of a host address
like address_space_map does.

Backports commit 1f4e496e1fc2eb6c8bf377a0f9695930c380bfd3 from qemu
2018-03-01 10:50:30 -05:00
Paolo Bonzini
88ad0f4f6e
exec: introduce memory_ldst.inc.c
Templatize the address_space_* and *_phys functions, so that we can add
similar functions in the next patch that work with a lightweight,
cache-like version of address_space_map/unmap.

Backports commit 0ce265ffef87f19f4dd1ff0663e09a63d66ae408 from qemu
2018-03-01 09:59:34 -05:00
Paolo Bonzini
9404dbf74e
cpu-exec: fix icount out-of-bounds access
When icount is active, tb_add_jump is surprisingly called with an
out of bounds basic block index. I have no idea how that can work,
but it does not seem like a good idea. Clear *last_tb for all
TB_EXIT_ICOUNT_EXPIRED cases, even when all you have to do is
refill icount_extra.

Backports commit d8dea6fbcbed177ca5d23ab77b3834a9437f0e88 from qemu
2018-03-01 09:17:26 -05:00
Bobby Bingham
d46e52d9d0
cpu_ldst.h: use correct guest address parameter
In the user emulation code path, tlb_vaddr_to_host erronesously passed
vaddr as the guest address to be translated, instead of addr, the parameter
which actually contained the guest address.

This resulted in incorrect addresses being used when emulating block copy
(mvc/mvpg) and block clear (xc) instructions for the s390x target.

Backports commit c2a85316902e67530da9d6548139fcce73c0cac6 from qemu
2018-03-01 08:56:37 -05:00
Paolo Bonzini
9d64a89acf
tcg: comment on which functions have to be called with tb_lock held
softmmu requires more functions to be thread-safe, because translation
blocks can be invalidated from e.g. notdirty callbacks. Probably the
same holds for user-mode emulation, it's just that no one has ever
tried to produce a coherent locking there.

This patch will guide the introduction of more tb_lock and tb_unlock
calls for system emulation.

Note that after this patch some (most) of the mentioned functions are
still called outside tb_lock/tb_unlock. The next one will rectify this.

Backports commit 7d7500d99895f888f97397ef32bb536bb0df3b74 from qemu
2018-02-28 10:26:28 -05:00
Alex Bennée
7aab0bd9a6
translate-all: add DEBUG_LOCKING asserts
This adds asserts to check the locking on the various translation
engines structures. There are two sets of structures that are protected
by locks.

The first the l1map and PageDesc structures used to track which
translation blocks are associated with which physical addresses. In
user-mode this is covered by the mmap_lock.

The second case are TB context related structures which are protected by
tb_lock which is also user-mode only.

Currently the asserts do nothing in SoftMMU mode but this will change
for MTTCG.

Backports commit 301e40ed8005306c009978be295ed9a4b725178b from qemu
2018-02-28 08:56:15 -05:00
Yongbok Kim
79e4c001a9
softmmu: Add probe_write()
Probe for whether the specified guest write access is permitted.
If it is not permitted then an exception will be taken in the same
way as if this were a real write access (and we will not return).
Otherwise the function will return, and there will be a valid
entry in the TLB for this access.

Backports commit 3b4afc9e75ab1a95f33e41f462921093f8a109c4 from qemu
2018-02-27 12:20:50 -05:00
Richard Henderson
e35aacd5ae
tcg: Add EXCP_ATOMIC
When we cannot emulate an atomic operation within a parallel
context, this exception allows us to stop the world and try
again in a serial context.

Backports commit fdbc2b5722f6092e47181a947c90fd4bdcc1c121 from qemu

Also backports parts of commit 02d57ea115b7669f588371c86484a2e8ebc369be
2018-02-27 11:57:58 -05:00
Peter Maydell
db8b0a82b1
cpu: Support a target CPU having a variable page size
Support target CPUs having a page size which isn't knownn
at compile time. To use this, the CPU implementation should:
* define TARGET_PAGE_BITS_VARY
* not define TARGET_PAGE_BITS
* define TARGET_PAGE_BITS_MIN to the smallest value it
might possibly want for TARGET_PAGE_BITS
* call set_preferred_target_page_bits() in its realize
function to indicate the actual preferred target page
size for the CPU (and report any error from it)

In CONFIG_USER_ONLY, the CPU implementation should continue
to define TARGET_PAGE_BITS appropriately for the guest
OS page size.

Machines which want to take advantage of having the page
size something larger than TARGET_PAGE_BITS_MIN must
set the MachineClass minimum_page_bits field to a value
which they guarantee will be no greater than the preferred
page size for any CPU they create.

Note that changing the target page size by setting
minimum_page_bits is a migration compatibility break
for that machine.

For debugging purposes, attempts to use TARGET_PAGE_SIZE
before it has been finally confirmed will assert.

Backports commit 20bccb82ff3ea09bcb7c4ee226d3160cab15f7da from qemu
2018-02-26 12:29:08 -05:00
Paolo Bonzini
eb75004013
memory: add a per-AddressSpace list of listeners
This speeds up MEMORY_LISTENER_CALL noticeably. Right now,
with many PCI devices you have N regions added to M AddressSpaces
(M = # PCI devices with bus-master enabled) and each call looks
up the whole listener list, with at least M listeners in it.
Because most of the regions in N are BARs, which are also roughly
proportional to M, the whole thing is O(M^3). This changes it
to O(M^2), which is the best we can do without rewriting the
whole thing.

Backports commit 9a54635dcb51a3fcf7507af630168f514a8cd4e7 from qemu
2018-02-26 10:46:50 -05:00
Paolo Bonzini
4b06e8bbb7
memory: eliminate global MemoryListeners
There is none, so just drop the code.

Backports commit d45fa784cd0c111131696808d1168259d66b7519 from qemu
2018-02-26 10:19:28 -05:00
Richard Henderson
66d79ac959
tcg: Merge GETPC and GETRA
The return address argument to the softmmu template helpers was
confused. In the legacy case, we wanted to indicate that there
is no return address, and so passed in NULL. However, we then
immediately subtracted GETPC_ADJ from NULL, resulting in a non-zero
value, indicating the presence of an (invalid) return address.

Push the GETPC_ADJ subtraction down to the only point it's required:
immediately before use within cpu_restore_state_from_tb, after all
NULL pointer checks have been completed.

This makes GETPC and GETRA identical. Remove GETRA as the lesser
used macro, replacing all uses with GETPC.

Backports commit 01ecaf438b1eb46abe23392c8ce5b7628b0c8cf5 from qemu
2018-02-26 02:54:44 -05:00
Paolo Bonzini
30845ae475
tcg: Prepare TB invalidation for lockless TB lookup
When invalidating a translation block, set an invalid flag into the
TranslationBlock structure first. It is also necessary to check whether
the target TB is still valid after acquiring 'tb_lock' but before calling
tb_add_jump() since TB lookup is to be performed out of 'tb_lock' in
future. Note that we don't have to check 'last_tb'; an already invalidated
TB will not be executed anyway and it is thus safe to patch it.

Backports commit 6d21e4208f382dd8ca1f7995a6dd9ea7ca281163 from qemu
2018-02-26 01:48:13 -05:00
Alex Williamson
fe66c2e088
memory: Don't use memcpy for ram_device regions
With a vfio assigned device we lay down a base MemoryRegion registered
as an IO region, giving us read & write accessors. If the region
supports mmap, we lay down a higher priority sub-region MemoryRegion
on top of the base layer initialized as a RAM device pointer to the
mmap. Finally, if we have any quirks for the device (ie. address
ranges that need additional virtualization support), we put another IO
sub-region on top of the mmap MemoryRegion. When this is flattened,
we now potentially have sub-page mmap MemoryRegions exposed which
cannot be directly mapped through KVM.

This is as expected, but a subtle detail of this is that we end up
with two different access mechanisms through QEMU. If we disable the
mmap MemoryRegion, we make use of the IO MemoryRegion and service
accesses using pread and pwrite to the vfio device file descriptor.
If the mmap MemoryRegion is enabled and results in one of these
sub-page gaps, QEMU handles the access as RAM, using memcpy to the
mmap. Using either pread/pwrite or the mmap directly should be
correct, but using memcpy causes us problems. I expect that not only
does memcpy not necessarily honor the original width and alignment in
performing a copy, but it potentially also uses processor instructions
not intended for MMIO spaces. It turns out that this has been a
problem for Realtek NIC assignment, which has such a quirk that
creates a sub-page mmap MemoryRegion access.

To resolve this, we disable memory_access_is_direct() for ram_device
regions since QEMU assumes that it can use memcpy for those regions.
Instead we access through MemoryRegionOps, which replaces the memcpy
with simple de-references of standard sizes to the host memory.

With this patch we attempt to provide unrestricted access to the RAM
device, allowing byte through qword access as well as unaligned
access. The assumption here is that accesses initiated by the VM are
driven by a device specific driver, which knows the device
capabilities. If unaligned accesses are not supported by the device,
we don't want them to work in a VM by performing multiple aligned
accesses to compose the unaligned access. A down-side of this
philosophy is that the xp command from the monitor attempts to use
the largest available access weidth, unaware of the underlying
device. Using memcpy had this same restriction, but at least now an
operator can dump individual registers, even if blocks of device
memory may result in access widths beyond the capabilities of a
given device (RTL NICs only support up to dword).

Backports commit 1b16ded6a512809f99c133a97f19026fe612b2de from qemu
2018-02-25 23:06:36 -05:00
Alex Williamson
5db45219c9
memory: Replace skip_dump flag with ram_device
Setting skip_dump on a MemoryRegion allows us to modify one specific
code path, but the restriction we're trying to address encompasses
more than that. If we have a RAM MemoryRegion backed by a physical
device, it not only restricts our ability to dump that region, but
also affects how we should manipulate it. Here we recognize that
MemoryRegions do not change to sometimes allow dumps and other times
not, so we replace setting the skip_dump flag with a new initializer
so that we know exactly the type of region to which we're applying
this behavior.

Backports commit ca83f87a66d19fdaabf23d4f5ebb49396fe232c1 from qemu
2018-02-25 23:00:45 -05:00
Richard Henderson
1547048a22
tcg: Reorg TCGOp chaining
Instead of using -1 as end of chain, use 0, and link through the 0
entry as a fully circular double-linked list.

Backports commit dcb8e75870e2de199db853697f8839cb603beefe from qemu
2018-02-25 21:44:50 -05:00
Igor Mammedov
62c89b9cd4
exec: Reduce CONFIG_USER_ONLY ifdeffenery
Backports commit 1bc7e522d9cf1b58f2de9c8f1737be0bb5129c35 from qemu
2018-02-25 20:57:48 -05:00
Markus Armbruster
c2ffbc575d
Clean up decorations and whitespace around header guards
Cleaned up with scripts/clean-header-guards.pl.

Backports commit 175de52487ce0b0c78daa4cdf41a5a465a168a25 from qemu
2018-02-25 04:26:02 -05:00
Markus Armbruster
1275b9b459
Clean up ill-advised or unusual header guards
Cleaned up with scripts/clean-header-guards.pl.

Backports commit 2a6a4076e117113ebec97b1821071afccfdfbc96 from qemu
2018-02-25 04:22:46 -05:00
Markus Armbruster
9ae2fc4d9e
Clean up header guards that don't match their file name
Header guard symbols should match their file name to make guard
collisions less likely. Offenders found with
scripts/clean-header-guards.pl -vn.

Cleaned up with scripts/clean-header-guards.pl, followed by some
renaming of new guard symbols picked by the script to better ones.

Backports commit 121d07125bb6d7079c7ebafdd3efe8c3a01cc440 from qemu
2018-02-25 04:18:42 -05:00
Markus Armbruster
60e8836b74
Use #include "..." for our own headers, <...> for others
Tracked down with an ugly, brittle and probably buggy Perl script.

Also move includes converted to <...> up so they get included before
ours where that's obviously okay.

Backports commit a9c94277f07d19d3eb14f199c3e93491aa3eae0e from qemu
2018-02-25 04:10:33 -05:00
Sergey Sorokin
d1e4ac0451
Fix confusing argument names in some common functions
There are functions tlb_fill(), cpu_unaligned_access() and
do_unaligned_access() that are called with access type and mmu index
arguments. But these arguments are named 'is_write' and 'is_user' in their
declarations. The patches fix the arguments to avoid a confusion.

Backports commit b35399bb4e9968296a12303b00f9f2066470e987 from qemu
2018-02-25 03:58:27 -05:00
Sergey Sorokin
e4d123caa9
tcg: Improve the alignment check infrastructure
Some architectures (e.g. ARMv8) need the address which is aligned
to a size more than the size of the memory access.
To support such check it's enough the current costless alignment
check implementation in QEMU, but we need to support
an alignment size specifying.

Backports commit 1f00b27f17518a1bcb4cedca49eaec96a4d560bd from qemu
2018-02-25 02:23:28 -05:00
Peter Maydell
efc6cc2b83
memory: Assert that memory_region_init_rom_device() ops aren't NULL
It doesn't make sense to pass a NULL ops argument to
memory_region_init_rom_device(), because the effect will
be that if the guest tries to write to the memory region
then QEMU will segfault. Catch the bug earlier by sanity
checking the arguments to this function, and remove the
misleading documentation that suggests that passing NULL
might be sensible.

Backports commit 39e0b03dec518254fabd2acff29548d3f1d2b754 from qemu
2018-02-25 00:29:52 -05:00
Peter Maydell
334e951ec1
memory: Provide memory_region_init_rom()
Provide a new helper function memory_region_init_rom() for memory
regions which are read-only (and unlike those created by
memory_region_init_rom_device() don't have special behaviour
for writes). This has the same behaviour as calling
memory_region_init_ram() and then memory_region_set_readonly()
(which is what we do today in boards with pure ROMs) but is a
more easily discoverable API for the purpose.

Backports commit a1777f7f6462c66e1ee6e98f0d5c431bfe988aa5 from qemu
2018-02-25 00:28:17 -05:00
Alexey Kardashevskiy
7187d77cfa
memory: Add MemoryRegionIOMMUOps.notify_started/stopped callbacks
The IOMMU driver may change behavior depending on whether a notifier
client is present. In the case of POWER, this represents a change in
the visibility of the IOTLB, for other drivers such as intel-iommu and
future AMD-Vi emulation, notifier support is not yet enabled and this
provides the opportunity to flag that incompatibility.

Backports commit d22d8956b185c002b50a4d0883aff61f857347ef from qemu
2018-02-25 00:23:00 -05:00
Alexey Kardashevskiy
096ca207af
memory: Add reporting of supported page sizes
Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_min_page_size callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the minimum
actual page size supported by an IOMMU.

As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.

This removes vfio_container_granularity() and uses new helper in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.

Backports the relevant parts of commit f682e9c244af7166225f4a50cc18ff296bb9d43e from qemu
2018-02-24 19:23:28 -05:00
Emilio G. Cota
ae3e22a689
tb hash: hash phys_pc, pc, and flags with xxhash
For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Backports commit 42bd32287f3a18d823f2258b813824a39ed7c6d9 from qemu
2018-02-24 18:00:14 -05:00
Emilio G. Cota
9ef9de9cf8
exec: add tb_hash_func5, derived from xxhash
This will be used by upcoming changes for hashing the tb hash.

Add this into a separate file to include the copyright notice from
xxhash.

Backports commit dc8b295d05ec35a8c032f9abca421772347ba5d4 from qemu
2018-02-24 17:36:35 -05:00
Peter Maydell
d7dccff836
cpu-exec: Rename cpu_resume_from_signal() to cpu_loop_exit_noexc()
The function cpu_resume_from_signal() is now always called with a
NULL puc argument, and is rather misnamed since it is never called
from a signal handler. It is essentially forcing an exit to the
top level cpu loop but without raising any exception, so rename
it to cpu_loop_exit_noexc() and drop the useless unused argument.

Backports commit 6886b98036a8f8f5bce8b10756ce080084cef11b from qemu
2018-02-24 17:25:28 -05:00
Peter Maydell
8d0faac1dc
qemu-common.h: Drop WORDS_ALIGNED define
The WORDS_ALIGNED #define is not used anywhere, and hasn't been since
2013 when commit 612d590ebc6cef rewrote the various ld<type>_<endian>_p
functions to not use it. Remove the #define and the comment describing it.
Also remove the line in the comment about TARGET_WORDS_ALIGNED, since
it has never actually existed.

Backports commit 0d5c21f2b3bf1e0b562a2c74e353d2e03f2f50ef from qemu
2018-02-24 17:01:55 -05:00
Paolo Bonzini
8df5ad80b1
exec: hide mr->ram_addr from qemu_get_ram_ptr users
Let users of qemu_get_ram_ptr and qemu_ram_ptr_length pass in an
address that is relative to the MemoryRegion. This basically means
what address_space_translate returns.

Because the semantics of the second parameter change, rename the
function to qemu_map_ram_ptr.

Backports commit 0878d0e11ba8013dd759c6921cbf05ba6a41bd71 from qemu
2018-02-24 16:17:49 -05:00
Paolo Bonzini
b2e1b34bcc
memory: split memory_region_from_host from qemu_ram_addr_from_host
Move the old qemu_ram_addr_from_host to memory_region_from_host and
make it return an offset within the region. For qemu_ram_addr_from_host
return the ram_addr_t directly, similar to what it was before
commit 1b5ec23 ("memory: return MemoryRegion from qemu_ram_addr_from_host",
2013-07-04).

Backports commit 07bdaa4196b51bc7ffa7c3f74e9e4a9dc8a7966a from qemu
2018-02-24 16:06:49 -05:00
Paolo Bonzini
918c626847
exec: remove ram_addr argument from qemu_ram_block_from_host
Of the two callers, one does not use it, and the other can compute
it itself based on the other output argument (offset) and the RAMBlock.

Backports commit f615f39616c4fd1a3a3b078af8d75bb4be6390de from qemu
2018-02-24 03:37:40 -05:00
Paolo Bonzini
f26f1f123c
memory: remove qemu_get_ram_fd, qemu_set_ram_fd, qemu_ram_block_host_ptr
Remove direct uses of ram_addr_t and optimize memory_region_{get,set}_fd
now that a MemoryRegion knows its RAMBlock directly.

Backports commit 4ff87573df3606856a92c14eef3393a63d736d11 from qemu
2018-02-24 03:34:44 -05:00
Fam Zheng
fb8135cd0d
memory: Remove code for mr->may_overlap
The collision check does nothing and hasn't been used. Remove the
variable together with related code.

Backports commit b61359781958759317ee6fd1a45b59be0b7dbbe1 from qemu
2018-02-24 02:55:25 -05:00
Gonglei
feff56cc11
memory: drop find_ram_block()
On the one hand, we have already qemu_get_ram_block() whose function
is similar. On the other hand, we can directly use mr->ram_block but
searching RAMblock by ram_addr which is a kind of waste.

Backports commit fa53a0e53efdc7002497ea4a76aacf6cceb170ef from qemu
2018-02-24 02:52:20 -05:00
Paolo Bonzini
d0d3712417
hw: remove pio_addr_t
pio_addr_t is almost unused, because these days I/O ports are simply
accessed through the address space. cpu_{in,out}[bwl] themselves are
almost unused; monitor.c and xen-hvm.c could use address_space_read/write
directly, since they have an integer size at hand. This leaves qtest as
the only user of those functions.

On the other hand even portio_* functions use this type; the only
interesting use of pio_addr_t thus is include/hw/sysbus.h. I guess I
could move it there, but I don't see much benefit in that either. Using
uint32_t is enough and avoids the need to include ioport.h everywhere.

Backports commit 89a80e7400f7225d9401b35ef32454b4ab29dc67 from qemu
2018-02-24 02:43:16 -05:00
Paolo Bonzini
9485b7c2e1
cpu: move exec-all.h inclusion out of cpu.h
exec-all.h contains TCG-specific definitions. It is not needed outside
TCG-specific files such as translate.c, exec.c or *helper.c.

One generic function had snuck into include/exec/exec-all.h; move it to
include/qom/cpu.h.

Backports commit 63c915526d6a54a95919ebece83fa9ca631b2508 from qemu
2018-02-24 02:39:08 -05:00
Paolo Bonzini
58693409ea
exec: extract exec/tb-context.h
TCG backends do not need most of exec-all.h; extract what they actually
need to a separate file or move it directly to tcg.h. The next patch
will stop including exec-all.h from everywhere.

Backports commit 00f6da6a1a5d1ce085334eccbb50ec899ceed513 from qemu
2018-02-24 02:09:58 -05:00
Paolo Bonzini
37f26922dd
qemu-common: push cpu.h inclusion out of qemu-common.h
Backports commit 33c11879fd422b759483ed25fef133ea900ea8d7 from qemu
2018-02-24 01:50:56 -05:00
Paolo Bonzini
78fd1aab94
cpu: move endian-dependent load/store functions to cpu-all.h
Disentangle cpu-common.h and memory.h from NEED_CPU_H. Prototypes are
not defined for !NEED_CPU_H, so remove them from poison.h too. Only
macros need poisoning.

Backports commit a7d6039cb35592683ecc56d2b37817da2d2f8b00 from qemu
2018-02-24 01:04:26 -05:00
Sergey Fedorov
ba9a237586
tcg: Rework tb_invalidated_flag
'tb_invalidated_flag' was meant to catch two events:
* some TB has been invalidated by tb_phys_invalidate();
* the whole translation buffer has been flushed by tb_flush().

Then it was checked:
* in cpu_exec() to ensure that the last executed TB can be safely
linked to directly call the next one;
* in cpu_exec_nocache() to decide if the original TB should be provided
for further possible invalidation along with the temporarily
generated TB.

It is always safe to patch an invalidated TB since it is not going to be
used anyway. It is also safe to call tb_phys_invalidate() for an already
invalidated TB. Thus, setting this flag in tb_phys_invalidate() is
simply unnecessary. Moreover, it can prevent from pretty proper linking
of TBs, if any arbitrary TB has been invalidated. So just don't touch it
in tb_phys_invalidate().

If this flag is only used to catch whether tb_flush() has been called
then rename it to 'tb_flushed'. Declare it as 'bool' and stick to using
only 'true' and 'false' to set its value. Also, instead of setting it in
tb_gen_code(), just after tb_flush() has been called, do it right inside
of tb_flush().

In cpu_exec(), this flag is used to track if tb_flush() has been called
and have made 'next_tb' (a reference to the last executed TB) invalid
for linking it to directly call the next TB. tb_flush() can be called
during the CPU execution loop from tb_gen_code(), during TB execution or
by another thread while 'tb_lock' is released. Catch for translation
buffer flush reliably by resetting this flag once before first TB lookup
and each time we find it set before trying to add a direct jump. Don't
touch in in tb_find_physical().

Each vCPU has its own execution loop in multithreaded mode and thus
should have its own copy of the flag to be able to reset it with its own
'next_tb' and don't affect any other vCPU execution thread. So make this
flag per-vCPU and move it to CPUState.

In cpu_exec_nocache(), we only need to check if tb_flush() has been
called from tb_gen_code() called by cpu_exec_nocache() itself. To do
this reliably, preserve the old value of the flag, reset it before
calling tb_gen_code(), check afterwards, and combine the saved value
back to the flag.

This patch is based on the patch "tcg: move tb_invalidated_flag to
CPUState" from Paolo Bonzini <pbonzini@redhat.com>.

Backports commit 6f789be56d3f38e9214dafcfab3bf9be7191f370 from qemu
2018-02-23 23:34:51 -05:00
Sergey Fedorov
d60af028c5
tcg: Clarify thread safety check in tb_add_jump()
The check is to make sure that another thread hasn't already done the
same while we were outside of tb_lock. Mention this in a comment.

Backports commit 9962c478b153a18fe88a6509fe58cd178aff8abc from qemu
2018-02-23 21:32:47 -05:00
Sergey Fedorov
fbc0a1105f
tcg: Use uintptr_t type for jmp_list_{next|first} fields of TB
These fields do not contain pure pointers to a TranslationBlock
structure. So uintptr_t is the most appropriate type for them.
Also put some asserts to assure that the two least significant bits of
the pointer are always zero before assigning it to jmp_list_first.

Backports commit c37e6d7e3589ecb96914faa21025ad7ba6654aea from qemu
2018-02-23 21:28:19 -05:00
Sergey Fedorov
e60c24cecf
tcg: Clean up direct block chaining data fields
Briefly describe in a comment how direct block chaining is done. It
should help in understanding of the following data fields.

Rename some fields in TranslationBlock and TCGContext structures to
better reflect their purpose (dropping excessive 'tb_' prefix in
TranslationBlock but keeping it in TCGContext):
tb_next_offset => jmp_reset_offset
tb_jmp_offset => jmp_insn_offset
tb_next => jmp_target_addr
jmp_next => jmp_list_next
jmp_first => jmp_list_first

Avoid using a magic constant as an invalid offset which is used to
indicate that there's no n-th jump generated.

Backports commit f309101c26b59641fc1aa8fb2a98a5441cdaea03 from qemu
2018-02-23 21:28:19 -05:00
Sergey Fedorov
c5b234ed1f
tcg: Note requirement on atomic direct jump patching
Backports commit 10b4f4855537dd421e193a7d0416513116370558 from qemu
2018-02-23 21:28:18 -05:00
Sergey Fedorov
52e2972300
tcg/arm: Make direct jump patching thread-safe
Ensure direct jump patching in ARM is atomic by using
atomic_read()/atomic_set() for code patching.

Backports commit 7d14e0e2d661479985197203589c38840e1066df from qemu
2018-02-23 21:28:18 -05:00
Sergey Fedorov
57359fbe6c
tcg/s390: Make direct jump patching thread-safe
Ensure direct jump patching in s390 is atomic by:
* naturally aligning a location of direct jump address;
* using atomic_read()/atomic_set() for code patching.

Backports commit ed3d51ecd7fe248d3959e469d53890ac9ffe0cd2 from qemu
2018-02-23 21:28:18 -05:00
Sergey Fedorov
5eb2d6618f
tcg/i386: Make direct jump patching thread-safe
Ensure direct jump patching in i386 is atomic by:
* naturally aligning a location of direct jump address;
* using atomic_read()/atomic_set() for code patching.

Backports commit 0d07abf05e98903c7faf204a9a90f7d45b7554dc from qemu
2018-02-23 21:28:17 -05:00
Emilio G. Cota
170f6e0b3b
tb: consistently use uint32_t for tb->flags
We are inconsistent with the type of tb->flags: usage varies loosely
between int and uint64_t. Settle to uint32_t everywhere, which is
superior to both: at least one target (aarch64) uses the most significant
bit in the u32, and uint64_t is wasteful.

Compile-tested for all targets.

Backports commit 89fee74a0f066dfd73830a7b5fa137e87888c870 from qemu
2018-02-23 21:28:11 -05:00
Edgar E. Iglesias
bfc74c4da2
gen-icount: Use tcg_set_insn_param
Use tcg_set_insn_param() instead of directly accessing internal
tcg data structures to update an insn param.

Backports commit 25caa94c4a26daaab1e65c6d887e2972aeb5749e from qemu
2018-02-23 20:01:17 -05:00
Lioncash
87130fc884
exec-all: Remove externs
These are unused
2018-02-23 12:43:03 -05:00
Peter Crosthwaite
576f1752a6
include/exec: Move cputlb exec.c defs out
Move the architecture agnostic function prototypes for exec.c out of
cputlb.h to exec-all.h. This allows hiding of the arch specific
cputlb.h from exec.c which should be getting close to having no
architecture specifics. Prepares support for multi-arch, which will have
a minimal cpu.h that services exec.c but not cputlb.h.

Backports commit dfccc7602374c9fd3b083208b552d62daa244811 from qemu
2018-02-23 10:52:25 -05:00
Peter Crosthwaite
97c9423ee8
cputlb: move CPU_LOOP() for tlb_reset() to exec.c
To prepare for multi-arch, cputlb.c should only have awareness of one
single architecture. This means it should not have access to the full
CPU lists which may be heterogeneous. Instead, push the CPU_LOOP() up
to the one and only caller in exec.c.

Backports commit 9a13565d52bfd321934fb44ee004bbaf5f5913a8 from qemu
2018-02-23 10:46:31 -05:00