Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0.0 causes crashes during execution #403

Open
dibyendumajumdar opened this issue Jun 5, 2024 · 30 comments
Open

v1.0.0 causes crashes during execution #403

dibyendumajumdar opened this issue Jun 5, 2024 · 30 comments

Comments

@dibyendumajumdar
Copy link
Contributor

Ravi tests crash during execution, and I am not yet able to narrow it down. Initial investigation appears to show that if I isolate the code that fails, then it passes, but when run in the larger test program, it fails. This makes me think that its something to do with the size of the function being compiled.

I am flagging this for two reasons:

  • It is likely there is some bug in the code generation that needs to be fixed
  • I'd like some ideas on how to isolate the failure

The failure is basically causing runtime assertions to trigger in Ravi - that means some corruption somewhere.

Example of failure (there are other failures similar to this)

https://github.com/dibyendumajumdar/ravi/actions/runs/9374498129/job/25810662116

@mingodad
Copy link

mingodad commented Jun 5, 2024

Maybe if you have the code that was working before using this mir version and dump the generated asm and comparing then can give a clue ?

@dibyendumajumdar
Copy link
Contributor Author

The master branch runs the previous version of MIR (v0.x) which works fine.

I could try dumping out the asm for both.

@dibyendumajumdar
Copy link
Contributor Author

I am not sure how to generate assembly output.

I generated the MIR output from the two versions, along with the C input file.
These are attached

test.mir_v1.0.txt
test.mir_v0.0.txt
test.c.txt

@mingodad
Copy link

mingodad commented Jun 6, 2024

After doing a search and replace with regular expressions on test.mir_v1.0.txt with this ):[^,]+ -> ) here is some recurrent differences I see in the diff between v0.0 and v1.0:

--- <v0.0>
+++ <v1.0>
@@ -2,16 +2,14 @@
 	mov	U0_base, u64:32(U0_ci)
 	add	U_10390, U0_base, 144
 	mov	U3277_ra, U_10390
-	mov	i64:160(fp), 0
+	mov	I3277_i, 0
 	bnes	L6487, u16:8(U3277_ra), 19
 L6486:
-	mov	U_10392, 160
-	add	U_10392, U_10392, fp
+	addr	U_10392, I3277_i
 	mov	i64:(U_10392), i64:(U3277_ra)
 	mov	i_10393, 1
 	jmp	L6488
 L6487:
-	mov	U_10395, 160
-	add	U_10395, U_10395, fp
+	addr	U_10395, I3277_i
 	call	proto15, luaV_tointegerns, i_10394, U3277_ra, U_10395, 0
 	mov	i_10393, i_10394

Also in a several functions the type of some variables have changed from v0.0 to v1.0:

--- <v0.0>
+++ <v1.0>
@@ -3,12 +3,14 @@
 	local	i64:U0_k, i64:I_3, i64:U0_base, i64:U_4, i64:I_5, i64:U_6, i64:I_7, i64:U_8
 	local	i64:I_9, i64:U_10, i64:I_11, i64:U_12, i64:I_13, i64:U_14, i64:I_15, i64:U_16
 	local	i64:I_17, i64:U_18, i64:I_19, i64:U_20, i64:I_21, i64:U_22, i64:I_23, i64:U1_ra
-	local	i64:U_24, i64:i_25, i64:U_26, i64:i_27, i64:i_28, i64:U_29, i64:I_30, i64:U3_stackbase
-	local	i64:i3_wanted, i64:i_31, i64:i_32, i64:i3_available, i64:i_33, i64:i3_j, i64:i_34, i64:U5_src_reg
-	local	i64:U_35, i64:U5_dst_reg, i64:U_36, i64:I_37, i64:i_38, i64:i_39, i64:U_40, i64:U_41
-	local	i64:U_42, i64:I_43, i64:i_44, i64:i_45, i64:U_46, i64:U_47, i64:U_48, i64:U_49
-# 1 arg, 64 locals
-	alloca	fp, 176
+	local	i64:U_24, i64:I1_i, i64:i_25, i64:U_26, i64:i_27, i64:i_28, i64:U_29, i64:I_30
+	local	i64:U3_stackbase, i64:i3_wanted, i64:i_31, i64:i_32, i64:i3_available, i64:i_33, i64:i3_j, i64:i_34
+	local	i64:U5_src_reg, i64:U_35, i64:U5_dst_reg, i64:U_36, i64:I_37, i64:i_38, i64:i_39, i64:U_40
+	local	i64:U_41, i64:U_42, i64:I_43, i64:i_44, i64:i_45, i64:U_46, i64:U_47, i64:U_48
+	local	i64:U_49
+
+# 1 arg, 65 locals, 0 globals
+	alloca	fp, 160
 	mov	i0_raviX__error_code, 0
 	mov	i0_result, 0
 	mov	U0_ci, u64:32(U0_L)
@@ -61,17 +63,15 @@
 L15221:
 	add	U_24, U0_base, 0
 	mov	U1_ra, U_24
-	mov	i64:160(fp), 0
+	mov	I1_i, 0
 	bnes	L15226, u16:8(U1_ra), 19
 L15225:
-	mov	U_26, 160
-	add	U_26, U_26, fp
+	addr	U_26, I1_i
 	mov	i64:(U_26), i64:(U1_ra)
 	mov	i_27, 1
 	jmp	L15227
 L15226:
-	mov	U_29, 160
-	add	U_29, U_29, fp
+	addr	U_29, I1_i
 	call	proto15, luaV_tointegerns, i_28, U1_ra, U_29, 0
 	mov	i_27, i_28
 L15227:

@dibyendumajumdar
Copy link
Contributor Author

thank you @mingodad

I need to find a reduced example of the issue - I will try to see if I can minimize failing tests; but if the issue is size related then it may be difficult to minimize.

@mingodad
Copy link

mingodad commented Jun 6, 2024

Probably what I said here Also in a several functions the type of some variables have changed from v0.0 to v1.0 is not correct but I noticed that in v0.0 there is less locals (64) than in v1.0 (65) but v0.0 call alloca fp, 176 and v1.0 call alloca fp, 160 (all the above are for __ravifunc_159: func i32, u64:U0_L).

@mingodad
Copy link

mingodad commented Jun 6, 2024

Running on Ubuntu-18.04 64bits with valgrind:

***** FILE 'errors.lua'*****
testing errors
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543C8AA: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Use of uninitialised value of size 8
==12900==    at 0x543883B: _itoa_word (_itoa.c:179)
==12900==    by 0x543BEDD: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x5438845: _itoa_word (_itoa.c:179)
==12900==    by 0x543BEDD: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543BFE4: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543CB1C: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
+
testing stack overflow

@mingodad
Copy link

mingodad commented Jun 6, 2024

Here is my modified script to run the tests:

#!/bin/sh

LUA=$1
RUN_TRACEHOOK_TESTS=$2

if [ "$LUA" = "" ]
then
  echo "Please pass path to Ravi executable"
  exit 1
fi

run_lua53_tests() {
  arg=$1
  errmsg=$2
  cd lua53
  valgrind $LUA -e"$arg" all.lua
  rc=$?
  cd ..
  if [ $rc != 0 ]
  then
    echo "$errmsg"
    exit 1
  fi
}

run_ravi_tests() {
  dir=$1
  script=$2
  arg=$3
  cd $dir
  valgrind $LUA -e"$arg" $script
  rc=$?
  cd ..
  if [ $rc != 0 ]
  then
    echo "Test $script failed"
    exit 1
  fi
}

run_lua53_tests "_port=true ravi.jit(false)" "Lua53 interpreter test failed"
run_lua53_tests "_port=true ravi.auto(true)" "Lua53 auto JIT test failed"
run_lua53_tests "_port=true ravi.auto(true,1)" "Lua53 auto JIT all test failed"

run_ravi_tests language ravi_tests1.ravi "ravi.jit(false)"
run_ravi_tests language ravi_tests1.ravi "ravi.auto(true,1)"
run_ravi_tests language ravi_tests2.ravi "ravi.jit(false)"
run_ravi_tests language ravi_tests2.ravi "ravi.auto(true,1)"
run_ravi_tests language defer_tests.ravi "ravi.jit(false)"
run_ravi_tests language defer_tests.ravi "ravi.auto(true,1)"
run_ravi_tests language ravi_tests3.ravi "ravi.auto(true,1)"
run_ravi_tests language ravi_errors.ravi "ravi.auto(true,1)"
run_ravi_tests language basics.lua "ravi.auto(true,1)"
run_ravi_tests extra gaussian2.lua "ravi.auto(true,1)"
run_ravi_tests extra bittest.lua "ravi.auto(true,1)"
run_ravi_tests performance fannkuchen.ravi "ravi.auto(true,1)"
run_ravi_tests performance mandel1.ravi "ravi.auto(true,1)"
run_ravi_tests performance matmul1_ravi.lua "ravi.auto(true,1)"
run_ravi_tests performance sieve.ravi "ravi.auto(true,1)"
run_ravi_tests performance pisum.ravi "ravi.auto(true,1)"
run_ravi_tests performance md5test.lua "ravi.auto(true,1)"
run_ravi_tests comptests test.lua "ravi.jit(false)"
exit 0

@dibyendumajumdar
Copy link
Contributor Author

thank you

@dibyendumajumdar
Copy link
Contributor Author

One more piece of info - I wasn't setting an opt level, so it was using the default level.

Opt level 0 and 1 do not appear to have the bug.
Opt level 2 fails.

@dibyendumajumdar
Copy link
Contributor Author

dibyendumajumdar commented Jun 6, 2024

It seems related to size somehow.

One of the failures is in a file containing many tests (the C code above is from that file); test 88 fails.
If I remove first 10 tests then test 98 fails.
If I remove first 20 tests then all tests pass.

@dibyendumajumdar
Copy link
Contributor Author

dibyendumajumdar commented Jun 6, 2024

Literally, if I remove one test, it fails after running 1 more test from the end.

@dibyendumajumdar
Copy link
Contributor Author

I wonder if the source file is too large causing something to overflow; maybe an integer value is overflowing? Seems to be related to opt level 2.

@rofl0r
Copy link

rofl0r commented Jun 7, 2024

for hard to debug stuff like this git bisect could be the right tool.

@dibyendumajumdar
Copy link
Contributor Author

dibyendumajumdar commented Jun 7, 2024

@rofl0r thanks for the suggestion; it may indeed by be a good idea to look back at the commits to see when it starts to fail. But the problem is that the bbv branch had many commits, and I not sure at what points in the commit history the branch was usable.

@rofl0r
Copy link

rofl0r commented Jun 7, 2024

just go back to where it branched off. if it works there, you can tag it good, and start bisecting. the good thing about bisecting is that it needs only a few attempts to find the culprit even for hundreds or thousands of commits, because you always split the candidate commits in half.

@vnmakarov
Copy link
Owner

Thank you for working on this bug. Even providing the test case requires a lot of efforts. I really appreciate your involvement. Although I tested the release well but one person is not enough.

In order to work on this issue I need some standalone C program (the bug can be in C code, e.g. C undefined behavior can be used, the bug can be in c2mir compiler, although I don't see it in your comparison of different mir versions). The C program should be executable with some result which says that the program works or not. The C code can be very big. It does not matter for me. Having this program I can bisect optimizations and functions of the program and find the bug origin.

I understand that providing the standalone program requires even more efforts from you but without such program it is practically impossible to find the bug reason and fix the bug.

@dibyendumajumdar
Copy link
Contributor Author

I guess the issue might be related to the size of a function; Lua code that is all part of a single script goes into a single top level function, making it large. It is hard to find such examples in regular C code.

Its impossible to create a standalone C program.

I think the best approach is as suggested by @rofl0r - but even this is a lot of effort, but I will have a go at it over the weekend. If we can narrow down the commit from when the bug started then it might help find the issue.

@mingodad
Copy link

mingodad commented Jun 7, 2024

But have you fixed the problems that valgrind found ?

@dibyendumajumdar
Copy link
Contributor Author

@mingodad don't think those are real problems - its reporting issues with standard C functions.

I have ASAN enabled and no failures reported there - but of course I don't think ASAN covers JIT output.

@dibyendumajumdar
Copy link
Contributor Author

dibyendumajumdar commented Jun 8, 2024

Results from git bisect:

git bisect bad e489efb93d8199f8e6d410535005e51e198c03b6
e489efb93d8199f8e6d410535005e51e198c03b6 is the first bad commit
commit e489efb93d8199f8e6d410535005e51e198c03b6
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Thu Mar 30 14:18:20 2023 -0400

    Implement semi-working (no-bootstrap) ssa combine:
      Remove assert for call results in target_machinize.
      Always check memory for non-strict in pattern_match_p.  Use op_ref instead of op in pattern_match_p and out_insn.
      Use const in get_ref_value.
      Add ssa_combine_ctx, free_ssa_edge, remove_ssa_edge_1, remove_disconnected_use_ssa_edge, merge_ssa_edge_lists.
      Remove assert change_ssa_edge_list_def.
      Remove edges only from definition in undo_build_ssa.  Add fixed_place_insn_p, use it for gvn_insn_p.
      Mave ssa code elimination and commutative_insn_code up.
      Add get_int_const, var_plus_const, var_mult_const, find_ssa_edge, update_op_combine_data, multiple_bb_def_p,
         no_mem_p, ssa_combine_op, processed_bb_insn_p, mark_bb_insn_as_processed, ssa_combine,
         init_ssa_combine, finish_ssa_combine
      Rename selection_ctx to combine_ctx.
      Call ssa_combine between conventional ssa and undo_build_ssa.

 mir-gen-x86_64.c | 145 +++++++++++-----------
 mir-gen.c        | 864 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
 2 files changed, 752 insertions(+), 257 deletions(-)

I realize that this may be an interim issue that may have been fixed - but I would require patch to continue bisecting.

CORRECTED

@dibyendumajumdar
Copy link
Contributor Author

I am going to do a different bisect run to see if I can find a later point with a different issue

@dibyendumajumdar
Copy link
Contributor Author

dibyendumajumdar commented Jun 8, 2024

okay so my second bisect run ignores other errors and focuses on the issue with "large" functions.

git bisect good
3f693641b509a14a1e527600ce848696ff1d626e is the first bad commit
commit 3f693641b509a14a1e527600ce848696ff1d626e
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Tue Jun 13 10:23:56 2023 -0400

    Constrain conflict matrix size:
      Add MIR_MAX_COALESCE_VARS.  Add scan_collected_moves.  Call scan_collected_moves in consider_move_vars_only.
      Extract new func collect_moves from coalesce.

 mir-gen.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 50 insertions(+), 32 deletions(-)

3f69364

To me this looks like a bug::

  scan_vars_num = 0;
  scan_collected_moves (gen_ctx);
  return scan_vars_num > 0;

@mingodad
Copy link

mingodad commented Jun 9, 2024

It seems that you didn't understood the valgrind output:

***** FILE 'errors.lua'*****
testing errors
==12900== Conditional jump or move depends on uninitialised value(s)
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Use of uninitialised value of size 8
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
==12900==    at 0x543BFE4: vfprintf (vfprintf.c:1642)
==12900==    by 0x54688AF: vsnprintf (vsnprintf.c:114)
==12900==    by 0x5444F9E: snprintf (snprintf.c:33)
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
==12900== Conditional jump or move depends on uninitialised value(s)
...
==12900==    by 0x131722: tostringbuff (lobject.c:377)
==12900==    by 0x131CFE: addnum2buff (lobject.c:481)
==12900==    by 0x131FC4: luaO_pushvfstring (lobject.c:512)
==12900==    by 0x1324F0: luaO_pushfstring (lobject.c:564)
==12900==    by 0x123D75: luaG_addinfo (ldebug.c:677)
==12900==    by 0x1241D4: luaG_runerror (ldebug.c:702)
==12900==    by 0x149E9F: luaV_idiv (lvm.c:892)
==12900==    by 0x153D05: luaV_execute (lvm.c:1635)
==12900==    by 0x126BA5: luaD_call (ldo.c:604)
==12900== 
+
testing stack overflow

Valgrind is interpreting the machine code and detected use of not initialized memory in libc but that has the origin in your code just around formatt %a %A that ultimately probably is calling printf like in libc, I'm not confirm that this is the problem but they are highly correlated and I'll not be surprised if it's indeed the causing of the failure (and other not known yet bugs).

testing strings and string library
testing 'format %a %A'
............./home/runner/work/ravi/build/ravi: strings.lua:366: bad argument #1 to 'assert' (value expected)
>>> closing state <<<

stack traceback:
	[C]: in function 'assert'
	strings.lua:366: in main chunk
	[C]: in function 'dofile'
	all.lua:167: in main chunk
	[C]: in ?
Lua53 auto JIT test failed

@dibyendumajumdar
Copy link
Contributor Author

dibyendumajumdar commented Jun 9, 2024

It is possible that there is a separate bug (notice that the call stack has luaG_runerror - so it is already in the process of reporting an error). I will have a look at it, but right now I don't think it is related.

The code that is being reported is this:

    if (n == 0)
      luaG_runerror(L, "attempt to divide by zero");

runerror tries to get more info - and its looking for a line number; I think the bug is that this doesn't make sense when using the source to JIT compiler. I will fix this.

@dibyendumajumdar
Copy link
Contributor Author

Well, I checked this, and actually what I said above does not apply in Lua tests, because they are compiled from Lua bytecodes.
Because the ravicomp tests were not even run yet - I am unsure what valgrind is complaining about - but it may be that the Lua value union has some part of it uninitialized?

@dibyendumajumdar
Copy link
Contributor Author

okay so my second bisect run ignores other errors and focuses on the issue with "large" functions.

git bisect good
3f693641b509a14a1e527600ce848696ff1d626e is the first bad commit
commit 3f693641b509a14a1e527600ce848696ff1d626e
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Tue Jun 13 10:23:56 2023 -0400

    Constrain conflict matrix size:
      Add MIR_MAX_COALESCE_VARS.  Add scan_collected_moves.  Call scan_collected_moves in consider_move_vars_only.
      Extract new func collect_moves from coalesce.

 mir-gen.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 50 insertions(+), 32 deletions(-)

3f69364

To me this looks like a bug::

  scan_vars_num = 0;
  scan_collected_moves (gen_ctx);
  return scan_vars_num > 0;

I can confirm that reverting this commit fixes the issue.

@vnmakarov
Copy link
Owner

To me this looks like a bug::

  scan_vars_num = 0;
  scan_collected_moves (gen_ctx);
  return scan_vars_num > 0;

Yes, you are absolutely right. I fixed this by 19d4a62 .

Although I have a question: how is important generated code speed for the test case on which the problem occurs?

W/o this optimization (coalescing), the generated code will be quite bad. It will have a lot of moves as for each phi result and operand a copy is added. When there are a lot vars (and consequently conflict matrix is big), I could do coalescing on live range basis. The current coalescing based on conflict graph has more quality than coalescing based on live-range uses.

@dibyendumajumdar
Copy link
Contributor Author

I can confirm the fix worked, thank you for looking into it. I assume this was size related; how is this feature tested? Presumably you do not have a test case that has a large enough function to hit the limit?

Re impact, I do not know yet as I haven't measured, but its okay I think to penalize very large functions, rather than fail completely or fail at runtime. I can of course limit JIT compilation based on function size. Usually such large functions are the top level script and the performance of these are unlikely to be important.

Another question is that in v0 there was no limit, but this limit was added in v1, is that right? What was the reason for it?

@vnmakarov
Copy link
Owner

Another question is that in v0 there was no limit, but this limit was added in v1, is that right? What was the reason for it?

v0 finds conflicts based on live-ranges. It is difficult to evaluate complexity of the algorithm but I would say it is "close to" linear. Therefore there is no limit.

v1 finds conflicts based on conflict graph. It is more superior. For example, v0 considers p1 and p2 conflicting in the following case (it is a pretty frequent case after out-of-SSA pass):

p2 = p1
...
use p1 and p1 becomes dead
...
use p2 

when v1 recognizes that there is no conflict. Complexity (in time and space) of conflict graph based algorithm can be quadratic in the worst case. Therefore I introduced a limit.

Using v0 algorithm as a backup when we hit the limit of v1 algorithm could be a solution. I'll work on it. It will be not a big addition to the MIR-generator code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants