Skip to content

MDEV-21248: Prevent optimizing out buf argument in check_stack_overrun.#1424

Merged
an3l merged 1 commit into
MariaDB:10.5from
marxin:MDEV-21248-lto-with-gcc-10
Feb 8, 2020
Merged

MDEV-21248: Prevent optimizing out buf argument in check_stack_overrun.#1424
an3l merged 1 commit into
MariaDB:10.5from
marxin:MDEV-21248-lto-with-gcc-10

Conversation

@marxin

@marxin marxin commented Dec 8, 2019

Copy link
Copy Markdown
Contributor

When using LTO, one can see optimization of stack variables that
are passed to check_stack_overrun as argument buf. That prevents
proper stack overrun detection.

@claassistantio

claassistantio commented Dec 8, 2019

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@an3l an3l added this to the 10.5 milestone Dec 11, 2019
@svoj

svoj commented Dec 11, 2019

Copy link
Copy Markdown
Contributor

Hi @marxin, unfortunately what you suggest is a scalability bottleneck. Have you managed to find out why this parameter was needed and why it helps in your case?

@marxin

marxin commented Dec 11, 2019

Copy link
Copy Markdown
Contributor Author

Hi @marxin, unfortunately what you suggest is a scalability bottleneck. Have you managed to find out why this parameter was needed and why it helps in your case?

Well, the argument is passed to the function check_stack_overrun in order to not optimize a stack variable in a caller. That way the caller can be sure that it will have enough stack space. However, this stopped working as GCC compiler can make a cross inlining (with LTO) and deduce that the argument is actually not used and the variable in caller can be optimized out.
An alternative solution would be a volatile store to the passed argument.

You can also take at a similar issue I saw recently with GCC 10:
python/cpython#17462

@svoj

svoj commented Dec 11, 2019

Copy link
Copy Markdown
Contributor

Please correct me if I'm wrong, if a stack variable is optimised away, it means less stack is used and check_stack_overrun should happily accept it?

@marxin

marxin commented Dec 11, 2019

Copy link
Copy Markdown
Contributor Author

Please correct me if I'm wrong, if a stack variable is optimised away, it means less stack is used and check_stack_overrun should happily accept it?

Yes, I suspect that in the problematic test-case check_stack_overrun will not trigger at the right time. Then I see the following test failure:

[  341s] main.execution_constants                 w75 [ fail ]
[  341s]         Test ended at 2019-12-04 11:37:46
[  341s] 
[  341s] CURRENT_TEST: main.execution_constants
[  341s] mysqltest: At line 75: query '$query_head 0 $query_tail' failed with wrong errno 1064: 'memory exhausted near '110609198224, 98224,  0 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ' at line 1', instead of 0...
[  341s] 
[  341s] The result from queries just before the failure was:
[  341s] CREATE TABLE `t_bug21476` (
[  341s] `ID_BOARD` smallint(5) unsigned NOT NULL default '0',
[  341s] `ID_MEMBER` mediumint(8) unsigned NOT NULL default '0',
[  341s] `logTime` int(10) unsigned NOT NULL default '0',
[  341s] `ID_MSG` mediumint(8) unsigned NOT NULL default '0',
[  341s] PRIMARY KEY  (`ID_MEMBER`,`ID_BOARD`),
[  341s] KEY `logTime` (`logTime`)
[  341s] ) ENGINE=MyISAM DEFAULT CHARSET=cp1251 COLLATE=cp1251_bulgarian_ci;
[  341s] INSERT INTO `t_bug21476` VALUES (2,2,1154870939,0),(1,2,1154870957,0),(2,183,1154941362,0),(2,84,1154904301,0),(1,84,1154905867,0),(2,13,1154947484,10271),(3,84,1154880549,0),(1,6,1154892183,0),(2,25,1154947581,10271),(3,25,1154904760,0),(1,25,1154947373,10271),(1,179,1154899992,0),(2,179,1154899410,0),(5,25,1154901666,0),(2,329,1154902026,0),(3,329,1154902040,0),(1,329,1154902058,0),(1,13,1154930841,0),(3,85,1154904987,0),(1,183,1154929665,0),(3,13,1154931268,0),(1,85,1154936888,0),(1,169,1154937959,0),(2,169,1154941717,0),(3,183,1154939810,0),(3,169,1154941734,0);
[  341s] 
[  341s]  - saving '/home/abuild/rpmbuild/BUILD/mariadb-10.3.20/build/mysql-test/var/75/log/main.execution_constants/' to '/home/abuild/rpmbuild/BUILD/mariadb-10.3.20/build/mysql-test/var/log/main.execution_constants/'

@svoj

svoj commented Dec 11, 2019

Copy link
Copy Markdown
Contributor

Feels like check_stack_overrun() call is reordered before stack variable allocation, or something like this. I wonder if something like __asm__ __volatile__ ("":::"memory") in check_stack_overrun() will be helpful. Or probably there's some LTO hint we could make use of?

@vaintroub

Copy link
Copy Markdown
Member

#pragma GCC push_options
#pragma GCC optimize ("O0")
.... function where gcc is buggy
#pragma GCC pop_options

If you would not want to optimize this function, then you can tell that directly, rather than introducing magic volatiles.

@marxin

marxin commented Dec 11, 2019

Copy link
Copy Markdown
Contributor Author

Feels like check_stack_overrun() call is reordered before stack variable allocation, or something like this.

Note that the compiler is free to remove such a stack variable if it's unused. One example:

static ha_rows get_quick_record_count(THD *thd, SQL_SELECT *select,
				      TABLE *table,
				      const key_map *keys,ha_rows limit)
{
  int error;
  DBUG_ENTER("get_quick_record_count");
  uchar buff[STACK_BUFF_ALLOC];
  if (unlikely(check_stack_overrun(thd, STACK_MIN_SIZE, buff)))
    DBUG_RETURN(0);                           // Fatal error flag is set
  if (select)
  {
    select->head=table;
    table->reginfo.impossible_range=0;
    if (likely((error=
                select->test_quick_select(thd, *(key_map *)keys,
                                          (table_map) 0,
                                          limit, 0, FALSE,
                                          TRUE,     /* remove_where_parts*/
                                          FALSE)) ==
               1))
      DBUG_RETURN(select->quick->records);
    if (unlikely(error == -1))
    {
      table->reginfo.impossible_range=1;
      DBUG_RETURN(0);
    }
    DBUG_PRINT("warning",("Couldn't use record count on const keypart"));
  }
  DBUG_RETURN(HA_POS_ERROR);			/* This shouldn't happend */
}

Here buf can be removed.

I wonder if something like __asm__ __volatile__ ("":::"memory") in check_stack_overrun() will be helpful. Or probably there's some LTO hint we could make use of?

Yes, this will likely work here.

@marxin

marxin commented Dec 11, 2019

Copy link
Copy Markdown
Contributor Author

You can play with the following example:

char *global;

static int foo(char *ptr)
{
  // this does not work
  __asm__ __volatile__ ("":::"memory");
  static volatile char *x;
  x = ptr;

  // this works:
  // global = ptr;
  return 0;
}

int main()
{
  char v[100];
  return foo(v);
}
$ gcc stack.c -O2 -fdump-tree-optimized=/dev/stdout

;; Function main (main, funcdef_no=1, decl_uid=1915, cgraph_uid=2, symbol_order=2) (executed once)

main ()
{
  <bb 2> [local count: 1073741824]:
  __asm__ __volatile__("" :  :  : "memory");
  return 0;
}

So one can either assign the address to a global variable, or use #pragma GCC optimize("-O0"), which is not ideal.

@vaintroub

Copy link
Copy Markdown
Member

How a stack variable is unused in your example if it is used directly after you declare it.

uchar buff[STACK_BUFF_ALLOC];
if (unlikely(check_stack_overrun(thd, STACK_MIN_SIZE, buff)))
DBUG_RETURN(0); // Fatal error flag is set

How a compiler would remove that? You mean the compiler , for which the obscure attribute(unused) was written?

@svoj

svoj commented Dec 11, 2019

Copy link
Copy Markdown
Contributor

@vvaintroub in this particular case it may be considered unused indeed. I just wonder why did we need that variable at all?

@marxin

marxin commented Dec 11, 2019

Copy link
Copy Markdown
Contributor Author

How a compiler would remove that? You mean the compiler , for which the obscure attribute(unused) was written?

The attribute only drives warning emission, it will not prevent optimizations to remove the variables:

unused
This attribute, attached to a variable, means that the variable is meant to be possibly unused. GCC will not produce a warning for this variable.

@marxin

marxin commented Dec 11, 2019

Copy link
Copy Markdown
Contributor Author

@vvaintroub in this particular case it may be considered unused indeed. I just wonder why did we need that variable at all?

It's just a placeholder that helps you to detect that you are close stack overflow. Otherwise, you'll make a call of another function (or call a alloca or VLA) you hit the overflow without a detection.

@vaintroub

Copy link
Copy Markdown
Member

do you actually know what LTO has done to this function? Do you have a dissassemly example of that?
Maybe it just inlined the function. against that there are probably a couple of methods, obscure attribute noinline and what not

@marxin

marxin commented Dec 11, 2019

Copy link
Copy Markdown
Contributor Author

do you actually know what LTO has done to this function? Do you have a dissassemly example of that?

Sure, I can provide an assembly. Do you which function should catch the stack overflow in the mentioned test-case?

@vaintroub

Copy link
Copy Markdown
Member

before we jump into that, can you share what happens if you deoptimize the function in a documented way

like #1424 (comment)
or attribute noinline could be interesting, as well.

Thank you very much.

@marxin

marxin commented Jan 2, 2020

Copy link
Copy Markdown
Contributor Author

before we jump into that, can you share what happens if you deoptimize the function in a documented way

like #1424 (comment)
or attribute noinline could be interesting, as well.

Sorry for a long wait for the reply. Yes, adding -O0 for the function will work. It's actually what Python selected to do for a similar issue:
python/cpython#17462

Thank you very much.

When using LTO, one can see optimization of stack variables that
are passed to check_stack_overrun as argument buf. That prevents
proper stack overrun detection.
@marxin marxin force-pushed the MDEV-21248-lto-with-gcc-10 branch from eeca2b7 to c1584b7 Compare January 22, 2020 09:23
@marxin

marxin commented Jan 29, 2020

Copy link
Copy Markdown
Contributor Author

May I please ping this?

@vuvova vuvova assigned an3l and unassigned svoj Feb 5, 2020
@svoj

svoj commented Feb 5, 2020

Copy link
Copy Markdown
Contributor

@marxin sorry for this delay. I prototyped solution without extra argument. Could you check if either fb265ac or 66cb908 solves this problem?

@marxin

marxin commented Feb 6, 2020

Copy link
Copy Markdown
Contributor Author

@marxin sorry for this delay. I prototyped solution without extra argument. Could you check if either fb265ac or 66cb908 solves this problem?

No, it does not help with the issue.

@svoj

svoj commented Feb 6, 2020

Copy link
Copy Markdown
Contributor

That's unfortunate. Could you share your build parameters either here or in jira task?

@marxin

marxin commented Feb 6, 2020

Copy link
Copy Markdown
Contributor Author

That's unfortunate. Could you share your build parameters either here or in jira task?

Sure. I basically add -flto into C*FLAGS. For complete build log, you can look at the Jira issue where I attached mariadb_log.txt.bz2 file.

@svoj

svoj commented Feb 6, 2020

Copy link
Copy Markdown
Contributor

@marxin, attached log shows successful build with gcc-9. It didn't fail on my side either. How do I properly install gcc10 with tumbleweed?

@marxin

marxin commented Feb 6, 2020

Copy link
Copy Markdown
Contributor Author

@marxin, attached log shows successful build with gcc-9. It didn't fail on my side either. How do I properly install gcc10 with tumbleweed?

The TW does not have gcc10 as system compiler. You can however use open build service and build the following package:
https://build.opensuse.org/package/show/openSUSE:Factory:Staging:Gcc7/mariadb

It uses GCC 10.
Btw. what's wrong with my suggested patch?

@svoj

svoj commented Feb 6, 2020

Copy link
Copy Markdown
Contributor

We do believe that buff argument is redundant and want to try fixing the problem by removing it. If it helps, we won't have to introduce compiler specific attributes. But as you say, it didn't help and I'd like to find out why.

@marxin

marxin commented Feb 6, 2020

Copy link
Copy Markdown
Contributor Author

You are right, the buff variable is redundant, but it's key to preserve buff and long stack_used to be placed on corresponding stack frames so that available_stack_size(thd->thread_stack, &stack_used) can detect we are close to the end of a stack.

@vaintroub

vaintroub commented Feb 6, 2020

Copy link
Copy Markdown
Member

@svoj it is much better to deoptimize GCC by introducing compiler specific pragma , than to make everyone wonder why the heck the unused variable is doing there.

There are too many ways to check for buffer overrun already. my_thread_var->stack_ends_here is used in mysys/lf_alloc-pin.c maybe it helps here too.

@marxin, If LTO is producing a buggy code, why do you use it? Or rather, why don't you bug gcc maintainers, but us?

@marxin

marxin commented Feb 6, 2020

Copy link
Copy Markdown
Contributor Author

@marxin, If LTO is producing a buggy code, why do you use it? Or rather, why don't you bug gcc maintainers, but us?

Strong words here ;) Well, to be honest, an optimizing compiler is free to remove any unused variable. In this particular case, the buf argument address is not used. The compiler can prove that in LTO mode.

We use it because it brings a lot of optimization opportunities (mostly cross-module inlining) and you end up with faster and smaller binaries. We believe it's what users want.

If you believe it's a GCC issue, feel free to open an issue with a reduced test-case and we can discuss that.

@vaintroub

Copy link
Copy Markdown
Member

The question is why do you believe it is our issue? I mean, where, exactly, is the bug in the code? Users want something that works, and only then cross-module inlining and the goodies link time optimization brings. I think -O2 would suffice most of the time. Add -g and -fno-omit-frame-pointer for debuggability in production, and for the good stack traces, and that would be fine, with me. It is nothing bad to do some LTO/PGO research, but in the end, what's the point in whooping 5% improvement on a sysbench benchmark, if generated code can't pass the test suite?

@marxin

marxin commented Feb 6, 2020

Copy link
Copy Markdown
Contributor Author

The question is why do you believe it is our issue? I mean, where, exactly, is the bug in the code?

Well, you basically rely on the following:

$ cat stack1.c
void check_overflow(char *stack_start, char *stack_end);

void foo(char *stack_start)
{
  char buffer[1000];
  check_overflow (stack_start, buffer);
}

void thread_start()
{
  char my_stack;
  foo (&my_stack);
}

int main()
{
  thread_start ();
  return 0;
}

$ cat stack2.c
void
check_overflow(char *stack_start, char *stack_end)
{
  char stack;
  int diff = stack_start - &stack;
  __builtin_printf ("Diff: %d\n", diff);
  if (diff <= 1000)
    __builtin_abort ();
  __builtin_printf ("We're close to stack end.\n");
}

Which works with non-LTO mode:
$ gcc stack1.c stack2.c -O2 && ./a.out 
Diff: 1056
We're close to stack end.

But not with LTO:
$ gcc stack1.c stack2.c -O2 -flto && ./a.out 
Diff: 32
Aborted (core dumped)

You will end up with the following optimized code:

foo (char * stack_start)
{
  int diff;
  char stack;
  long int _6;

  <bb 2> [local count: 1073741824]:
  _6 = stack_start_2(D) - &stack;
  diff_7 = (int) _6;
  __builtin_printf ("Diff: %d\n", diff_7);
  if (diff_7 <= 1000)
    goto <bb 3>; [0.00%]
  else
    goto <bb 4>; [100.00%]

  <bb 3> [count: 0]:
  __builtin_abort ();

  <bb 4> [local count: 1073741824]:
  __builtin_puts (&"We\'re close to stack end."[0]); [tail call]
  stack ={v} {CLOBBER};
  return;

}



;; Function main (main, funcdef_no=3, decl_uid=4389, cgraph_uid=4, symbol_order=2) (executed once)

main ()
{
  char my_stack;

  <bb 2> [local count: 1073741824]:
  foo (&my_stack);
  my_stack ={v} {CLOBBER};
  return 0;

}

Moreover, `clang` behaves more aggressively even in non-LTO mode:
$  clang stack1.c stack2.c -O2 && ./a.out 
Diff: 32
Aborted (core dumped)

$ clang stack1.c stack2.c && ./a.out 
Diff: 1072
We're close to stack end.

Users want something that works, and only then cross-module inlining and the goodies link time optimization brings.

Well, users want ideally both.

I think -O2 would suffice most of the time. Add -g and -fno-omit-frame-pointer for debuggability in > production, and for the good stack traces, and that would be fine, with me. It is nothing bad to do > some LTO/PGO research, but in the end, what's the point in whooping 5% improvement on a sysbench benchmark, if generated code can't pass the test suite?

We're on the cutting edge. People want to have correct programs that run maximally fast. LTO is a technique that exposes much more defects in a user code base and we help people to explain the violations. And yes, sometimes LTO causes a miscompilation and we're eager to fix all these issues.

@vaintroub

Copy link
Copy Markdown
Member

Ok, the "buff" and gcc unused are both obscure. Can you please expand then on #1424 (comment) . The buff parameter is redundant, and there should not be any buffs of size 352 (STACK_BUFF_ALLOC) to check the stack size. Instead , the size can be added to STACK_MIN_SIZE, right, this is what @svoj did in 66cb908 . You said it did not help, and why it did not?

@marxin

marxin commented Feb 6, 2020

Copy link
Copy Markdown
Contributor Author

With some luck (and proper tuning of the margin) it can help, but the check is fundamentally problematic. Let's consider being in a function foo. You would like to verify first the function will have enough stack frame to do a corresponding call to different functions.

Let's imagine foo calls A with calls B, etc. You can end up with either:
call chain: A->B->C->.. or the whole chain will be inlined into foo and so that all local variables of A, B, ... will be added to stack frame of foo. And you lost because you'll reach a stack overflow at the very beginning of function foo (during the function prologue). And the decision is based on inliner heuristics (which are complex).

So that explains my current patch in this pull request: it adds an optimization barrier to check_stack_overrun which will really end up with another stack frame. But it's not a bullet-proof solution as I've just explained.

@svoj

svoj commented Feb 6, 2020

Copy link
Copy Markdown
Contributor

FWIW I managed to deploy opensuse with gcc10. My patches didn't work indeed. Even worse they break this fix. OTOH this fix alone works.

I do believe that MariaDB is guilty for this bug: it is relying on non-standard behaviour, specifically stack layout (or is it defined anywhere?)

If I fail to find out what's going on during the next week, I will vote to apply this PR. My current guess is: either stack got optimised and became smaller or stack_used got moved to an earlier position.

@svoj

svoj commented Feb 7, 2020

Copy link
Copy Markdown
Contributor

It came to be that the buffer is needed to reserve stack space for recursive fix_fields() calls. I was wrong at my assumption that fix_fields() is not called recursively. Without the buffer stack usage in fix_fields() becomes much lower, so that the parser rans out of its stack earlier than fix_fields() and test is failing with wrong error.

We thought about alternatives like adding new THD member and reserving STACK_BUFF_ALLOC there, but it feels a bit less optimal than reserving buffer on stack.

Another alternatives were:

template<size_t size> union Reserve_stack
{
  char buf[size];
  ulong stack_used;
  Reserve_stack(THD *thd)
  { stack_used= available_stack_size(thd->thread_stack, &stack_used); }
  check_overrun(THD *thd, long margin);
};

or

template<size_t size> class Reserve_stack
{
  ulong stack_used[size / sizeof(long) + 1];
  Reserve_stack(THD *thd)
  { stack_used[0]= available_stack_size(thd->thread_stack, &stack_used); }
  check_overrun(THD *thd, long margin);
};

But I guess very smart compiler may optimise these as well.

In other words we couldn't come up with anything better and your patch is approved.

an3l pushed a commit that referenced this pull request Feb 7, 2020
When using LTO, one can see optimization of stack variables that
are passed to check_stack_overrun as argument buf. That prevents
proper stack overrun detection.

Closes #1424
@an3l an3l merged commit 35c2778 into MariaDB:10.5 Feb 8, 2020
@an3l

an3l commented Feb 8, 2020

Copy link
Copy Markdown
Contributor

@marxin thanks for contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants