MDEV-21248: Prevent optimizing out buf argument in check_stack_overrun. by marxin · Pull Request #1424 · MariaDB/server

marxin · 2019-12-08T09:54:16Z

When using LTO, one can see optimization of stack variables that
are passed to check_stack_overrun as argument buf. That prevents
proper stack overrun detection.

claassistantio · 2019-12-08T09:55:02Z

All committers have signed the CLA.

svoj · 2019-12-11T11:31:41Z

Hi @marxin, unfortunately what you suggest is a scalability bottleneck. Have you managed to find out why this parameter was needed and why it helps in your case?

marxin · 2019-12-11T11:57:43Z

Hi @marxin, unfortunately what you suggest is a scalability bottleneck. Have you managed to find out why this parameter was needed and why it helps in your case?

Well, the argument is passed to the function check_stack_overrun in order to not optimize a stack variable in a caller. That way the caller can be sure that it will have enough stack space. However, this stopped working as GCC compiler can make a cross inlining (with LTO) and deduce that the argument is actually not used and the variable in caller can be optimized out.
An alternative solution would be a volatile store to the passed argument.

You can also take at a similar issue I saw recently with GCC 10:
python/cpython#17462

svoj · 2019-12-11T12:19:04Z

Please correct me if I'm wrong, if a stack variable is optimised away, it means less stack is used and check_stack_overrun should happily accept it?

marxin · 2019-12-11T12:22:37Z

Please correct me if I'm wrong, if a stack variable is optimised away, it means less stack is used and check_stack_overrun should happily accept it?

Yes, I suspect that in the problematic test-case check_stack_overrun will not trigger at the right time. Then I see the following test failure:

[  341s] main.execution_constants                 w75 [ fail ]
[  341s]         Test ended at 2019-12-04 11:37:46
[  341s] 
[  341s] CURRENT_TEST: main.execution_constants
[  341s] mysqltest: At line 75: query '$query_head 0 $query_tail' failed with wrong errno 1064: 'memory exhausted near '110609198224, 98224,  0 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ' at line 1', instead of 0...
[  341s] 
[  341s] The result from queries just before the failure was:
[  341s] CREATE TABLE `t_bug21476` (
[  341s] `ID_BOARD` smallint(5) unsigned NOT NULL default '0',
[  341s] `ID_MEMBER` mediumint(8) unsigned NOT NULL default '0',
[  341s] `logTime` int(10) unsigned NOT NULL default '0',
[  341s] `ID_MSG` mediumint(8) unsigned NOT NULL default '0',
[  341s] PRIMARY KEY  (`ID_MEMBER`,`ID_BOARD`),
[  341s] KEY `logTime` (`logTime`)
[  341s] ) ENGINE=MyISAM DEFAULT CHARSET=cp1251 COLLATE=cp1251_bulgarian_ci;
[  341s] INSERT INTO `t_bug21476` VALUES (2,2,1154870939,0),(1,2,1154870957,0),(2,183,1154941362,0),(2,84,1154904301,0),(1,84,1154905867,0),(2,13,1154947484,10271),(3,84,1154880549,0),(1,6,1154892183,0),(2,25,1154947581,10271),(3,25,1154904760,0),(1,25,1154947373,10271),(1,179,1154899992,0),(2,179,1154899410,0),(5,25,1154901666,0),(2,329,1154902026,0),(3,329,1154902040,0),(1,329,1154902058,0),(1,13,1154930841,0),(3,85,1154904987,0),(1,183,1154929665,0),(3,13,1154931268,0),(1,85,1154936888,0),(1,169,1154937959,0),(2,169,1154941717,0),(3,183,1154939810,0),(3,169,1154941734,0);
[  341s] 
[  341s]  - saving '/home/abuild/rpmbuild/BUILD/mariadb-10.3.20/build/mysql-test/var/75/log/main.execution_constants/' to '/home/abuild/rpmbuild/BUILD/mariadb-10.3.20/build/mysql-test/var/log/main.execution_constants/'

svoj · 2019-12-11T12:38:10Z

Feels like check_stack_overrun() call is reordered before stack variable allocation, or something like this. I wonder if something like __asm__ __volatile__ ("":::"memory") in check_stack_overrun() will be helpful. Or probably there's some LTO hint we could make use of?

vaintroub · 2019-12-11T13:41:23Z

#pragma GCC push_options
#pragma GCC optimize ("O0")
.... function where gcc is buggy
#pragma GCC pop_options

If you would not want to optimize this function, then you can tell that directly, rather than introducing magic volatiles.

marxin · 2019-12-11T13:55:04Z

Feels like check_stack_overrun() call is reordered before stack variable allocation, or something like this.

Note that the compiler is free to remove such a stack variable if it's unused. One example:

static ha_rows get_quick_record_count(THD *thd, SQL_SELECT *select,
				      TABLE *table,
				      const key_map *keys,ha_rows limit)
{
  int error;
  DBUG_ENTER("get_quick_record_count");
  uchar buff[STACK_BUFF_ALLOC];
  if (unlikely(check_stack_overrun(thd, STACK_MIN_SIZE, buff)))
    DBUG_RETURN(0);                           // Fatal error flag is set
  if (select)
  {
    select->head=table;
    table->reginfo.impossible_range=0;
    if (likely((error=
                select->test_quick_select(thd, *(key_map *)keys,
                                          (table_map) 0,
                                          limit, 0, FALSE,
                                          TRUE,     /* remove_where_parts*/
                                          FALSE)) ==
               1))
      DBUG_RETURN(select->quick->records);
    if (unlikely(error == -1))
    {
      table->reginfo.impossible_range=1;
      DBUG_RETURN(0);
    }
    DBUG_PRINT("warning",("Couldn't use record count on const keypart"));
  }
  DBUG_RETURN(HA_POS_ERROR);			/* This shouldn't happend */
}

Here buf can be removed.

I wonder if something like __asm__ __volatile__ ("":::"memory") in check_stack_overrun() will be helpful. Or probably there's some LTO hint we could make use of?

Yes, this will likely work here.

marxin · 2019-12-11T14:06:55Z

You can play with the following example:

char *global;

static int foo(char *ptr)
{
  // this does not work
  __asm__ __volatile__ ("":::"memory");
  static volatile char *x;
  x = ptr;

  // this works:
  // global = ptr;
  return 0;
}

int main()
{
  char v[100];
  return foo(v);
}

$ gcc stack.c -O2 -fdump-tree-optimized=/dev/stdout

;; Function main (main, funcdef_no=1, decl_uid=1915, cgraph_uid=2, symbol_order=2) (executed once)

main ()
{
  <bb 2> [local count: 1073741824]:
  __asm__ __volatile__("" :  :  : "memory");
  return 0;
}

So one can either assign the address to a global variable, or use #pragma GCC optimize("-O0"), which is not ideal.

vaintroub · 2019-12-11T14:08:00Z

How a stack variable is unused in your example if it is used directly after you declare it.

uchar buff[STACK_BUFF_ALLOC];
if (unlikely(check_stack_overrun(thd, STACK_MIN_SIZE, buff)))
DBUG_RETURN(0); // Fatal error flag is set

How a compiler would remove that? You mean the compiler , for which the obscure attribute(unused) was written?

svoj · 2019-12-11T14:10:05Z

@vvaintroub in this particular case it may be considered unused indeed. I just wonder why did we need that variable at all?

marxin · 2019-12-11T14:10:45Z

How a compiler would remove that? You mean the compiler , for which the obscure attribute(unused) was written?

The attribute only drives warning emission, it will not prevent optimizations to remove the variables:

unused
This attribute, attached to a variable, means that the variable is meant to be possibly unused. GCC will not produce a warning for this variable.

marxin · 2019-12-11T14:12:01Z

@vvaintroub in this particular case it may be considered unused indeed. I just wonder why did we need that variable at all?

It's just a placeholder that helps you to detect that you are close stack overflow. Otherwise, you'll make a call of another function (or call a alloca or VLA) you hit the overflow without a detection.

vaintroub · 2019-12-11T14:46:38Z

do you actually know what LTO has done to this function? Do you have a dissassemly example of that?
Maybe it just inlined the function. against that there are probably a couple of methods, obscure attribute noinline and what not

marxin · 2019-12-11T14:48:01Z

do you actually know what LTO has done to this function? Do you have a dissassemly example of that?

Sure, I can provide an assembly. Do you which function should catch the stack overflow in the mentioned test-case?

vaintroub · 2019-12-11T16:31:04Z

before we jump into that, can you share what happens if you deoptimize the function in a documented way

like #1424 (comment)
or attribute noinline could be interesting, as well.

Thank you very much.

marxin · 2020-01-02T08:30:43Z

before we jump into that, can you share what happens if you deoptimize the function in a documented way

like #1424 (comment)
or attribute noinline could be interesting, as well.

Sorry for a long wait for the reply. Yes, adding -O0 for the function will work. It's actually what Python selected to do for a similar issue:
python/cpython#17462

Thank you very much.

When using LTO, one can see optimization of stack variables that are passed to check_stack_overrun as argument buf. That prevents proper stack overrun detection.

marxin · 2020-01-29T10:08:31Z

May I please ping this?

svoj · 2020-02-05T19:43:48Z

@marxin sorry for this delay. I prototyped solution without extra argument. Could you check if either fb265ac or 66cb908 solves this problem?

marxin · 2020-02-06T10:55:31Z

@marxin sorry for this delay. I prototyped solution without extra argument. Could you check if either fb265ac or 66cb908 solves this problem?

No, it does not help with the issue.

svoj · 2020-02-06T11:11:13Z

That's unfortunate. Could you share your build parameters either here or in jira task?

marxin · 2020-02-06T11:18:36Z

That's unfortunate. Could you share your build parameters either here or in jira task?

Sure. I basically add -flto into C*FLAGS. For complete build log, you can look at the Jira issue where I attached mariadb_log.txt.bz2 file.

svoj · 2020-02-06T12:44:55Z

@marxin, attached log shows successful build with gcc-9. It didn't fail on my side either. How do I properly install gcc10 with tumbleweed?

marxin · 2020-02-06T12:54:51Z

@marxin, attached log shows successful build with gcc-9. It didn't fail on my side either. How do I properly install gcc10 with tumbleweed?

The TW does not have gcc10 as system compiler. You can however use open build service and build the following package:
https://build.opensuse.org/package/show/openSUSE:Factory:Staging:Gcc7/mariadb

It uses GCC 10.
Btw. what's wrong with my suggested patch?

svoj · 2020-02-06T13:00:43Z

We do believe that buff argument is redundant and want to try fixing the problem by removing it. If it helps, we won't have to introduce compiler specific attributes. But as you say, it didn't help and I'd like to find out why.

marxin · 2020-02-06T13:50:18Z

You are right, the buff variable is redundant, but it's key to preserve buff and long stack_used to be placed on corresponding stack frames so that available_stack_size(thd->thread_stack, &stack_used) can detect we are close to the end of a stack.

vaintroub · 2020-02-06T18:12:38Z

@svoj it is much better to deoptimize GCC by introducing compiler specific pragma , than to make everyone wonder why the heck the unused variable is doing there.

There are too many ways to check for buffer overrun already. my_thread_var->stack_ends_here is used in mysys/lf_alloc-pin.c maybe it helps here too.

@marxin, If LTO is producing a buggy code, why do you use it? Or rather, why don't you bug gcc maintainers, but us?

marxin · 2020-02-06T19:48:08Z

@marxin, If LTO is producing a buggy code, why do you use it? Or rather, why don't you bug gcc maintainers, but us?

Strong words here ;) Well, to be honest, an optimizing compiler is free to remove any unused variable. In this particular case, the buf argument address is not used. The compiler can prove that in LTO mode.

We use it because it brings a lot of optimization opportunities (mostly cross-module inlining) and you end up with faster and smaller binaries. We believe it's what users want.

If you believe it's a GCC issue, feel free to open an issue with a reduced test-case and we can discuss that.

vaintroub · 2020-02-06T20:03:19Z

The question is why do you believe it is our issue? I mean, where, exactly, is the bug in the code? Users want something that works, and only then cross-module inlining and the goodies link time optimization brings. I think -O2 would suffice most of the time. Add -g and -fno-omit-frame-pointer for debuggability in production, and for the good stack traces, and that would be fine, with me. It is nothing bad to do some LTO/PGO research, but in the end, what's the point in whooping 5% improvement on a sysbench benchmark, if generated code can't pass the test suite?

marxin · 2020-02-06T20:33:09Z

The question is why do you believe it is our issue? I mean, where, exactly, is the bug in the code?

Well, you basically rely on the following:

$ cat stack1.c
void check_overflow(char *stack_start, char *stack_end);

void foo(char *stack_start)
{
  char buffer[1000];
  check_overflow (stack_start, buffer);
}

void thread_start()
{
  char my_stack;
  foo (&my_stack);
}

int main()
{
  thread_start ();
  return 0;
}

$ cat stack2.c
void
check_overflow(char *stack_start, char *stack_end)
{
  char stack;
  int diff = stack_start - &stack;
  __builtin_printf ("Diff: %d\n", diff);
  if (diff <= 1000)
    __builtin_abort ();
  __builtin_printf ("We're close to stack end.\n");
}

Which works with non-LTO mode:
$ gcc stack1.c stack2.c -O2 && ./a.out 
Diff: 1056
We're close to stack end.

But not with LTO:
$ gcc stack1.c stack2.c -O2 -flto && ./a.out 
Diff: 32
Aborted (core dumped)

You will end up with the following optimized code:

foo (char * stack_start)
{
  int diff;
  char stack;
  long int _6;

  <bb 2> [local count: 1073741824]:
  _6 = stack_start_2(D) - &stack;
  diff_7 = (int) _6;
  __builtin_printf ("Diff: %d\n", diff_7);
  if (diff_7 <= 1000)
    goto <bb 3>; [0.00%]
  else
    goto <bb 4>; [100.00%]

  <bb 3> [count: 0]:
  __builtin_abort ();

  <bb 4> [local count: 1073741824]:
  __builtin_puts (&"We\'re close to stack end."[0]); [tail call]
  stack ={v} {CLOBBER};
  return;

}



;; Function main (main, funcdef_no=3, decl_uid=4389, cgraph_uid=4, symbol_order=2) (executed once)

main ()
{
  char my_stack;

  <bb 2> [local count: 1073741824]:
  foo (&my_stack);
  my_stack ={v} {CLOBBER};
  return 0;

}

Moreover, `clang` behaves more aggressively even in non-LTO mode:
$  clang stack1.c stack2.c -O2 && ./a.out 
Diff: 32
Aborted (core dumped)

$ clang stack1.c stack2.c && ./a.out 
Diff: 1072
We're close to stack end.

Users want something that works, and only then cross-module inlining and the goodies link time optimization brings.

Well, users want ideally both.

I think -O2 would suffice most of the time. Add -g and -fno-omit-frame-pointer for debuggability in > production, and for the good stack traces, and that would be fine, with me. It is nothing bad to do > some LTO/PGO research, but in the end, what's the point in whooping 5% improvement on a sysbench benchmark, if generated code can't pass the test suite?

We're on the cutting edge. People want to have correct programs that run maximally fast. LTO is a technique that exposes much more defects in a user code base and we help people to explain the violations. And yes, sometimes LTO causes a miscompilation and we're eager to fix all these issues.

vaintroub · 2020-02-06T20:52:08Z

Ok, the "buff" and gcc unused are both obscure. Can you please expand then on #1424 (comment) . The buff parameter is redundant, and there should not be any buffs of size 352 (STACK_BUFF_ALLOC) to check the stack size. Instead , the size can be added to STACK_MIN_SIZE, right, this is what @svoj did in 66cb908 . You said it did not help, and why it did not?

marxin · 2020-02-06T21:19:57Z

With some luck (and proper tuning of the margin) it can help, but the check is fundamentally problematic. Let's consider being in a function foo. You would like to verify first the function will have enough stack frame to do a corresponding call to different functions.

Let's imagine foo calls A with calls B, etc. You can end up with either:
call chain: A->B->C->.. or the whole chain will be inlined into foo and so that all local variables of A, B, ... will be added to stack frame of foo. And you lost because you'll reach a stack overflow at the very beginning of function foo (during the function prologue). And the decision is based on inliner heuristics (which are complex).

So that explains my current patch in this pull request: it adds an optimization barrier to check_stack_overrun which will really end up with another stack frame. But it's not a bullet-proof solution as I've just explained.

svoj · 2020-02-06T21:26:34Z

FWIW I managed to deploy opensuse with gcc10. My patches didn't work indeed. Even worse they break this fix. OTOH this fix alone works.

I do believe that MariaDB is guilty for this bug: it is relying on non-standard behaviour, specifically stack layout (or is it defined anywhere?)

If I fail to find out what's going on during the next week, I will vote to apply this PR. My current guess is: either stack got optimised and became smaller or stack_used got moved to an earlier position.

svoj · 2020-02-07T12:47:15Z

It came to be that the buffer is needed to reserve stack space for recursive fix_fields() calls. I was wrong at my assumption that fix_fields() is not called recursively. Without the buffer stack usage in fix_fields() becomes much lower, so that the parser rans out of its stack earlier than fix_fields() and test is failing with wrong error.

We thought about alternatives like adding new THD member and reserving STACK_BUFF_ALLOC there, but it feels a bit less optimal than reserving buffer on stack.

Another alternatives were:

template<size_t size> union Reserve_stack
{
  char buf[size];
  ulong stack_used;
  Reserve_stack(THD *thd)
  { stack_used= available_stack_size(thd->thread_stack, &stack_used); }
  check_overrun(THD *thd, long margin);
};

or

template<size_t size> class Reserve_stack
{
  ulong stack_used[size / sizeof(long) + 1];
  Reserve_stack(THD *thd)
  { stack_used[0]= available_stack_size(thd->thread_stack, &stack_used); }
  check_overrun(THD *thd, long margin);
};

But I guess very smart compiler may optimise these as well.

In other words we couldn't come up with anything better and your patch is approved.

When using LTO, one can see optimization of stack variables that are passed to check_stack_overrun as argument buf. That prevents proper stack overrun detection. Closes #1424

an3l · 2020-02-08T05:01:31Z

@marxin thanks for contribution!

an3l added this to the 10.5 milestone Dec 11, 2019

an3l assigned svoj Dec 11, 2019

an3l added the license-mca label Dec 31, 2019

MDEV-21248: Prevent optimizing out buf argument in check_stack_overrun.

c1584b7

When using LTO, one can see optimization of stack variables that are passed to check_stack_overrun as argument buf. That prevents proper stack overrun detection.

marxin force-pushed the MDEV-21248-lto-with-gcc-10 branch from eeca2b7 to c1584b7 Compare January 22, 2020 09:23

vuvova assigned an3l and unassigned svoj Feb 5, 2020

an3l merged commit 35c2778 into MariaDB:10.5 Feb 8, 2020

Uh oh!

Uh oh!

Conversation

marxin commented Dec 8, 2019

Uh oh!

claassistantio commented Dec 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svoj commented Dec 11, 2019

Uh oh!

marxin commented Dec 11, 2019

Uh oh!

svoj commented Dec 11, 2019

Uh oh!

marxin commented Dec 11, 2019

Uh oh!

svoj commented Dec 11, 2019

Uh oh!

vaintroub commented Dec 11, 2019

Uh oh!

marxin commented Dec 11, 2019

Uh oh!

marxin commented Dec 11, 2019

Uh oh!

vaintroub commented Dec 11, 2019

Uh oh!

svoj commented Dec 11, 2019

Uh oh!

marxin commented Dec 11, 2019

Uh oh!

marxin commented Dec 11, 2019

Uh oh!

vaintroub commented Dec 11, 2019

Uh oh!

marxin commented Dec 11, 2019

Uh oh!

vaintroub commented Dec 11, 2019

Uh oh!

marxin commented Jan 2, 2020

Uh oh!

marxin commented Jan 29, 2020

Uh oh!

svoj commented Feb 5, 2020

Uh oh!

marxin commented Feb 6, 2020

Uh oh!

svoj commented Feb 6, 2020

Uh oh!

marxin commented Feb 6, 2020

Uh oh!

svoj commented Feb 6, 2020

Uh oh!

marxin commented Feb 6, 2020

Uh oh!

svoj commented Feb 6, 2020

Uh oh!

marxin commented Feb 6, 2020

Uh oh!

vaintroub commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marxin commented Feb 6, 2020

Uh oh!

vaintroub commented Feb 6, 2020

Uh oh!

marxin commented Feb 6, 2020

Uh oh!

vaintroub commented Feb 6, 2020

Uh oh!

marxin commented Feb 6, 2020

Uh oh!

svoj commented Feb 6, 2020

Uh oh!

svoj commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

an3l commented Feb 8, 2020

Uh oh!

Reviewers

Assignees

claassistantio commented Dec 8, 2019 •

edited

Loading

vaintroub commented Feb 6, 2020 •

edited

Loading

svoj commented Feb 7, 2020 •

edited

Loading