Closed Bug 571118 Opened 14 years ago Closed 14 years ago

investigate abnormal increase in fx 3.6.4 crashes beginning June 7, 16:15

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chofmann, Assigned: laura)

References

Details

Attachments

(21 files, 2 obsolete files)

96.42 KB, image/png
Details
57.97 KB, image/png
Details
113.57 KB, image/png
Details
4.09 KB, text/plain
Details
90.26 KB, image/png
Details
1.91 MB, text/plain
Details
243.74 KB, image/svg+xml
Details
3.04 KB, text/plain
Details
81.36 KB, image/svg+xml
Details
95.55 KB, image/svg+xml
Details
57.78 KB, image/svg+xml
Details
69.70 KB, text/plain
Details
107.86 KB, image/png
Details
124.47 KB, image/png
Details
91.95 KB, image/png
Details
82.81 KB, image/png
Details
121.45 KB, image/png
Details
128.33 KB, image/png
Details
117.93 KB, image/png
Details
158.64 KB, image/svg+xml
Details
145.78 KB, image/png
Details
http://people.mozilla.com/~chofmann/crash-stats/20100608/crash-counts.txt

shows a pretty dramatic increase in crashes on June 7 and 8, that don't correspond to the increase in the beta population growth over the same period.  Its possible that we just had a volume increase but want to make sure there aren't socorro infrastructure changes that are giving us different numbers.

I've run a few checks to try and match up data from web queries, the daily .csv files, and the temporary crash stats reports listed above and they all seem to be confirming the increase.
so, comment 1 shows one place where the web interface/data base is matching data in the .csv files and http://people.mozilla.com/~chofmann/crash-stats/20100608/

also reopened bug 568872 to get some other direct sql queries to compare the crash database against data in the .csv files and crash-stats reports.

one place that isn't showing the increase is the crash per 100 user report at

 http://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=3.6.4&throttle[]=100&v[]=&throttle[]=100&v[]=&throttle[]=100&v[]=&throttle[]=100&os[]=Windows&os[]=Mac&os[]=Linux&date_start=2010-05-25&date_end=2010-06-08&submit=Generate

It doesn't have the full set of data for June 8th to work off yet, but it should have shown an increase for the 7th.  It looks to be consistently reporting a lower number of total reports than the .csv files.

2010-06-08  	8,719  	792,675 	100%  	1.1%
2010-06-07 	18,647 	781,924 	100% 	2.38%
2010-06-06 	19,047 	650,091 	100% 	2.93%
2010-06-05 	20,107 	635,033 	100% 	3.17%

The count of total reports in this listing also doesn't have hang data removed but should roughly correspond to column 2 of http://people.mozilla.com/~chofmann/crash-stats/20100608/crash-counts.txt but its not lining up that way

20100605 635033 24365 2227 0.351 
20100606 650091 23696 2307 0.355 
20100607 781924 30323 7297 0.933
20100608 792675 44122 18378 2.318
plus scaled ( * 0.1) total 3.6.4 beta population is shown.   overall crash volume is up over the last 2 days, but the big increase is in the last day.
Sounds like the only configuration change in the last few days on the socorro side is related to https://bugzilla.mozilla.org/show_bug.cgi?id=570478 

its possible that these two problems are connected, but after talking with lars a the socorro meeting today it seems unlikely unless a very high pct. of the 3.6.4 crashes were being routed through the processor that had the problem.
re: comment 1

It looks like there is uniform growth of about 10x across most of the top 40 signatures from 2010 06 08.   there are a few execeptions like \N  crashes.

chart below compares crash volumes across the top 40 3.6.4 for June 8th and 6th.

0606/0608
growth% 0608  0606
        count count  signature

0.091   656     60  UserCallWinProcCheckWow 
0.097   528     51  SkypeFfComponent.dll@0x440c3 
0.108   398     43  nsGlobalWindow::cycleCollection::UnmarkPurple(nsISupports*) 
0.974   378    368  \N 
0.126   238     30  _PR_MD_SEND 
0.100   230     23  nsIFrame::GetOffsetTo(nsIFrame const*) 
0.125   224     28  RtlpWaitForCriticalSection | RtlEnterCriticalSection 
0.066   196     13  _woutput_l 
0.123   163     20  js_TraceObject 
0.056   160      9  JS_CallTracer 
0.097   154     15  StrChrIA 
0.158   146     23  GraphWalker::DoWalk(nsDeque&) 
0.159   145     23  GCGraphBuilder::NoteXPCOMChild(nsISupports*) 
0.117   137     16  nsDocument::cycleCollection::UnmarkPurple(nsISupports*) 
0.085   117     10  mozilla::plugins::PStreamNotifyParent::Send__delete__(mozilla::plugins::PStreamNotifyParent*, short const&) 
0.219   114     25  arena_dalloc_small | arena_dalloc | free | XPT_DestroyArena 
0.053   114      6  BaseThreadStart 
0.072   111      8  nsFrame::BoxReflow(nsBoxLayoutState&, nsPresContext*, nsHTMLReflowMetrics&, nsIRenderingContext*, int, int, int, int
, int) 
0.152   105     16  JS_TraceChildren 
0.104    96     10  @0x0 | SkypeFfComponent.dll@0x440c3 
0.097    93      9  MultiByteToWideChar 
0.141    92     13  GetRealWindowOwner 
0.198    91     18  RtlEnterCriticalSection 
0.033    90      3  connect 
0.090    89      8  nsCycleCollector::MarkRoots(GCGraphBuilder&) 
0.170    88     15  RtlpCoalesceFreeBlocks 
0.200    80     16  js_GC 
0.114    79      9  ntdll.dll@0x38c39 
0.077    78      6   
0.027    75      2  ExecuteTree 
0.183    71     13  libc-2.11.1.so@0x329a5
we have also seen an increase in the number of signatures in the past few days indicateing growth in the volume and length of the long tail of different crashes.  those are usually indicators of a larger, or more active beta test group.

day tl-rpts num-sigs crash hang for 3.6\.4
20100604 25378 4190 10675 14703
20100605 24365 4110 10361 14004
20100606 23696 3999 10144 13552
20100607 30323 5693 14057 16266
20100608 44122 8205 24514 19608
looks like we might have started to see the uptick in 3.6.4 crash volume on June 7th at 4pm ( 2010060716 ) just when the daily taper should have started to kick in, and then volume increase continued for the full day on June 8.

2010060705	1159
2010060706	1148
2010060707	1290
2010060708	1294
2010060709	1338
2010060710	1284
2010060711	1201
2010060712	1174
2010060713	1154
2010060714	1185
2010060715	1079

2010060716	1633
2010060717	1853
2010060718	1888
2010060719	1830
2010060720	1550
2010060721	1496
2010060722	1328
2010060723	1319
2010060800	1220
2010060801	1324
2010060802	1364
2010060803	1418
2010060804	1844
2010060805	2082
2010060806	2229
2010060807	2354
2010060808	2343
2010060809	2293
2010060810	2199
2010060811	2083
narrowing down further it looks like just after 16:15 is when the crash volume started to climb


201006071605	16
201006071606	20
201006071607	24
201006071608	21
201006071609	17
201006071610	17
201006071611	16
201006071612	20
201006071613	21
201006071614	31
201006071615	20
201006071616	26
201006071617	32
201006071618	27
201006071619	41
201006071620	36
201006071621	28
201006071622	28
201006071623	37
201006071624	33
201006071625	24
201006071626	33
201006071627	33
201006071628	37
201006071629	45
dm-breakpad-stagedb has some recent data (snapshot of prod)

When did the crashes start?
http://pastebin.mozilla.org/733511
What versions were affected?
http://pastebin.mozilla.org/733515

Drill down on 6/3 which is the first day of the spike:

Three Fx versions all say 15% Errno 24:
3.6.4 errors 3056  non-errors 20135 non-error
3.6.3 errors 30065 non-errors 199578
3.5.9 errors 3764  non-errors 24947
Total increase in 3.6.4 crashes was a gross increase of about 20,000 crashes by chofmann's numbers.  

Daniel shared
https://metrics.mozilla.com/pentaho/content/pentaho-cdf-dd/Render?solution=metrics&path=/weblogs/crashdata/crashdata_1&file=crashdata.wcdf
(In reply to comment #10)
If those pastebins expire... I've set them to keep forever at
http://pastebin.mozilla.org/733546
http://pastebin.mozilla.org/733547
The increases that are going on seem to be affecting all OSes.

Windows going from 18-22k to 40k (~2x)
Mac going from around 140-200 crashes per day to 1090 (~5x)
Linux going from 30-40 to 142 (~3x)


day     tl-rpts   win mac linux -- for 3.6.4
20100601 24694  22435 194 46
20100602 27161  24641 218 40
20100603 23863  18849 139 31
20100604 26082  20620 146 24
20100605 25009  19618 124 28
20100606 24342  18954 133 46
20100607 31050  27012 459 46
20100608 44751  40926 1090 132
20100609 44014  40123 1091 142
marcia was going to try and check hendrix and twitter feeds to see if user comments about firefox 3.6.4 crashes started increasing around June 7 @ 16:15.  There is going to be a lot more noise in that data, but it might be the only source of user validation of an increase/change in behavior that users are actually seeing.  Asa,  got any ideas on other ways to harvest user feedback on this.  I'll also look to see if the # of submitted crash comments on 3.6.4 had any relative changes.
What was the exact date when we enabled splitting hangs vs crashes in the
reports?  I heard May 12th.  Is that correct?  dbaron has some data which seems
to suggest it was May 15th, but I don't want to pollute the answer.
Just to see if there were previous changes in the data, I looked at the number of nsGlobalWindow::cycleCollection::UnmarkPurple for 3.6.4 crashes going back to May 14, the oldest correlation reports we currently have.  It looks pretty stable (slightly increasing, probably with usage) until May 7.
(In reply to comment #15)
> What was the exact date when we enabled splitting hangs vs crashes in the
> reports?  I heard May 12th.  Is that correct?  dbaron has some data which seems
> to suggest it was May 15th, but I don't want to pollute the answer.

The reason I ask:  I'm wondering if there was a drop in the Fx crashes at that
time indicating that we were obscuring them somehow since and then on June
4-7th, we all of a sudden stopped obscuring them.  I understand that the
ooppsie to fx crash ratio stayed the same, but maybe we need to double check?
(In reply to comment #16)
> stable (slightly increasing, probably with usage) until May 7.

You mean June 7, right?
comment 16 and its attachment matches up with a lot of the other topcrasher signatures in comment 5.  all those have been stable up to June 6, then jumps on June 7 and 8.  On June 9 it looks like we might be leveling off at new plateau.  We will have more data in the June 10 data .csv to confirm that the increase is not continuing.
* Bug 570478 says 115802 crash reports were not handled for ~7 days
* 115802 / ~7 = ~16500, which we would expect to be a little low as it includes
a weekend
* That number is suspiciously close to the 20,000 increase in comment 11. That number was calculated I believe by taking total on June 8th (new plateau) and subtracting total on June 6th (old plateu)
* clock time the processor restarted in bug 570478 does not match the timing in
comment 8, though perhaps that's how long it took for the numbers to catch up

This seems to point to bug 570478 for me.
(In reply to comment #17)
> (In reply to comment #15)
> > What was the exact date when we enabled splitting hangs vs crashes in the
> > reports?  I heard May 12th.  Is that correct?  dbaron has some data which seems
> > to suggest it was May 15th, but I don't want to pollute the answer.
> 
> The reason I ask:  I'm wondering if there was a drop in the Fx crashes at that
> time indicating that we were obscuring them somehow since and then on June
> 4-7th, we all of a sudden stopped obscuring them.  I understand that the
> ooppsie to fx crash ratio stayed the same, but maybe we need to double check?

On May 12 .csv files "process_type" was added as field 25.  That started allowing us to segregate plugin crashes from firefox crashes

format of the .csv files after may 12 is:

1 signature
2 url
3 uuid_url
4 client_crash_date
5 date_processed
6 last_crash
7 product
8 version
9 build
10 branch
11 os_name
12 os_version
13 cpu_name
14 address
15 bug_list
16 user_comments
17 uptime_seconds
18 email
19 adu_count
20 topmost_filenames
21 addons_checked
22 flash_version
23 hangid
24 reason
25 process_type
(In reply to comment #20)
> * Bug 570478 says 115802 crash reports were not handled for ~7 days
> * 115802 / ~7 = ~16500, which we would expect to be a little low as it includes
> a weekend
> * That number is suspiciously close to the 20,000 increase in comment 11. That
> number was calculated I believe by taking total on June 8th (new plateau) and
> subtracting total on June 6th (old plateu)
> * clock time the processor restarted in bug 570478 does not match the timing in
> comment 8, though perhaps that's how long it took for the numbers to catch up
> 
> This seems to point to bug 570478 for me.

But I think we decided not to (re)process that backlog.  Lars would have to conirm. I can also try and do a sanity check to see if we have a lot times where client_crash_time is a couple of days older than process_time beginning where the uptick starts.
This is based on Firefox 3.6.4 data in the CSV files from May 25 through June 9.  I excluded client crash dates that were outside of May 24-June 10.

This makes it pretty clear there wasn't processing of backlogs.
Oh, and I meant to fix the labels on that graph to point out that the client crash date is (I presume) in an arbitrary timezone, so we expect a ~24-hour banding.
confirming what dbaron sees in comment 23 and 24. process_time is PDT and uniformly increasing, and client_crash_time is the time zone and clock setting on the PC.  we are kind of at the mercy of clock setttings on the PC for those times.

The raw data I looked at for all of June 7 doesn't show any evidence of backlog processing.
This is crashes per hour on the same set of data, Firefox 3.6.4 crashes processed on May 25 through June 9, inclusive, but separated by crash type:
 - red is browser crashes
 - green is plugin crashes
 - blue is the browser side of a plugin hang
 - purple is the plugin side of a plugin hang

It's clear the spike was primarily in browser crashes.
we have look at a lot of theories on the crash volume side of the "crash per 100 user" equation and haven't turned up anything that helps to explain what we are seeing so far. 

Daniel is going to look at the ADU side of the equation.

1. ADU info that goes into Socorro comes from Metrics' processing of blocklist pings.
2. We don't have an unusual spike in the number of reported 3.6.4 blocklist pings.
3. If there were somehow a population of 3.6.4 users that were *not* being tracked as 3.6.4 uses in blocklist processing, this could account for the problem

If we we overcounting ADUs and users for 3.6.4 before 6/7 or undercounting ADUs  or actual users after that date that could account for changes we see.

This scenario is a bit far-fetched, but a few check could dismiss it quickly.

For example if we somehow actually pushed 3.6.4 bits to 7 million users on 6/7 but the adu numbers didn't reflect this, then the spike in 3.6.4 crashes in the socorro data makes sense.
Given comment 27, I don't think it's a misunderstanding of ADUs.  We saw a spike in browser crashes but not a spike in the other types of crash reports.
Though it would account for the crash increase while the oopsies stayed constant, we can pretty much rule out something like a crash on startup...the increases weren't a new signature, they were existing signatures ballooning which likely corresponds with real browsing/usage. This was mentioned before in a meeting but I wanted to note it here.
yeah, we have seen top signatures increase 10x between 6/6 and 6/8, and we have seen the long tail grow.  both of those are indicators of growing the pool of people using builds.

re: comment 29.  yeah, that might be enough to dismiss the adu problem theory.  it could also be the case of people getting a new build and not getting to farmville, youtube or other flash/silverlight intensitive sites yet, and growing the number of plugin crashes and hangs.

the alturnate adu theory is that we have been overcounting adu's for several months on 3.6.4.  e.g. 3.6.4 beta pop has actually only been around under 100k. its harder to come up with a scenario where that could have happened.
My theory - the # of open file-descriptor, JVM crash and the queue might all be inter-related.... notes below:

On Monday, June 7th, lars/aravind noticed that processor bp02 processor_01 is having problems and this problem started around 2010-06-04
On June 7th, daniel noticed that JVM crashed on one of our boxes. These two issues might be cor-related and the a low file-descriptor count could be the possible cause.
aravind increased the file-descriptor count from 1024 to ~8192, daniel (or someone) restarted one of the machines and normal data processing resumed.

We have multi-threading/processors for socorro, but based on my limited knowledge, python's not that good at handing out threads to different cores and that job is best left to the underlying OS. For some reason, majority of 3.6.4 crashes were sent to this processor thereby causing the queue to increase and resulting in skewed numbers.
lars suggested looking at possible domains/urls that might have caused a spike.  I don't see anything out of the ordnarly in top 20 domains.  facebook, orkut youtube... longtail.   The number of \N urls has only gone from 12k to 15k. and isn't growing with the rest of the spike.  not sure what that explains.

$ awk -F\t '$8 ~ /3.6\.4/ {print $2}' 20100606* | awk -F\/ '{print $1,$3}' | sort | uniq -c | sort -nr | more | head -20
12736 \N 
2530 http: apps.facebook.com
1138 http: www.orkut.com.br
 526 http: www.facebook.com
 426 http: my.mail.ru
 394 http: www.youtube.com
 255 about:blank 
 227  
 180 http: vkontakte.ru
 125 http: www.farmville.com
 108 http: fb-0.treasure.zynga.com
  95 http: www.habbo.com.br
  70 http: www.kaixin001.com
  56 http: www.google.com.br
  50 http: www.hulu.com
  47 http: www.google.com
  43 http: www.gamezer.com
  41 http: user.qzone.qq.com
  41 http: fb-1.farmville.com
  40 http: v.youku.com

$ awk -F\t '$8 ~ /3.6\.4/ {print $2}' 20100609* | awk -F\/ '{print $1,$3}' | sort | uniq -c | sort -nr | more | head -20
15847 \N 
3414 http: apps.facebook.com
2325 http: www.orkut.com.br
1899  
1236 http: www.facebook.com
 903 about:blank 
 849 http: www.youtube.com
 507 http: my.mail.ru
 471 http: www.google.com.br
 340 http: vkontakte.ru
 242 http: fb-0.treasure.zynga.com
 229 http: www.google.com
 155 https: mail.google.com
 151 http: www.farmville.com
 144 about:sessionrestore 
 141 http: gamefutebol.globoesporte.globo.com
 128 http: www.baixaki.com.br
 117 http: www.habbo.com.br
 106 http: www.kaixin001.com
 101 chrome: coralietab
Better-labeled version of the graph in comment 27.
Attachment #450559 - Attachment is obsolete: true
This shows that the spike happened for both build 4 (yellow) and build 6 (purple).  Build 5 (red) doesn't seem used enough anymore to really tell, and likewise for all the other builds.
My latest hypothesis is that crashes from other releases are somehow being mislabeled as 3.6.4 crashes.  That would explain that the only increase is in browser crashes, since they're the only crashes that other releases have.

To check that theory, I looked to see if that pattern held for other browser top crashes.  The other really high crash is one that's actually OOPP-related:  RtlEnterCriticalSection.  The pattern here is very different from the pattern in comment 16; there was no spike around May 7.
(In reply to comment #36)

> My latest hypothesis is that crashes from other releases are somehow being
> mislabeled as 3.6.4 crashes.  That would explain that the only increase is in
> browser crashes, since they're the only crashes that other releases have.
> 

I like this theory.  Can somebody in build easily audit the 3.5 builds for buildids and version strings in builds available from 6/7?
I have started a timeline for possibly related issues.  Feel free to add/edit
https://intranet.mozilla.org/Socorro:Spike_Timeline
replaces attachment 450581 [details] with an additional day's (interesting) data
Attachment #450581 - Attachment is obsolete: true
Depends on: 571553
This graph is showing, within each hour, the number of crashes with a hangid divided by the number of unique hangids.

The temporary dip (from 2010-06-03 between 01:00 and 02:00 until 2010-06-07 around 10:00) is what I'd expect from the processor problem then.  (Not sure what the dips before then were.)

But the fact that there's no spike in this graph seems to contradict the theory that we're somehow reprocessing crashes that we already got.  We'd expect this number to be around 2 (slightly less due to network problems between client and server or slow networks that cause the two halves to be processed in different hours; perhaps slightly more due to any resubmission that occurs).
Depends on: 571558
Depends on: 571559
(In reply to comment #42)
I'm also not seeing results to support re-processing theory.
http://pastebin.mozilla.org/733954
6/7 partition doesn't have enough dups per hour to support the theory.

There are duplicates in 5/31 partition, in the same volume and pattern.
From yesterday's analysis of the effects of the "Too many files" faulty Socorro processor:

I sampled 24257 of the raw json files that did NOT get inserted into the database from Sunday, 2010/06/06. 1630 or roughly 6% self identified as being from FF3.6.4.

This compares to a total of 333886 total crashes inserted into the database that day, of which 23697 or 7% were identified as being from FF3.6.4
(In reply to comment #36)
> Created an attachment (id=450590) [details]
> 3.6.4 RtlEnterCriticalSection numbers back to may 14
> 
> My latest hypothesis is that crashes from other releases are somehow being
> mislabeled as 3.6.4 crashes.  That would explain that the only increase is in
> browser crashes, since they're the only crashes that other releases have.
> 
> To check that theory, I looked to see if that pattern held for other browser
> top crashes.  The other really high crash is one that's actually OOPP-related: 
> RtlEnterCriticalSection.  The pattern here is very different from the pattern
> in comment 16; there was no spike around May 7.

dbaron's attachment grouped crash counts for RtlEnterCriticalSection around UTC days.  When I group around PDT directly from the .csv files its easier to see a trend of increasing counts for this plugin crash on June 7 as well.  we got down to a low of 186 crashes on June 6, but we have been increasing and are now at 301 crashes for 6/10

awk -F\t '$1 ~ /^RtlEnterCriticalSection$/ && $8 ~ /3.6\.4/ {print $8,$1}' $file | wc -l

<previous days in May roughly map to dbaron's attachment with a slow reduction of RtlEnterCriticalSection  as more people pick up builds with the bug fix for https://bugzilla.mozilla.org/show_bug.cgi?id=563847 on 5/11 >


20100515-crashdata.csv    2607
20100516-crashdata.csv    1648
20100517-crashdata.csv    1329  42% of 3.6.4 users on 20100513144105
20100518-crashdata.csv     997  54% on 20100513144105
20100519-crashdata.csv     699  65% on 20100513144105
20100520-crashdata.csv     603  73% on 20100513144105
20100521-crashdata.csv     540  77% on 2010051314410
20100522-crashdata.csv     468  79% on 20100513144105
20100523-crashdata.csv     440  80% on 20100513144105
20100524-crashdata.csv     402
20100525-crashdata.csv     367
20100526-crashdata.csv     393  83% on 20100513144105
20100527-crashdata.csv     338
20100528-crashdata.csv     336  64% on 20100513144105 +  22% 20100523185824 
20100529-crashdata.csv     310
20100530-crashdata.csv     257
20100531-crashdata.csv     278
20100601-crashdata.csv     265
20100602-crashdata.csv     248
20100603-crashdata.csv     166
20100604-crashdata.csv     206
20100605-crashdata.csv     181
20100606-crashdata.csv     186  70% on 20100527093236 + 19% on 20100513144105

                                <- plugin crash starts to rise

20100607-crashdata.csv     246  70% on 20100527093236 + 21% 20100513144105 
20100608-crashdata.csv     267
20100609-crashdata.csv     293
20100610-crashdata.csv     301  74% on 20100527093236 + 17% 20100513144105
Depends on: 571574
re: duplicate hang comments

there is some analysis and rational for these reports in bug 568849 -
duplicated and missing hang reports in hand id pairs in 3.6.4 crash reports
This is a comparison of counts of the domains associated with crashes of FF 3.6.4 for the dates of 2010-06-07 and 2010-05-31.  

No particular domain stands out as a spike culprit.
I took a look at the top ten crash signatures and top ten domains (where blank is "had a URL but it didn't have a domain"), for Firefox 3.6.4 build 6 Windows crashes that were browser crashes.  I compared the top signatures and domains for those crashes between (a) the period of 7 days before the spike and (b) the period 3 days after the spike (leaving a 1 hour buffer around the spike time).  The top ten set for both signatures and domains was quite similar, and only had some slight reordering:

> top_bysig_before

                                    UserCallWinProcCheckWow 
                                                        277 
                               SkypeFfComponent.dll@0x440c3 
                                                        256 
nsGlobalWindow::cycleCollection::UnmarkPurple(nsISupports*) 
                                                        229 
                     nsIFrame::GetOffsetTo(nsIFrame const*) 
                                                        144 
                                                 _woutput_l 
                                                        111 
       RtlpWaitForCriticalSection | RtlEnterCriticalSection 
                                                        102 
                                                   StrChrIA 
                                                         93 
                                                _PR_MD_SEND 
                                                         92 
               GCGraphBuilder::NoteXPCOMChild(nsISupports*) 
                                                         84 
    nsDocument::cycleCollection::UnmarkPurple(nsISupports*) 
                                                         80 

> top_bysig_after

                                    UserCallWinProcCheckWow 
                                                       1508 
                               SkypeFfComponent.dll@0x440c3 
                                                       1349 
nsGlobalWindow::cycleCollection::UnmarkPurple(nsISupports*) 
                                                        944 
                     nsIFrame::GetOffsetTo(nsIFrame const*) 
                                                        593 
                                                _PR_MD_SEND 
                                                        496 
       RtlpWaitForCriticalSection | RtlEnterCriticalSection 
                                                        494 
                                                 _woutput_l 
                                                        400 
                                                   StrChrIA 
                                                        375 
    nsDocument::cycleCollection::UnmarkPurple(nsISupports*) 
                                                        361 
               GCGraphBuilder::NoteXPCOMChild(nsISupports*) 
                                                        324 

> top_bydom_before

                   www.facebook.com  www.orkut.com.br       about:blank 
              843               448               413               291 
apps.facebook.com   www.youtube.com    www.google.com www.google.com.br 
              244               195               170               155 
     vkontakte.ru   mail.google.com 
              114                72 

> top_bydom_after

                   www.orkut.com.br  www.facebook.com       about:blank 
             3591              1904              1761              1160 
apps.facebook.com   www.youtube.com    www.google.com www.google.com.br 
              872               737               710               699 
     vkontakte.ru   mail.google.com 
              478               247 


Given that, I think it's highly unlikely that a Web site change could have caused this spike.  (And I think the fact that the spike was at the same time on Mac and Windows, and roughly the same portion, makes it highly unlikely to be somebody else's software deployment.)

So, in other words, I don't think this crash spike represents an increase in crashes observed by our users; I think it's either an increased ability to submit crashes something changed about processing of crashes on our end.


Also, it seems like the post-spike numbers may well be more believable (relative to ADUs) than the pre-spike numbers, though chofmann may have more to say about that.
re:
> Also, it seems like the post-spike numbers may well be more believable
> (relative to ADUs) than the pre-spike numbers, though chofmann may have more to
> say about that.

socorro now has the ability to split out hangs from the crashes in the crash per 100 user chart. so this gives us more insight.

http://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=3.6.4&throttle[]=100&v[]=3.6.3&throttle[]=15&v[]=&throttle[]=100&v[]=&throttle[]=100&hang_type=crash&os[]=Windows&os[]=Mac&os[]=Linux&date_start=2010-05-27&date_end=2010-06-10&submit=Generate

shows 3.6.4 firefox_crashes + plugin_crashes just above current levels of 3.6.3 crashes.   Firefox 3.6.4=2.78 v. firefox 3.6.3=2.34  after the spike.

if you compare 3.6.4 firefox_crashes + plugin_crashes + hangs  against 3.6.4 you get  3.6.4=5 .v 3.6.3=2 per hundred.  see:

http://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=3.6.4&throttle[]=100&v[]=3.6.3&throttle[]=15&v[]=&throttle[]=100&v[]=&throttle[]=100&hang_type=any&os[]=Windows&os[]=Mac&os[]=Linux&date_start=2010-05-27&date_end=2010-06-10&submit=Generate

However, we are still missing the key metric which is 3.6.4 firefox_crashes against 3.6.3 crashes that both take down the browser down.  Reports with that come with  Bug 571526, and will give us side-by-side comparisions of how much "the browser goes down" in 3.6.4 v. 3.6.3.  

we can estimate that 3.6.4 still goes down less than 3.6.3 since OOP plugin crashes are some percent of the 3.6.4 total crashes.  Its hard to get a handle on exacty what that pct. is with the current fluxuation in the data.

on 
2010 06 06 it was 2227 fx_crashes v. 3880 plugin_crashes
but on 
2010 06 08 it was 18378 fx_crashes v 3928 plugin_crashes

and it sounds like June 10 has an increase on the plugin crash side again, although I haven't looked at that data.

I think we can say that the data is telling us that even after events on June 7 3.6.4 beta users have the same (or slighly lower) frequency of browser crashes that require browser restart as Firefox 3.6.3.

Is this believable? Yeah, we could construct some theories that could make the numbers believable, before and after June 7.   The spike or "unblocking of crash summibmissions" that might have been happening for several weeks still casts some suspicion, and the size of the beta population (~800k v. 70million) still leaves open the question of if these 3.6.4 numbers will be good predictors of how stable 3.6.4 is going to be when it gets wider exposure.

Over all I think we are still in good shape to ship on Tuesday, and to keep digging to to understand the data and the events around June 7 better.  

I'd still like to see us do staged rollouts of these mini-feature releases with momentary 1 or 2 day stop points at 5, 10, 15 million to try and spot regressions before we get to the full roll out, but we have already had that discussion a few times.
re: which data is believable...

I combined the .csv from the crash-stats crash_per_user chart, and the data I've been collecting about just the 3.6.4 firefox crashes.  The resulting chart is attached.

I guess the best way to sum up is that for almost a month before June 7 the total number of 3.6.4 incidences (firefox crashes + plugin crashes + hang pairs) were tracking with 3.6.3.   After June 7 firefox crashes on 3.6.4 are tracking with firefox crashes on 3.6.3.   

Browser crashes that require restart should be reduced by at least 20-30% in 3.6.4 with OOPP turned on, and the data isn't showing that after June 7.  

If the data post June 7 is accurate it suggests we need to look closer for regression crashes in the long tail of 3.6.4 data, or the volume of some of the firefox side OOPP crash bugs we know about and have been lattered to 3.6.6.
https://bug571118.bugzilla.mozilla.org/attachment.cgi?id=450855 also shows that the increase in hang reports we saw on Thursday continued on Friday (green line continued trending upward)
I filed a separate bug, bug 571691, on the spike in plugin hang reports on June 10; I think that spike is related to a change in farmville.
Summary: investigate abnormal increase in fx 3.6.4 crashes in the last two days → investigate abnormal increase in fx 3.6.4 crashes beginning June 7, 16:15
growth in the farmville/plugin hangs slowed down, and other trends remained stable.
here is a chart that shows crash per 100 user data when 3.6 was at 800k users prior to launch and shortly after when it had near 15million.

It also show a couple of grey bands shat reflect an estimate that we had that flash crashes made up about 20-30% of all crashes.  Between the grey bands is the area where we might expect the OOPP effect to be for 3.6.4 during release candidates, and in post release deployment.   After the strange spike on June 7 3.6.4 is falling within these grey band.
I took a look at the trends for all the recent 3.6.x releases, and after weeding out some wild data points for days when socorro had capacity or other problems it looks like all these releases trended downward at about a 3% slope over their lifetime.  that helps to support the theory that it might be more realistic to compare 3.6.4 to the early lifetimes of other 3.6 releases, than to compare it to current 3.6.3 which as been in the field for awhile.
Hm, so we are seeing a global climb in crash data across all versions. That's interesting.
if you remove the dip caused by the "Errno 24" errors talked about in comment 10 that resulted in a 15% reduction in crash reports processed across all versions durning June 3-6 then it really doesn't look like an "increase across versions."  

To me it looks a bit more like daily fluctuations.
also removed those Err No 24 days where there were a 15% reduction in reports processed
There are a few med/low volume regressions still around in build 6.  Those would be the ones to keep an eye on as we push out the final release, to spot if we see changes in the volume.  This is after removing hangs and looking 100 signatures deep into the long tail.

Bug 556643  - [OOPP] NS_RUNTIMEABORT("Corrupted plugin stream data.") crash in [@ mozilla::plugins::PluginModuleParent::StreamCast(_NPP*, _NPStream*) ]

bug 572131 Summary: Firefox 3.6.4 Crash [@ @0x0 | Plugin.dll@0xfdb8 ]

Bug 572187  - Firefox 3.6.4 Crash Report [@ mozilla::ipc::SyncChannel::ShouldContinueFromTimeout() ]
I was asked about diversity of IPs:
sqlite> select count(distinct(ip)) from crash_report_ips where date='06';
65921
sqlite> select count(distinct(ip)) from crash_report_ips where date='07';
66371
sqlite> select count(distinct(ip)) from crash_report_ips where date='08';
66618
Nothing especially interesting here.  The spike was across flash versions.  Flash version composition starting shifting noticeably to 10.1 a few days after the spike, though.
Depends on: 572248
Depends on: 572419
Depends on: 572501
Depends on: 573381
(In reply to comment #56)
> Created an attachment (id=451093) [details]
> crash per 100 user chart with 3.6RC and expected OOPP effect on 3.6RC's and
> 3.6.3
> 
> here is a chart that shows crash per 100 user data when 3.6 was at 800k users
> prior to launch and shortly after when it had near 15million.
> 
> It also show a couple of grey bands shat reflect an estimate that we had that
> flash crashes made up about 20-30% of all crashes.  Between the grey bands is
> the area where we might expect the OOPP effect to be for 3.6.4 during release
> candidates, and in post release deployment.   After the strange spike on June 7
> 3.6.4 is falling within these grey band.

latest data on this shows the red line still inside the grey band, meaning 3.6.4 is inside the expected crashes per hundred users of 3.6RC/Initial release, and the current 3.6.3.

3.6.4 fx_crashes + plugin_crashes + hangs continues to be slightly higher than than total 3.6.3 crashes at initial release.
Summarizing final solution:

On April 10, we were requested to throttle 3.6.4 crashes at 100% (ie process all of them).  IT made a change to the collector config to support this, and as is usual, the config was pushed out to all the collectors using git.

At this point git experienced stuck locks.  We have seen this issue before in web pushes on SUMO and AMO.  What this meant was that one collector had the new config, and the other two were processing 3.6.4 crashes as if they were release version crashes, that is, at 15%.

On June 7 a different (Hbase related) change was made to the config, and at this point when the configs were pushed, the two 15% collectors were updated to the newer config and began throttling at 100% also.  This accounts for the jump in crashes.

We have audited deferred crash storage and Hbase to confirm this theory.

This issue was uncovered in 
https://bugzilla.mozilla.org/show_bug.cgi?id=573381
Assignee: nobody → laura
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: