Oh joy ... a crash on our Amazon EC2 hosted web server

By joe

October 22, 2012 - 4 minutes read - 713 words

[update 2] Yuppers, Amazon US East N. Virginia is experiencing issues. C.f. here.

Doug thought I did this with our IP update (larger block of static). So did I for a moment until I logged in. Partial failure on my part for not having the backup live/ready. Wlll remedy over the next day or two.

[update] I think we can call this one a #fail.

Imagine if some small business with no technical acumen, sold on the “push this button to run your website” saw that error message.

We moved the company web server to EC2 during our site move. So far, I’ve been pretty happy with arrangement, though it is not cheap to run a web server in the cloud. So a few minutes ago, we noticed the site was down. I log in and find this:

[14723518.261905] type=1400 audit(1350525262.856:2): apparmor="STATUS" operation="profile_load" name="/sbin/dhclient" pid=307 comm="apparmor_parser"
[14723518.262331] type=1400 audit(1350525262.856:3): apparmor="STATUS" operation="profile_load" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=307 comm="apparmor_parser"
[14723518.262574] type=1400 audit(1350525262.856:4): apparmor="STATUS" operation="profile_load" name="/usr/lib/connman/scripts/dhclient-script" pid=307 comm="apparmor_parser"
[14723518.788693] init: udev-fallback-graphics main process (387) terminated with status 1
[14723519.586454] ISOFS: Unable to identify CD-ROM format.
[14723519.651291] ISOFS: Unable to identify CD-ROM format.
[14723520.158420] ISOFS: Unable to identify CD-ROM format.
[14723520.212266] ISOFS: Unable to identify CD-ROM format.
[14723520.461771] kjournald starting.  Commit interval 5 seconds
[14723520.462199] EXT3-fs (xvdb): using internal journal
[14723520.462204] EXT3-fs (xvdb): mounted filesystem with ordered data mode
[14723520.500658] Adding 4194300k swap on /dev/xvdf.  Priority:-1 extents:1 across:4194300k SS
[14723520.694194] init: failsafe main process (562) killed by TERM signal
[14723521.082805] type=1400 audit(1350525265.676:5): apparmor="STATUS" operation="profile_replace" name="/sbin/dhclient" pid=629 comm="apparmor_parser"
[14723521.083261] type=1400 audit(1350525265.676:6): apparmor="STATUS" operation="profile_replace" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=629 comm="apparmor_parser"
[14723521.083504] type=1400 audit(1350525265.676:7): apparmor="STATUS" operation="profile_replace" name="/usr/lib/connman/scripts/dhclient-script" pid=629 comm="apparmor_parser"
[14723521.147769] type=1400 audit(1350525265.740:8): apparmor="STATUS" operation="profile_load" name="/usr/sbin/mysqld" pid=630 comm="apparmor_parser"
[14723521.174302] type=1400 audit(1350525265.768:9): apparmor="STATUS" operation="profile_load" name="/usr/sbin/tcpdump" pid=632 comm="apparmor_parser"
[14723521.559871] type=1400 audit(1350525266.152:10): apparmor="STATUS" operation="profile_replace" name="/usr/sbin/mysqld" pid=712 comm="apparmor_parser"
[14723524.552633] postgres (816): /proc/816/oom_adj is deprecated, please use /proc/816/oom_score_adj instead.
[14723526.930550] init: plymouth-upstart-bridge main process (598) killed by TERM signal
[14723529.685708] eth0: no IPv6 routers present
[15128036.049267] INFO: task jbd2/xvda1-8:158 blocked for more than 120 seconds.
[15128036.049284] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15128036.049293] jbd2/xvda1-8    D ffff8800efc13700     0   158      2 0x00000000
[15128036.049302]  ffff8800026d1ac0 0000000000000246 000000002044eb1b ffffffff81cd9a00
[15128036.049307]  ffff8800026d1fd8 ffff8800026d1fd8 ffff8800026d1fd8 0000000000013700
[15128036.049310]  ffffffff81c0d020 ffff8800028e2dc0 ffff8800026d1a90 ffff8800efc13fc0
[15128036.049315] Call Trace:
[15128036.049327]  [<ffffffff81114640>] ? __lock_page+0x70/0x70
[15128036.049337]  [<ffffffff81651e5f>] schedule+0x3f/0x60
[15128036.049348]  [<ffffffff81651f0f>] io_schedule+0x8f/0xd0
[15128036.049351]  [<ffffffff8111464e>] sleep_on_page+0xe/0x20
[15128036.049355]  [<ffffffff8165272f>] __wait_on_bit+0x5f/0x90
[15128036.049359]  [<ffffffff811147b8>] wait_on_page_bit+0x78/0x80
[15128036.049364]  [<ffffffff810890f0>] ? autoremove_wake_function+0x40/0x40
[15128036.049368]  [<ffffffff811148cc>] filemap_fdatawait_range+0x10c/0x1a0
[15128036.049371]  [<ffffffff8111498b>] filemap_fdatawait+0x2b/0x30
[15128036.049379]  [<ffffffff8125ce70>] journal_finish_inode_data_buffers+0x70/0x170
[15128036.049382]  [<ffffffff8125d765>] jbd2_journal_commit_transaction+0x665/0x1240
[15128036.049391]  [<ffffffff81076142>] ? try_to_del_timer_sync+0x92/0x130
[15128036.049394]  [<ffffffff81261feb>] kjournald2+0xbb/0x220
[15128036.049397]  [<ffffffff810890b0>] ? add_wait_queue+0x60/0x60
[15128036.049400]  [<ffffffff81261f30>] ? commit_timeout+0x10/0x10
[15128036.049402]  [<ffffffff8108860c>] kthread+0x8c/0xa0
[15128036.049407]  [<ffffffff8165e474>] kernel_thread_helper+0x4/0x10
[15128036.049410]  [<ffffffff8165c523>] ? int_ret_from_sys_call+0x7/0x1b
[15128036.049414]  [<ffffffff816541fc>] ? retint_restore_args+0x5/0x6
[15128036.049417]  [<ffffffff8165e470>] ? gs_change+0x13/0x13
[15128036.049422] INFO: task rs:main Q:Reg:611 blocked for more than 120 seconds.
[15128036.049428] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15128036.049434] rs:main Q:Reg   D ffff8800e711f000     0   611      1 0x00000000
[15128036.049438]  ffff8800e83157e8 0000000000000286 ffffea0002bbf680 ffff8800effc02a0
[15128036.049444]  ffff8800e8315fd8 ffff8800e8315fd8 ffff8800e8315fd8 0000000000013700
[15128036.049448]  ffff8800e75a44a0 ffff8800029796e0 ffff8800e83157f8 ffff8800e8315858
[15128036.049452] Call Trace:
[15128036.049455]  [<ffffffff81651e5f>] schedule+0x3f/0x60
[15128036.049458]  [<ffffffff8125c29d>] do_get_write_access+0x27d/0x4f0
[15128036.049465]  [<ffffffff8115ffc7>] ? __cmpxchg_double_slab.isra.22+0x17/0x90
[15128036.049468]  [<ffffffff810890f0>] ? autoremove_wake_function+0x40/0x40
[15128036.049472]  [<ffffffff811a6807>] ? __getblk+0x27/0x60
[15128036.049475]  [<ffffffff81653efe>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[15128036.049479]  [<ffffffff8125c650>] jbd2_journal_get_write_access+0x30/0x50
[15128036.049485]  [<ffffffff8123e73e>] __ext4_journal_get_write_access+0x3e/0x80
[15128036.049490]  [<ffffffff81215cb8>] ext4_reserve_inode_write+0x78/0xa0
[15128036.049493]  [<ffffffff81215d2e>] ext4_mark_inode_dirty+0x4e/0x200
[15128036.049496]  [<ffffffff812181ec>] ? ext4_dirty_inode+0x3c/0x60
[15128036.049499]  [</ffffffff812181ec><ffffffff812181ec>] ext4_dirty_inode+0x3c/0x60
[15128036.049504]  [<ffffffff8119d9a0>] __mark_inode_dirty+0x40/0x2a0
[15128036.049507]  [<ffffffff811a7c64>] ? block_write_end+0x44/0x80
[15128036.049510]  [<ffffffff811a7d0a>] generic_write_end+0x6a/0xa0
[15128036.049513]  [<ffffffff81216c02>] ext4_da_write_end+0xf2/0x360
[15128036.049516]  [<ffffffff81114afa>] generic_perform_write+0x13a/0x210
[15128036.049519]  [<ffffffff81114c2d>] generic_file_buffered_write+0x5d/0x90
[15128036.049523]  [<ffffffff81116599>] __generic_file_aio_write+0x229/0x440
[15128036.049528]  [<ffffffff810543cd>] ? set_next_entity+0xad/0xd0
[15128036.049531]  [<ffffffff81116822>] generic_file_aio_write+0x72/0xe0
[15128036.049535]  [<ffffffff8120d8af>] ext4_file_write+0xbf/0x260
[15128036.049539]  [<ffffffff8109cbf8>] ? futex_wait+0x108/0x210
[15128036.049544]  [<ffffffff8117495a>] do_sync_write+0xda/0x120
[15128036.049550]  [<ffffffff812d3e68>] ? apparmor_file_permission+0x18/0x20
[15128036.049555]  [<ffffffff8129963c>] ? security_file_permission+0x2c/0xb0
[15128036.049558]  [<ffffffff81174f01>] ? rw_verify_area+0x61/0xf0
[15128036.049560]  [<ffffffff81175263>] vfs_write+0xb3/0x180
[15128036.049563]  [<ffffffff8117558a>] sys_write+0x4a/0x90
[15128036.049565]  [<ffffffff8165c302>] system_call_fastpath+0x16/0x1b
[15128036.049572] INFO: task flush-202:1:779 blocked for more than 120 seconds.
</ffffffff8165c302></ffffffff8117558a></ffffffff81175263></ffffffff81174f01></ffffffff8129963c></ffffffff812d3e68></ffffffff8117495a></ffffffff8109cbf8></ffffffff8120d8af></ffffffff81116822></ffffffff810543cd></ffffffff81116599></ffffffff81114c2d></ffffffff81114afa></ffffffff81216c02></ffffffff811a7d0a></ffffffff811a7c64></ffffffff8119d9a0></ffffffff812181ec></ffffffff81215d2e></ffffffff81215cb8></ffffffff8123e73e></ffffffff8125c650></ffffffff81653efe></ffffffff811a6807></ffffffff810890f0></ffffffff8115ffc7></ffffffff8125c29d></ffffffff81651e5f></ffffffff8165e470></ffffffff816541fc></ffffffff8165c523></ffffffff8165e474></ffffffff8108860c></ffffffff81261f30></ffffffff810890b0></ffffffff81261feb></ffffffff81076142></ffffffff8125d765></ffffffff8125ce70></ffffffff8111498b></ffffffff811148cc></ffffffff810890f0></ffffffff811147b8></ffffffff8165272f></ffffffff8111464e></ffffffff81651f0f></ffffffff81651e5f></ffffffff81114640>

So I am working on rebooting it. Interesting. I can’t. Happily we have full daily dumps of the database and backups of the web server, so at worst, this is a few hour rebuild somewhere else if we can’t get it fully up. Very annoying that there is no “Force this (&(&%*&^ machine to reboot hard” button. Or even a rough equivalent. There is a stop/start pair, and I am in the process of trying that now. Started before I started typing this. Hasn’t gotten it back yet. Rethinking that aspect of our cloud strategy now … if we can’t restart a machine from a known state (isn’t this part and parcel for what the cloud is), then why exactly is this appealing versus running this in-house?