SQL Server 2016 SP1 CU3 清除計劃記憶體
每隔幾分鐘到大約一小時,proc 記憶體就會被刷新(但不是全部!)。如果我執行:
SELECT count (*) FROM sys.dm_exec_cached_plans
…在清理髮生之後,計劃的數量下降到幾百個,然後逐漸增加到大約 2,000 個,然後再次清理,依此類推。
伺服器在 VMWare 上執行,它有128 GB的 RAM(SQL Server 最大伺服器記憶體設置為102 GB,最小伺服器記憶體設置為72 GB)。根據 SentryOne 的輸出,我可以看到緩衝池消耗了**~61 GB**。我沒有看到任何記憶體壓力指標。
我還觀察到,當我將最小伺服器記憶體減少到 16 GB 時,清除 proc 記憶體的頻率急劇增加。
我的 SQL Server 版本如下:
Microsoft SQL Server 2016 (SP1-CU3) (KB4019916) - 13.0.4435.0 (X64)
還有什麼:
- 我還沒有嘗試過 LPIM,因為它需要重新啟動 SQL Server,但即使它是解決方案,我也很想了解為什麼會發生此問題。此外,布倫特建議它僅在緩衝池修剪的情況下才有幫助。在我的情況下,緩衝池看起來完好無損,只有 proc 記憶體受到影響。
- 我沒有找到任何可以為此負責的工作,也找不到任何真正的模式。
- 如果有人發現這對修復/調查此問題很有用,我可以訪問 vCenter。但如果是這樣的話,請你能不能讓我知道我應該在那裡檢查什麼。
我懷疑 VMWare 可能在這裡扮演了一些角色,但是當我檢查來賓端可用的性能計數器時,我沒有發現任何可疑之處。
根據Brent 的建議,我還收集了sp_Blitz的輸出:
**Priority 10: Performance**: - Query Store Disabled - The new SQL Server 2016 Query Store feature has not been enabled on this database. * Database1 * Database2 * distribution * Database3 * Database3Integration * Database3IntegrationSandBox * Database4 * Database5 * Database6 * Database6Sandbox * Database7 * Database8 * Database9 **Priority 50: Performance**: - Poison Wait Detected: Serializable Locking - 0:04:49:23 of LCK_M_R% waits have been recorded. This wait often indicates killer performance problems. **Priority 50: Reliability**: - Transaction Log Larger than Data File * Database2 - The database [Database2] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough. * Database4 - The database [Database4] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough. * Database8 - The database [Database8] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough. * Database9 - The database [Database9] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough. **Priority 100: Performance**: - Fill Factor Changed Database3 - The [Database3] database has objects with fill factor < 80%. This can cause memory and storage performance problems, but may also prevent page splits. - Many Plans for One Query - 180 plans are present for a single query in the plan cache - meaning we probably have parameterization issues. - Stored Procedure WITH RECOMPILE distribution - [distribution].[dbo].[sp_MSdelete_dodelete] has WITH RECOMPILE in the stored procedure code, which may cause increased CPU usage due to constant recompiles of the code. **Priority 110: Performance**: - Active Tables Without Clustered Indexes * Database3 - The [Database3] database has heaps - tables without a clustered index - that are being actively queried. * Database3Integration - The [Database3Integration] database has heaps - tables without a clustered index - that are being actively queried. * Database3IntegrationSandBox - The [Database3IntegrationSandBox] database has heaps - tables without a clustered index - that are being actively queried. * msdb - The [msdb] database has heaps - tables without a clustered index - that are being actively queried. * Database6 - The [Database6] database has heaps - tables without a clustered index - that are being actively queried. * Database6Sandbox - The [Database6Sandbox] database has heaps - tables without a clustered index - that are being actively queried. * Database9 - The [Database9] database has heaps - tables without a clustered index - that are being actively queried. **Priority 150: Performance**: - Deadlocks Happening Daily - 116 deadlocks have been recorded since startup. - Foreign Keys Not Trusted * Database3 - The [Database3] database has foreign keys that were probably disabled, data was changed, and then the key was enabled again. Simply enabling the key is not enough for the optimizer to use this key - we have to alter the table using the WITH CHECK CHECK CONSTRAINT parameter. * Database6 - The [Database6] database has foreign keys that were probably disabled, data was changed, and then the key was enabled again. Simply enabling the key is not enough for the optimizer to use this key - we have to alter the table using the WITH CHECK CHECK CONSTRAINT parameter. - Inactive Tables Without Clustered Indexes * distribution - The [distribution] database has heaps - tables without a clustered index - that have not been queried since the last restart. These may be backup tables carelessly left behind. * Database3 - The [Database3] database has heaps - tables without a clustered index - that have not been queried since the last restart. These may be backup tables carelessly left behind. * msdb - The [msdb] database has heaps - tables without a clustered index - that have not been queried since the last restart. These may be backup tables carelessly left behind. - Queries Forcing Join Hints - 1106 instances of join hinting have been recorded since restart. This means queries are bossing the SQL Server optimizer around, and if they don't know what they're doing, this can cause more harm than good. This can also explain why DBA tuning efforts aren't working. - Queries Forcing Order Hints - 1106 instances of order hinting have been recorded since restart. This means queries are bossing the SQL Server optimizer around, and if they don't know what they're doing, this can cause more harm than good. This can also explain why DBA tuning efforts aren't working. - Triggers on Tables * Database3 - The [Database3] database has 2 triggers. * Database9 - The [Database9] database has 2 triggers. **Priority 170: Reliability**: - Max File Size Set * Database3 - The [Database3] database file Database3Prod_log has a max file size set to 40960MB. If it runs out of space, the database will stop working even though there may be drive space available. * Database8 - The [Database8] database file db263426220 has a max file size set to 1024MB. If it runs out of space, the database will stop working even though there may be drive space available. * Database9 - The [Database9] database file Database9_log has a max file size set to 10000MB. If it runs out of space, the database will stop working even though there may be drive space available. **Priority 200: Backup**: - MSDB Backup History Not Purged msdb - Database backup history retained back to Feb 27 2017 3:48PM **Priority 200: Informational**: - Agent Jobs Starting Simultaneously - Multiple SQL Server Agent jobs are configured to start simultaneously. For detailed schedule listings, see the query in the URL. - Cluster Node - This is a node in a cluster. - Collation is SQL_Latin1_General_CP1_CI_AS - Collation differences between user databases and tempdb can cause conflicts especially when comparing string values * Database3Integration * Database3IntegrationSandBox * Database7 * Database8 - Replication In Use * distribution - Database [distribution] is a replication publisher, subscriber, or distributor. * Database3 - Database [Database3] is a replication publisher, subscriber, or distributor. - Tables in the MSDB Database - The SQLSentryAlertLog_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - The SQLSentryDBEmails_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - The SQLSentryDBEmails_Attachments_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - The SQLSentryEmails_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - The SQLSentryLogCache_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - The SQLSentryLogData_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - The SQLSentryObjectVersion_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - The SQLSentryQueueLog_20 table in the msdb database was created by end users on Mar 10 2017 4:40PM. Tables in the msdb database may not be restored in the event of a disaster. - TraceFlag On - Trace flag 3226 is enabled globally. **Priority 200: Monitoring**: - Agent Jobs Without Failure Emails - The job Agent history clean up: distribution has not been set up to notify an operator if it fails. - The job Database1_Job_DeleteExpiredSessions has not been set up to notify an operator if it fails. - The job CommandLog Cleanup has not been set up to notify an operator if it fails. - The job DatabaseBackup - SYSTEM_DATABASES - FULL has not been set up to notify an operator if it fails. - The job DatabaseBackup - USER_DATABASES - DIFF has not been set up to notify an operator if it fails. - The job DatabaseBackup - USER_DATABASES - FULL has not been set up to notify an operator if it fails. - The job DatabaseBackup - USER_DATABASES - LOG has not been set up to notify an operator if it fails. - The job DatabaseIntegrityCheck - SYSTEM_DATABASES has not been set up to notify an operator if it fails. - The job DatabaseIntegrityCheck - USER_DATABASES has not been set up to notify an operator if it fails. - The job Distribution clean up: distribution has not been set up to notify an operator if it fails. - The job Expired subscription clean up has not been set up to notify an operator if it fails. - The job IndexOptimize - USER_DATABASES has not been set up to notify an operator if it fails. - The job IntegrationAPI_SSIS_Integration has not been set up to notify an operator if it fails. - The job Output File Cleanup has not been set up to notify an operator if it fails. - The job Replication agents checkup has not been set up to notify an operator if it fails. - The job sp_cycle_errorlog has not been set up to notify an operator if it fails. - The job sp_delete_backuphistory has not been set up to notify an operator if it fails. - The job sp_purge_jobhistory has not been set up to notify an operator if it fails. - The job SSIS Server Maintenance Job has not been set up to notify an operator if it fails. - The job StatisticsUpdate - USER_DATABASES has not been set up to notify an operator if it fails. - Alerts Disabled - The following Alert is disabled, please review and enable if desired: Peer-to-peer conflict detection alert - The following Alert is disabled, please review and enable if desired: Replication: agent custom shutdown - The following Alert is disabled, please review and enable if desired: Replication: agent retry - The following Alert is disabled, please review and enable if desired: Replication: agent success - The following Alert is disabled, please review and enable if desired: Replication: expired subscription dropped - The following Alert is disabled, please review and enable if desired: Replication: Subscriber has failed data validation - The following Alert is disabled, please review and enable if desired: Replication: Subscriber has passed data validation - The following Alert is disabled, please review and enable if desired: Replication: Subscription reinitialized after validation failure - No Alerts for Corruption - SQL Server Agent alerts do not exist for errors 823, 824, and 825. These three errors can give you notification about early hardware failure. Enabling them can prevent you a lot of heartbreak. - No Alerts for Sev 19-25 - SQL Server Agent alerts do not exist for severity levels 19 through 25. These are some very severe SQL Server errors. Knowing that these are happening may let you recover from errors faster. - No failsafe operator configured - No failsafe operator is configured on this server. This is a good idea just in-case there are issues with the [msdb] database that prevents alerting. - Not All Alerts Configured - Not all SQL Server Agent alerts have been configured. This is a free, easy way to get notified of corruption, job failures, or major outages even before monitoring systems pick it up. **Priority 200: Non-Default Server Config**: - Agent XPs - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - backup checksum default - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - backup compression default - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - clr enabled - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - cost threshold for parallelism - This sp_configure option has been changed. Its default value is 5 and it has been set to 50. - Database Mail XPs - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - max degree of parallelism - This sp_configure option has been changed. Its default value is 0 and it has been set to 8. - max server memory (MB) - This sp_configure option has been changed. Its default value is 2147483647 and it has been set to 106496. - min server memory (MB) - This sp_configure option has been changed. Its default value is 0 and it has been set to 73728. - optimize for ad hoc workloads - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - remote admin connections - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - scan for startup procs - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. - show advanced options - This sp_configure option has been changed. Its default value is 0 and it has been set to 1. **Priority 200: Performance**: - User-Created Statistics In Place * distribution - [distribution] has 10 user-created statistics. This indicates that someone is being a rocket scientist with the stats, and might actually be slowing things down, especially during stats updates. * Database3 - [Database3] has 7 user-created statistics. This indicates that someone is being a rocket scientist with the stats, and might actually be slowing things down, especially during stats updates. **Priority 210: Non-Default Database Config**: - ANSI NULL Default Enabled - This database setting is not the default. * Database3Integration * Database3IntegrationSandBox - Read Committed Snapshot Isolation Enabled - This database setting is not the default. * Database6 * Database6Sandbox **Priority 240: Wait Stats**: - 1 - CXPACKET - 817.9 hours of waits, 259.3 minutes average wait time per hour, 0.7% signal wait, 45111405 waiting tasks, 65.3 ms average wait time. - 2 - TRACEWRITE - 188.8 hours of waits, 59.9 minutes average wait time per hour, 0.0% signal wait, 354542 waiting tasks, 1917.4 ms average wait time. **Priority 250: Server Info**: - Default Trace Contents - The default trace holds 28 hours of data between Jul 31 2017 3:30AM and Aug 1 2017 7:55AM. The default trace files are located in: E:\MSSQL13.MSSQLSERVER\MSSQL\Log - Drive C Space - 35530.00MB free on C drive - Drive E Space - 129384.00MB free on E drive - Drive F Space - 135632.00MB free on F drive - Drive G Space - 13597.00MB free on G drive - Drive H Space - 102956.00MB free on H drive - Drive I Space - 4207.00MB free on I drive - Drive Q Space - 9619.00MB free on Q drive - Drive R Space - 255214.00MB free on R drive - Hardware - Logical processors: 8. Physical memory: 128GB. - Hardware - NUMA Config - Node: 0 State: ONLINE Online schedulers: 8 Offline schedulers: 0 Processor Group: 0 Memory node: 0 Memory VAS Reserved GB: 160 - Server Last Restart - Jul 23 2017 8:07PM - Server Name - SecretServerName - Services - Service: SQL Full-text Filter Daemon Launcher (MSSQLSERVER) runs under service account NT Service\MSSQLFDLauncher. Last startup time: not shown.. Startup type: Manual, currently Running. - Service: SQL Server (MSSQLSERVER) runs under service account SecretDomain\SQLSERVER. Last startup time: Jul 24 2017 10:39AM. Startup type: Manual, currently Running. - Service: SQL Server Agent (MSSQLSERVER) runs under service account SecretDomain\SQLAGENT. Last startup time: not shown.. Startup type: Manual, currently Running. - SQL Server Last Restart - Jul 24 2017 10:39AM - SQL Server Service - Version: 13.0.4435.0. Patch Level: SP1. Edition: Standard Edition (64-bit). AlwaysOn Enabled: 0. AlwaysOn Mgr Status: 2 - Virtual Server - Type: (HYPERVISOR) - Windows Version - You're running a pretty modern version of Windows: Server 2012R2 era, version 6.3 **Priority 254: Rundate**: - Captain's log: stardate something and something...
這是輸出的前幾行:
SELECT * FROM sys.dm_exec_query_stats WITH (NOLOCK) ORDER BY creation_time ASC
如果我錯了,請糾正我,但這表明在 15:38:48 左右發生了一些事情並刪除了計劃,並且由於某種原因它沒有觸及在 15:33 創建的計劃。從 15:38:48.117 開始,計劃再次開始記憶體。
plan_handle creation_time last_execution_time 0x0500070027F2492EC09D25603E00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:33:55.160 2017-08-01 15:59:18.223 0x05000700CD381A65E0AD58482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.117 2017-08-01 15:59:50.737 0x05000700CD381A65E0AD58482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.120 2017-08-01 15:59:50.737 0x0500070091273D3040B458482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.120 2017-08-01 15:59:56.963 0x05000700CD381A65E0AD58482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.127 2017-08-01 15:59:50.737 0x05000700AB256B35A0BC58482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.133 2017-08-01 15:59:50.533 0x05000700B31CF10530BE58482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.163 2017-08-01 15:59:58.687 0x0500070051C21425C0BE58482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.167 2017-08-01 15:59:56.003 0x0500070055138E3B50BF58482900000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.177 2017-08-01 15:59:55.017 0x05000700ED867A0F30FFB4B53200000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.180 2017-08-01 15:59:56.643 0x050006005D05B621B0D1B6803C00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.183 2017-08-01 15:59:58.590 0x0500070036D29E2540D3B6803C00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.190 2017-08-01 15:59:44.230 0x050007008228A749A0DEB6803C00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.203 2017-08-01 15:59:58.620 0x050007002C6CA30830DFB6803C00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.207 2017-08-01 15:59:43.103 0x050007000F2BB316E06D90782E00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.230 2017-08-01 15:59:52.497 0x05000700022A9029706E90782E00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.237 2017-08-01 15:59:56.000 0x050007004B75B43FF0CEE1104200000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.247 2017-08-01 15:59:50.907 0x050007002F3E632220DEE1104200000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.260 2017-08-01 15:59:58.047 0x050007003EC92476307D90782E00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.263 2017-08-01 15:59:58.610 0x05000700119B6E78C07E90782E00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.283 2017-08-01 15:59:54.340 0x05000700B13E0511507F90782E00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.317 2017-08-01 15:59:50.640 0x05000700F888E46D30BFC37B3D00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.333 2017-08-01 15:59:50.903 0x050007006015F81980B9C37B3D00000001000000000000000000000000000000000000000000000000000000 2017-08-01 15:38:48.340 2017-08-01 15:59:44.077
以及多達 700 多行…
更新 02.08.2017
最後我看到了Jonathan Kehayias 的一篇文章,我決定檢查環形緩衝區和繁榮!,結果我收到來自資源監視器的通知說:“低物理記憶體”。此通知的出現非常適合 proc 記憶體清除。 現在的案例是如何解釋這些結果以及如何找到負責任的過程。您可以在查詢結果中找到:
SELECT EventTime, record.value('(/Record/ResourceMonitor/Notification)[1]', 'varchar(max)') as [Type], record.value('(/Record/ResourceMonitor/IndicatorsProcess)[1]', 'int') as [IndicatorsProcess], record.value('(/Record/ResourceMonitor/IndicatorsSystem)[1]', 'int') as [IndicatorsSystem], record.value('(/Record/MemoryRecord/AvailablePhysicalMemory)[1]', 'bigint') AS [Avail Phys Mem, Kb], record.value('(/Record/MemoryRecord/AvailableVirtualAddressSpace)[1]', 'bigint') AS [Avail VAS, Kb] FROM ( SELECT DATEADD (ss, (-1 * ((cpu_ticks / CONVERT (float, ( cpu_ticks / ms_ticks ))) - [timestamp])/1000), GETDATE()) AS EventTime, CONVERT (xml, record) AS record FROM sys.dm_os_ring_buffers CROSS JOIN sys.dm_os_sys_info WHERE ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR') AS tab ORDER BY EventTime DESC;
儘管可用物理記憶體的值保持在同一水平,但我們可以觀察到“低物理記憶體”標誌。此外,如果我沒有誤會上述結果表明內部記憶體壓力(IndicatorsProcess = 2,這對我來說很奇怪,因為 Sentry 一直表明 SQL Server 沒有充分利用分配的記憶體。這是 Sentry 擷取的記憶體使用情況作為範例早上 8 點拍攝:
所有的線條都很平坦。對我來說也很奇怪的是那組事件:
- RESOURCE_MEMPHYSICAL_LOW
- RESOURCE_MEM_STEADY
- RESOURCE_MEMPHYSICAL_HIGH
同時發生。所以這種壓力需要幾毫秒或更短的時間(也許這也是 Sentry 不擷取任何東西的原因,因為它收集數據的頻率較低)。
現在我想知道如何確定這種記憶體壓力的根本原因?
最好的祝福,
格雷格
最終,LPIM 做了一個詭計並解決/掩蓋了這個問題,但根本原因仍然不確定,甚至可能是 SQL Server 中的一個錯誤。
如果有人對進一步的發現感興趣,請導航到 MSDN 論壇上的討論:
你在這裡有幾個不同的問題:
**“鎖定記憶體中的頁面”設置是否保留計劃記憶體?**只是間接的。LPIM 意味著如果 SQL Server 在記憶體壓力下不會分頁到磁碟,但是當作業系統處於壓力下時 SQL Server仍然會放棄記憶體。Jonathan Kehayias 詳細介紹了這一點。簡短的回答是,當盒子承受壓力時,給定 16GB 的最小記憶體設置,即使使用 LPIM,您仍然會看到 SQL Server 放棄記憶體(並減少計劃記憶體的內容)。
**SQL 2016 SP1 CU3 是否定期清除計劃記憶體?**與其他版本沒有什麼不同。您的標題表明您在應用 CU 後看到了不同的行為 - 如果確實如此,最好的辦法是解除安裝 CU(返回到以前的 CU2),然後查看行為是否發生變化。不過,我敢打賭,您看到的行為恰好是最近才開始的。
**我如何查看正在清除計劃記憶體的部分內容?**由於您寫道計劃記憶體並未完全清除,因此您需要查看正在消失的計劃中涉及哪些數據庫/表/索引/統計資訊。例如,您可能有一個表達到了 20% 的更改門檻值,觸發了自動統計更新。
由於這不是真正的緊急情況,因此我將為您提供我會採取的臨時、低勞動強度的方法來解決它。我會定期執行sp_BlitzCache(免責聲明:我是合著者之一,儘管我的程式碼很糟糕),了解前 10 個資源密集型查詢。(甚至可以使用 OutputDatabaseName、OutputSchemaName 和 OutputTableName 參數將其記錄到表中。)然後,當計劃記憶體清除時,執行它並查看哪些計劃消失了。他們有什麼共同點嗎,比如他們都打了一個主要的銷售表嗎?如果是這樣,那張桌子上的變化率是多少?您可以使用rowmodctr粗略了解哪些表更改的頻率。
由於您使用的是 SQL Sentry,因此您還可以基於此設置自定義監視器:
SELECT COUNT(*) FROM sys.dm_exec_query_stats WITH (NOLOCK)
這將為您提供一個折線圖,以確定您的計劃記憶體何時下降。如果它以非常特定的時間間隔(例如每小時)發生,那麼我將使用此答案中的技術來追踪導致它的原因。