- SQL Server Reference Guide
- Introduction
- SQL Server Reference Guide Overview
- Table of Contents
- Microsoft SQL Server Defined
- SQL Server Editions
- SQL Server Access
- Informit Articles and Sample Chapters
- Online Resources
- Microsoft SQL Server Features
- SQL Server Books Online
- Clustering Services
- Data Transformation Services (DTS) Overview
- Replication Services
- Database Mirroring
- Natural Language Processing (NLP)
- Analysis Services
- Microsot SQL Server Reporting Services
- XML Overview
- Notification Services for the DBA
- Full-Text Search
- SQL Server 2005 - Service Broker
- Using SQL Server as a Web Service
- SQL Server Encryption Options Overview
- SQL Server 2008 Overview
- SQL Server 2008 R2 Overview
- SQL Azure
- The Utility Control Point and Data Application Component, Part 1
- The Utility Control Point and Data Application Component, Part 2
- Microsoft SQL Server Administration
- The DBA Survival Guide: The 10 Minute SQL Server Overview
- Preparing (or Tuning) a Windows System for SQL Server, Part 1
- Preparing (or Tuning) a Windows System for SQL Server, Part 2
- Installing SQL Server
- Upgrading SQL Server
- SQL Server 2000 Management Tools
- SQL Server 2005 Management Tools
- SQL Server 2008 Management Tools
- SQL Azure Tools
- Automating Tasks with SQL Server Agent
- Run Operating System Commands in SQL Agent using PowerShell
- Automating Tasks Without SQL Server Agent
- Storage – SQL Server I/O
- Service Packs, Hotfixes and Cumulative Upgrades
- Tracking SQL Server Information with Error and Event Logs
- Change Management
- SQL Server Metadata, Part One
- SQL Server Meta-Data, Part Two
- Monitoring - SQL Server 2005 Dynamic Views and Functions
- Monitoring - Performance Monitor
- Unattended Performance Monitoring for SQL Server
- Monitoring - User-Defined Performance Counters
- Monitoring: SQL Server Activity Monitor
- SQL Server Instances
- DBCC Commands
- SQL Server and Mail
- Database Maintenance Checklist
- The Maintenance Wizard: SQL Server 2000 and Earlier
- The Maintenance Wizard: SQL Server 2005 (SP2) and Later
- The Web Assistant Wizard
- Creating Web Pages from SQL Server
- SQL Server Security
- Securing the SQL Server Platform, Part 1
- Securing the SQL Server Platform, Part 2
- SQL Server Security: Users and other Principals
- SQL Server Security – Roles
- SQL Server Security: Objects (Securables)
- Security: Using the Command Line
- SQL Server Security - Encrypting Connections
- SQL Server Security: Encrypting Data
- SQL Server Security Audit
- High Availability - SQL Server Clustering
- SQL Server Configuration, Part 1
- SQL Server Configuration, Part 2
- Database Configuration Options
- 32- vs 64-bit Computing for SQL Server
- SQL Server and Memory
- Performance Tuning: Introduction to Indexes
- Statistical Indexes
- Backup and Recovery
- Backup and Recovery Examples, Part One
- Backup and Recovery Examples, Part Two: Transferring Databases to Another System (Even Without Backups)
- SQL Profiler - Reverse Engineering An Application
- SQL Trace
- SQL Server Alerts
- Files and Filegroups
- Partitioning
- Full-Text Indexes
- Read-Only Data
- SQL Server Locks
- Monitoring Locking and Deadlocking
- Controlling Locks in SQL Server
- SQL Server Policy-Based Management, Part One
- SQL Server Policy-Based Management, Part Two
- SQL Server Policy-Based Management, Part Three
- Microsoft SQL Server Programming
- An Outline for Development
- Database
- Database Services
- Database Objects: Databases
- Database Objects: Tables
- Database Objects: Table Relationships
- Database Objects: Keys
- Database Objects: Constraints
- Database Objects: Data Types
- Database Objects: Views
- Database Objects: Stored Procedures
- Database Objects: Indexes
- Database Objects: User Defined Functions
- Database Objects: Triggers
- Database Design: Requirements, Entities, and Attributes
- Business Process Model Notation (BPMN) and the Data Professional
- Business Questions for Database Design, Part One
- Business Questions for Database Design, Part Two
- Database Design: Finalizing Requirements and Defining Relationships
- Database Design: Creating an Entity Relationship Diagram
- Database Design: The Logical ERD
- Database Design: Adjusting The Model
- Database Design: Normalizing the Model
- Creating The Physical Model
- Database Design: Changing Attributes to Columns
- Database Design: Creating The Physical Database
- Database Design Example: Curriculum Vitae
- NULLs
- The SQL Server Sample Databases
- The SQL Server Sample Databases: pubs
- The SQL Server Sample Databases: NorthWind
- The SQL Server Sample Databases: AdventureWorks
- The SQL Server Sample Databases: Adventureworks Derivatives
- UniversalDB: The Demo and Testing Database, Part 1
- UniversalDB: The Demo and Testing Database, Part 2
- UniversalDB: The Demo and Testing Database, Part 3
- UniversalDB: The Demo and Testing Database, Part 4
- Getting Started with Transact-SQL
- Transact-SQL: Data Definition Language (DDL) Basics
- Transact-SQL: Limiting Results
- Transact-SQL: More Operators
- Transact-SQL: Ordering and Aggregating Data
- Transact-SQL: Subqueries
- Transact-SQL: Joins
- Transact-SQL: Complex Joins - Building a View with Multiple JOINs
- Transact-SQL: Inserts, Updates, and Deletes
- An Introduction to the CLR in SQL Server 2005
- Design Elements Part 1: Programming Flow Overview, Code Format and Commenting your Code
- Design Elements Part 2: Controlling SQL's Scope
- Design Elements Part 3: Error Handling
- Design Elements Part 4: Variables
- Design Elements Part 5: Where Does The Code Live?
- Design Elements Part 6: Math Operators and Functions
- Design Elements Part 7: Statistical Functions
- Design Elements Part 8: Summarization Statistical Algorithms
- Design Elements Part 9:Representing Data with Statistical Algorithms
- Design Elements Part 10: Interpreting the Data—Regression
- Design Elements Part 11: String Manipulation
- Design Elements Part 12: Loops
- Design Elements Part 13: Recursion
- Design Elements Part 14: Arrays
- Design Elements Part 15: Event-Driven Programming Vs. Scheduled Processes
- Design Elements Part 16: Event-Driven Programming
- Design Elements Part 17: Program Flow
- Forming Queries Part 1: Design
- Forming Queries Part 2: Query Basics
- Forming Queries Part 3: Query Optimization
- Forming Queries Part 4: SET Options
- Forming Queries Part 5: Table Optimization Hints
- Using SQL Server Templates
- Transact-SQL Unit Testing
- Index Tuning Wizard
- Unicode and SQL Server
- SQL Server Development Tools
- The SQL Server Transact-SQL Debugger
- The Transact-SQL Debugger, Part 2
- Basic Troubleshooting for Transact-SQL Code
- An Introduction to Spatial Data in SQL Server 2008
- Performance Tuning
- Performance Tuning SQL Server: Tools and Processes
- Performance Tuning SQL Server: Tools Overview
- Creating a Performance Tuning Audit - Defining Components
- Creating a Performance Tuning Audit - Evaluation Part One
- Creating a Performance Tuning Audit - Evaluation Part Two
- Creating a Performance Tuning Audit - Interpretation
- Creating a Performance Tuning Audit - Developing an Action Plan
- Understanding SQL Server Query Plans
- Performance Tuning: Implementing Indexes
- Performance Monitoring Tools: Windows 2008 (and Higher) Server Utilities, Part 1
- Performance Monitoring Tools: Windows 2008 (and Higher) Server Utilities, Part 2
- Performance Monitoring Tools: Windows System Monitor
- Performance Monitoring Tools: Logging with System Monitor
- Performance Monitoring Tools: User Defined Counters
- General Transact-SQL (T-SQL) Performance Tuning, Part 1
- General Transact-SQL (T-SQL) Performance Tuning, Part 2
- General Transact-SQL (T-SQL) Performance Tuning, Part 3
- Performance Monitoring Tools: An Introduction to SQL Profiler
- Performance Tuning: Introduction to Indexes
- Performance Monitoring Tools: SQL Server 2000 Index Tuning Wizard
- Performance Monitoring Tools: SQL Server 2005 Database Tuning Advisor
- Performance Monitoring Tools: SQL Server Management Studio Reports
- Performance Monitoring Tools: SQL Server 2008 Activity Monitor
- The SQL Server 2008 Management Data Warehouse and Data Collector
- Performance Monitoring Tools: Evaluating Wait States with PowerShell and Excel
- Practical Applications
- Choosing the Back End
- The DBA's Toolbox, Part 1
- The DBA's Toolbox, Part 2
- Scripting Solutions for SQL Server
- Building a SQL Server Lab
- Using Graphics Files with SQL Server
- Enterprise Resource Planning
- Customer Relationship Management (CRM)
- Building a Reporting Data Server
- Building a Database Documenter, Part 1
- Building a Database Documenter, Part 2
- Data Management Objects
- Data Management Objects: The Server Object
- Data Management Objects: Server Object Methods
- Data Management Objects: Collections and the Database Object
- Data Management Objects: Database Information
- Data Management Objects: Database Control
- Data Management Objects: Database Maintenance
- Data Management Objects: Logging the Process
- Data Management Objects: Running SQL Statements
- Data Management Objects: Multiple Row Returns
- Data Management Objects: Other Database Objects
- Data Management Objects: Security
- Data Management Objects: Scripting
- Powershell and SQL Server - Overview
- PowerShell and SQL Server - Objects and Providers
- Powershell and SQL Server - A Script Framework
- Powershell and SQL Server - Logging the Process
- Powershell and SQL Server - Reading a Control File
- Powershell and SQL Server - SQL Server Access
- Powershell and SQL Server - Web Pages from a SQL Query
- Powershell and SQL Server - Scrubbing the Event Logs
- SQL Server 2008 PowerShell Provider
- SQL Server I/O: Importing and Exporting Data
- SQL Server I/O: XML in Database Terms
- SQL Server I/O: Creating XML Output
- SQL Server I/O: Reading XML Documents
- SQL Server I/O: Using XML Control Mechanisms
- SQL Server I/O: Creating Hierarchies
- SQL Server I/O: Using HTTP with SQL Server XML
- SQL Server I/O: Using HTTP with SQL Server XML Templates
- SQL Server I/O: Remote Queries
- SQL Server I/O: Working with Text Files
- Using Microsoft SQL Server on Handheld Devices
- Front-Ends 101: Microsoft Access
- Comparing Two SQL Server Databases
- English Query - Part 1
- English Query - Part 2
- English Query - Part 3
- English Query - Part 4
- English Query - Part 5
- RSS Feeds from SQL Server
- Using SQL Server Agent to Monitor Backups
- Reporting Services - Creating a Maintenance Report
- SQL Server Chargeback Strategies, Part 1
- SQL Server Chargeback Strategies, Part 2
- SQL Server Replication Example
- Creating a Master Agent and Alert Server
- The SQL Server Central Management System: Definition
- The SQL Server Central Management System: Base Tables
- The SQL Server Central Management System: Execution of Server Information (Part 1)
- The SQL Server Central Management System: Execution of Server Information (Part 2)
- The SQL Server Central Management System: Collecting Performance Metrics
- The SQL Server Central Management System: Centralizing Agent Jobs, Events and Scripts
- The SQL Server Central Management System: Reporting the Data and Project Summary
- Time Tracking for SQL Server Operations
- Migrating Departmental Data Stores to SQL Server
- Migrating Departmental Data Stores to SQL Server: Model the System
- Migrating Departmental Data Stores to SQL Server: Model the System, Continued
- Migrating Departmental Data Stores to SQL Server: Decide on the Destination
- Migrating Departmental Data Stores to SQL Server: Design the ETL
- Migrating Departmental Data Stores to SQL Server: Design the ETL, Continued
- Migrating Departmental Data Stores to SQL Server: Attach the Front End, Test, and Monitor
- Tracking SQL Server Timed Events, Part 1
- Tracking SQL Server Timed Events, Part 2
- Patterns and Practices for the Data Professional
- Managing Vendor Databases
- Consolidation Options
- Connecting to a SQL Azure Database from Microsoft Access
- SharePoint 2007 and SQL Server, Part One
- SharePoint 2007 and SQL Server, Part Two
- SharePoint 2007 and SQL Server, Part Three
- Querying Multiple Data Sources from a Single Location (Distributed Queries)
- Importing and Exporting Data for SQL Azure
- Working on Distributed Teams
- Professional Development
- Becoming a DBA
- Certification
- DBA Levels
- Becoming a Data Professional
- SQL Server Professional Development Plan, Part 1
- SQL Server Professional Development Plan, Part 2
- SQL Server Professional Development Plan, Part 3
- Evaluating Technical Options
- System Sizing
- Creating a Disaster Recovery Plan
- Anatomy of a Disaster (Response Plan)
- Database Troubleshooting
- Conducting an Effective Code Review
- Developing an Exit Strategy
- Data Retention Strategy
- Keeping Your DBA/Developer Job in Troubled Times
- The SQL Server Runbook
- Creating and Maintaining a SQL Server Configuration History, Part 1
- Creating and Maintaining a SQL Server Configuration History, Part 2
- Creating an Application Profile, Part 1
- Creating an Application Profile, Part 2
- How to Attend a Technical Conference
- Tips for Maximizing Your IT Budget This Year
- The Importance of Blue-Sky Planning
- Application Architecture Assessments
- Transact-SQL Code Reviews, Part One
- Transact-SQL Code Reviews, Part Two
- Cloud Computing (Distributed Computing) Paradigms
- NoSQL for the SQL Server Professional, Part One
- NoSQL for the SQL Server Professional, Part Two
- Object-Role Modeling (ORM) for the Database Professional
- Business Intelligence
- BI Explained
- Developing a Data Dictionary
- BI Security
- Gathering BI Requirements
- Source System Extracts and Transforms
- ETL Mechanisms
- Business Intelligence Landscapes
- Business Intelligence Layouts and the Build or Buy Decision
- A Single Version of the Truth
- The Operational Data Store (ODS)
- Data Marts – Combining and Transforming Data
- Designing Data Elements
- The Enterprise Data Warehouse — Aggregations and the Star Schema
- On-Line Analytical Processing (OLAP)
- Data Mining
- Key Performance Indicators
- BI Presentation - Client Tools
- BI Presentation - Portals
- Implementing ETL - Introduction to SQL Server 2005 Integration Services
- Building a Business Intelligence Solution, Part 1
- Building a Business Intelligence Solution, Part 2
- Building a Business Intelligence Solution, Part 3
- Tips and Troubleshooting
- SQL Server and Microsoft Excel Integration
- Tips for the SQL Server Tools: SQL Server 2000
- Tips for the SQL Server Tools – SQL Server 2005
- Transaction Log Troubles
- SQL Server Connection Problems
- Orphaned Database Users
- Additional Resources
- Tools and Downloads
- Utilities (Free)
- Tool Review (Free): DBDesignerFork
- Aqua Data Studio
- Microsoft SQL Server Best Practices Analyzer
- Utilities (Cost)
- Quest Software's TOAD for SQL Server
- Quest Software's Spotlight on SQL Server
- SQL Server on Microsoft's Virtual PC
- Red Gate SQL Bundle
- Microsoft's Visio for Database Folks
- Quest Capacity Manager
- SQL Server Help
- Visual Studio Team Edition for Database Professionals
- Microsoft Assessment and Planning Solution Accelerator
- Aggregating Server Data from the MAPS Tool
Last week, we learned about a simple framework to follow to build programs. This week, we'll use that framework to create a complex statistical algorithm.
You'll recall from last week's chapter that, while a more formal approach is called for in larger projects, our simplified development framework has four main phases:
Understand the goal or problem as completely as possible
Comment the process
Code the comments
Optimize the code
We'll use these phases to create today's algorithm.
Understand the goal or problem as completely as possible
Let's begin the process by understanding, as completely as possible, what we want to accomplish. There may not be time to become an expert in a particular area, but try to understand as far as you can. When you reach your limit, bring topic experts into the design process.
Today's concept is regression statistics. Regression analysis shows relationships between sets of data. In using this analysis method, we're trying to show if one thing is related to another. Be careful here; we can't exactly say that since we brush our teeth each morning and the sun comes up each morning, they are related!
In a simple example of a regression, we can examine one dependent variable in relation to one independent variable. The formula looks like this:
Y = a + bX
Where Y is the value of the dependent variable, X is the value of the independent variable, a is the intercept of the regression line on the Y axis when X = 0, and b is the slope of the regression line. We can take the data, plug it into the formula, and plot the points.
But there's a simpler way to look at this kind of data. Statistical algorithms allow us to understand the data without having to process it all. This is the "sample" we've learned about.
Let's take a look at a concrete example. Below we have the chart of data for rainfall and the inches of the jalapeño pepper plant's growth in my garden:
Growth | Days |
10 | 6 |
7 | 4 |
12 | 7 |
12 | 8 |
9 | 10 |
16 | 7 |
12 | 10 |
18 | 15 |
8 | 5 |
12 | 6 |
14 | 11 |
16 | 13 |
Rather than solving the formula above, we can plot this data graphically. Once we have the data plotted on an X and Y axis, we can see a pattern emerge, which seems to indicate that growth follows rain. Here's the way that looks:
SELECT REPLICATE(' ',Growth) + '*' FROM test ORDER BY days DESC
And here's the output:
* * * * * * * * * * * *
As we can see, drawing a line that hits as close to all the points we've created forms a rather distinct line, sloping upward, showing a possible relationship between the number of days of rain and the growth in inches.
In the code snippet above we've used a space to indent the result, and print an asterisk to mark the spot. We then ordered the data by day to see if it "scattered" around a particular point.
OK, this is probably not very useful - problems arise with large data sets or values above 256 - but it is fun to see what can be accomplished with simple T-SQL! The point of this exercise is that sometimes we have to look at the graphical movement of data rather than the discrete numbers. It's important to pick the best method to examine the data at hand. While the graphing method is useful, it's really a number that we're after. We'd like to find some sort of number that will show us whether one set of data could be related to another, and how strongly.
There is a formula that will help us - it's called a regression coefficient. While the exact formula is a bit on the symbolic side, here's what it looks like:
Wait! It's actually not as bad as it looks. What we're given by this monster is a number between -1 and 1. If the number is closer to -1, then items are inversely related. That is, the less rain, the higher the growth. If the number is closer to +1, then more rain means more growth. Closer to 0 and the two aren't considered to be related.
It's not really all that bad. Let's just break down the formula into comments one part at a time. The only statistical symbol you have to learn is the big "E" looking letter, called sigma. It simply means "the sum of."
Comment the process
To create our algorithm, we'll treat it as a simple program. Let's take the formula and re-write it as a word problem:
/* First, we examine the whole formula. We see that there are several sums required, and a few of them are sums of numbers we don't have, such as x*y and x^2 or y^2. It might be easier to construct a temporary table that has those numbers already computed. We see that we need the x's raised to a power of two as well as the y's, and we also need each x times each y. */ /* While we're at it, let's declare some variables to hold the parts of the formula. */ /* Now, we take on the numerator. The first part is the sum of the x's times the y's. Notice that it's not the sum of x times the sum of y, but each x and y multiplied, and then summed. */ /* Next, we take that number and subtract the following: the sum of all the x's times the sums of all the y's divided by the number of items in the set. We now have our numerator. */ /* We move to the denominator. We need to take the square root of the whole denominator. We'll leave that for the last step. We notice that the two parts of the denominator are the same, except that the numbers are x's and y's. We may be able to use that later in the optimization step. First, we need to get the sum of all the x's which have been raised to the power of two. Note that we don't take the sum of x's and then raise that value to a power of two, it's every value that we're after. Then we need to take that number, and subtract from it the very number we weren't looking for a moment ago (the sum of x's and then raise that value to a power of two), divided by the number of values in the set. */ /* Now we multiply those two values together, and then take the whole denominator and take the square root. We can also take the completed numerator and divide it by the completed denominator. */ /* And we're there. Now we select the answers. */ /* Don't forget to clean up! */
Notice the level of granularity we've chosen here. We think that we've broken down a complex process as simply as we can. We'll find out if the comments need to be "tweaked" in the next step.
Code the comments
Now it's just a matter of applying some of our programming know-how to those comments. Here we go:
/* First, we examine the whole formula. We see that there are several sums required, and a few of them are sums of numbers we don't have, such as x*y and x^2 or y^2. It might be easier to construct a temporary table that has those numbers already computed. We see that we need the x's raised to a power of two as well as the y's, and we also need each x times each y. */ USE pubs GO CREATE TABLE #Regression(x int, y int, x2 int, y2 int, xy int) INSERT INTO #Regression(x, y, x2, y2, xy) SELECT Growth, Days, POWER(Growth, 2), POWER(Days, 2), Growth * Days FROM test /* While we're at it, let's declare some variables to hold the parts of the formula. */ DECLARE @Exy as int , @Ex as int , @Ey as int , @n as int , @Ex2 as int , @Ey2 as int , @r as decimal(10, 5) , @a as int , @b as int , @c as int /* Now we take on the numerator. The first part is the sum of the x's times the y's. Notice that it's not the sum of x times the sum of y, but each x and y multiplied, and then summed. */ SET @Exy = (SELECT SUM(xy) FROM #Regression) /* Next, we take that number and subtract the following: the sum of all the x's times the sums of all the y's divided by the number of items in the set. We now have our numerator. */ SET @Ex = (SELECT SUM(x) FROM #Regression) SET @Ey = (SELECT SUM(y) FROM #Regression) SET @n = (SELECT COUNT(x) FROM #Regression) SET @a = @Exy - ((@Ex*@Ey)/@n) /* Next we move to the denominator. We notice that we need to take the square root of the whole denominator. We'll leave that for the last step. We notice that the two parts of the denominator are the same, except that the numbers are x's and y's. We may be able to use that later in the optimization step. First, we need to get the sum of all the x's which have been raised to the power of two. Note that we don't take the sum of x's and then raise that value to a power of two, it's every value that we're after. Then we need to take that number, and subtract from it the very number we weren't looking for a moment ago (the sum of x's and then raise that value to a power of two), divided by the number of values in the set. */ SET @Ex2 = (SELECT SUM(POWER(x, 2)) FROM #Regression) SET @b = @Ex2 - ((POWER(@Ex, 2))/@n) /* We just need to repeat the two steps above for y. */ SET @Ey2 = (SELECT SUM(POWER(y ,2)) FROM #Regression) SET @c = @Ey2 - ((POWER(@Ey, 2))/@n) /* Now we multiply those two values together, and then take the whole denominator and take the square root. We can also take the completed numerator and divide it by the completed denominator. */ SET @r = (@a/(SQRT(@b*@c))) /* And we're there. Now we select the answers. */ SELECT @Ex AS 'Ex' , @Ey AS 'Ey' , @Ex2 AS 'Ex2' , @Ey2 AS 'Ey2' , @Exy AS 'Exy' , @n AS 'n' , @a AS 'a' , @b AS 'b' , @c AS 'c' , @r AS 'r' /* Don't forget to clean up! */ DROP TABLE #Regression GO ----------- (12 row(s) affected) Ex Ey Ex2 Ey2 Exy n a b c r ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ------------ 146 102 1902 990 1334 12 93 126 123 .74704 (1 row(s) affected)
There we have it. The number is pretty close to +1, so we can say that rain tends to cause more growth in our garden. (I bet you already knew that!)
Why not just do all this in one or two lines? Well, we could, but that wouldn't make for a very good tutorial! Plus, that's what the optimization step is for.
Optimize the code
To optimize - no, wait. Let's hear from you. Send me an e-mail (woodyb@hotmail.com) with the subject line of "Statistical Regression." I'll post a few of the responses in our upcoming articles.
Online Resources
Will Hopkins has a good article with an explanation of the statistical formulas around the regression coefficient.
InformIT Tutorials and Sample Chapters
Katrina Maxwell has a worthwhile application of statistical analysis in her sample chapter, A Data Analysis Methodology for Software Managers.