關(guān)系數(shù)據(jù)庫(kù)系統(tǒng)和混合/云數(shù)據(jù)管理解決方案的用戶都可以使用SQL靈活地訪問(wèn)業(yè)務(wù)數(shù)據(jù),并以創(chuàng)新的方式進(jìn)行轉(zhuǎn)換或顯示。
對(duì)業(yè)務(wù)數(shù)據(jù)庫(kù)結(jié)構(gòu)的透徹了解,對(duì)上游數(shù)據(jù)進(jìn)行轉(zhuǎn)換和聚合的巧妙解決方案sql 查前 條平均值,對(duì)于高效,完善的ETL至關(guān)重要。這是我在構(gòu)建復(fù)雜的管道時(shí)學(xué)到的一些技巧,這些技巧使我的工作輕松而有趣。
一、計(jì)算滾動(dòng)平均
使用時(shí)間序列數(shù)據(jù)時(shí),為觀察值計(jì)算滾動(dòng)平均值或附加歷史值可能會(huì)有所幫助。假設(shè)我想獲取一家公司每天售出的小部件數(shù)量。我可能想包括7天移動(dòng)平均線,或附上上周出售的工作日小部件,以查看業(yè)務(wù)與上周相比的表現(xiàn)。我可以通過(guò)將數(shù)據(jù)集連接到自身上,并使用日期列上的操作來(lái)選擇單個(gè)值或觀察范圍來(lái)做到這一點(diǎn)。
首先,讓我們開(kāi)始使用Db2庫(kù)中名為的表中的以下代碼sql 查前 條平均值,按日銷售小部件:
select t1.date
, sum(t1.widgets_sold) as total_widgets_sold
from db2.widget_order_history t1
where t1.date between ‘2021–04–05’ and ‘2021–05–01’
group by t1.date
如果我們想在表的另一個(gè)變量(例如特定存儲(chǔ))上添加任何過(guò)濾條件,則可以添加一個(gè)簡(jiǎn)單的WHERE語(yǔ)句:
select t1.date
, sum(t1.widgets_sold) as total_widgets_sold
from db2.widget_order_history t1
where (t1.date between ‘2021–04–05’ and ‘2021–05–01’)
and t1.store = 1234
group by t1.date
位于其他表上的其他條件(即)將需要附加的INNER JOIN:
select t1.date
, sum(t1.widgets_sold) as total_widgets_sold
from db2.widget_order_history t1

inner join (
select store
from db2.store_data
where state = ‘NY’
) t2
on t1.store = t2.store
where t1.date between ‘2021–04–05’ and ‘2021–05–01’
group by t1.date
從提供的代碼生成的示例時(shí)間序列數(shù)據(jù):
二、自連接附加歷史數(shù)據(jù)
現(xiàn)在,如果我想附加4/25 / 21–5 / 1/21這一周的7天滾動(dòng)平均值,可以通過(guò)將表連接到自身上并利用在SUM()函數(shù)。
當(dāng)您只想滿足表中的特定條件時(shí),可以使用此技術(shù)來(lái)使用分組功能(即SUM(),COUNT(),MAX())。它只會(huì)對(duì)滿足WHEN子句中包含的規(guī)則的值求和。
在下面的示例中,如果表B的值在表A上當(dāng)前觀察日期的前7天之內(nèi),我們可以將這些銷售量相加并除以7,以獲得表A的每一行的每周滾動(dòng)平均值:
select a.date
, a.total_widgets_sold
, sum(
case when (b.date between a.date-7 and a.date-1)
then b.total_widgets_sold
else 0
end)/7 as seven_day_avg

from (
select date
, sum(widgets_sold) as total_widgets_sold
from db2.widget_order_history
where date between ‘2021–04–25’ and ‘2021–05–01’
group by date
) a
left join (
select date
, sum(widgets_sold) as total_widgets_sold
from db2.widget_order_history
where date between ‘2021–04–05’ and ‘2021–05–01’
group by date
) b
on a.date = b.date
group by a.date
, a.total_widgets_sold
order by a.date
2021日歷年第17周的小部件銷售,其7天平均值處于滾動(dòng)狀態(tài):
如果要將歷史值附加到每個(gè)觀察值,則可以避免聚合,而只需根據(jù)指定間隔時(shí)間的日期加入表即可。
下面的示例將表B聯(lián)接到表A上,以將日期回溯7天以獲取前一個(gè)工作日的小部件銷售:
select a.date
, a.total_widgets_sold
, b.total_widgets_sold as prev_wkday_sales
from (
select date
, sum(widgets_sold) as total_widgets_sold
from db2.widget_order_history
where date between ‘2021–04–25’ and ‘2021–05–01’
group by date
) a
left join (
select date
, sum(widgets_sold) as total_widgets_sold
from db2.widget_order_history
where date between ‘2021–04–04’ and ‘2021–05–01’
group by date
) b
on a.date -7 = b.date
第20周第2021日歷年的窗口小部件銷售以及上周的工作日窗口小部件銷售:
將表聯(lián)接到自身上是一種非常靈活的方式,可以向數(shù)據(jù)集添加匯總列和計(jì)算列。
分組功能(例如SUM()和COUNT()與CASE()語(yǔ)句)的創(chuàng)造性使用為功能工程,分析報(bào)告和各種其他用例帶來(lái)了巨大的機(jī)會(huì)。
在實(shí)踐中,如果查詢通過(guò)子查詢加入自身,并且查詢量很大,則可以預(yù)期運(yùn)行時(shí)間很長(zhǎng)。解決此問(wèn)題的一種方法是使用臨時(shí)表來(lái)保存具有特定問(wèn)題標(biāo)準(zhǔn)的初步結(jié)果。
例如,在SAS的WORK庫(kù)中為整個(gè)時(shí)間范圍創(chuàng)建一個(gè)小部件銷售表,并多次查詢?cè)摫怼8咝У拇a結(jié)構(gòu)(例如使用索引)也可以提高效率。
三、使用CASE語(yǔ)句處理復(fù)雜的邏輯
CASE語(yǔ)句的語(yǔ)法與整個(gè)數(shù)據(jù)科學(xué)中其他常用編程語(yǔ)言的邏輯不同(請(qǐng)參閱: / R)。
通過(guò)使用偽代碼對(duì)邏輯規(guī)則進(jìn)行周到的設(shè)計(jì)可以幫助避免由于不正確/不一致的規(guī)則而導(dǎo)致的錯(cuò)誤。了解如何在SQL中編碼嵌套邏輯對(duì)于釋放數(shù)據(jù)中的潛力至關(guān)重要。
假設(shè)有一張購(gòu)物者表,其中包含給定時(shí)間范圍內(nèi)的年齡,家庭狀態(tài)和銷售情況等大量特征。有針對(duì)性的營(yíng)銷活動(dòng)正用于嘗試提高普通購(gòu)物者的銷售額(已將平均購(gòu)物者確定為消費(fèi)在$ 100- $ 200之間的人)。
一旦被識(shí)別,Z世代/千禧一代購(gòu)物者將獲得數(shù)字優(yōu)惠券,所有其他購(gòu)物者將被郵寄一張印刷優(yōu)惠券,該打印優(yōu)惠券將根據(jù)他們所居住的州而有所不同。為簡(jiǎn)單起見(jiàn),只有三個(gè)州的購(gòu)物者居住。
這是在R和SQL中如何編碼此邏輯的方法:
## Example of Nested Logic in R
if(shoppers$sales<=0){ print("Error: Negative/No Sales")}
else if(shoppers&sales<=100){ print("Shopper has below-average sales.")}
else if(shoppers&sales<=200){
if(shopper$age<41){print("Shopper has average sales and is Gen Z/Millennial.")}
else{
if(shopper$state=='NJ'){print("Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New Jersey.")}
else if(shopper$state=='NY'){print("Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New York.")
else(shopper$state=='CT'){print("Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in Connecticut.")}
}
}
else{print("Shopper has above-average sales.")}

*Example of nested logic in SQL. No need to actually nest statements!;
, case when sales < 0
then 'Error: Negative/No Sales.'
when sales <=100
then 'Shopper has below-average sales.'
when sales <=200 and age <41
then 'Shopper has average sales and is Gen Z/Millennial.'
when sales <=200 and state = 'NJ'
then 'Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New Jersey.'
when sales <=200 and state = 'NY'
then 'Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New York.'
when sales <=200 and state = 'CT'
then 'Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in Connecticut.'
else 'Shopper has above-average sales.'
end as shopper_classification
周到地使用CASE語(yǔ)句將使您能夠構(gòu)建復(fù)雜業(yè)務(wù)邏輯的任何組合。
但是,SQL邏輯與其他編程語(yǔ)言所需要的思維方式略有不同。
結(jié)合分組功能,這些工具可以為數(shù)據(jù)科學(xué)家提供競(jìng)爭(zhēng)優(yōu)勢(shì),以獲取和轉(zhuǎn)換用于特征工程,商業(yè)智能,分析報(bào)告等的數(shù)據(jù)源!