Chuyên đềKhai phá dữ liệu trong sql server 2012
BO GAO DUC DAO TAD TRUONG DAI HQC THANG LONG --o0o-- CHUYEN DE TOT NGHIEP KHAI PHA DU' LIEU TRONG SQL SERVER 2012 thing vien huOng den : Trait Quang Duy Sinh vien unit hien : Doan Minh C6ng A11278 Nguyen Mk Hoang A11500 Chuyen nginh : C8ng nett thong tin HA NOI-2014 Lot MO DAU Srv phat then cua cong nghe thong tin va viec img dung tong nghe thong tin trong nhieu linh Arc ctia dbi song, kinh tee, xft hoi trong nhieu nim qua cling ding nghia veri lu
Tóm tắt tài liệu Chuyên đềKhai phá dữ liệu trong sql server 2012, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
ucmg de lieu dl duqc the co quan thu thip va lint frit ngay mot tich lily
nhieu len. H9 luu t± cac de lieu nay vi cho ring no An chfra nhung gia trj nho nhat
nao do. Tuy nhien, theo thOng ke tin chi mot lacing nho cira nheng de lieu nay (khoing
tir 5% den 10%) la luon duqc phan tich, so con lui h9 khong biet phai lam gi hoic co
the lam gi veri chting nhung h9 van tiep mc thu thip rat ton kern viii y nghia lo sq rang
co cai gi de quan trcong bj be qua sau nay Inc can den no. Mit khac, trong mOi throng
canli tranh, ngu&i to ngay cang can c6 nhieu thong tin veri tic dO nhanh try glop
viec ra quyOt djnh vi ngay cang nhieu cau hoi mang tinh chit djnh firth can phai tra lei
dua tr'en mot khOi lacing de lieu khOng 16 dii c6. Viii nheng It do nhtr vay, cac phuong
phap quan trj va khai thac ca ser de lieu truyin thong nwly cing khong dap img duqc
thuc to di lam phat trier mot khuynh huemg ky thuat mOi de la ky thuat phat hien tri
thirc va khai thic de lieu (KDD — Knowlefge Discovery and Data Mining)
icy thuit kham pha tri thfrc va khai pha de lieu da va dang duqc nghien ciru, img
dung trong nhieu rinh Arc khac nhau 6 cac ntrerc ten the gieri, tai Viet Nam ky thuot
nay tuong dOi con mai me toy nhien cling dang duqc nghien thuva din dua vao ling
dung. Buerc quan trong nhat ctia qua tranh nay la Khai phi de lieu (Data Mining), giirp
ngueri sir dung thu thip duqc nhung tri thirc heu ich tir nhung ca ser de lieu hoic cac
nguOn de lieu khOng to khac. Rat nhieu doanh nghiep Ara to chirc tre'n the giai da img
dung ky thuilt khai pha de lieu vao hoot dOng kinh doanh ctia minh va di thu duqc
nheng lqi ich to Ion.
Vi nhung IY do nhu viy nen chting em di ch9n de taithai pha du lieu va img
dung SQL Server 2012"v6i mong mu6n tim hieu cac phuong phap, cac me) hinh, ky
thuat khai phi de lieu. Dieu nay khong chi c6 tat dung 6 tat gee do nghien cuu IY
thuyet ma con img dung thuc to din tren mot me hinh va kiim chimg tinh xac thuc ma
ky thuat khai phi de lieu dem lid. Tir nhung kien thirc ca ban, dan sang tim hieu cac
van de phirc tap lien quan den cac thuat Win khai phi du lieu. Tuy chi la nhting mirc
tim hieu ca ban, don &An nhung cling it nhieu de cap duqc den cac van de can ton tai
va kha ning cita img dung khai pha de lieu, dic biet la trong img dung he quan trj
CSDL SQL Server 2012.
NOi dung bio ciao chuyen de tot nghiep bao gem:
Lori my diu
Danh !nye tir vier tit
Chuang 1. Tong quan ye khai phi de lieu
Chuang 2: Cie tic vu trong khai phi (M. lieu
Chuang 3: Khai phi der lieu trong SQL Server 2012
Chuang 4: Ling dung khai phi de lieu trong SQL 2012
Ket luin
TM lieu tham khio
BANG it IOU VA CHU VIET TAT
KY hieu viet tit Nghia tieng anh Nghia tiang viet
DM Data Mining Khai pha dU lieu
BI Business Intelligence Tri tue doanh nghiep
CSDL/DB Database Ca so dft lieu
OLAP Online Analytical Processing Xir ly, Oen tich der lieu ttvc
tuyen
KDD Knowledge discovery in databases Kham pha tri thtic trong cac at
sa der lieu
SSIS SQL Server Integration Services Cac djch At tich hop ten SQL
Server ht3 trq khai pha de lieu
ERP Enterprise Resource Planning Quin lY nguOn loc va tai nguyen
ctia doanh nghiep
ODBC Open Database Connectivity Ket not ca ser du lieu ma
MVC LUC
CH •CING 1. TONG QUAN VE KHAI PHA DIY LIEU 1
1.1. Khai niem ve khai pha 80 lieu 1
1.1.1. Giei thieu ye khai pha der lieu 1
1.1.2. Dinh nghia ve khai pha der lieu 1
1.2. Cac buoy trong khai pha 80 lieu 2
1.2.1. Cac ki thuat khai pha 80 lieu 2
1.2.2. Luting 80 lieu 3
1.2.3. yang dbi caa mOt du an khai pha der lieu 5
1.2.4. Chuan khai phi dii lieu 7
1.3. Cac huang tiep can den van de khai pha der lieu 8
1.3.1. Kien irk caa mOt he thOng khai phi der lieu 8
1.3.2. Cac chirc rang chinh cua khai pha 80 lieu 10
1.3.3. Cac dung de lieu do the khai pha 11
1.3.4. Nhang van de kho khan trong khai phi der lieu 12
1.4. Xu huemg nghien cuu va viec *fig dung cua khai pha der lieu hien nay 14
1.4.1. Huang nghien ciru 14
1.4.2. (Trig dung coa khai phi der lieu trong thuc to 14
1.4.3. Ung dung cua khai phi der lieu trong viec giii guy& cac nhom bai
toga kink doanh 15
CHUCING 2. CAC Kt THU3T KHAI PHA usu 16
2.1. Phan lop da lieu 16
2.1.1. M8 hinh phin lap cay guy& dinh 16
2.1.2. M8 hinh phin lop chit lieu Bayer 18
2.2. Phan gun 80 lieu 20
2.3. Hai quy 22
2.4. Luat ket hap 23
2.5. Du bio 25
2.6. T6'ng hqp hem (Summarization) 26
2.7. M8 hinh h6a sv phv thuec (dependency modeling) 26
2.8. Phat hien stir Bien d6i va de Itch (Change and deviation detection) 27
CHUIZING 3. KHAI PHA Dir LItU TRONG SQL SERVER 2012 28
3.1. MO Willi OLE DB trong SQL Sever 28
3.1.1. Gidi thieu 28
3.1.2. Cac khai niem co ban trong OLE DB cho Data Mining 30
3.1.3. Data Mining Extensions to SQL (DMX) 31
3.2. Cac thuat toan khai phi der lieu trong SQL Server 2012 34
3.2.1. Microsoft Decion Trees 35
3.2.2. Microsoft Clustering 35
3.2.3. Microsoft Naive Bayes 36
3.2.4. Microsoft Sequence Clustering 36
3.2.5. Microsoft Time Series 36
3.2.6. Microsoft Association Rules 37
3.2.7. Microsoft Neural Network 38
3.2.8. Microsoft Linear Regression 38
3.2.9. Microsoft Logistic Regression 38
3.3. Nguyen tic chqn dm* toan 38
CHITONG 4. VNG DVNG KHAI PHA DC. LIEU SQL SERVER 2012 41
4.1. GiOi thieu ve Business Intelligence Development Studio 41
4.2. ling dvng trong SQL 42
4.2.1. Sir dung thuat than Microsoft Decision Tree va Microsoft Naive
Bayes 42
4.2.2. Su dying thujt toan Microsoft Association Rule 63
!CET LU*N 81
TAI LI$U THAM KHAO 81
TONG QUAN YE KHAI PHA DIY LIEU
CHUCFNG 1. TONG QUAN VE KHAI PHA Dir LIEU
1.1. Khii niem va khai phi d* lieu
1.1.1. GM thifu vi Mai plui chi Wu
Trong nhcmg am gin day, su phat then mph me ciut CNT'T va nganh ding
nghiep phis cimg da lam cho kha ning thu nhap va Itru fru thong tin ciia cac thimg
thong tin tang nhanh met cach cheng mat. Ben conh do viec tin hoc hea met each 6 at
va nhanh chiong cac hoot dOng san xuat, kinh doanh cling nhu nhieu lInh Arc hog dOng
khk di tio ra cho chimg to met lucmg de lieu luu tray Ichting 16. Hang trieu CSDL da
dugc sir dung trong cac host dong san xuat, kinh doanh, wan trong do co nhieu
CSDL cac len cot Gigabyte, thorn chi la Terabye. So bang nay din tin ye'u cau cap
thiet la can co nhung k9 thuit va ding cu mei de to Ong chuyen doi Wong de lieu
khang to Ida thanh the tri thirc co ich. Tir do, cac Id thuili khai pha de lieu di fro thanh
met linh we then so dm nen cting nghe thong tin the giei hien nay.
1.1.2. Dinh nghia vi khai pith dfr lifu
Phat hien tri thirc (Knowledge Discovery) trong cac co se du lieu la met qui trinh
nhan biet the miu ho4c the mo Mob trong de lieu voi cac tinh fling: hqp thee mei,
kha ich, va c6 the hiau duqc.
Con khai thic de lieu (data mining) la men nge tuong del mei, no ra din vao
khoang nhfrng nam cu&. cua dun thap 4 1980. C6 nit nhieu djnh nghia khac nhau ve
khai phi de lieu. Giao su Tom Mitchell da dua ra djnh nghia cita khai pha de lieu nhu
sau:" Khai phi de lieu la viec sir dung da lieu lich sir de kham phi nheng qui tic va
cai thien nhcmg quyet djnh tong tuong lai". Veri met each ti6'p c4r1 ling dung han, tien
si Fayyad da phat bleu:" Khai phi da lieu durang duqc xem la viec kham phi tri thirc
trong cac co se de lieu, la meat qua trinh trich xuat nheng thong tin in, trues day chua
hi& va co kha fling heu ich, duel ding cac quy luat, rang bu0c, qui tic trong co se du
lieu". Con cac nha thong ke thi xem" khai phi da lieu nhu la min qua trinh phan tich
dugc thiet ke tham do mitt luong coc len cac der lieu nhim phat hien ra cac miu thich
hqp vil hok cac mOi quan he mang tinh he thing gifts cac hien va sau de se hqp thirc
hoi cac ket qua rim duqc bing each ap dung the miu da phat hien duqc cho tip con
mei cita de lieu".
Trang 1190 A11278 — Doan Thanh Cong
A11500 — Nguyin Dec Hoing
TONG QUAN VE KHAI PHA DIY LI$U
N6i tom lai: khai pha 80 lieu la met buoy trong quy trinh phat hien tri thirc gom
co cac that town khai thic du lieu chuyen dimg dtrOi met se quy djnh ve hieu qua tinh
town chap nhan duqc di tim ra cac mitt hoac cac me hinh trong dO lieu.
1.2. Cic bulk trong khai phi do Ilea
1.2.1. Clic ki thuOt khan ph6 drr lifu
M3c du khai thic dfr lieu nhu lit met thuat nge tuong del mai, nhung hau bet cac
ky thuat khai thic du lieu da ten tai tong nhieu nim. Ma tier than cita khai thic dur
lieu deu xuat phat tir: thong ke, hoc may ya co so a lieu. Mot so thOt town khai thic
d0 lieu, bao gOm ca hOi quy, chugi that wan, va cay quyet djnh deu duqc phat minh
boi cac nhi thOng ke hqc. Ky thuorhei quy" CIA ton tai trong nhieu the kY. Cac thuat
toan"chuOi than gian" di duqc nghien ciru trong nhieu thap ky. Thuat town thy quyet
djnh la met trong nhieu k9 thuat gin day, co nien dai tir gifta nhUng nam 1980.
Khan thic d0 lieu tap trong yao phat hien to (king ho#c ban qr ()Ong matt. Met di
thuat town hoc may(machine learning) duqc lip dtmg cho khai thic dti lieu:
a. Mang noron (Neural networks)
Day la mot trong nhftng icy thuat khai pha du lieu dirge ling dung ph6 bien Men
nay. K9 thuat nay phat trien dva ten' met nen tang town hqc vtIng yang, kha nang h tan' '
luyen trong ky thuat nay (lira tren mil hinh than kinh trong trong cita con ngu&i.
Kat qua ma mpg naron hqc duqc c6 kha nang tao ra cac mo hinh dv bio, dv
doin yeti de chinh xitc yi dO tin cay cao. NO co kha nang phat hien ra duqc cac xu
bluing phirc tap ma k9 thuat thong thubng Ichic kh6 c6 the phat hien ra duqc. Tuy
nhien phuong phip tnang no ron rat phirc tap yi qua trinh tien Minh no g#p rat nhieu
kh6 khan: doi hoi mat nhieu thai gian, nhieu 80 lieu, nhieu Ian lciem tra thir nghiem.
b. Giii thuat di truyen
Li qui trinh m8 phong theo tier hoi cua tSr nhien. Y Wang chinh cua giai thuat 11
dva vim quy luat di truyen trong bien dOi, chip Ice tv nhien yi tiers boa trong sinh hoc.
Viec xay dvng cac thuat town di truyen me phong sinh hoc nhim tim ra cac giii
phip tot What bao gem cac btreic sau:
- Tao ra ca the ma di truyen dual long cac xau cita met bang ma lct tv han che.
- Thiet lap mei tnrang nhan tao trorTh may tinh co cac giii phip co the tham
gia"dau tranh sinh tO'n"veri nhau de zit djnh dO do thanh cong hay that
hay con goi thich nghi".
Trang 2190 A11278 — Doan Thanh Gong
A11500 — Nguygn Thic Holing
TONG QUAN VE KHAI PHA DIY LISU
- Phat trien cac"phep lai ghep" de the gild phip ket hqp vei nhau. Khi do cac
rcau mi di truyen cua giii phip cha va mg bi cat di vi xep lai, trong qua trinh
sinh sin nhu vay cac kieu dOt bien co the duqc ap dung.
- Cung cap mot (lull the cac giii phip ban diu tucmg d6i da long vi a may
tinh thqc hien"cu(ic chai tien hem" bing each loci be cac gal phip tir min ca
the va thay the chung bing cac con chin hoac cac dOt hien cua cac giai phip
bk. Thu* wan se ket thitc khi mot h9 cac giiti phip thinh citing duqc sinh ra.
Khai phi de lieu (KPDL) la viec frith chcm d.3c trtmg MI lieu mot each ty doting
tir mot Si dii lieu 16n. Tri thin do thtrimg o cac ding maw c6 tinh chat khong tam
thuong, An (khong twang minh) nhung 13i co the mang 13i ich lqi lam neu no duce sir
clung dung chi). Co the coi KPDL 11 cot lai cfut qua trinh phat hien tri thac trong co so
dii lieu (Knowledge Discovery in Databases — KDD)
1.2.2. Luling di lifu
Khai thic der lieu la mot trong nhUng thanh vien quan trong trong data warehouse
family. Trutmg hqp khai thic dft lieu nio la phu hqp veri dien kien ctla cac luOng der
lieu trong mot kith bin kinh doanh dien hinh?
Hinh sau minh h9a mot luting dir lieu doanh nghiep dien hinh ma khai that der
lieu co the duqc ap dung trong cac giai down Ichic nhau.
Si Data Mining
Application -4 ill P-
O
♦ 4
•
Online
Onlbe
transaction ••■ Analytical
Processing
Processing
(OLTP)
Hinh 1: M6 hinh khai phti du lieu doanh nghiep
Trang 3190 A11278 — Doan Thanh Cling
A11500 — Nguyln Dire Hoing
TONG QUAN VE KHAI PHA Dir
Met ung dung kinh doanh luu till the dt1 lieu giao Bich trong met ca so &I lieu
bb 15, giao djch true tuyan (online transaction processing- OLTP). Cie clit lieu OLTP
duqc chiet xuat, chuyin doi va nap vio data warehouse met each thuong xuyen. Luqc
itO Gila data warehouse thuimg khic nhau tir met luqc 46 OLTP. Met lucre d6 data
warehouse dk tnrng cob hinh ding du met ngoi sao hay met bong tuyet.V6i bang giao
djch o chinh gifta luqc 46 va dtrqc bao quash bei met be dimension tables(cic bang
kich thubc).
Tnnk lien, vi ph6 hien nhit, khai that dO lieu co the duqc by dung cho cac kho
dO lieu nth ma dft lieu di duct lim mtch. Cac miu duqc phat hien bed cic mo hinh
khai thic c6 the duqc trinh bay cho cite nhit quan lt tiep chi thong qua the bio cao.
Khai thic dft lieu co the c6 met lien ket true tiep den cic ling dung kinh doanh,
ph6 bien nhit la thong qua cac du doin. Nh(mg khai thic dft lieu vio ling dung kinh
doanh dang ngay met phO bien han.
Vi du: Trong met kich bin bin hang qua Web, met khi met khach hang dit met
sin vio trong gio hang, met du bao troy van khai thic der lieu duqc thuc hien de c6
duqc mot danh sich cic sin phAm duqc de nghj dua tren phin tich.
Khai thic du lieu cling co the duqc cip dung de pit tich kh6i OLAP, la met cc
so du lieu da chieu ved nhieu kich thubc vi don vi do. Kich thy& c6 the len den hang
trieu bin ghi do d6 se kho khAn cho vier tim ra mo hinh quan tan. Ky thubt khai thic
dO lieu c6 the duqc ap dung de kham phi ra cac mo hinh an trong met khoi OLAP.
Vi du: Met thulit than lien ket co the duqc bp dung cho mot Ich6i ban hang, phin
tich mau mua ctia khich hing cho met vimg cà the va then gian. Chling to c6 the ip
dung ky thubtIchai thic dO lieu de du bao cac bien phip nhu ban hing vi lqi nhubn.
Trong 4190 A11278 — Doan Thanh tong
A11500 — Nguyin Due Hoing
TONG QUAN VE KHAI PHA D() LISU
1.2.3. Yong did min m#t dv tin Heal phd dit
Eavaluboo of
Data Hag
Transfortution
Clean-mg Praprocetsr4
40! II laiKtnittir
Selection Preto-ton
I — . Doti' rin,
Gathering
Alli I erarafra"Dlia
qp- Cleansed
Preprocessed
4r Target 14Warted
Data
Data
Hinh 2: Yong doff aia men dv an khai phti du lieu.
a. Gom du lieu (gathering) va Trich lqc du lieu (selection)
Gom du lieu: Tap hqp du lieu la boat dau tien trong khai phi du lieu. Busk nay
lay du lieu tir trong mOt co so de lieu, mOt kho dft lieu, them chi di' lieu tir nhimg
nguon cung Ong web.
Trich lqc du lieu: O giai down nay du lieu duqc lira chon va phfin chic theo mOt
se lieu chuan nao d6.
c. Lam sach va tiers xir 15r der lieu (cleansing prepocessing)
Lam sach de' lieu: Day la qua trinh xir ly a ga be hoac lam giam nhieu vi each
xir 15, cac gia tri khuyet. Burk lam giarn su mop mer khi hqc.
Phan tich stir thich hqp: Nhieu thuOc tinh trong du lieu co the khOng thich hqp
hay khong can thiet de phan loai. Vi vay phop phan tich sar thich hqp duqc the hien
teen der lieu veri muc dich ger be bat lck nhung thuOc tinh khong thich hqp hay khong
can thiet. Trong hqc may bait nay duqc gqi la trich hoc dac tnrng. Phip phan tich nay
giup phan loci hieu qua va nfing cao kha rang ma rung.
Trong 5190 A11278 — Doan Thinh COng
A11500 — Nguyen Dirc Hoang
TONG QUAN VE KHAI PHA Dti
Giai doan nay la giai don hay bj sao ling, nhtmg thuc 4 no la med buck rat quan
trqng trong qua trinh khai phi de lieu. M6t s6 16i thubng mac phai trong khi gom de
lieu la de lieu khong day du hok khong thong nhat, thieu chit chi. Vi 4y du lieu
thubng chfra cic gia trj vo nghia va kh8ng co kha ning kit not du lieu, vi di; Sinh vien
co tuai=200. Giai doan nay nh&m xir ly cac de lieu nhu tren (de lieu vo nghia, de lieu
khong co kha fling kit nai). Nheng de lieu ding nay thubng duce xem la thOng tin du
thin, khong c6 gia tq. Bed viy day li mOt qua trinh rat quan tong. Neu de lieu khong
duqc lam such - tiers xi: ly - chuan bj threw thi se gay nen nheng kit qui sai tech
nghiem tang ve sau.
d. Chuyen d6i de lieu (tranformation)
Trong giai doan nay, de lieu co the duqc to chile va sir dung lai. Muc dich ctia
viec chuy'en dal de lieu li lam cho de lieu phit hqp han veri muc dich khai phi de lieu.
De lieu co the duqc tong quit him teri cac mirc khai niem cao han. Dieu nay rat
him ich cho cac thuk tinh co gia tr1 lien tuc. Vi du, cac gia trj so cua thuk tinh thu
nhip duce tang quit hoa sang cac pham vi rai rac nhu thap, twig binh va cao. Tuang
Ur, cac thutjc tinh gii trj nhu dtrimg ph6 dirge tong quit hoa ten khai niem cao han nhu
thinh ph6. Nher do cac thao tic vio/ra trong qtth tint' xir li se it di.
De lieu co the duqc tieu chuan h6a, &lc biet khi the mpg na-ron hay cac
phuong phap dung phep do khoing each trong cac buck xir H. Tieu chuan hoa bien dot
theo ty le tat ca cac gia trj cita mOt thuk tinh cho truck de chfmg rai vao pham vi chi
djnh nhu [-1,0;1,0] hay [0;1,0]. Tuy nhien dieu nay can cher cac thutjc tinh co pham vi
ban &anion (nhu thu nhip) co nhieu inh huerng dal veri cac thuk tinh c6 pham vi the)
han ban dau (nhu cac thuk tinh nhj phin).
e. Phut hien va trich mau de lieu (pattern extraction and discovery)
Day la butc to duy trong khai phi de trong giai doan nay nhieu thuit toan
khac nhau di duqc sir dung de trich ra cac man tir dft lieu. Thuit town thubng dimg de
trich man de lieu li thuit town phan loci dir lieu, kit hqp MI lieu, thuit town ma hinh
hoa de lieu min ur.
Li mOt trong cac buerc quan IA:mg nhat vi tan thin gian What cita qua trinh KDD,
trong d6 sir dung nheng phuang phip thong minh de chat Ice ra nhimg nth dt1 lieu.
Chu yeu la cac k9 thujt ciut machine learning (hoc may) de khai phi, trich chon nheng
mau (patterns), cac rang bu6c lien he (realionships) biet trong dit lieu
Trang 6190 A11278 — Doan Thanh Cong
A11500 — NguyIn Dim Hoing
TONG QUAN VE KIIAI PHA Din Lieu
C6 the cac mo hinh khong china cac mau c6 the sir dung. Co the la dft lieu hoin
than ngiu nhien hoc dft lieu c6 qua nhieu thong tin gay nhieu. Dieu nay you cau can
phai lap lai cac buoy lim sach vi chuyin doi dft lieu de chit lqc ra cac dft lieu c6
nghia ham. Day la met qua trinh lap lai vi tot dill len de dtra ra cac th6ng tin phi' hqp,
coy nghia yeti ngtrai quan trf
f. Dinh gia ket qua ink vi bleu dien tri thirc (evaluation of result and Knowledge
presentation)
Day la giai doan curfoi sung trong qui trinh khai pha &I lieu, a giai doan nay cac
matt dft lieu duqc chiet xuat ra bai phan mem khai phi du lieu. KhOng phai man der
lieu nao cling hftu ich, d8i khi no can bi sai tech. Vi vay can phai dua ra nhiing lieu
chuir' danh gia do uu tien cho cac mttu der lieu de rut ra duqc nhemg tri link can thiet.
Bieu dien tri thfrc: sir dung cac kgr thuit de bien dien vi the hien tivc quan cho
nguiri dung. Cac citch bieu dien nen a clang gait gui vi de hieu vai ngtriri dung nhtr
clang dri thj, cay,... de dua ra cac bio cao gulp ngtreri quan tri co the dua ra cac quyet
djnh mang tinh chat quan tong.
1.2.4. Chain khai phsi din life
SAS: la nhit cling cap san pham khai phi de lieu tau uhit .4 mat thi. phan. Dung
dau trong linh vuc thOng ke trong nhieu thop kY. Co sa SAS chira met be rat phong
pith cac chile ning thOng ke c6 the duqc sir dung cho tit ca cac loai phan tich din
Ho trq khai thic van ban, moi tnrimg di) hqa di xay dung cac mg hinh, co cac thuat
toan khai thic dit lieu phi') bier nhu: cay quyit djnh, mang naron, hOi quy...
SPSS: gOm cac san pham khai thic &I lieu nhu"SPSS base"vrAnswer Tree.
Ke thira gOi khai thic dft lieu Clementine — mot trong nhiing cong ty Mu lien gith
thieu cac khii them luOng khai thic dft lieu, cho phip ngtrai ding lam sach dft
chuy'en dOi der lieu vi thvc hien cac mo hinh thin nghiem
IBM: sin pham khai thic dft lieu la Intelligent Miner a Disc. N6 chira mitt tap
hop cac thuat than va cac cling cu tnrc quan. Dun ra nhiing me hinh khai thic du lieu
trong Predictive Modeling Markup Language (PMML). PMML la cac file XML chira
me to cim cac matt me hinh vi so lieu thong ke cua cac dirt lieu mau vai !nue Bich du
bao
Microsoft la nha cung cap dft lieu chuyen nghiep dau lien bao gilm cac tinh ring
khai thic trong met ca se der lieu quan he. SQL Server 2000 c6 hai thuat toan khai
thic dft lieu la: Microsoft Decision Tree vi Microsoft Clustering. Vai cac phien bin
Trang 71 90 A11278 — Doan Thinh Cong
A11500 — Nguyen Dim Hoang
TONG QUAN VE KHAI PHA DIT LI$U
tiep theo cua SQL Server la 2005, 2008, 2012 cic tinh rang khai phi co kr chi lieu
ngly cang duqc rang cep va sin phew ctia Microsoft ngly cang chiem linh thj truang
Oracle: Oracle 9i twit xtremg vao nim 2000, oft met cap thu#t town khai thic du
lieu dtra tren association (141 kit hqp) va Naive Hayes. Oracle lOg bao gam nhieu
cong cv va thu#t toin khai thic de lieu hon. Oracle cling kit hqp veri Java Data Mining
API la gai phin mem cho khai phi der lieu
Angoss: chit yeu xay (tong ck th41 toin decision trees, cluster analysis vi cic
me hinh du doin cho phep nguiri dung hieu de lieu ctia ho tir nhieu quan diem khic
nhau. Cic th4t win duqc ha trq ben cong co troc quan manh me et4 giii thich flitting
tri thirc khai phi duqc, n6 ding liun viec tot vai cic lien ich cita he quan trt Microsoft
SQL Server
KXEN: cung cep mot s6 thuol town khai phi de lieu nhtr: SVM, regression, time
series, segmentation...Va cic giai phip khai phi de lieu cho khei OLAP. Ngoai ra,
cung cep tien ich Excel add — in de khai phi di lieu trong moi truerng Excel.
13. Cic hiring dip clin den yin a khai phi do lieu
1.3.1. Kiln Ink Su min he thing khai phd Aar Mist
Co se du lieu: gam kho de lieu hoc ck cich luu tra thong tin khic (Database,
data warehouse, worldwideweb, information repositories). Day la mot hay mot tip cic
CSDL, cic kho der lieu, cic trang tinh hay ck dung luu tre thong tin khic.Trong
nheng tinh hung co the, thanh phan nay la nguan nh#p (input) dm ck kt thuftt tich
hqp va lam such de
May chit CSDL hay may chti kho dit lieu (Database or Data warehouse server):
may chit nay c6 trich nhiem ley nhemg de lieu thich hqp dtra tren cic you aulchai phi
cua ngtroi dung.
Trang 8190 A11278 — Doan Thanh C8ng
A11500 — Nguyen Dire Hoang
TONG QUAN VE KHAI PHA DIY LltU
Giao difm ad hos wen Wog
Ulm !nog miu
May khai phi da, liiu
May chi' CSOL hay kho dat lido
Lim each yi doh hqp SY lido
Co so de, lido Kho canidu
Co. so tri thtk (Knowledge base): duqc dung de lureng dan qui trinh tim kiem,
danh gia the mau ket qui duqc tim they. Ca sa tri thirc c6 the 11 the phan cap khai
niem, niem tin ciia ngutri sir dung, cac ring but* hay the ngtrecng gii tri, sieu &I lieu...
May khai pith du lieu (Data mining engine): Thinh phan nay chira cic khai chirc
ming thuc hien tac vu khai phi da lieu nhu: die trung h6a, ket hqp, phan lop, phfin
cum, phan tich su tien
Module danh gib malt (Pattern evaluation): Thinh phin nay c6 the duqc tich hqp
vio thinh phan Data mining engine. NO co the dung cac nguOng ve do quan tam de 19c
mau da kham phi duqc. Cling co the module danh gia mau duqc tich hqp vio module
khai phi, toy theo su cii dit ctia phucmg phip khai phi duqc dung.
Giao di en do hpa nguai dung (Graphical user interface): Thinh phin ha trq su
Wang tic gift nguai sir dung vi he thing khai phi du lieu.
- Nguiti sir dung co the chi djnh cau troy vin hay tic vu khai phi du lieu.
- Ngubi sir dung co the duqc cung cap thong tin ha my vies tim kiem, thuc hien
khai phi du lieu saw hcm thong qua cic ket qui khai phi trung gian.
- Ngtroi sir dung sung co the xem cic Itrqc dO co s6 dit lieu/kho der lieu, cac eau
trite chl lieu; dinh gia cic mau khai phi duqc; true quan hea cac mau nay a
cic clang khic nhau.
Trang 91 90 A11278 — Doan Thanh Gong
A11500 — Nguyen Disc Hoing
TONG QUAN VE KHAI PHA Dir tau
1.3.2. Clic chic sang chills ciao Mai pho dile lieu
Cac chfrc nang nay duqc the hien qua
a. Dac trtmg hem va phan biet:
Dac flung h6a 11 viec tong ket town b0 the dk diem hay cac tinh chat chung cua
mot lop du lieu dich. DO lieu d6 twang Ung veri mot kip do ngtthi dung dac ta bang
mot cdu truy van CSDL. DO lieu tra ve ctia qua trinh ddc hung hem co the &the bieu
dien bang nhOng khuon ding khic nhau.
b. Phan tich sv ket hqp:
La kham pha ra cac luat ket hqp trong mot tap lern dO lieu. Cac IWO ket hqp the
hien m6i quan he glad cac gia tri thuOc fink ma ta nhan thdy duqc to tan suat xuat hien
ding veri nhau.
Cac ludt ket hqp duqt kham phi to mot tap lern cac ban ghi giao dich trong kinh
doanh vi nhOng luat coy nghia co the gitip cho cac nha doanh nghiep ra quyet dinh.
c. Phan lop va dtr down:
Phan lop la qua trinh tim mot tap cac m8 hinh (hoac cac clue= nang) m8 ta va
phan biet cac lop du lieu. Ck mo hinh nay se duck sir diving cho ink dich dv doin ve
lop cua mot s6 d6i twang.
Vi'ec xay dvng m8 hinh dva tren sv phan tich cita mitt tap cac dir lieu huan luyen,
mitt m8 hinh nhu vay co the duck bleu dien trong nhieu Bang: ludt phan 16p, cay quyet
dinh hay mpg naron...
De phan lop vi dv doin co the thvc hien tunic mot sv phan tich thich hqp. Sr
phan tich d6 nhitm xac dinh nhOng thutjc firth kheng tham gia vi qua trinh phan lop vi
dv down, cluing se bi loai tth sau buerc nay.
d. Phan cvni:
!Chong gating nhu phan lop vi dv down, phan cvm se phan tich cac dai twang clit
lieu khi chua biet nhan cfia lop.
Sr phan cvm co attic dich nh6m cac dEti tuqng lai then nguyen Cac d6i twang
trong ding mot nhom giotng nhau a mt.= cao nhit vi cac d6i thong khac nhom giting
nhau it nhat.
e. Phan tich phan ter ngoai cuOc:
Trang 10190 A11278 — Doan Thinh Cong
A11500 — Nguyen Mc Hoang
TONG QUAN VE ICHAI PHA Dli LL$U
Min so CSDL c6 the china cac din wag du lieu khong tuan theo me hinh der
lieu, nhiing del tuqng nhu viy gel la phin tin ngoai cuOc.
Hiu het cac phuong phap khai pha der lieu deu coi phin tin ngoai cuec la nhieu va
loii be chung.Tuy nhien trong met se ling dung nao d6 nhu phat hien nhieu ching han,
cac str viec hiem khi xay ra lai duqc quan tam hon nhting gi thuirng xuyen phai.
Sr phan tich du lieu ngoai cuOc xem nhu la sr khai pha cac phin tin ngoai cuoc.
C6 met so phucrng phap de phat hien phAn tir ngoai cuOc: dung cac test mang tinh
thong k8 tren co so met gia thiet ve phan phoi du lieu hay met me hinh xac suit cho
dit lieu, dung cac phucmg phap dva ten dt) tech di kitm tra sv klik nhau tong nhcmg
di c trung chinh cita cac del tuqng tong met nh6m.
1.3.3. Cdc dong dfr lit'u cti thi khai plod
Nhu chimg ta di biet, tri thirc cua nhan loci la tong hoa cua cac mot quan he, lien
quan met thiet, logic yeti nhau va duqc hat tnlr duoi clang du lieu thy du lieu kia.Trong
thvc to c6 rat nhieu me hinh co so de lieu, my nhien trong cac linh vvc Ung dung cy
the khac nhau, chung ta c6 the dinh nghia va phan biet ra rat nhieu ding du lieu sao
cho thuin lqi nhAt tong qui trinh sir dung. Khai pha du lieu c6 kha ning chip nhin
met se kik' du lieu sau:
Ca so. du lieu quan he (relationnal databases): la cac dit lieu duqc to chat theo
mo hinh clft lieu quan he fit phe hien trong nhieu nginh. Do d6 hiu het cac he quan tri
cc se dir lieu dEu he trq dung co sa du lieu quan he nhu Oracle, MS SQL Server, IBM
DB2, MS Access...
Ca see da lieu da chieu (multidimensional structures, data warehouses): day cling
la clang dft lieu tac nghiep c6 cac ban ghi that:mg la cac giao tic. Dang du lieu nay
cling phe hien hi'c1/41 nay.
Ca set dit lieu quan he - Wong dei tuqng (object relational databases): la clang du
lieu lai giera hai me hinh quan he va hut:mg del tuqng.
Du lieu khong gian, thoi gian va chuti thoi gian (spatial, temporal and time
series data): la clang de lieu ca tich hop thuoc tinh ve khong gian dit lieu nhu dit lieu
ban at mang cap dien thoai hoic thiri gian nhu dft lieu ark dien thoai, phat hanh bao
chi, chi se chimg khoan...
Trang 11190 A11278 — Doan Thanh Cong
A11500 — Nguyen Dirc Holing
TONG QUAN VE KHAI PHA Dir tau
Ca so' du lieu da phieang tien (Multimedia databases): la dang de lieu am thanh,
hinh inh, text & WWW... Dang de lieu nay nit phong phit, da dang va duqc phi') bien
rOng rdi, nhAt la tr'en intemet.
1.3.4. Nhung vin di kho khan trong khai phi dir Iteu
a. VAn de ve Ca SO De Lieu
DAu vao dm met he thong khai phi de lieu thuang la tap cac de lieu the, so nhieu
Inc kh6ng dAy dit va 131 nhieu. Ngoai ra trong thvc to de lieu lai luon bien dong khong
ngUng va duqc b6 xung lien fix tao thanh mot Itnyng de lieu Ichiing to chira ding ca
nheng th8ng tin c6 ich va khong c6 ich. Chinh vi voy trong bAt kY met he thong khai
pha da lieu nio viec dAu fien can lim la phin tich va xem xet co se de lieu ma he
thong khai phi.
b. Co sa de lieu lot
Viec sir dung cong cu phan tich true tuyen khong khai thic het duqc nhemg thong
tin dm CSDL hien the& chinh vi v'ay nheng floral xir ly de lieu khong con each nao
khk la Itm de lieu lai de phuc vu cho muc dich sir dung sau nay. Der lieu dtrqc hru
chira dung ca thong tin co ich va vo ich. Viec tich ley nay ngay tang len va cho den
nay cac CSDL tai hang trieu ban ghi c6 kith think len den Tetabytes. Tity timg img
dung cu the, viec lam nhu nio de loai 136 de lieu china, nheng thong tin ve nghia lai c6
nheng each khic nhau. Vi vOy phucmg phip xir lY de lieu het sire da clang va phirc tap,
khong co met quy tic chung cho moi irng dung.
c. SO chi...
37,700
16V87/2013 23/08/2013 18/10/2013 2/12/ 013
2.6. Tang hop hea (Summarization)
La cong viec lien quan den cac phtrong phap tim kiem met me to tap con der lieu.
K9 thuat mo ta khai niem va tong hqp 116a thutmg ip dung trong viec phan tich dir lieu
co tinh tham de va bio cao to deng. Nhiem vu chink la son sinh ra cac mo to dac trong
cho met lop. MO ta loaf nay la met kieu tong hqp, tom tit cac dk tinh chung ctia tat ca
hay hau het cac muc ciia met lop. Cac me ta dac thing the hien theo loot co clang sau:"
Neu met muc thuec ve lap da chi trong lien de thi muc do co tat ca cac thuec tinh da
neu trong ket luan". Cac luat clang nay co khic biet so viii cac luat clang phan lap. Luat
phat hien dac tnmg cho lop chi san sink !chi cac muc da thuec ve lop dO.
2.7. Ma hinh hes sp phi? thuac (dependency modeling)
La viec tim kiem met mo hinh mo ta str phu thuec gift cac bien, thuec tinh theo
hai muc. Mirc cau &lc coa me hinh mo ta (thuimg dueri clang de th0, trong do cac hien
Trang 26190 A11278 — Doan Thinh COng
Al 1500 — Nguyen Dire Hoing
CAC Kt THUAT KHAI PHA DU LICU
phi thuoc hi) phan vao cac bien khk. Va muc dinh luqng mo hinh mo to mire dO phu
thuoc. Nhfing phu thuOc nay thuerng duqc bieu 011 dueri clang luat"neu-thi" — neu fien
de dung thi ket luan dung. VE nguyen tic, ca tier de va ket luan du co the la stir kEt
hqp logic cita cac gia tri thuoc tinh. Tren thuc te, tien de thuong la nh6m cac gia tri
thuOc tinh va ket luan chi la mot thuijc tinh. Hon ntia, hg thong co the phat hien cac
luat phan 16p trong d6 tat ca cac luat can phai co cling mot thuOc tinh do ngtred dung
chi ra trong ket luan. Quan he phu thuOc cling co the bleu dien dueri ding maingt tin cay
Bayes. D6 la dO thi co huerng khong chu trinh. Cac nut bleu dien thuoc tinh va tong so
elm lien kat phu thuoc gift the nut do.
2.8. PhIt hifn std Min di vi dO lich (Change and deviation detection)
NhiOin Ai nay tap chung vao kham phi hau het stir thay d0i co nghia dueri clang
dO do di Nat trirerc hoc gia tri chuin, phat hien di) tech ding ke gift not dung cila tip
con du lieu thuc va nOi dung mong dqi. Hai me, hinh do Lech hay dung la loch theo th&i
gian va l'ech theo nhom. DO loch theo thin gian la su thay tfoi coy nghia cua der lieu
thin gian. DO loch theo nhom la stir khac nhau cua du lieu trong hai tap con du lieu, 6
day xet ca trtrOng hqp tap con du lieu nay thuoc tap con kia. Nghia la xic dinh dit lieu
trong mot nhOm con ciia dOi tuqng c6 khac ding kE so vOi toan b0 dt)i tirqng hay
Ichong? Theo cach nay, sai sot du lieu hay sal Rich so veri gia tri thong thu6ng se duqc
phat hien
Trang 271 90 A11278 — Doin Thanh Cling
A11500 — Nguyen Due Hoang
KHAI PHA DO' LIEU TRONG SQL SERVER 2012
CHICONG 3. KHAI PHA "Kr LI$U TRONG SQL SERVER 2012
3.1. Mil hinh OLE DB trong SQL Sever
3.1.1. Gliti thifu
Duqc giei thieu vio thing 7 Nam 2000. N6 co nguen geoc tir hai ding nghe ca se
der lieu chinh: OLE DB vi SQL. Tieu chu& nay thong qua cac khai niem co sa de lieu
quan he va nhieu ap dung cita chting vac linh yip khai thac der lieu. Phan cot lOi cea
OLE DB la Data Mining eXtensions (DMX), mot neon ngir- truy yin SQL-style cho
khai thac der lieu. Dic to nay thingbao gOm mot danh sach cac chin nang du bao duqc
xac djnh fru& va mot be schema rowsets. Cac schema rowsets cho phep cac img dung
cita ban kham phi cac me hinh khai thac vi cac (Lich vu khai thac to tong.
Muc dich chinh ctla OLE DB la cung cip met cach thac chuin de truy cip vao
bang dr( lieu. Truck khi OLE DB ra deri, each IMO bien Mit de truy cip vac) met ca se
dO lieu quan he duqc thong qua Open Database Connectivity (ODBC), mot API Oa
tren chuin SQL C Level Interface. ODBC cung cap met cach de clang truy van cac
loci co se lieu quan he. Tuy hau bet cac dO lieu khong duqc Itru trong co se do lieu
quan he. DO lieu duqc tim they trong cac tip tin van ban, email, bang tinh Excel, tai
lieu Word,... Bon muen truy cip vat) tit ca do lieu tren theo cach twang to nhu cach
ban truy cip dir lieu quan he, tea nhit la thong qua cling met API. OLE DB duct gieri
thieu cho now dich nay.
Ifinh 5: Kiln Pic clic: Object Linking and Embedding Database (OLE DB)
Trang 28190 A11278 — Doan Thanh Cong
A11500 — Nguyen Dec Hoing
KHAI PRA D' LIEU TRONG SQL SERVER 2012
Cac chucmg trinh irng dung do the ket non tin cac nguan clit lieu khai pha khac
nhau thOng qua cac ket non OLE DB hoic ADO. Mai OLE DB cho met nguan dit lieu
Data Mining, cung cap met tip cac giai thuit khai pha de lieu. Cac thuit town nay co
the truy xuAt bit 4 nguan de lieu dung bang nao th8ng qua OLE DB. Du lieu nguan
c6 the luu ter a trong nhieu clung nhtr CSDL quan he, OLAP cubes, file yin ban hay
email...
DE c6 kha ning trO thanh met chart chung cho khai pha de lieu, OLE DB dinh
nghia met tip cac giao tiep. Cie giao tiep nay duce cal dit ix% cac den tuqng. Chimg
bao gOm:
Ddi twng ngudn dik lieu (Data Source Object): la met dei mpg COM ma
thong qua d6 cac chucmg trinh ang dung ket not ten nguan de lieu. Mai met
nguttin dit lieu, OLE DB cai dit met lap doi tucing rieng cho no. De ket non ten
met nguan de lieu OLE DB, cac chucmg trinh Ong dung can phai khai tao 16p
nay truerc. Data Source Object thuc thi giao tiep IDB Create Sesion la giao
tiep h6 trq de mieu to cac thong tin sieu de lieu
Doi turmg phien (Session Object): cung cap mot ngit canh cho met phien giao
tic. NO sir dung giao tiep IDB Create Session, met Data Source Object c6 the
tao ra met se luqng cac phien. NO thuc thi giao tiep 1DB Create Command
Doi twng phien (Session Object): cung cap met nit canh cho met phien giao
tic. NO sir dung giao tiep 1DB Create Session, met Data Source Object co the
tao ra met so lucmg cac phien. N6 thuc thi giao tiep IDB Create Command
Doi tuTyng tap cac dong (Rowset Object): no la en Wog trung tam cho phep
tit ca cac nguan de lieu OLE DB truy xuat de boc tach dit lieu ra clued ding
bang. Met tip du lieu rowset c6 the hieu then khai niem la met tip cac dong
ma mai dang co cac cot dit lieu. Chuang trinh se duyet cac rowset de lAy ra
cac dit lieu khac nhau. Ket qua truy yin tra ye la met tip cac rowset co dung
bang (gam column ya row).
Trang 29i 90 A11278 — Doan Thanh Ding
A11500 — Nguyen Dec Hoang
KHAI PHA DC! LIEU TRONG SQL SERVER 2012
Hinh 6: Ccic doi Won trong OLE DB
3.1.2. Clic khdi nifm co ban trong OLE DB cho Data Mining
Case: Data Mining 11 phan tich cac cases — moToi case II mot tap cac thut)c tinh
(attributes). MOi thuec tinh c6 the do met top cac gia tri goi la cac tang thai. V1):
thuec tinh giei tinh c6 2 tang thai la: nam vi ner
Case Key: la thuec tinh xac dinh duy nhAt cho m61 case. N6 thuimg la kh6a chinh
dm mot bang quan he. Thinh thoing, met case c6 the c6 khea tong hop (gem vai thuOc
tinh). Vi du: First Name va Last Name c6 the ducrc ghop lai thinh khea do tuyen.
Nested Key: mac de Case Key c6 the &roc quyet dinh lam kh6a chinh, nhung
kh6a long nhau tit khac biet vei kh6a ngoai. Case Key chi de xk dinh tinh duy nhAt
nhung 13i kheing chira cac niu (va thuerng bi be qua beri cac thuAt town khai phi der
lieu), can khea king nhau lai la thuOc tinh quan tong nit Cac thu'Oc tinh khac a trong
phAn long nhau dung de mieu to khea 16ng nhau.
Case Tables viz Nested Tables: met bang case chira cac thong tin lien quan den
phAn nen dia case. Mal bang Tong nhau la met bang chin cac thong tin lien quan den
phAn king nhau cim case. N6 thirimg IA bang giao tic (transaction table), VD: lich sir
giao dich mua hang, logs truy cap Web...Mt)t bang long nhau c6 the k'et n6i v6i bang
case nha dimg Case Key. De ket not bang case va bang Icing nhau theo mo hinh ke
thin, OLE DB dinh nghia phip toin Shape
Scalar Column va Table Column: met cOt trong mo hinh khai pha no giOng nhu
met cot trong mo hinh quan he, no cling dirqc goi la bien hay thuoc tinh trong thwAt
ngil thong ke.
Trang 301 90 A11278 — Doan Thanh Cong
A11500 — NguyIn Dirc Hoing
KRAI PHA LIEU TRONG SQL SERVER 2012
Thy theo mvc Bich sir dung, me hinh khai phi de lieu co the do 4 kieu cot la:
khem, dau vao, dv doan va met cot chira ca dau vau va dv doan. Met vai thuat toan nhu
la phan cum, kheng you cau cac cot dv doan. Trong truerng hqp nay, me hinh khai pha
c6 the chi bao gam cac cot daua vao
Co 2 loai au trite cot la: vo Wong (scalar) va bang (table). Phan len cac cot la
cot vo htremg. Mtn cot ve hiving cira mot tap cac ban ghi rieng bier co gia tri dun. Vi
dv: Tit& va Geri tinh la cac cot vo hut:mg. Met cOt bang la mot cot dic bier. NO chira
met bang ben trong. Vi du: ThuOc tinh Purchases chinh la mot cot bang (chira thong
tin ve san pham va se Itrqng hang khach hang da mua). OLE DB co khai niem ve tap
der lieu ke this: phan nen dinh cho ck cot vo htretng va phan phan cap la cac cot bang
Data Mining Model: me hinh khai pha de lieu co the duqc hieu la met tap cac
bang quan he. N6 bao gem cac cOt khoa, cOt dau van va cac cot dv doan. Moi mo hinh
duqc gin veri met thuat toan khai pha de lieu ma tai d6 mai me hinh duqc huan luyen.
Vie huan luyen met me hinh khai pha tirc la tun ra cac matt 4p hqp de lieu bang cach
dac to cac thuat loan khai pha de lieu veri cac thong se phh hqp. Sau qua trinh huk
luyOn, the ma hinh khai pha du lieu luu trir cac matt ma thuat toan khai pha tim ra
duqc. Trong khi met bang quan he la tip cac bin ghi thi me hinh khai pha du lieu la
tap cac matt
Model Creation: khai niem ve tao and hinh dan gian la tao ra met me hinh khai
pha de lieu trong, gin giong nhu cach ma tao ra met bang meri
Moddel Training: can duqc goi la cach xti lY me hinh. NO duqc clang de din ra
thuat town khai pha de lieu de kham pha tri thee nher cac 4'p de lieu huan luyen. Sau
qua thrill hair luyen, cac matt duqc luu tre trong cac me hinh khai pha
Model Prediction: me hinh dv doan duqc cliing de ap dung co cac man me hinh
khai pha di duqc huan luyen, de dv dok cac tap de lieu men img veri mai twang hqp
meri.
3.1.3. Data Mining Extensions to SQL (DMX)
a. Dinh nghia:
DMX - Data Mining Extensions la met ngen nger truy van khai pha (Wien duct
dinh nghia trong OLE DB climh cho khai pha de lieu. DMX duqc thiet ke hau het cac
khan niem quan he va eau true cua no dva tren ngon nge truy van SQL.
Trang 31190 A11278 — Doan Thanh Cong
A11500 — Nguyen Dirc Hoang
ICHAI PHA Dg LIEU TRONG SQL SERVER 2012
Tren SQL Server 2012, ngoii viec sir dung cling cm SQL server data tool de khai
phi du lieu mat cich tnic quan bang giao dien, to con co the sir dung DMX truy van
tai he quin trj CSDL nay de lim to clang boa qui trinh xay dung mo hinh, huan luyen
der lieu, du doin, truy van ra caythong tin tri thirc, hien thi kat qui tren giao dien
ngtrai &mg.
b. Cic bulk khai phi du lieu sir dung neon ngit DMX:
Xay dung m8 hinh khai thic: tuang to nhu tao mat bang trong ca se di/ lieu quan
Mat mo hinh khai thic gem:
- Cot dit lieu diu vao
- Cat dtr doin dugc
- Thuat town lien quan
Vi du mat doan lenh xay dtmg m8 hinh khai phi du lieu dung de" du doin kieu
the thinh vien cua mat Ichich hang sir dung that town cay quyet dinh:
Create mining model MemberCard_prediction
CustumerlD long key,
Gender text discrete,
Income long continous,
MemberCard text discrete predict,
Purchase table(
ProductName text key,
Quantity long continous
Using Microsoft_Decision_Trees
Huan luyen mo hinh khai phi: trong buerc nay cic thuat town khai phi du lieu bit
d'au phan tich cic der lieu diu vio. Tity vio hieu qui cita tirng thuat town se cho thiy
mei ttrcmg quan gift cic gii trt thuac firth.
Doan ma l'enh DMX de luyen ma hinh:
Insert into MemberCard_prediction
(CustomerID,Gender,Age,Profession,Income,HouseOwner,MemberCard)
Trang 32j 90 A11278 — Doan TUT:1h COng
A11500 — Nguyen Dire Hoang
KHAI PHA Dv LIEU TRONG SQL SERVER 2012
OpenRowset(` sqloledb' :myserver': mylogin mypass ' ,` select
CustomerID,Gender,Age,Profession,Income,HouseOwner,MemberCard
From customers')
Du doan: De du doom chimg to can mot m6 hinh da duqc huan luyen va new
dataset.Sir dung cac hang du doin duqc djnh nghia trong DMX de dua ra du doin.
Vi du down mA lent' DMX du doin:
Select t.CustomerlD,t.LastName,M.MemberCard
From MemberCardjrediction Prediction Join
OpenRowset(`Provider=Microsoftiet.OLEDBVdata
source=c:\customer.mdb%
`select * from customers') as t
On MemberCard_prediction.Gender=tgender
And MemberCard_prediction.Age=t.Age
And MemberCard_prediction.Profession--4.Profession
And MemberCard_prediction.Income=t.Income
And MemberCard_prediction.HouseOwner=t.HouseOwner
Where NewCustomer.Age >30
c. M6t se ham du bao dtrqc dish nghia:
Trang 331 90 A11278 — Doan Thinh Cong
A11500 — Nguyen Dirc Hoing
KHAI PHA DU LIEU TRONG SQL SERVER 2012
OWLS PISS Finis on Suit Cana
Oneidas. than OirPYN teadPdh
MCA i.Cla.cD >al &Ski lain
Cin3 sailart
Predict lass 931 Wei pair finlanio
rattle beim id Felon *afar
qt_ro. soli thind midi' a Clanserekeilig kit Mitigidelprist
move Matte I ci2:ep ice
nit siren to ra is deliedtg clatta.or et
gecko/aim lelikapidirrivet
inelicravert kik Ws Cori ohm :nand it Ns con ar Sank Gettemilipiehlteriat
(gala: *Sat Hs, ninny>) tedetkva ENS sin
zdameni
rpicon ar kit Gets Is white SS
enact/glace SS *ski It Oils
at et Mgr Warn)) hldC hatanleini
*AMISS/Min i
((Kale :Cr,
rthrmet >I the re iserag brakes waraticsai: Subs Crimp aid it
athuK elm rderco) peal baits. Sind
ar
cliatensuou Sca t :emcelaahme dh
tZLE:C.C.."_011) sidsiied SM. Set Sipe Ctitatcagin
::am:..-jao: skids
IL 1 . I I,
tit! Ws
frtIlliff
Clubianbsbilley Sal lalarts it awe
Podietirdtaty Saiekt lkictddiratetthe
Ste S
tmeladvdiespielt
br tauff In
hired akrHeee maw •assi MSS xotiallege
Mit
taped ✓SCe.lt So.Y :Midi WWI II IS Sod
-.Corwirn) wir Irr, SAS mhos
usitia '<mita km Ea :ash awl paid
cn -sinned.) sr Ira Sand Sat
at tar Sus :Ns Ire ife rid ol ta
cos nfettarol sdecktei MI Mimi
Ira
3.2. Cic thuat tom khai phi de lieu trong SQL Server 2012
Thuat toan khai that de lieu la met kg thuat de tao ra cac mo hinh khai that.
tao ra met me hinh, met thuat toan tien phai phan tich thitt lap ciia du lieu, him
kiem cac mau d4c trung va xu hurling. Thuat toan sau do sir dung nhtmg kit qui cup
viec phan tich nay de xac djnh cac tham se dm mo hinh khai that.
MO hinh khai that ma met thuat toan tao ra c6 the cú nhieu dang khac nhau, bao
g6m:
Viec thiet lap cac lust ma to lam each nao cac son pham duqc gom them lai
voi nhau thanh met thao
- Cay quyet djnh du down met khach hang cu the se mua met son pham hay
kh6ng
M6 hinh toan hoc du down viec mua ban
Thiet lap cac nhom mo to cac case trong dataset lien quan den nhau nhu the
nao.
Trang 341 90 A11278 — Doan Thanh Cong
A11500 — Nguyen Dec Hoang
KHAI PHA Dir LIEU TRONG SQL SERVER 2012
Microsoft SQL Server Analysis Services cung cap nhieu thuit town cho ck giii
phip khai thic du lieu cilia ban. Cic thuit town nay li tip con cita tat ca cic thuit town
co the duqc clang cho viec khai chic du lieu. Ban cling c6 the sir dung cic thuit town
cita hang this ba than theo cic dic ti OLE DB for Data Mining.
3.2.1. Microsoft Decion Trees
Thuit town Microsoft Decision Tree ha trq ca viec phan loci vi hai quy, vi tho
rat tot cic mo hinh du doin. Sir dung thuit town nay c6 the do doin ca ck thuec tinh
rat rac vi lien toe.
Trong viec xay dung mo hinh, thuit town nay se khio sit sv anh huerng cia mai
thuoc tinh trong tip du lieu vi ket qui cilia thuoc tinh dv down. Vi tiep den no sir dung
ck thuec tinh input (vii ck quan he ra rang) de tho thinh mot nhom phan hem gqi
cic node. Khi met node mei duqc them vio mo hinh, met ciu tric cay se duqc thiet
lip. Node dinh ctia cay se inieu ti so phan tich (bang thong ke) cita cic thuec tinh dv
doin thong qua cic matt. Mai node them vio se duct to ra dtra tren so sip xep cic
Huang cita thuec firth dv doin, de so sixth veri di lieu input. NM met thuok tinh input
duce coi la nguyen nhan cilia thuec tinh dv doin, met node meri se them vio me hinh.
MO hinh tiep tuc phit trien cho den lite khong can thuec tinh nio, tho thinh met su
phan tich de cung cap met du ha° hoin chinh thong qua ck node da Mn tai. MO hinh
dui hoi tim kiem mot sir ket hqp giaa ck thuOc tinh vi truing dm no, nhim thiet lip
Met su phin phei khong can ximg gicra the trithng trong thuOc tinh dkr doin. Vi the
cho phop du doin ket quA cua thuOc tinh du doin met cach tot nhit.
3.2.2. Microsoft Clustering
Thuit town nay sir dung ky thuit lip de nhom ck ban ghi tir mot tip hqp du lieu
vio met lien cung cling cú dic diem gning nhau. Sir dung lien cung nay c6 the khim
phi dir lieu, tim hitu ve ck quan he da ton thi, ma cic quan he nay khong a ding tim
duqc met each hqp 19 thong qua quan sat ngau nhien. Them nfra, c6 the du doin tir the
mo hinh lien cung da duqc tho bed thuit town.
Vi du: xem xet met nhom ngueri song a cling met vimg, c6 cling met lo3i xe, an
cling met loci thirc an vi mua clung mot sin phim. Day li met lien cung cua &I lieu,
met lien cung khac c6 the bao gam riffling ngueri cling den mot nhi hang, cling mire
lucmg, vi duqc di nghi a nu& ngoii hai lin trong ram. Hay quan sat nth -mg lien cung
nay duqc phan ph& ra sao? Ta co the hitt rb han so inh htremg cilia cic bin ghi trong
Trong 351 90 A11278 — Doan Thinh Ceng
A11500 — Nguyen Dirc Doing
KHAI PHA LItU TRONG SQL SERVER 2012
mot tap dar lieu. Cling nhu su anh huerng nay c6 anh huling gi den ket qua dm thuec
tinh dv doin
3.2.3. Microsoft Naive Bayes
Thuat Man nay xay dung mo hinh khai thic nhanh hcrn cac thuat wan Ichic, phut
vu viec phan loci va dv doin. NO tinh town khit Jiang co the xay ra trong mOi trtreing
hqp cim thuec tinh dau vao input, gin cho mei truing met thuec tinh de co the dv
down. Moi trifling nay c6 the sau d6 duqc sir dung de dv doin ket qua cita thuec tinh
dv doin dva vao nhiing thuec tinh input da biet. Thuat toil' nay chi he thy cac thuec
tinh hoac la tin rac hoax la lien tic va cac thuec tinh dau vao nay dec lap veri nhau.
Thuat town nay cho to met me hinh khai phi don gian (co the coi la diem xuat phat ctia
DataMining), beri vi hau het tit ca cac tinh loan sir dung trong khi thief lap mo hinh
duqc sinh ra trong xir li cim khei (cube), ket qua duqc tra ve nhanh thong.
3.2.4. Microsoft Sequence Clustering
Thuat town Sequence Clustering phin tich cac del tuqng du lieu co trinh tv, cac
du lieu nay bao gem met chuOi cac gia trj raj rac. Thuimg thi thuec tinh trinh tv cim
met chat anh teri met tap cac sty kien ctla met trat tv re rang. Bing cach phan tich str
chuyen tiep giiia cac tinh trang cua met chuOi, thuat toin ce the dv doin ttrcmg lai
trong cac chueli c6 quan he nhau.
Thuat than Sequence Clustering la sir pha ten giita thuat town chat va thuat toan
lien cung. Thuat town nhom tit ca cac sir kien phirc tap yeti cac thuec tinh trinh to vao
met phan down dva vac, sv gieng nhau mia nhang chat nay. MOt dac trtmg sir dung
chuei sv kien cho thuat town nay la phan tich khach hang web cilia met Gong thong tin.
Met ding thong tin la mot tap cac ten mien lien ket nhu: tin ttic, thin tiet, gia Wen,
mail, va the thao... mei khach hang duqc lien ket veri met chucli cac click web tren cac
ten mien nay.
Thuat Man Sequence Clustering c6 the nhom cac khach hang web ve met hok
nIti'eu nhOm dva tren kieu hanh deng cum he. Nhang nhom nay co the duqc tryc quan
hem, cung cap met ban chi tie[ de bier duqc muc dick sir dung trang web nay cita khach
hang.
3.2.5. Microsoft Time Series
Thu#t Man Time Series Mo ra nhang mo hinh duqc sir dung de dv dotin cac Bien
lieu theo tir OLAP va cac nguen MI lieu quan he.
Trang 36190 A11278 — Doan Thinh Cong
A11500 — Nguyen Dire Hoang
KHAI PHA DU LIEU TRONG SQL SERVER 2012
Vi du: sir dung thuat toan nay de du doan bin hang va lqi nhui'n dui vao cac dit
lieu qua khir trong 1 cube
Sir dung thuat toan Time Series do the chqn mot hoac nhieu bien de du doin
(nhung cac bien phai la lien tuc). CO the c6 nhieu trueng hqp cho mai mo hinh. Tap
cac tnrtmg hqp xac dinh vi tri cua met thorn, nhu la ngiy thing khi xem viec bin hang
thong qua vai thing hoc vai nam truerc.
Mot twang hqp co the bao gem met tap cac bien (vi du nhu ban hang tai cac cira
hang Ichic nhau). Thuat town nay co the sir dung sir tuong quan cua thay doi bien so
trong du doin ctia no.
Vi du.: bin hang trtrem kia tai met cira hang co the rat heru ich trong viec du bao
bin hang hien tai tai nhimg cira hang.
3.2.6. Microsoft Association Rules
Th4't toan nay duqc thiet ke de sir dung phan tich gio hang thi twang (basket
market) ten str giao dich cita khach hang. Nher thult town phan tich lust ket hqp co the
biet dtrqc nhcmg sin phim nao thuerng dtrqc bin ding yeti nhau va lam the nao met san
pham dic Wet duqc ban cling veri nhemg san phim khic. Vi du 5% so ichich mua
laptop, chuet khong day ding v6i de tan nhiet va 90% cim ithimg khich hang nay da
mua laptop, chuot khong day thi cling se mua de tan nhiet.
Thuat toan phan tich lust ket hqp, thuat toan nay se xet mai cap thuec tinh/gia tri
la mot item. Met Itemset la met tap hqp cac item trong 1 giao dich (transaction) don
le.Thuat toan se quat qua cac tap dit lieu, tim kiem cac tap Itemset xuat hien trong
nhieu giao dich. Tham chieu Support se dinh nghia bao nhieu transaction ma itemset
se xui't hien tank khi no duqc go:pi la quan tong.
Vi du ve met itemset phi) bien:
{Gender="Male", Marital Status="Married", Age="30-35"}
Trong Itemset nay co cac Item: Gender=male; MaritalStatus =Married; Age=30-
35
Trong Item Gender la thuec tinh, male dirge goi la gia trf cita thuec tinh gender
Thuat toan nay cling tim ra 141 ket hqp giara cac Iternset.Vi du, Mot luit kat hqp
05 ding X,Y=>Z. Khi ca X,Y,Z deu la Itemset phe bien thi to not ring Z duqc du (loan
to X,Y.
Trang 371 90 A11278 — Doan Thanh Cling
A11500 — Nguyen Dirc Hang
KHAI pHA DC! LIEU TRONG SQL SERVER 2012
Mqt dac tinh quan tong nita cita phan tich tat ket hop d6 la Probability (xac
sat). Xic suat cua quy tee ket hop A=>B duqc tinh town bang each sir dung Support
cita itemset (A,B) chia cho Support cita itemset A. Xic sat nay duqc goi la dq tin cay
trong nhitng nghien ciru ctia khai phi du
3.2.7. Microsoft Neural Network
That town Neural Network tao cac m8 hinh khai thic fel h8i quy vi phin loaf
bang each xay dung da lop perceptim cfra cac flown. Citing nhu that toan cay quyet
djnh, dua ra mei tinh twig cita thuOc tinh co the do doin. That tan Neural Network
tinh toan kha nang co the cita mei trang thai ce, the Gila thuOc tinh dau vao. That toan
Neural Network se xir 1$' tan the cac trtrerng hop. So lap di lap tai so sinh the du (loan
phan loaf cfra cac truerng hop vfri so phan loci ca cac truerng dA biet. Sai se tir so phin
loci ban diu (dm phep lap ban dAu) cita toan bo cac twang hqp duqc tra ve network va
dtrqc sir dung de thay aM so thoc thi cita network cho cac phop lap ke theo,v.v... co
the sau do sir dung nheng kha nang nay de do doin ket qua cita cac thuOc tinh do
doan, dua tren thuete tinh vac).
Mqt so khach biet chink gifra that town Neural Network va that town cay guy&
djnh la cac lcien thirc xir li la nhi'mg tham so network ton uu nhim lam nhe !that cac loi
co the trong khi cay quyet djnh tach cac !at, moc dich de eve dal h6a th8ng tin co 10.
That town nay he trq ca thuqc tinh raj rac va IS tic.
3.2.8. Microsoft Linear Regression
Microsoft Linear Regression la met cau hinh co the cita that town Microsoft
Decision Trees, thu duqc bang each vo hieu hew chia tach (cac tong thirc hqi quy town
bq duqc xay (long trong met nut gec duy nhat). That town nay he trq cac do down cho
cac thuOc firth lien tic.
3.2.9. Microsoft Logistic Regression
Microsoft Logistic Regression la mgt cau hinh co the cita that town Microsoft
Neural Network, thu duqc bang each loci bo cac lop An. That town nay he trq cac do
doom cita ca hai thuqc tinh red rac va lien tic.
3.3. Nguyen tic chon thu$t toin
SQL Server bao gam nhfrng that tan sau:
Trang 381 90 A11278 — Doan Thinh Ding
Al 1500 — Nguyen Dire Hoing
KHAI PHA DC! LIEU TRONG SQL SERVER 2012
Thuat town phin loci: du doin mot hok nhieu bien rai roc (khong lion tic),
dua ten cac thu6c firth trong Lap de lieu (Microsoft Decision Trees
Algorithm).
Thu* toan hoi quy: du doin moat hoc nhieu hien lien tic, kieu nhu nhiing lqi
nhuan va nhff ►g ban that, dua ten cac thuetc tinh khic nhau cith tap hqp dit
lieu (Microsoft Time Series Algorithm).
Thuat tan phin down: chia dii lieu thinh hai nhom, hok cac lien cung, hok
cac danh mac c6 thuOc tinh gicang nhau (Microsoft Clustering Algorithm).
Thuat Wan ket hqp: tim nhiing su tuang quan gicra cac thuOc tinh khach nhau
trong met tap hqp du lieu. ling dung pile hien nhat dua loci thuat toan nay 11
tao ra cac lust ket hqp, do the dirk dimg trong market basket (Microft
Association Algorithm).
Thuat Wan phan tich tien trinh: hang kit nhiing tien trinh thirerng xay ra hoc it
xay ra trong der lieu (Microsoft Sequence Clustering Algorithm).
Chan moat thuat toan dimg de sir dung cho cac nghiep va Hong biet la mot nhiem
via kh6 lchAn. Khi to c6 the sir dung cac thuat toan khic nhau de thuc thi ding met
nghiep vu, moi thuat Wan tao ra met ket qua khich nhau, va mot vai thuat Wan c6 the
tao ra nhieu han met ket qua.
Vi du 1 c6 the sir dung thuat than Microsoft Decision Trees kh8ng chi de di;
down ma din la met ckh de giam s6 litmg cot trong dataset, bai vi cay quyet djnh co
the xk dinh cac got ma khong anh hitmg den me hinh khai thic cuei ding.
Ta cling kung phai sir dung vac thuat than doc lap trong giii phip khai thic dii
lieu don gian, c6 the sir dung vai thuat toan de khao sat du lieu, va sau d6 sir dung cac
thuat tan de du down kat qua reri roc dta ten du lieu nay.
Vi du 2: c6 the sir dung thuat man gom nhom, nhan ra cac mau, dua dii lieu vao
nhom dOng that, va sau d6 sir dung cac ket qua de tao ra mo hinh cay quyet Binh tot
han.
Vi du 3: bang each sir dung thuat toan cay hal quy de ley thong tin du doin ve tai
chinh, va thuat Wan dua teen luat de dux thi viec khao sat thi throng.
Cac mo hinh khai thic co the du doom cac gin tri, dua ra bang tom tit dii lieu, va
tim ra su Prong quan An. De giup cho lta chop thuat toan cho giai phip khai thic chi
Trang 39190 A11278 — Doan Thanh Ding
A11500 — Nguyen Dirc Hoang
KHAI PHA Dir LIEU TRONG SQL SERVER 2012
lieu. Bang duoi day cung cap cac gqi y cho vi'ec Ira chcon thuat town nio cho cac clang
viec cu the nio:
out; HI@ tu.in str dung
Du Joan thuqc tudi hen me Thuat twin Microsoft Decision Trees
VI du: du doin doanh thu nam tiep theo Thu$t toin Microsoft Time Series
Tim nhom lulling di twig trong cac Thu$t town Microsoft Association
giao thirc thuc hien Thu* toan Microsoft Decision Trees
Vi du: sir dung phan tick thj throng de
dua them cac san phim cho khich hang
Trang 40190 A11278 — Doan Thanh Cong
A11500 — Nguyen Dirc Hoing
UNG DUNG KHAI PHA Doi MU SQL SERVER 2012
CHU'ONG 4. UNG DUNG KHAI PHA DIY LI€U SQL SERVER 2012
4.1. Gioi thieu ve Business Intelligence Development Studio
a. Gioi thieu
SQL Server Data Tool- Business Intelligence Development Studio (BIDS) la
cong cu cho phep to chirc quail 19 va khai thic kho du lieu (Xir 19 phin tich trvc tuyen)
cling nhu thy dung cac me hinh khai pha dft lieu rat di sir dung va hieu qua ciia
Microsoft.
SQL Server Data Tool lam viec ben tong Visual Stuciio.Nguei sir dung co the
tao met dv an Analysis Services project a khai thitc du lieu. Quy trinh xay dung mot
me hinh khai phi da lieu vai BIDS nhu sau:
To meri 1 project (Analysis Services Project)
Tao mot Data Source
Tao met Data Source View
Tao met Mining model structure.
Tao ck Mining models.
Khai thk Mining models.
Kiem tra de chinh xac dm Mining Models.
Sir dung Mining Models de dv doin.
b. flu cau hg thong cal d4t
Cai at SQL sever 2012
Khi cai det SQL Sever 2012 rald cal d4t them be SQL Server Data Tools for
Visual Studio 2012. SQL Server Data Tools for Visual Studio 2012 la nen ngft dimg
a tao va thvc thi chtrung trinh.
Ban phii chic ring djch vu phin tich di duqc chay
Trang 411 90 A11278 — Doin Thinh Cong
A11500 — Nguyen Dirc Holing
UNG DUNG KHAI PHA DIY LItU SQL SERVER 2012
Fla Action Vs. Help
im 7. 1 '
is SQ Sorer Configunban Manager Roca) None Rate Stan Mock Log On As Rgcas O
SQL Saver Sinew gt SQL Saw Integrabon Seniors 11J) Running Manus NT SeviceVAallsi. 5112
SO. Server Network Caniquradon (32biti
kil-tet Mar Dimness Wadi— Rumng Manua NT SerictQASSQL.
SQ Rebut Client '1.0 Cenfiguntion (320!
aiSQL San RASSOLSERRER) Rumens Manua NT SeviceVASSOL.. um'
SQL Saw Network Canfigurebon
Rang Manus NT ServiceVASSQL.. IMO
_t SQL Neale Clot 11.0 Confer...bon
bsa. Sow Ripening Sestet PASS- Running Manua NT Senicenlepolt5.- 5136
a SQL Senn Browse Unwed Odle (Boot. Syste- NT AUTHORIVALO...
5 SQ Sew Agent (.45501SERVER) Running Manta NT SarvinelSOLSEIL- 4036
Hinh 7: Khoi Being djch V14 SQL Server viz Analysis Server
4.2. ling dung tong SQL
4.2.1. Sir dyng thu@ roan Microsoft Decision Tree va Microsoft Naive Bayes.
a. Co so du lieu va muc tieu khai phi
Co sic din lieu
Co so din lieu duqc sir dung a minh hqa trong bai vier nay co ten
la AdventureWorIcsDW2012 , day la kho dO lieu cim cong ty chuyen san xu...f Re Wie..
View ni4c djnh Oki Prediction Query Builder
Sir dung Design and Query views, bon c6 the xay dung va xem xet truy Van cita
minh. Ban co the thvc hien vi xem ket qua cita cac truy van trong Result view.
Tao true yin
Trong Select Input Table(s) box, click Select case table. Select Table dialog
box met ra, chon mot bang du vio c6 chira cac du lieu thir nghiem sir dung
trong dv bao cac truy van.
- Trong Select Table dialog box, chon Adventure Works DW2012 tir danh sach
dii lieu ngu8n.
- Chon vTargetMail tir Table/View list vi click OK.
Trang 60190 A11278 — Doan Thinh Cong
A11500 — Nguyen Dire Hoing
▪
trNG DUNG ICHAI nu DC! LIEU SQL SERVER 2012
Cac cot cau trac khai thic duqc to dOng anh xa ten cac cot c6 cimg ten tong bang
diu vao, nhu the hien trong hinh &raj.
• 7
N or on• !act IMO DOUG TU AMMO MMHG a POLS rill MOITICTIM Dotal NCO* WiP
a-411110
b-o••
j MSS 'wow ji on, haunt. Ow
74 *was. p
' 0 Luca I.LIVD00.7)•VOCONW Iiinmtl
• 41. rat••MIT•ANIC••
• a
• a NOY.= Yes
cp ••••••• Wet OV/201I
a OAS.
la Ms
• a of.." sta..
TL
Pas
term Tr WYSS
P.O/bn
Ilsoffibucluv
Z
OW11.10
brteiaripille. (*..M
INS M. 1/.. imp Mee 04.12~.-
Heldwilieses 0
)101.1MAAAAA
1.111m6s1
(WISSINIsl
WT
we. DM
• L.K. Yom. Weal NOM Cis
•••
SPKAmen mind al **Spa
Luu y rang thong thuOng ban se c6 mOt bang rieng blest c6 chfra cac khach hang
tri'en vqng va ban main du down xem moi khach hang c6 mua mOt chiec xe dap hay
kh8ng (tirc la Bike Buyer column) dqa tren thong tin khac duqc biet den (cac cOt
khk). 6 day vTargetMail nhu mOt bang cac khach hang teen vong.
Sau khi ban chqn bang du vao, Prediction Query Builder tao ra mOt anh xa mac
dinh giera cac mo hinh khai thk va bang dau vao dua ten ten cua cac cOt, nhu the hien
tong hinh sau.
Xilv dune tic true vain du bAo
Trong cot Source, nhan vao 6 tang 6 hang diu tien, va sau d6 nhap vao bang
vTargetMail.
Trong cot Field, ben canh cac muc ban da tao tong buck throe, click
CustomerKey. Dieu nay cho biet them xac dinh duy nhat yeti truy van du bao
de ban c6 the nhan ra nhang nguiri c6 va khong co kha nang mua mOt chiec xe
dap.
Click vao 8 ke tiep tong cOt Source, va sau d6 nhip vao Targeted Mail mining
model.
Trong 8 Field, titian Bike Buyer.
Trang 61190 A11278 — Doan Thanh C8ng
A11500 — Nguyen Dirc Hoang
• ▪
ITNG DUNG KHAI PHA Dir LIEU SQL SERVER 2012
Click vio o ke tiep trong cot Source, va sau do nhip vio Prediction Function.
Tiep thee Prediction Function, trong cot Field, click PredictProbability.
Prediction functions cung cap thong tin ve me hinh dit bao. Chfrc ning
PredictProbability cung cap thong tin ve xic suat cna cac du dot' duqc. Ban
co the chi dinh cac thong sift' cho cac chirc nang du bao trong cOt
Criteria/Argument.
Trong cOt Criteria/Argument, g8 [vTargeted Mail]. [Bike Buyer]. Man hinh
duqc hi'en thj nhu sau:
FU Off VIM MOM MD OHM MI la MAME WM MOS MU tir IMCMCMI MUM *MOO WU
a 0 IP Ire 1 :
*us lahn 11
r 111 - ✓ •0 •1:1
42 MI Aum Mmd• MD NM va. a Mq mem Mt
0
swop pa m 0.• MP MO Maim
a AMOY 1 MIMM B
A ....mews 1 ism. e
p Prat:wows * Mk0MIM P 1 1••••0 110.1 i
Mom muslem
• Ima Sweat MS DIDD12
MMUS mml the Sper.
Bang cach nhan vio bieu tuung o gee tren ben trai cfm view, ban c6 the chuyEn
sang the deo xem truy van va xem xot cac mA DMX that Prediction Query Builder tao
ra. Ban 6-mg co the chay cac truy van, sira d6i cac truy van, va chay cac truy van sira
den, nhtmg cac truy van sira d3i la kh8ng tan tai nEu ban chuyen va Design view.
Xem ket qui
Ban co the chay cac truy van bAng each nhan vao mil ten ben canh bieu tucong 6
goc trai tren cite tab, va sau d6 nhap vio ket qua.
Trong 62190 A11278 — Doan Thanh Cong
A11500 — Nguyen Dire Hoing
trNc DUNG KHAI PHA Del Lieu SQL SERVER 2012
-
4 Test2 .70ANT,A HCCNC •
411
RE MIT YEW PROJECT WU DMA IE1Y sa WW1 BM* 1/0011 TOMS TBT MORTICTUIE 0100EE 1000*
a 11r 1. sir
-Y.maw m •
b -Sit
01 1111.0mdue 1041 Pidd A1w11Nlw.m 10 Noy Pan" Owl
t p
2.11610
CAIMIWY -•••
non
Lou
1I43
MIN MEMOS
10:0
10111
ID?
1100
liLa
1100
11111 0.1711.111113.
1/011
101.3
1011
1013
IMP
1/017
100
110/9
IVO
11012
111111
11111
0 fun...a.coisal.0 atom YAW
Cac cot CustomerKey, BikeBuyer, va Expression cho biet ma cita cac khich
hang tiem ming, hq c6 la nhimg ngu&i mua xe dap khong(dua vio gia tri 0 hoc 1), va
do chiral xic cua cac xic suit du doin duqc. Ban c6 the tich bang dit lieu nay thanh
hai phan va tau vio co s6 dir
MOt hm dank sich ithimg khich hang dac biet (c6 kha nang mua xe nhit) vdri
Bike Buyer bang 1.
MOt lim dank sach nhfing khach hang binh thuong (Ichong kha nang mua xe
nhit) veri Bike Buyer bang 0.
Do d6 ta co the the sir dung cac ket qua nay de xac Binh ai se duqc giri mot quang
cao.
4.2.2. Sir dung dm& todn Microsoft Association Rule
a. M6 ta der lieu va muc tieu khai phi
Ca sOLAIfr
Trong bai tout nay ta tiep Ale khai thk co se der lieu Adventure Works DW
2012. Trong bai town nay der lieu can khai phi la hai bang vAssocSeqOrders
va vAssocSeqLineltems.
Muc den khai phi
Trang 631 90 A11278 — Doan Thanh along
A11500 — Nguyen Dire Hoang
irNG DUNG KHAI PHA DIY LIEU SQL SERVER 2012
BO phin ban hang ctia cong ty muen tim hien so ket hop gift cac mau sin phim
bay ban veri nhau thong qua cac giao djch di duqc thuc hien. Cu the la met ngueri den
mua mat hang A, B vay thi ho se c6 xu hueng chon mua them met mat hang C nio do?
Dieu nay rat quan trong, thin via ban hang dip vao de nam bat duqc tim 15 , ngueri
mua, tiep thi sin phim vim 5( khach hang. Nhiem vu Gila Microsoft Association Rule
chinh la kham pha ra nhUng su ket hqp d6.
b. Qua trinh chay img dung
Tao met Analysis Services Proiect
Tnrerc tien, Mo Met Analysis Services Project vei ten" Association Rule Model"
va Mo ket not dil lieu, tao met Data Source vi Data Source view gem du lieu la 2
views la v AssocSeq Orders vi v AssocSeqLine Items.
- Add Data Source
- Trong muc data source view. Right - click chqn New data source view. Chqn
co so der lieu Adventure Works DW 2012. Click Next
Chen Case la vAssocSeq Orders vi Nested la vAssocSeq Lineltems, click
Next.
Tio quan he many to one bang cac keo thoc the OrderNumber trong bang v
AssocSeqLine Items sang bang v AssocSeq Orders
a •
000 warn ..e-. 140
00•4 TOW 0 00 0100 04/ 010/10 111000111. TOW 0
0 EOM vOIT 100 0.10
0- 4-Diat
See.. tylom
1 a 0.0•1•D
rame-ega. ....awe..
..• _ • -
as Mr•••••••••••••• 0TTO
DON loam 0. •a
= 006.0
40-
• 00000.
• /0.0.0.00 000
CIS
01•■•••••
• -CIMISOIMINOS
Asoblos
001•10
40... Town 00
Colonalm
000000 000
.6100100
o Sew !wan
0,00001T 1%•••••• ■•7•••••
---0 00020.0,
WNW
00.0/0.0 C SIPA
0.111111111 immphell•
• 0 Ow. 210 14 10.0 TKO
0.0.0.00100.0.0 OTT
• • • 9..•
Imcrtni p
La • CTT. 0.10ormat001. 106
0.0.° O
- Trong cira so Solution Explorer, right - click Mining Structures, click New
Mining Structure.
- Click Next
Trang 64190 A11278 — Doan Thinh COng
A11500 — Nguyen Dirc Hoing
I.J.NG DUNG KHAI PHA Dir Lieu SQL SERVER 2012
Click From existing relational database or data warehouse, click Next.
Trong muc What data mining technique do you want to use?, chon Microsoft
Association Rules. click Next
click Next
- click Next
- Hep tho3i Input Table hien ra. O v AssocSeq Orders chon Case , v AssocSeq
Line Items chon Nested.
„•—,•• • . ■ •ui„ • . • . • •,
CMG 11/JA a DeOSI 1001.5 Tat NICK110/111 e.Y11 Seel 1.511
ill em le al •
IP ,
0 • a - • IN • Dlelle111
61.enase hceleGiel0•60, e
it • •
, p
I• • • • -1 41 !y .
Me ens M done
14 awn • • elee
• fa Nee
ee 04101.41•
• SI Do.%••••We
edule INeele
C.
• o veeittill• a
6 o..emw
i6nyte T N. Lee
Zb
Inert*
Cam-
Ilerneber Fie
Sees* Fe Ss. Woe
rI 0
- Click Next. Chon thuoc tinh input, predict, key
Trang 651 90 A11278 — Doan Thinh C8ng
A11500 — Nguyen Dire Hoing
IJNG DUNG KHAI PHA DU LIEU SQL SERVER 2012
P
04 3:XCV lat; MlaDS:1 .
PRO., MD MUG UMS SQL MAO DAIMASI DATA SOME Mk TOMS EST •llOCK1111 Aasefli wane 1411
WE 1/0 *W
a Alai xSox- owe,- 0,
krarraa. huts Dan012 thOes,71: a X
it 0* rti
I ' MO 10 1a9 X MP•
OVallpar DIUMISOrard
[SI olTallas Snob the Trim Our
inchown•xsommixo
ISM
I 0 •oeSeflass
0.R∎34011
x 9 -
ramrsaixemeresossama
set x
Eno LA
- Input: OrderNumber trong bang v AssocSeq Orders
- Model trong bang v AssocSeqLine Items: a day vita 11 input vita 11 predict,
key.
Click Next
Het) thoai hien ra you cau chat.' phin tram der lieu sir dung de khai pha. DO
mac djnh 30%, 1000 clang. Click Next
Dat ten cho Mining Structures va Click Finish
Trang 661 90 A11278 — Doan Thinh Cling
A11500 — Nguyen Dirc Hoing
UNG DUNG ICHA1 PHA lar LEW SQL SERVER 2012
p e x
WINDOW ti
tLL EDIT t ROG MD tOE TEMI 91 FOOMT CATWASE DATASOICEMEW TOOLS TBT
- •Midi oai••obs•
o ALherweliorfraCC:ISCEts-7.• < x
x 3P b-*011*0
D.
cmg-on 11 Dila liras Mord
0 ea 'asseceal ea)
vi oats") Crepes last ji miede
SW* elieree areee • la psi.
MEM/3421s
♦ Q Rtl Sone
*Se ese Nile
et Lie cos
Dame
Meta
TM
no 11 Abode.
a •wolalles
I. 0 tear
▪ v Sme
Ium . I x
elle
'Pi pekoe.
2
Eon
lieu chinh tham s6 cho mo hinh:
Trong cfra s6 Mining Models, him phim phai chu6t vio Microsoft Association
Rules va chon Set Algorithm Parameters va thiet lop gia tri 2 tham sa MINIMUM_
PROBABILITY la 0.1 va MINIMUM_ SUPPORT la 0.01 nhu sau:
Trong 67190 A11278 — Dan Thinh COng
A11500 — Nguyen Dirc Hoing
LING DUNG ICHAI PHA Dv LIEU SQL SERVER 2012
P
4 di„. le (0 0e5 - 51.0!3(1.diC
MIALYZI MEOW le,
RE EDIT 411BY POMO MD MSG TEEM 91 DAMS( TOOLS 1131 SIDOKTLIII
0- aliblit 1Mi-Dees - $:
• 5000 S,ks • •
Me/se VIM OWM2.0410uml
• lewawaoan O (000000010 (A If P
P •
000(00000110
(00000100100 Sea papal
• 44 amlismic
• Q 00a ken
A so Ser w. OMI 0.0 Rase Ma EMY0-6
fi OnMAer . Q dbaan Km
♦ •0000.01108 AM p A•m•MMOMIIM
A 1 1/40ZUMEI6TM 1 AM
OM
lalkSLIPPM LT 140-1
la Moms
111119111 (---1
SIMMOSOMM • QIimSbudm
PM ?Oat* 0.••••••
ISIMAIRKIIMILIIY (0 44 Kum ONO
IIIMINMPORT OM an PA-)
k(0,5000 lam 60:0(
MernerOmm mo•mr•a •1 1
Star ballet
Men WANE (5000101110
5110001000111,0111110MII 400116.000a hafalas VS* tle Spas
grin 41 made met
-p Mem Mist momrn 04,08•1
3. 920 0.• zu000 *WS Et 5m1
- Spas 14molonon ■ MO*
(404...ô••-.0 (mai 04 sr
ar Ism 0( nib •
• I K ••104
P.'
0 U,•0•.0•Mated Sms1•14
:ak nw Ercgttu Ma(
Sau khi hieu chinh cac tham s6 cua Mining Models, him F5 de thuc hien m6
hinh
Khim phi Minine Models
Ket qua ciia Microsoft Association Rules the hien trong Tab Mining Models
Viewer heti 3 n6i dung chinh la Itemsets, Rules, va Dependency Net
Itemsets:
Itemsets cho ',jet cac thong tin quan tong dm luat ket hqp nhu Support (d6 ho
trq cua 1u4t ket hqp), Size (S6 items trong Itemsets). EM hien thj cac Itemsets co chira
m6t item nio do (vi du mau xe Mountain-200) till nha‘p Mountain -200 trong 6 Filter
Itemset.
Trang 681 90 A11278 — Doan Thanh Cong
A11500 — Nguyen Et Hoing
irNG DUNG ICHAI PHA Dif LIEU SQL SERVER 2012
4 association rules - Niticroson. ilisuai St.JO.°
RLE EDIT VIEW PROJECT BUILD DOG TEAM SCIL DATABASE PINING MOOEI TOOLS TEST AR(1411(1191 ANALYZE *COW MP
0 0.11111 sbit • Ds*
Akeetat Weis DY12012Asv (Deign]
MSc Sbudare A 'atoms* Kmpaincyairt 81big lbd sar
14n14odd: soillbt nit Y *nit And& RIM Oar .0 Cl
Rules Don Dererdoxy Newt
Snromacrt 141 ; FAN ltepeet
Wisp 1E4441 la: 0 ;- Shove Slsr ettrbie ram gle
Kmitun ram: 2000 ; QSMokm we
Sart Sr And
4MS 1 Mot-100 thistim
MA 1 VAlkabt
2110 1 Pal6M-Exilm
M42 1 1621004 'WAS • bilm
1739 1 14222120-2M • Wm
1583 1 balite ToW
mn 1 COO, .6121m
UM 1 Fide Set - IS& • Salm
1354 1 142421WMMIKage •ENMN
1217 1 Wane 12441m2y •Emig
1203 1 Rom10016 Cage•&dm
1146 2 Mobil BOMBCape•UPW11ar BO*.
Hinh ten veri Itemsets c6 Support la 1146 gOm 2 items do la Mountain Bottle
Cage, Water Bottle co nghia la trong tat ca cac giao dich thi co 1146 giao dich trong
d6 khach hang mua loaf Mountain Bottle Cage thi cling mua loci Water Bottle
Rules Tab:
Phan nay trinh bay cac luat kit hqp dugc phat hien Uri mo hinh. Cac thong tin
ve luat kit hqp bao g6m:
Probability: Cho bier xac suat xay ra cua fait.
Importance: Do Wang tinh him eking cua lust, gia Ili nay tang cao thi luat kit
hqp tang tot.
Rules: Phan nay the hien cac luat kit hqp clang X=--->Y
Trang 691 90 A11278 — Doan Thinh Cling
A11500 — Nguyen Dirc Hoang
eNG DI,1NG ICHAI PHA Litu SQL SERVER 2012
ea association rules - Microsoft Visa! StJd[a
RLE EIXT AR ARD1KT BUILD DM TERM SQL DATABASE 11196 MOIR TOOLS TEL ARCHITECTURE ANALYZE Maw its.
■ Slat Dredop • 9
0• d - ,
•
0 kkeenwcgos waniatom
;it Skase ;{ MIthg Alodels limj Away 0.1 6 WircliteRtddm
*INAS: reakri AR • Veen Win Isar RAMA, • 0
Demet I Depailey PISS
14riun mealy: 0. 10 ữ %Ruh:
14M. ingrIne: 0.18 • 97701 5ko Anita stale
fie
❑ 5b7A krig rent ties mes:
Pr... lacrirce RUA
10:0 1167 Tom-1000 • BEIM Wen Dot • Eddrp Rod Bole Cage A Wry
LOCO 1.412 Pal-79) • BSc Rad The %be • Etart> Rune • bac
1.000 100 131M7 The =Niro Seat-100•E I .A Icor* The Tite bag
1.030 1.011 ALPS Tre Emirs SprAMO • bag -a Rosd ire 'Ate • BIM
LCCO LW Huangđ • Bin% ItuVAILTre IL& • &sr . , 11116Aran - ESN
1.000 0.733 boa& Opt • Sas Cychi •edort> war We • Ears] D
LOOD 1106 fernierSet -14osan beg War DA -Rag -711.nagtifk Cmie • Reg 5
1000 1119 YouValla • Nang, Wale Retie • billrt) lontin Mk Cage • bag C
1.070 0.715 Rod Bole Cap A bac Spit-100 • Eta -A Wale Bath • Essig r-
LOX 1.221 Ros7-750•SEIM Warr Balk - et101->Road Bode Cap-Bate
0.904 0.901 WWI*, Bre • Bata Srat-100 • SW , ISE* Trete • Reg
0.951 0.713 YOJIIth ask Cgs • ES. COI BON ->