Programatically retrieving data from a website into a database

Summary: Our client wanted to extract data from a web portal. We wrote a VBA script to the requested information and store it in an Access database. The example below demonstrates the same technology, but the data is stored in Excel.

Edit: 7/23/2017 – An improvement on this solution.

Edit: 9/9/2016 – I posted a download to the file here.

Technologies

  • Microsoft Access: Data and VBA
  • Microsoft Internet Controls
  • Microsoft HTML Document Control*

An understanding of the HTML Document Object Model (DOM) will help you A LOT.

 

Scrape web content using VBA

Screen scraping used to be a way to grab characters off of mainframes and AS/400’s. It is much easier to scrape web site information due to the DOM. Actually, it’s not scraping at all – more of a structured retrieval of data from HTML.

I cannot provide the exact code from the above project as it was a bought and paid for by our client and belongs to them. I’ll explain the methodology and provide sample code to retrieve web content. I’ll also demonstrate the retrieving information from the DOM.

I will provide the source code for you to retrieve information from my blog’s root page – the one listing the most recent 10 blog entries found at the link below:
http://www.pulseinfomatics.com/blog/

I’ll explain a couple critical differences in the code provided and the code we built for our client.

 

Create an Access table: tblWebData

With the following fields:

  • wd_uno (autonumber, key)
  • ArticleTitle (Short Text, 255)
  • ArticleURL (Short Text, 255)
  • ArticleAuthor (Short Text, 255)
  • ArticleSummary (Long Text)
  • DateRetrieved (date/time)

Note, I would size them differently but I was in a hurry. I’ll leave that to you.

 

Create an Access form (frmWebTest)

I put a single button on the form:btnGetWebData

This is the code for that button plus another sub-routine.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Private Sub btnGetWebData_Click()
Dim strURL
Dim iPage As InternetExplorer
Dim iHTML As HTMLDocument
Set iPage = New InternetExplorer
iPage.Navigate strURL
 
While iPage.ReadyState <> READYSTATE_COMPLETE
' I'll make a big assumption that it will get to the page, otherwise, you'd want to have a way to catch a timeout
 DoEvents
Wend
 
Set iHTML = iPage.Document 'put the HTML from the navigated webpage into the HTML Document object
 
'look for the "article" tag and return number of articles on the page
intNumA = iHTML.getElementsByTagName("article").length
 
For x = 1 To intNumA 'loop through each article
 
 strAURL = iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("header").Item(0).getElementsByTagName("a").Item(0).href
 strATitle = FixQuote(iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("header").Item(0).getElementsByTagName("a").Item(0).innerText)
 strAAuthor = iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("header").Item(0).getElementsByTagName("a").Item(1).innerText
 strASummary = FixQuote(iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("p").Item(0).innerText)
 SaveWebInfo strAURL, strATitle, strAAuthor, strASummary
  
Next
 
Set iHTML = Nothing
Set iPage = Nothing
 
End Sub
 
Sub SaveWebInfo(inURL, inTitle, inAuthor, inSummary)
Dim strSQL As String
strSQL = "INSERT INTO tblWebData (ArticleURL,ArticleTitle,ArticleAuthor,ArticleSummary,DateRetrieved) " & _
 "VALUES ('" & inURL & "','" & inTitle & "', '" & inAuthor & "','" & inSummary & "', #" & Now() & "#)"
DoCmd.SetWarnings False
DoCmd.RunSQL strSQL
DoCmd.SetWarnings True
End Sub

 

FixQuotes Function

Variations of this functionality can be found in various places around the web. I grabbed mine from this blog.

http://mikeperris.com/access/escaping-quotes-Access-VBA-SQL.html

It ensures you don’t get errors when trying to save text with quotes.

1
2
3
4
5
6
7
8
9
10
11
12
' courtesy "dschmidt333", utteraccess.com
Function FixQuote(FQText As String) As String
On Error GoTo Err_FixQuote
FixQuote = Replace(FQText, "'", "''")
FixQuote = Replace(FixQuote, """", """""")
Exit_FixQuote:
Exit Function
Err_FixQuote:
MsgBox Err.Description, , "Error in Function Fix_Quotes.FixQuote"
Resume Exit_FixQuote
Resume 0 '.FOR TROUBLESHOOTING
End Function

 

Some DOM elements and nesting explained

My goal is not to teach you how to navigate the DOM. You can start here to learn more about the DOM.

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

intNumA = iHTML.getElementsByTagName(“article”).length
Returns the number of “article” tags on the page. For my blog, each article is contained in an article tag.

 

GetElementsByTagName(“article”)
Returns an array of any elements with the tag name (“article”)

You can now loop through each article to retrieve article-specific data.

iHTML.GetElementsByTagName(“article”).Item(x – 1)
Refers to a specific article tag and it’s associated HTML. x started counting 1. Arrays start at 0. x – 1 ensures you are referring to the correct article tag.

Once you refer to a specific article (x – 1) the objects you refer to after that relate only to that specific article. Elements that are not encapsulated (nested) in that “article” tag are ignored. Remember that when reading the href and innerTEXT references below.

 

.getElementsByTagName(“a”).Item(0).href
The first a tag href is the link to the blog entry.

The second href is the blog author. Subsequent hrefs, if they exist, link to categories. But I am only concerned with item(0) and item(1).

The innerTEXT of each hrefs returns the linked text. The article URL’s innerTEXT is the title. The author URL’s innerTEXT is the author name.

 

getElementsByTagName(“p”).Item(0).innerText
References the only paragraph in each article, the summary text.

That gives you some idea of how to reference items in the HTML Document Object Model.

 

How my client’s project differed.

In order to retrieve several pages of profiles, I had to:

  • navigate to a page
  • loop through information from a list
  • present the user information and choices as to what they wanted to copy
  • retrieve the data from a list
  • navigate to sub-pages if they exist
  • retrieve information from sub-pages
  • move to the next page and repeat

 

Example Video

Here’s an example where I navigate the DOM on our website

 

 

Download File

pulse-grab-web-data.zip

[출처] https://www.pulseinfomatics.com/programatically-retrieving-data-from-a-website-into-a-database/

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
1196 [ 一日30分 인생승리의 학습법] supabase 폼 미쳤다 newfile 졸리운_곰 2024.04.27 1
1195 [ 一日30分 인생승리의 학습법] VBA Web Scraping: How Can VBA Be Used To Scrape Website Data? file 졸리운_곰 2024.04.13 3
1194 [ 一日30分 인생승리의 학습법] 윈도우 실행파일 구조(PE파일) file 졸리운_곰 2024.03.31 3
1193 [ 一日30分 인생승리의 학습법] [Analysis] PE(Portable Executable) 파일 포맷 공부 file 졸리운_곰 2024.03.31 3
1192 [ 一日30分 인생승리의 학습법] 성공하는 메타버스의 3가지 조건 file 졸리운_곰 2024.03.30 7
1191 [ 一日30分 인생승리의 학습법] REST, REST API, RESTful 과 HATEOAS file 졸리운_곰 2024.03.10 9
1190 [ 一日30分 인생승리의 학습법] 렌더링 삼형제 CSR, SSR, SSG 이해하기 file 졸리운_곰 2024.03.10 2
1189 [ 一日30分 인생승리의 학습법] 엑셀 VBA에서 셀레니움 사용을 위한 Selenium Basic 설치 file 졸리운_곰 2024.02.23 11
1188 [ 一日30分 인생승리의 학습법]500 Lines or Less Blockcode: A Visual Programming Toolkit : 500줄 이하의 블록코드: 시각적 프로그래밍 툴킷 졸리운_곰 2024.02.12 4
1187 [ 一日30分 인생승리의 학습법] 구글 클라이언트(앱) 아이디를 발급받으려면 어떻게 해야 하나요? 졸리운_곰 2024.01.28 3
1186 [ 一日30分 인생승리의 학습법] 빅뱅 프로젝트를 성공적으로 오픈하기 위한 팁 졸리운_곰 2023.12.27 16
1185 [ 一日30分 인생승리의 학습법]“빅뱅 전환보다 단계적 전환 방식이 이상적 애자일팀과 협업 쉽게 체질 개선을” file 졸리운_곰 2023.12.27 12
1184 [ 一日30分 인생승리의 학습법] Big-bang / phased 접근 file 졸리운_곰 2023.12.27 3
1183 [ 一日30分 인생승리의 학습법] CodeDragon 메뉴 데이터 전환의 개념 이해 - 데이터 전환의 개념, 데이터 전환방식, 데이터 전환방식 및 장단점 비교, 데이터전환 이후 검토해야 할 사항 졸리운_곰 2023.12.27 5
1182 [ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 분쟁 해결 시스템 file 졸리운_곰 2023.12.27 6
1181 [ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 개념과 리뷰 시스템 file 졸리운_곰 2023.12.27 4
1180 [ 一日30分 인생승리의 학습법] 소켓 CLOSE_WAIT 발생 현상 및 처리 방안 file 졸리운_곰 2023.12.03 7
1179 [ 一日30分 인생승리의 학습법] robots 설정하기 졸리운_곰 2023.12.03 3
1178 [ 一日30分 인생승리의 학습법] A Tutorial and Elementary Trajectory Model for the Differential Steering System of Robot Wheel Actuators : 로봇 휠 액츄에이터의 차동 조향 시스템에 대한 튜토리얼 및 기본 궤적 모델 file 졸리운_곰 2023.11.29 6
1177 [ 一日30分 인생승리의 학습법] Streamline Your MLOps Journey with CodeProject.AI Server : CodeProject.AI 서버로 MLOps 여정을 간소화하세요 file 졸리운_곰 2023.11.25 2
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED