[엑셀로 웹 스크래핑2] Scrape webpage data using Excel & VBA
2016.10.09 20:12
[엑셀로 웹 스크래핑2] Scrape webpage data using Excel & VBA
This post is going to get a bit technical- If you’re just looking for a quick tool to scrape data off pages and you’re not trying to set up the code yourself, I’d recommend this post: Search source code of any sites with Excel. You can just download the completed VBA file, and get some ideas about practical marketing applications of why you would want to scrape webpage data.
Considerations for coding a basic VBA script to scrape webpage data
In the past, I’ve highlighted one method of coding VBA to scrape webpage data by using Internet Explorer and making VBA calls. While IE is good for certain applications, if all you need is to scrape webpage source HTML code then oHTTP calls allow VBA to get through a lot more pages in a shorter amount of time, and generally encounter fewer errors.
If you’ve never coded VBA, I’d recommend looking at my introductory piece before diving into this particular code.
Writing the code
Since Excel will be the driving thing, generally the starting place is going to be some sort of list of URLs that you need to extract data from. The basic outline of what you will be coding is:
For each Cell in URL-List oHTTP Navigate to Cell.Value Open site source Perform some action with HTML file Output results of action in some way Next
The actual code you’ll need to have the ability to scrape webpage data follows. First, you’ll need to define your variables and set your HTTP object.
Sub Site_Source_Scraper() Dim navtar As String Dim oHTTP As Object Dim page_html As String Dim cell As Range Set oHTTP = CreateObject("msxml2.ServerXMLHTTP")
In this first block of code, we are simply defining the variables and the objects we’ll need for the script. Navtar is the string that is the URL for the website we’ll be scraping data off of, oHTTP is HTTP object we’ll be using for most of our commands, page_html the string of the complete html file for the page, and of course cell is just the cell we are currently getting a URL from. Lastly, setting oHTTP as a Server XML object allows us to make URL calls with oHTTP.
The next section is the loop that does the actual scraping.
For Each cell In Selection cell.Activate navtar = Replace(Replace(Replace(ActiveCell.Value, "https://", ""), "http://", ""), "www.", "") 'Clean URL navtar = "http://" & navtar On Error GoTo HTTPErr: oHTTP.Open "GET", navtar, False oHTTP.send page_html = (oHTTP.responseText) 'put all the awesome code here! LoopPickup: Next Exit Sub
The whole block of code is incased in a for-loop that will lookup the webpages for all the cells selected when the script is executed. The first thing that happens inside of the loop is cleaning the URL. The URL will have to be of the form http://URLHERE.COM (or .net, .org, etc.). If there is a www. in front of the URL it will cause an error. The Replace( command is there to basically pull out all the extraneous info from the URL. Then, in the next line, “http://” is added to the beginning of the URL.
To avoid errors from crashing the script, the On Error right before the .Open command will push any problems to the error handler (see below). The actual “get this data” happens next. The .Open command takes the form OBJECT.Open “GET”, theURL, False to navigate to a page; after that a .Send will execute the command. The result is the variable “page_html”. The variable is a giant string of the entire HTML of the webpage. If you’re looking for ideas of things to do with this, see my post on why a marketer would be searching through HTML..
Lastly, the error handler must be built.
HTTPErr: If Err.Number <> 0 Then ActiveCell.Offset(0, 1).Value = "Error: " & Err.Description End If Resume LoopPickup: End Sub
When you put those 3 parts together you’ll have a VBA script that can let you search through a massive list of sites instantly, scraping webpage data for whatever you might need. Leave comments below about what you used this for!
Complete VBA code:
Sub Site_Source_Scraper() Dim navtar As String Dim oHTTP As Object Dim page_html As String Dim cell As Range Set oHTTP = CreateObject("msxml2.ServerXMLHTTP") For Each cell In Selection cell.Activate navtar = Replace(Replace(Replace(ActiveCell.Value, "https://", ""), "http://", ""), "www.", "") 'Clean URL navtar = "http://" & navtar On Error GoTo HTTPErr: oHTTP.Open "GET", navtar, False oHTTP.send page_html = (oHTTP.responseText) 'put all the awesome code here! LoopPickup: Next Exit Sub HTTPErr: If Err.Number <> 0 Then ActiveCell.Offset(0, 1).Value = "Error: " & Err.Description End If Resume LoopPickup: End Sub
[출처] http://tipsformarketers.com/scrape-webpage-data-using-excel-vba/
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.