[selenium driver를 차단한 사이트의 selenium 접속]

How to avoid Selenium webdriver from being detected as bot or web spider

How to avoid Selenium webdriver from being detected as bot or web spider

selenium as bot

Before we start to use php-webdrive and Selenium for web scraping and social media auto posting, we need to do some settings in code or file modifications to avoid our script from being detected as web bot or spider. I have listed some ways to hide our automation using Selenium. The methods can be used for any programming languages as well. Please note that this is not a complete list and from time to time web servers companies can find new methods to detect and block our Selenium automation. Anyway, we just have to factor in all known methods in our scripts to reduce chances of detection.

1. Remove browser control flag

2. Remove signature in javascript

3. Set User-Agent

4. Avoid using headless browser

5. Use maximum resolution

6. Follow page flow

7. Use proxy or VPN

8. Insert random delay

9. Use cookies to login

Previous article : How to install php-webdriver + Selenium for screen scrapping and auto-post

1. Remove browser control flag

If you run Selenium with default settings, you will see a line of notification "Chrome is being controlled by automated test software" at the top of Chrome browser. I am not sure can web server see this notification or detect the flag that turn this on. But since it is there in front of screen, we can turn the flag off.

Selenium browser control flag

In php-webdriver, use setExperimentalOption to disable automation flag in ChromeOptions object. You will not see the notification again after this.

1
2
3
4
$ops new ChromeOptions();
$ops->setExperimentalOption("excludeSwitches"array("enable-automation"));     
$capabilities = DesiredCapabilities::chrome();
$capabilities->setCapability( ChromeOptions::CAPABILITY, $ops ); <br>$driver = RemoteWebDriver::create( host, $capabilities );

2. Remove signature in javascript

Inside chromedrive.exe (same for geckodrive (firefox) and edgedriver (edge)) there is a javascript signature that used by bot detection software such as  FingerprintJSImperva or even Google's Captcha. I use Agent Ransack to search for "cdc_" signature in chromedrive.exe binary file.

1 chromedriver cdc search

 The signature is "$cdc_asdjflasutopfhvcZLmcfl_". What we can do is change "cdc" to string of same length. For example, I change "cdc" to "tch". You can change to anything like "abc", "xyz" etc.

Since this is a binary file, we can not edit that signature with normal text editor. I am using "vim" for this purpose.

Go to https://www.vim.org/download.php and download self-installing executable file.

2 chromedriver vim 

After installation, go to C:\Program Files (x86)\Vim\vim82> (because the installation not set env path automatically, but you can do it yourself if you want). 

Run command "vim.exe <path to>\chromecriver.exe

11 chromedriver vim

The binary file displayed in vim looks like this. Type ":%s/cdc_/tch_/g" to make a global change of string "cdc_" to "tch_". Enter to execute the command.

13 chromedriver vim global change

Then to exit vim type ":wq!". This will save the changes under the same file name - chromedrive.exe in this case.

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

14 chromedriver vim write quit

There might be some intermediate files (with ~ at end of file name) in the same webdriver directory. Just delete it. 

15 chromedriver delete files

Now if we do the same search, it is gone. You can also verify by searching the new string that you changed. So now the signature has changed!

16 chromedriver changed

3. Set User-Agent

Social medias most likely keep track of our internet IP address and user-agent when we use browser to create account in social media or updating new post. So it is good to use the same user-agent during Selenium automation. 

From the browser that we use for social media activates, go to https://www.whatismybrowser.com/detect/what-is-my-user-agent

browser user agent

Cut and past the user-agent string and set in Selenium. Example:

1
2
3
4
5
6
$chrome_options array'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4324.104 Safari/537.36' );
$ops new ChromeOptions();
$ops->addArguments( chrome_options );
$capabilities = DesiredCapabilities::chrome();
$capabilities->setCapability( ChromeOptions::CAPABILITY, $ops );    
$driver = RemoteWebDriver::create( host, $capabilities );

4. Avoid using headless browser 

Chrome browser introduced the ability to run in headless mode, that is the ability to run Chrome without creating visible browser window and other benefits such as greater testing reach, improve speed and test performance and multitasking. However, in real world, we perform social media activities through web browser. So it is good to leave the browser window remain opened during autoposting. 

5. Use maximum resolution

Since we want to leave the browser window opened, we can set it to a reasonable size. To check your browser window size, go to http://howbigismybrowser.com/howbigismybrowser

You can set to maximum screen size or to the nearest number of pixel in Selenium.

1
2
$chrome_options = array( '--start-maximized')  // set to max screen size
$chrome_options = array( 'window-size=1400,900')  // set to 1400x900

6. Follow page flow

Unlike scraping using cURL, it is better to follow the page flow when using webdriver. For example, if your can only go to page C after browsing page A, then B, don't direct go to page C. Try to imitate human user browsing actions.

7. Use proxy or VPN

Never use your own IP address for scraping or auto posting. Website like Amazon will block your IP address if the server detect unusual activities. Always use VPN or proxy. Even that, do not use the same VPN or proxy to continuously scraping the same website. Always change to new VPN address or proxy to avoid detection.

8. Insert random delay

Always insert random delay between two actions. For example, after login, insert random delay of 5 to 10 seconds before go to next page. So server will see different delay time from one action to another for every cycles.

1
sleep( rand ( 5, 10 ) );

9. Use cookies to login

This is important for social media posting. Store the cookies file after login with username/password for the first time. Then use cookies for subsequence logins. Login with username/password and without cookies too frequently in a day might get your account blocked.

 

[출처] http://php8legs.com/en/php-web-scraper/51-how-to-avoid-selenium-webdriver-from-being-detected-as-bot-or-web-spider

 

 

 

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
공지 침투테스트(취약점검점검, 모의해킹) 문의 / 답변 졸리운_곰 2017.12.10 26654
272 [보안뉴스] 디지털 바이러스, 그놈이 다시 창궐한다 file 졸리운_곰 2022.03.13 13
271 [보안뉴스] “해결 방법 없다” 구글이 본 제로클릭 공격의 위험 file 졸리운_곰 2022.01.19 12
» [selenium driver를 차단한 사이트의 selenium 접속] How to avoid Selenium webdriver from being detected as bot or web spider file 졸리운_곰 2021.11.27 31
269 [해킹 프로그래밍][화이트해커][모의해킹] DARK FANTASY HACK TOOL file 졸리운_곰 2021.04.28 55
268 [보안뉴스] 멕시코서 1200억 턴 北해커, 한국 계좌로 송금...누구에게? file 졸리운_곰 2021.02.19 29
267 [단독]”남한 은행 모조리 털어라” 돈줄 마른 북한, 이런 해킹팀까지 file 졸리운_곰 2021.02.12 45
266 2020년 Kali Linux 한글깨짐 현상 해결하기 file 졸리운_곰 2020.10.01 77
265 "저쪽 애들에게 당했어" 北해킹에 10년 베테랑 기자도 낚였다 file 졸리운_곰 2020.07.05 64
264 [웹해킹] iframe Injection file 졸리운_곰 2020.05.10 70
263 北 3개 해킹조직, 亞서 6800억원 규모 암호화폐 탈취 file 졸리운_곰 2020.02.09 40
262 일본 호텔에 근무하는 로봇들, 몰카로 변신시킬 수 있다 file 졸리운_곰 2019.10.25 157
261 HTTP-Botnets: The Dark Side of a Standard Protocol! file 졸리운_곰 2019.09.21 119
260 Kali Linux Tools Listing 졸리운_곰 2019.03.12 280
259 무선 설정 - Kali / Connect USB Wireless Adapter to Kali Linux in Virtualbox file 졸리운_곰 2019.03.12 223
258 [주말판] 프라이버시 침해가 판치는 인터넷에서 살아남기 file 졸리운_곰 2019.01.20 92
257 "애플·아마존 서버에 중국 스파이칩… 한국도 안전 장담 못해" file 졸리운_곰 2018.10.06 107
256 1사분기의 디도스 공격, 숨으려 하지 않았다 file 졸리운_곰 2018.07.06 117
255 갠드크랩 랜섬웨어, 국내 피해 갈수록 확산 file 졸리운_곰 2018.04.22 112
254 BIND DNS와 DHCP 보안 취약점 발견...업데이트 필수 file 졸리운_곰 2018.03.11 130
253 [김민석의 Mr. 밀리터리] 가공할 북한 사이버 공격력, 한국은 기능부전 file 졸리운_곰 2018.02.23 185
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED