➜ Old React website
Chung Cheuk Hang MichaelJava Web Developer
寫第一個 Chrome ExtensionGit 基本操作

網頁抓取(二) - Selenium/Selenide

Continued from previous post:
網頁抓取(一) - jsoup

Table of contents

3 動手使用 Selenium/Selenide 抓取動態網頁

3.1 幾時用 Selenium/Selenide

Selenium 係一個用黎測試網頁既 testing library,但係佢既 use case 其實不限於做測試,可以用喺不同既網頁自動化流程上。
至於 Selenide 就係基於 Selenium 再擴充 API 既 library。
Selenium、Selenide 同上述既 jsoup 一樣,都係擁有強大既 API 畀我地根據 HTML source 既 tags 同各種 attributes 去定位 DOM node,但因為佢地係用瀏覽器既 driver(由瀏覽器官方開發者提供)去駁住一個實實在在咁運行緊既瀏覽器,所以佢會額外有「等待直至頁面出現 DOM node」(出現左喺 DOM,或者要視覺上顯示左)既功能。
用 Selenium/Selenide 比用 jsoup 既好處係,我地除左用 jsoup 都支援既所有方法之外,仲可以用 XPath 黎定位 DOM node。XPath 同 CSS selector 類似,但功能上更強大。其中主要既分別有:
  • XPath 可以用 text() 黎根據字眼定位 DOM node;相反,CSS selector 就冇辦法根據字眼黎定位 DOM node
  • XPath 可以用 /../parent::*/ancestor::* 黎返返上上一層或上幾層(parent/ancstor node);相反,CSS selector 就冇辦法返返去 parent/ancestor node,只能由上至下咁 navigate DOM tree
注意 Selenium/Selenide 因為要駁住瀏覽器,所以會明顯慢過 jsoup fetch 完網頁然後 parse,所以如果 jsoup 夠做就用 jsoup 做,要 handle JavaScript 先好用 Selenium/Selenide。

3.2 需求情境

為左簡單咁示範我地必須用 Selenium/Selenide 而唔可以用 jsoup 黎抓取資料,用呢個 e-portfolio 網站黎做例子就最好不過,因為呢個 e-portfolio 網站係用 React JS 寫既冇 server-side rendering 既網站,必須要有 JavaScript 先可以喺 user 瀏覽果刻即時令 DOM tree 出現到本身寫既 components。
今次我地想要寫一個程式,去我既 e-portfolio 既 All Blog Posts 頁面上面,睇下有冇新既 blog post。假設呢個程式會定時自動喺背景執行,我地要做既就係處理網頁上既資訊,搵出所有 blog post 既標題。

3.3 分析

呢個 e-portfolio 網站 All Blog Posts 既網址係 https://blackr1234.github.io/eportfolio/#/blog/
從頁面可以睇到,只有一個位置提供 blog post 標題既資訊。

3.4 添加 Maven dependencies

pom.xml 裡面需要以下 dependency:
<dependency> <groupId>com.codeborne</groupId> <artifactId>selenide</artifactId> <version>5.15.1</version> </dependency>

3.5 寫 Java code

3.5.1 如果使用 jsoup 下載網頁 HTML 會如何?

參考返上次既 「網頁抓取(一) - jsoup - 下載網頁 HTML」,如果我地用同樣既 code 黎下載網頁其實又得唔得呢?
1public class Main { 2 3 public static void main(String[] args) throws Exception { 4 5 final Document doc = Jsoup.connect("https://blackr1234.github.io/eportfolio/#/blog/").get(); 6 final String html = doc.body().html(); 7 8 System.out.println(html); 9 } 10}
我地睇下下載返黎既係咩 HTML source…非常可惜,今次我地用 jsoup 係唔 work 既,因為冇 JavaScript engine 既關係,jsoup 執行唔到 JavaScript,就冇辦法令本身既內容出現喺 HTML source 度。
1<noscript> 2 You need to enable JavaScript to run this app. 3</noscript> 4<div id="root"></div> 5<script>!function(e){function r(r){for(var n,l,f=r[0],i=r[1],p=r[2],c=0,s=[];c<f.length;c++)l=f[c],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(a&&a(r);s.length;)s.shift()();return u.push.apply(u,p||[]),t()}function t(){for(var e,r=0;r<u.length;r++){for(var t=u[r],n=!0,f=1;f<t.length;f++){var i=t[f];0!==o[i]&&(n=!1)}n&&(u.splice(r--,1),e=l(l.s=t[0]))}return e}var n={},o={1:0},u=[];function l(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,l),t.l=!0,t.exports}l.m=e,l.c=n,l.d=function(e,r,t){l.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},l.t=function(e,r){if(1&r&&(e=l(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(l.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)l.d(t,n,function(r){return e[r]}.bind(null,n));return t},l.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return l.d(r,"a",r),r},l.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},l.p="/eportfolio/";var f=this.webpackJsonpeportfolio=this.webpackJsonpeportfolio||[],i=f.push.bind(f);f.push=r,f=f.slice();for(var p=0;p<f.length;p++)r(f[p]);var a=i;t()}([])</script> 6<script src="/eportfolio/static/js/2.aa0813bd.chunk.js"></script> 7<script src="/eportfolio/static/js/main.c5cc27be.chunk.js"></script>

3.5.2 使用 Selenide 下載網頁 HTML

我地先寫到佢可以下載到 All Blog Posts 頁既 HTML:
1public class Main { 2 3 public static void main(String[] args) throws Exception { 4 5 Selenide.open("https://blackr1234.github.io/eportfolio/#/blog/"); 6 System.out.println(WebDriverRunner.source()); 7 Selenide.closeWebDriver(); 8 } 9}
運行程式,我地可以得到以下結果。因為用左 HTML formatter 所以先係 pretty format。
我地可以見到已經有曬我地要既 blog post 標題。
1<html lang="en"> 2 <head> 3 <meta charset="utf-8"> 4 <meta name="viewport" content="width=device-width,initial-scale=1"> 5 <link rel="shortcut icon" href="/eportfolio/favicon.ico"> 6 <link rel="apple-touch-icon" sizes="180x180" href="/eportfolio/apple-touch-icon.png"> 7 <link rel="icon" type="image/png" sizes="32x32" href="/eportfolio/favicon-32x32.png"> 8 <link rel="icon" type="image/png" sizes="16x16" href="/eportfolio/favicon-16x16.png"> 9 <link rel="mask-icon" href="/eportfolio/safari-pinned-tab.svg" color="#5bbad5"> 10 <link rel="manifest" href="/eportfolio/manifest.json"> 11 <meta name="msapplication-TileColor" content="#da532c"> 12 <meta name="theme-color" content="#f8bbd0"> 13 <title>Michael Chung's e-Portfolio</title> 14 <link href="/eportfolio/static/css/2.e4bde519.chunk.css" rel="stylesheet"> 15 <link href="/eportfolio/static/css/main.5f04d1e0.chunk.css" rel="stylesheet"> 16 <style data-styled="active" data-styled-version="5.2.0"></style> 17 </head> 18 <body data-aos-easing="ease" data-aos-duration="500" data-aos-delay="0"> 19 <noscript>You need to enable JavaScript to run this app.</noscript> 20 <div id="root"> 21 <a href="#/blog/"><button class="ui yellow big circular compact icon button sc-fnVYJo aqTZb" style="opacity: 0; pointer-events: none;"><i aria-hidden="true" class="chevron up icon"></i></button></a> 22 <div class="sc-fFSRdu iimMZr" style="opacity: 1;"><span class="sc-iemXMA KstKn top"></span><span class="sc-iemXMA KstKn middle"></span><span class="sc-iemXMA KstKn bottom"></span></div> 23 <div class="sc-bkbjAj gDLPQT" style="opacity: 1;"></div> 24 <div id="nav-menu-buttons-background" class="sc-dIvqjp ciTSxc"></div> 25 <div class="sc-hBMVcZ hahZCc"> 26 <nav> 27 <ul> 28 <li data-is-nav-menu-open="false"><a class="code" href="#/">Home</a></li> 29 <li data-is-nav-menu-open="false"><a class="code" href="#/workExp">Work Experiences</a></li> 30 <li data-is-nav-menu-open="false"><a class="code" href="#/hobbyProject">Hobby Projects</a></li> 31 <li data-is-nav-menu-open="false"><a class="code" href="#/personality">Personality</a></li> 32 <li data-is-nav-menu-open="false"><a class="code" href="#/blog">Blog</a></li> 33 </ul> 34 </nav> 35 </div> 36 <div style="width: 100%; opacity: 1; transition: all 250ms ease 0s;"> 37 <div class="sc-bdnylx jMhaxE"><code class="sc-gtssRu gmjWml">Chung Cheuk Hang Michael</code><code class="sc-dlnjPT cuIYFB">Web Developer</code></div> 38 <div class="sc-hKFyIo bdDYJz" style="padding-bottom: 1em;"> 39 <div class="ui text center aligned container"> 40 <div class="ui tiny borderless compact inverted stackable four item menu"> 41 <a draggable="false" class="blue item" href="#/" style="text-align: center;">Home</a> 42 <div role="listbox" aria-expanded="false" class="ui item simple dropdown" tabindex="0"> 43 <div aria-atomic="true" aria-live="polite" role="alert" class="divider text">My Experiences</div> 44 <i aria-hidden="true" class="dropdown icon"></i> 45 <div class="menu transition"><a draggable="false" role="option" aria-checked="false" class="item" href="#/workExp" style="text-align: center;">Work</a><a draggable="false" role="option" aria-checked="false" class="item" href="#/hobbyProject" style="text-align: center;">Hobby Projects</a></div> 46 </div> 47 <a draggable="false" class="purple item" href="#/personality" style="text-align: center;">Personality</a><a draggable="false" class="teal item" href="#/blog" style="text-align: center;">Blog</a> 48 </div> 49 </div> 50 </div> 51 <div class="sc-hKFyIo bdDYJz"> 52 <div class="ui huge header">All Blog Posts</div> 53 <div class="ui section divider"></div> 54 <div class="ui vertical segment"> 55 <div class="ui divided relaxed items sc-iqAbSa gfcIZQ"> 56 <div class="item"> 57 <a class="ui tiny image" href="#/blog/git-basics" style="margin: auto;"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Git-logo.svg/1280px-Git-logo.svg.png"></a> 58 <div class="content"> 59 <a class="header" href="#/blog/git-basics">Git 基本操作</a> 60 <div class="meta"><span>2020-09-27</span></div> 61 <div class="description">如何使用 Git 既基本 commands 進行版本控制</div> 62 <div class="extra"> 63 <a class="ui pink basic compact right floated button" role="button" href="#/blog/git-basics">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 64 <div class="sc-eCApGN htLRdL"> 65 <div class="ui labels"><a class="ui orange label"><i aria-hidden="true" class="github icon"></i>Git</a></div> 66 </div> 67 </div> 68 </div> 69 </div> 70 <div class="item"> 71 <a class="ui tiny image" href="#/blog/web-scraping" style="margin: auto;"><img src="https://selenide.org/images/selenide-logo-big.png"></a> 72 <div class="content"> 73 <a class="header" href="#/blog/web-scraping">網頁抓取(Web scraping)</a> 74 <div class="meta"><span>2020-09-26</span></div> 75 <div class="description">如何使用 jsoup 及 Selenium/Selenide 工具抓取網頁,喺網頁獲取有用資訊</div> 76 <div class="extra"> 77 <a class="ui pink basic compact right floated button" role="button" href="#/blog/web-scraping">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 78 <div class="sc-eCApGN htLRdL"> 79 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a><a class="ui brown label"><i aria-hidden="true" class="chrome icon"></i>Web Scraping</a></div> 80 </div> 81 </div> 82 </div> 83 </div> 84 <div class="item"> 85 <a class="ui tiny image" href="#/blog/spring-mapstruct" style="margin: auto;"><img src="https://mapstruct.org/images/mapstruct.png"></a> 86 <div class="content"> 87 <a class="header" href="#/blog/spring-mapstruct">在 Spring Boot 中使用 MapStruct</a> 88 <div class="meta"><span>2020-09-23</span></div> 89 <div class="description">MapStruct 的用途,以及如何在 Spring Boot 中使用 MapStruct</div> 90 <div class="extra"> 91 <a class="ui pink basic compact right floated button" role="button" href="#/blog/spring-mapstruct">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 92 <div class="sc-eCApGN htLRdL"> 93 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a><a class="ui green label"><i aria-hidden="true" class="leaf icon"></i>Spring</a></div> 94 </div> 95 </div> 96 </div> 97 </div> 98 <div class="item"> 99 <a class="ui tiny image" href="#/blog/coding-java-6" style="margin: auto;"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Maven_logo.svg/1200px-Maven_logo.svg.png"></a> 100 <div class="content"> 101 <a class="header" href="#/blog/coding-java-6">Java 開發筆記(六)</a> 102 <div class="meta"><span>2020-09-15</span></div> 103 <div class="description">Java 開發筆記 - Dependency Management</div> 104 <div class="extra"> 105 <a class="ui pink basic compact right floated button" role="button" href="#/blog/coding-java-6">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 106 <div class="sc-eCApGN htLRdL"> 107 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a><a class="ui red label"><i aria-hidden="true" class="code icon"></i>Maven</a></div> 108 </div> 109 </div> 110 </div> 111 </div> 112 <div class="item"> 113 <a class="ui tiny image" href="#/blog/coding-java-5" style="margin: auto;"><img src="https://upload.wikimedia.org/wikipedia/fr/thumb/2/2e/Java_Logo.svg/1200px-Java_Logo.svg.png"></a> 114 <div class="content"> 115 <a class="header" href="#/blog/coding-java-5">Java 開發筆記(五)</a> 116 <div class="meta"><span>2020-09-15</span></div> 117 <div class="description">Java 開發筆記 - 幾款超有用必學 3rd party libraries</div> 118 <div class="extra"> 119 <a class="ui pink basic compact right floated button" role="button" href="#/blog/coding-java-5">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 120 <div class="sc-eCApGN htLRdL"> 121 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a></div> 122 </div> 123 </div> 124 </div> 125 </div> 126 <div class="item"> 127 <a class="ui tiny image" href="#/blog/coding-java-4" style="margin: auto;"><img src="https://cdn.worldvectorlogo.com/logos/spring-3.svg"></a> 128 <div class="content"> 129 <a class="header" href="#/blog/coding-java-4">Java 開發筆記(四)</a> 130 <div class="meta"><span>2020-09-15</span></div> 131 <div class="description">Java 開發筆記 - Spring 基礎知識</div> 132 <div class="extra"> 133 <a class="ui pink basic compact right floated button" role="button" href="#/blog/coding-java-4">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 134 <div class="sc-eCApGN htLRdL"> 135 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a><a class="ui green label"><i aria-hidden="true" class="leaf icon"></i>Spring</a></div> 136 </div> 137 </div> 138 </div> 139 </div> 140 <div class="item"> 141 <a class="ui tiny image" href="#/blog/coding-java-3" style="margin: auto;"><img src="https://upload.wikimedia.org/wikipedia/fr/thumb/2/2e/Java_Logo.svg/1200px-Java_Logo.svg.png"></a> 142 <div class="content"> 143 <a class="header" href="#/blog/coding-java-3">Java 開發筆記(三)</a> 144 <div class="meta"><span>2020-09-15</span></div> 145 <div class="description">Java 開發筆記 - Java 基礎知識</div> 146 <div class="extra"> 147 <a class="ui pink basic compact right floated button" role="button" href="#/blog/coding-java-3">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 148 <div class="sc-eCApGN htLRdL"> 149 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a></div> 150 </div> 151 </div> 152 </div> 153 </div> 154 <div class="item"> 155 <a class="ui tiny image" href="#/blog/coding-java-2" style="margin: auto;"><img src="https://upload.wikimedia.org/wikipedia/fr/thumb/2/2e/Java_Logo.svg/1200px-Java_Logo.svg.png"></a> 156 <div class="content"> 157 <a class="header" href="#/blog/coding-java-2">Java 開發筆記(二)</a> 158 <div class="meta"><span>2020-09-15</span></div> 159 <div class="description">Java 開發筆記 - 建立 project</div> 160 <div class="extra"> 161 <a class="ui pink basic compact right floated button" role="button" href="#/blog/coding-java-2">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 162 <div class="sc-eCApGN htLRdL"> 163 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a></div> 164 </div> 165 </div> 166 </div> 167 </div> 168 <div class="item"> 169 <a class="ui tiny image" href="#/blog/coding-java-1" style="margin: auto;"><img src="https://upload.wikimedia.org/wikipedia/fr/thumb/2/2e/Java_Logo.svg/1200px-Java_Logo.svg.png"></a> 170 <div class="content"> 171 <a class="header" href="#/blog/coding-java-1">Java 開發筆記(一)</a> 172 <div class="meta"><span>2020-09-15</span></div> 173 <div class="description">Java 開發筆記 - 安裝所需程式</div> 174 <div class="extra"> 175 <a class="ui pink basic compact right floated button" role="button" href="#/blog/coding-java-1">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 176 <div class="sc-eCApGN htLRdL"> 177 <div class="ui labels"><a class="ui blue label"><i aria-hidden="true" class="code icon"></i>Java</a></div> 178 </div> 179 </div> 180 </div> 181 </div> 182 </div> 183 </div> 184 </div> 185 <div class="sc-hKFyIo bdDYJz"> 186 <div class="ui horizontal section divider"> 187 <h5 class="ui header" style="opacity: 0.5;"><i aria-hidden="true" class="react icon"></i>Reach Me Now!</h5> 188 </div> 189 <div class="ui stackable padded three column grid"> 190 <div class="column"> 191 <a href="tel:+85263301333"> 192 <h4 class="ui header" style="opacity: 0.75; transform: scale(1);"> 193 <i aria-hidden="true" class="teal phone clockwise rotated icon"></i> 194 <div class="content"> 195 Mobile 196 <div class="sub header">6330 1333</div> 197 </div> 198 </h4> 199 </a> 200 </div> 201 <div class="column"> 202 <a href="mailto:michaelboyboy@gmail.com"> 203 <h4 class="ui header" style="opacity: 0.75; transform: scale(1);"> 204 <i aria-hidden="true" class="red mail icon"></i> 205 <div class="content"> 206 Email 207 <div class="sub header">michaelboyboy@gmail.com</div> 208 </div> 209 </h4> 210 </a> 211 </div> 212 <div class="column"> 213 <a href="https://www.linkedin.com/in/mickchung" target="_blank" rel="external nofollow noopener noreferrer"> 214 <h4 class="ui header" style="opacity: 0.75; transform: scale(1);"> 215 <i aria-hidden="true" class="blue linkedin icon"></i> 216 <div class="content"> 217 LinkedIn 218 <div class="sub header">www.linkedin.com/in/mickchung</div> 219 </div> 220 </h4> 221 </a> 222 </div> 223 </div> 224 <div class="ui divider"></div> 225 <div> 226 <div style="color: rgb(75, 163, 199);"> 227 <div class="sc-eCApGN htLRdL" style="display: inline-block;">Copyright © 2020 Chung Cheuk Hang Michael. All rights reserved.</div> 228 <div class="sc-eCApGN htLRdL" style="display: inline-block; float: right;">Last Updated On: <span title="10 Oct 2020 06:09:14" style="cursor: help;">10 Oct 2020</span> / <span title="8abd2641a6dd67cff979ead5b07edf5ca6f89db6" style="cursor: help;">8abd264</span></div> 229 </div> 230 </div> 231 <div class="ui hidden divider"></div> 232 </div> 233 </div> 234 <div class="particles-wrapper"> 235 <canvas width="3355" height="1705" style="width: 100%; height: 100%;"></canvas> 236 </div> 237 </div> 238 <script>!function(e){function r(r){for(var n,l,f=r[0],i=r[1],p=r[2],c=0,s=[];c<f.length;c++)l=f[c],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(a&&a(r);s.length;)s.shift()();return u.push.apply(u,p||[]),t()}function t(){for(var e,r=0;r<u.length;r++){for(var t=u[r],n=!0,f=1;f<t.length;f++){var i=t[f];0!==o[i]&&(n=!1)}n&&(u.splice(r--,1),e=l(l.s=t[0]))}return e}var n={},o={1:0},u=[];function l(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,l),t.l=!0,t.exports}l.m=e,l.c=n,l.d=function(e,r,t){l.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},l.t=function(e,r){if(1&r&&(e=l(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(l.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)l.d(t,n,function(r){return e[r]}.bind(null,n));return t},l.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return l.d(r,"a",r),r},l.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},l.p="/eportfolio/";var f=this.webpackJsonpeportfolio=this.webpackJsonpeportfolio||[],i=f.push.bind(f);f.push=r,f=f.slice();for(var p=0;p<f.length;p++)r(f[p]);var a=i;t()}([])</script><script src="/eportfolio/static/js/2.aa0813bd.chunk.js"></script><script src="/eportfolio/static/js/main.c5cc27be.chunk.js"></script> 239 </body> 240</html>

3.5.3 取得定點資訊

拎到 HTML source 之後,我地可以根據 blog post 標題既 DOM 定位,定點咁取得所有標題 value。其實一旦有左似樣既 HTML source(確保行左 JavaScript,DOM tree 砌好曬),咁我地轉返用 jsoup parser 都係可以既(用 Jsoup.parse(html))。
以下係一個 blog post 標題既節錄:
1<div class="item"> 2 <a class="ui tiny image" href="#/blog/git-basics" style="margin: auto;"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Git-logo.svg/1280px-Git-logo.svg.png"></a> 3 <div class="content"> 4 <a class="header" href="#/blog/git-basics">Git 基本操作</a> 5 <div class="meta"><span>2020-09-27</span></div> 6 <div class="description">如何使用 Git 既基本 commands 進行版本控制</div> 7 <div class="extra"> 8 <a class="ui pink basic compact right floated button" role="button" href="#/blog/git-basics">Continue reading<i aria-hidden="true" class="right chevron icon"></i></a> 9 <div class="sc-eCApGN htLRdL"> 10 <div class="ui labels"><a class="ui orange label"><i aria-hidden="true" class="github icon"></i>Git</a></div> 11 </div> 12 </div> 13 </div> 14</div>
因為 blog post 標題冇特定字眼可以追蹤,咁我地退而求其次,可以用 class="header" 黎定位 blog post 標題。
1public class Main { 2 3 public static void main(String[] args) throws Exception { 4 5 Selenide.open("https://blackr1234.github.io/eportfolio/#/blog/"); 6 Selenide.$$(By.className("header")) 7 .stream() 8 .map(SelenideElement::getText) 9 .forEach(System.out::println); 10 Selenide.closeWebDriver(); 11 } 12}
我地得到既結果:
1All Blog Posts 2網頁抓取(二) - Selenium/Selenide 3Git 基本操作 4網頁抓取(一) - jsoup 5在 Spring Boot 中使用 MapStruct 6Java 開發筆記(六) 7Java 開發筆記(五) 8Java 開發筆記(四) 9Java 開發筆記(三) 10Java 開發筆記(二) 11Java 開發筆記(一) 12Reach Me Now! 13Mobile 146330 1333 156330 1333 16Email 17michaelboyboy@gmail.com 18michaelboyboy@gmail.com 19LinkedIn 20www.linkedin.com/in/mickchung 21www.linkedin.com/in/mickchung
雖然有我地要既野,但係有唔少多餘既資訊都拎埋出黎,因為 class 包含 header 既唔淨止 blog post 標題。我地嘗試用埋 tag 係 <a> 黎 narrow down。
1public class Main { 2 3 public static void main(String[] args) throws Exception { 4 5 Selenide.open("https://blackr1234.github.io/eportfolio/#/blog/"); 6 Selenide.$$(By.cssSelector("a.header")) 7 .stream() 8 .map(SelenideElement::getText) 9 .forEach(System.out::println); 10 Selenide.closeWebDriver(); 11 } 12}
以上 CSS selector 既意思係 DOM node 既 Element.classList 包含 header<a> tag 先會中到。
然後今次就岩曬喇!
1網頁抓取(二) - Selenium/Selenide 2Git 基本操作 3網頁抓取(一) - jsoup 4在 Spring Boot 中使用 MapStruct 5Java 開發筆記(六) 6Java 開發筆記(五) 7Java 開發筆記(四) 8Java 開發筆記(三) 9Java 開發筆記(二) 10Java 開發筆記(一)

3.6 Selenide 自定義配置

以下既自定義功能都係由 Selenide 提供(背後會幫我地改 Selenium 配置),只要喺 call Selenium.open(relativeOrAbsoluteUrl) 之前配置好就得。

3.6.1 隱藏瀏覽器視窗

由於 Selenium/Selenide 啟動既時候會踢著 local 裝既瀏覽器(如 Chrome),如果唔想下下都彈個瀏覽器出黎,可以 set headless mode。
Configuration.headless = true;

3.6.2 瀏覽器安裝位置

如果想由 Chrome 瀏覽器改為使用 Chromium,我地可以下載 Chromium 既 portable binary distribution:
  1. Chromium History Versions Download
  2. 根據 Selenide JAR 裡面(或者 GitHub)既 versions.properties 檔上面寫既支援既版本號碼,喺上面網頁搵最接近既版本(建議版本號碼唔好大過 Selenide 支援既版本號碼)
  3. 撳落去會開 chromium-browser-snapshots 既網頁出黎,再撳落去 zip 檔名度就可以下載
  4. 解壓縮個 zip 檔
之後咁樣寫:
// 改成解壓縮後既 location Configuration.browserBinary = "path/to/chrome.exe";