Table of contents
1 關於 web scraping
Web scraping 即係網頁抓取,相關既技術更可以用黎做自動化。只要對象係網頁,我地都可以用呢篇文章教既 3rd party Java libraries 黎處理。
1.1 技術簡介
Web scrapers 既技術可以簡單分為以下兩種:
種類 | 描述 | 例子 |
---|
HTML parser | 目標網頁既目標資訊必須係以 static 網頁方式黎呈現,唔支援抓取個網頁以運行 JavaScript 黎獲取既資訊 | jsoup |
HTML parser (支援以 JavaScript 生成既元素) | 支援 static 網頁之餘,亦支援抓取個網頁以運行 JavaScript 黎獲取既資訊。因為背後其實係用左瀏覽器既 API 黎駁去瀏覽器,所以運行既時候係需要運行埋一個真既瀏覽器(可以係 under automated testing mode),然後配合一個瀏覽器既 driver(瀏覽器既相關官網都會提供埋,如 Chrome/Chromium 既 for developer 既網站有提供埋 chromedriver) | Selenium |
註:個網頁以運行 JavaScript 黎獲取既資訊——意思即係目標網頁既目標資訊係透過 HTTP request,又稱 AJAX 既方式,動態咁改個網頁 HTML DOM,
而唔係成版 HTML source 同一時間、一炮過,以一個 HTTP GET request(下載網頁)咁下載。例子有某啲實時股價報價網頁,佢地可以係用 JavaScript(如 jQuery)去處理 DOM。
1.2 注意事項
1.2.1 可行性
基本上一般網頁都可以成為 web scraping 既目標,因為佢地就算有措施防範 DDoS,都好可能會疏於防範 web scraping。
不過最重視資訊安全既網站,例如 Google(Google Search 等)以及需要用 captcha、reCAPTCHA 登入或進行其他操作既時候證明唔係機械人既網站,就未必畀我地無限量咁去做 web scraping,因為佢地有完善既防 web scraping 安全措施,例如以 JavaScript 黎判斷 cursor 既移動路徑、撳按鈕既時候撳既位置等等。另外都有部分網站應用左 Cloudflare 既 anti-DDoS solution,所以呢啲網站都未必可以畀我地進行 web scraping。
只要目標網頁既 HTML DOM 結構冇點樣大改,我地個程式都仍然可以正常運作。
1.2.2 法律風險
網頁上既資訊有各式各樣既版權,如果獲取既資訊係要商用,必須先了解清楚法律條文。另外就係如果自動化既 web scraper 運行得太密,以致被定性為 DDoS,
咁都有可能帶黎法律風險。
2 動手使用 jsoup 抓取靜態網頁
2.1 幾時用 jsoup
jsoup 係一個 HTML parser,擁有強大既 API 畀我地根據 HTML source 既 tags 同各種 attributes 去定位 DOM nodes。
佢只係適用於 static 網頁,如果網頁上既資訊黎自於 HTTP request(Chrome developer tools 既 Network 分頁下屬於 XHR 既分類),如果唔係一啲簡單到我地可以用 Postman 試都試到出黎既 HTTP request,咁就唔應該用 jsoup,而係要用更強大、可以行到 JavaScript(擁有 JavaScript engine)既工具先得。咁既情況就要用 Selenium/Selenide,而唔係單純既 HTML parser。
2.2 需求情境
試想像我地而家想寫一個程式,可以幫我地喺不同既軟件網站獲取軟件版本資訊,以便我地知道我地手上既軟件清單上面邊啲軟件有更新(基本上就係我寫過既 Software Version Checker 工具)。
其中一個通用得黎又簡單直接既方法就係用 jsoup 呢一個 HTML parser library 去將成個網頁既 HTML 下載落黎,然後 parse 個 HTML,定位我地想要既資訊。
2.3 分析
從頁面可以睇到,有兩個位置都提供最新版本既資訊:
- 左手邊既
Current Version
- 中間 Downloads 既第一個 item
咁既然佢都已經寫到明 Current Version
,我地就信佢呢個 value 啦。
2.4 添加 Maven dependencies
pom.xml
裡面需要以下 dependency:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
如果想 build 一個 JAR 出黎運行,可以加上 Maven build plugin:
1<build>
2 <plugins>
3 <plugin>
4 <groupId>org.apache.maven.plugins</groupId>
5 <artifactId>maven-shade-plugin</artifactId>
6 <version>3.2.4</version>
7 <executions>
8 <execution>
9 <phase>package</phase>
10 <goals>
11 <goal>shade</goal>
12 </goals>
13 <configuration>
14 <transformers>
15 <transformer
16 implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
17 <mainClass>code.Main</mainClass> <!-- 改成你既 main class -->
18 </transformer>
19 </transformers>
20 </configuration>
21 </execution>
22 </executions>
23 </plugin>
24 </plugins>
25</build>
2.5 寫 Java code
2.5.1 下載網頁 HTML
我地先寫到佢可以下載到 Notepad++ 下載頁既 HTML:
1public class Main {
2
3 public static void main(String[] args) throws Exception {
4
5 final Document doc = Jsoup.connect("https://notepad-plus-plus.org/downloads/").get();
6 final String html = doc.body().html();
7
8 System.out.println(html);
9 }
10}
運行程式,我地可以得到:
1<a href="#main">skip to content</a>
2<svg style="display: none"> <symbol id="bookmark" viewbox="0 0 40 50">
3 <g transform="translate(2266 3206.2)">
4 <path style="stroke:currentColor;stroke-width:3.2637;fill:none" d="m-2262.2-3203.4-.2331 42.195 16.319-16.318 16.318 16.318.2331-42.428z" />
5 </g>
6 </symbol> <symbol id="w3c" viewbox="0 0 127.09899 67.763">
7 <text font-size="83" style="font-size:83px;font-family:Trebuchet;letter-spacing:-12;fill-opacity:0" letter-spacing="-12" y="67.609352" x="-26.782778">
8 W3C
9 </text>
10 <text font-size="83" style="font-size:83px;font-weight:bold;font-family:Trebuchet;fill-opacity:0" y="67.609352" x="153.21722" font-weight="bold">
11 SVG
12 </text>
13 <path style="fill:currentColor;image-rendering:optimizeQuality;shape-rendering:geometricPrecision" d="m33.695.377 12.062 41.016 12.067-41.016h8.731l-19.968 67.386h-.831l-12.48-41.759-12.479 41.759h-.832l-19.965-67.386h8.736l12.061 41.016 8.154-27.618-3.993-13.397h8.737z" />
14 <path style="fill:currentColor;image-rendering:optimizeQuality;shape-rendering:geometricPrecision" d="m91.355 46.132c0 6.104-1.624 11.234-4.862 15.394-3.248 4.158-7.45 6.237-12.607 6.237-3.882 0-7.263-1.238-10.148-3.702-2.885-2.47-5.02-5.812-6.406-10.022l6.82-2.829c1.001 2.552 2.317 4.562 3.953 6.028 1.636 1.469 3.56 2.207 5.781 2.207 2.329 0 4.3-1.306 5.909-3.911 1.609-2.606 2.411-5.738 2.411-9.401 0-4.049-.861-7.179-2.582-9.399-1.995-2.604-5.129-3.912-9.397-3.912h-3.327v-3.991l11.646-20.133h-14.062l-3.911 6.655h-2.493v-14.976h32.441v4.075l-12.31 21.217c4.324 1.385 7.596 3.911 9.815 7.571 2.22 3.659 3.329 7.953 3.329 12.892z" />
15 <path style="fill:currentColor;image-rendering:optimizeQuality;shape-rendering:geometricPrecision" d="m125.21 0 1.414 8.6-5.008 9.583s-1.924-4.064-5.117-6.314c-2.693-1.899-4.447-2.309-7.186-1.746-3.527.73-7.516 4.938-9.258 10.13-2.084 6.21-2.104 9.218-2.178 11.978-.115 4.428.58 7.043.58 7.043s-3.04-5.626-3.011-13.866c.018-5.882.947-11.218 3.666-16.479 2.404-4.627 5.954-7.404 9.114-7.728 3.264-.343 5.848 1.229 7.841 2.938 2.089 1.788 4.213 5.698 4.213 5.698l4.94-9.837z" />
16 <path style="fill:currentColor;image-rendering:optimizeQuality;shape-rendering:geometricPrecision" d="m125.82 48.674s-2.208 3.957-3.589 5.48c-1.379 1.524-3.849 4.209-6.896 5.555-3.049 1.343-4.646 1.598-7.661 1.306-3.01-.29-5.807-2.032-6.786-2.764-.979-.722-3.486-2.864-4.897-4.854-1.42-2-3.634-5.995-3.634-5.995s1.233 4.001 2.007 5.699c.442.977 1.81 3.965 3.749 6.572 1.805 2.425 5.315 6.604 10.652 7.545 5.336.945 9.002-1.449 9.907-2.031.907-.578 2.819-2.178 4.032-3.475 1.264-1.351 2.459-3.079 3.116-4.108.487-.758 1.276-2.286 1.276-2.286l-1.276-6.644z" />
17 </symbol> <symbol id="tag" viewbox="0 0 177.16535 177.16535">
18 <g transform="translate(0 -875.2)">
19 <path style="fill-rule:evenodd;stroke-width:0;fill:currentColor" d="m159.9 894.3-68.79 8.5872-75.42 77.336 61.931 60.397 75.429-76.565 6.8495-69.755zm-31.412 31.835a10.813 10.813 0 0 1 1.8443 2.247 10.813 10.813 0 0 1 -3.5174 14.872l-.0445.0275a10.813 10.813 0 0 1 -14.86 -3.5714 10.813 10.813 0 0 1 3.5563 -14.863 10.813 10.813 0 0 1 13.022 1.2884z" />
20 </g>
21 </symbol> <symbol id="balloon" viewbox="0 0 141.73228 177.16535">
22 <g transform="translate(0 -875.2)">
23 <g>
24 <path style="fill:currentColor" d="m68.156 882.83-.88753 1.4269c-4.9564 7.9666-6.3764 17.321-5.6731 37.378.36584 10.437 1.1246 23.51 1.6874 29.062.38895 3.8372 3.8278 32.454 4.6105 38.459 4.6694-.24176 9.2946.2879 14.377 1.481 1.2359-3.2937 5.2496-13.088 8.886-21.623 6.249-14.668 8.4128-21.264 10.253-31.252 1.2464-6.7626 1.6341-12.156 1.4204-19.764-.36325-12.93-2.1234-19.487-6.9377-25.843-2.0833-2.7507-6.9865-7.6112-7.9127-7.8436-.79716-.20019-6.6946-1.0922-6.7755-1.0248-.02213.0182-5.0006-.41858-7.5248-.22808l-2.149-.22808h-3.3738z" />
25 <path style="fill:currentColor" d="m61.915 883.28-3.2484.4497c-1.7863.24724-3.5182.53481-3.8494.63994-2.4751.33811-4.7267.86957-6.7777 1.5696-.28598 0-1.0254.20146-2.3695.58589-5.0418 1.4418-6.6374 2.2604-8.2567 4.2364-6.281 7.6657-11.457 18.43-12.932 26.891-1.4667 8.4111.71353 22.583 5.0764 32.996 3.8064 9.0852 13.569 25.149 22.801 37.517 1.3741 1.841 2.1708 2.9286 2.4712 3.5792 3.5437-1.1699 6.8496-1.9336 10.082-2.3263-1.3569-5.7831-4.6968-21.86-6.8361-33.002-.92884-4.8368-2.4692-14.322-3.2452-19.991-.68557-5.0083-.77707-6.9534-.74159-15.791.04316-10.803.41822-16.162 1.5026-21.503 1.4593-5.9026 3.3494-11.077 6.3247-15.852z" />
26 <path style="fill:currentColor" d="m94.499 885.78c-.10214-.0109-.13691 0-.0907.0409.16033.13489 1.329 1.0675 2.5976 2.0723 6.7003 5.307 11.273 14.568 12.658 25.638.52519 4.1949.24765 14.361-.5059 18.523-2.4775 13.684-9.7807 32.345-20.944 53.519l-3.0559 5.7971c2.8082.76579 5.7915 1.727 8.9926 2.8441 11.562-11.691 18.349-19.678 24.129-28.394 7.8992-11.913 11.132-20.234 12.24-31.518.98442-10.02-1.5579-20.876-6.7799-28.959-.2758-.4269-.57803-.86856-.89617-1.3166-3.247-6.13-9.752-12.053-21.264-16.131-2.3687-.86369-6.3657-2.0433-7.0802-2.1166z" />
27 <path style="fill:currentColor" d="m32.52 892.22c-.20090-.13016-1.4606.81389-3.9132 2.7457-11.486 9.0476-17.632 24.186-16.078 39.61.79699 7.9138 2.4066 13.505 5.9184 20.562 5.8577 11.77 14.749 23.219 30.087 38.74.05838.059.12188.1244.18052.1838 1.3166-.5556 2.5965-1.0618 3.8429-1.5199-.66408-.32448-1.4608-1.3297-3.8116-4.4602-5.0951-6.785-8.7512-11.962-13.051-18.486-5.1379-7.7948-5.0097-7.5894-8.0586-13.054-6.2097-11.13-8.2674-17.725-8.6014-27.563-.21552-6.3494.13041-9.2733 1.775-14.987 2.1832-7.5849 3.9273-10.986 9.2693-18.07 1.7839-2.3656 2.6418-3.57 2.4409-3.7003z" />
28 <path style="fill:currentColor" d="m69.133 992.37c-6.2405.0309-12.635.76718-19.554 2.5706 4.6956 4.7759 9.935 10.258 12.05 12.625l4.1272 4.6202h11.493l3.964-4.4516c2.0962-2.3541 7.4804-7.9845 12.201-12.768-8.378-1.4975-16.207-2.6353-24.281-2.5955z" />
29 <rect style="stroke-width:0;fill:currentColor" ry="2.0328" height="27.746" width="22.766" y="1017.7" x="60.201" />
30 </g>
31 </g>
32 </symbol> <symbol id="info" viewbox="0 0 41.667 41.667">
33 <g transform="translate(-37.035 -1004.6)">
34 <path style="stroke-linejoin:round;stroke:currentColor;stroke-linecap:round;stroke-width:3.728;fill:none" d="m76.25 1030.2a18.968 18.968 0 0 1 -23.037 13.709 18.968 18.968 0 0 1 -13.738 -23.019 18.968 18.968 0 0 1 23.001 -13.768 18.968 18.968 0 0 1 13.798 22.984" />
35 <g transform="matrix(1.1146 0 0 1.1146 -26.276 -124.92)">
36 <path style="stroke:currentColor;stroke-linecap:round;stroke-width:3.728;fill:none" d="m75.491 1039.5v-8.7472" />
37 <path style="stroke-width:0;fill:currentColor" transform="scale(-1)" d="m-73.193-1024.5a2.3719 2.3719 0 0 1 -2.8807 1.7142 2.3719 2.3719 0 0 1 -1.718 -2.8785 2.3719 2.3719 0 0 1 2.8763 -1.7217 2.3719 2.3719 0 0 1 1.7254 2.8741" />
38 </g>
39 </g>
40 </symbol> <symbol id="warning" viewbox="0 0 48.430474 41.646302">
41 <g transform="translate(-1.1273 -1010.2)">
42 <path style="stroke-linejoin:round;stroke:currentColor;stroke-linecap:round;stroke-width:4.151;fill:none" d="m25.343 1012.3-22.14 37.496h44.28z" />
43 <path style="stroke:currentColor;stroke-linecap:round;stroke-width:4.1512;fill:none" d="m25.54 1027.7v8.7472" />
44 <path style="stroke-width:0;fill:currentColor" d="m27.839 1042.8a2.3719 2.3719 0 0 1 -2.8807 1.7143 2.3719 2.3719 0 0 1 -1.718 -2.8785 2.3719 2.3719 0 0 1 2.8763 -1.7217 2.3719 2.3719 0 0 1 1.7254 2.8741" />
45 </g>
46 </symbol> <symbol id="menu" viewbox="0 0 50 50">
47 <rect style="stroke-width:0;fill:currentColor" height="10" width="50" y="0" x="0" />
48 <rect style="stroke-width:0;fill:currentColor" height="10" width="50" y="20" x="0" />
49 <rect style="stroke-width:0;fill:currentColor" height="10" width="50" y="40" x="0" />
50 </symbol> <symbol id="link" viewbox="0 0 50 50">
51 <g transform="translate(0 -1002.4)">
52 <g transform="matrix(.095670 0 0 .095670 2.3233 1004.9)">
53 <g>
54 <path style="stroke-width:0;fill:currentColor" d="m452.84 192.9-128.65 128.65c-35.535 35.54-93.108 35.54-128.65 0l-42.881-42.886 42.881-42.876 42.884 42.876c11.845 11.822 31.064 11.846 42.886 0l128.64-128.64c11.816-11.831 11.816-31.066 0-42.9l-42.881-42.881c-11.822-11.814-31.064-11.814-42.887 0l-45.928 45.936c-21.292-12.531-45.491-17.905-69.449-16.291l72.501-72.526c35.535-35.521 93.136-35.521 128.64 0l42.886 42.881c35.535 35.523 35.535 93.141-.001 128.66zm-254.28 168.51-45.903 45.9c-11.845 11.846-31.064 11.817-42.881 0l-42.884-42.881c-11.845-11.821-11.845-31.041 0-42.886l128.65-128.65c11.819-11.814 31.069-11.814 42.884 0l42.886 42.886 42.876-42.886-42.876-42.881c-35.54-35.521-93.113-35.521-128.65 0l-128.65 128.64c-35.538 35.545-35.538 93.146 0 128.65l42.883 42.882c35.51 35.54 93.11 35.54 128.65 0l72.496-72.499c-23.956 1.597-48.092-3.784-69.474-16.283z" />
55 </g>
56 </g>
57 </g>
58 </symbol> <symbol id="doc" viewbox="0 0 35 45">
59 <g transform="translate(-147.53 -539.83)">
60 <path style="stroke:currentColor;stroke-width:2.4501;fill:none" d="m149.38 542.67v39.194h31.354v-39.194z" />
61 <g style="stroke-width:25" transform="matrix(.098003 0 0 .098003 133.69 525.96)">
62 <path d="m220 252.36h200" style="stroke:currentColor;stroke-width:25;fill:none" />
63 <path style="stroke:currentColor;stroke-width:25;fill:none" d="m220 409.95h200" />
64 <path d="m220 488.74h200" style="stroke:currentColor;stroke-width:25;fill:none" />
65 <path d="m220 331.15h200" style="stroke:currentColor;stroke-width:25;fill:none" />
66 </g>
67 </g>
68 </symbol> <symbol id="tick" viewbox="0 0 177.16535 177.16535">
69 <g transform="translate(0 -875.2)">
70 <rect style="stroke-width:0;fill:currentColor" transform="rotate(30)" height="155" width="40" y="702.99" x="556.82" />
71 <rect style="stroke-width:0;fill:currentColor" transform="rotate(30)" height="40" width="90.404" y="817.99" x="506.42" />
72 </g>
73 </symbol>
74</svg>
75<div class="wrapper">
76 <header class="intro-and-nav" role="banner">
77 <div>
78 <div class="intro"> <a class="logo" href="/" aria-label="Notepad++ home page"> <img src="https://notepad-plus-plus.org/images/logo.svg" alt=""> </a>
79 <p class="library-desc"> <a href="/downloads/v7.9/"><strong>Current Version 7.9</strong></a> </p>
80 </div>
81 <nav id="patterns-nav" class="patterns" role="navigation">
82 <h2 class="vh">Main navigation</h2> <button id="menu-button" aria-expanded="false">
83 <svg viewbox="0 0 50 50" aria-hidden="true" focusable="false"> <use xlink:href="#menu"></use>
84 </svg> Menu </button>
85 <ul id="patterns-list">
86 <li class="pattern"> <a href="/">
87 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
88 </svg> <span class="text">Home</span> </a> </li>
89 <li class="pattern"> <a href="/downloads/" aria-current="page">
90 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
91 </svg> <span class="text">Download</span> </a> </li>
92 <li class="pattern"> <a href="/news/">
93 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
94 </svg> <span class="text">News</span> </a> </li>
95 <li class="pattern"> <a href="/online-help/">
96 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
97 </svg> <span class="text">Online Help</span> </a> </li>
98 <li class="pattern"> <a href="/resources/">
99 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
100 </svg> <span class="text">Resources</span> </a> </li>
101 <li class="pattern"> <a href="/index.xml">
102 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
103 </svg> <span class="text">RSS</span> </a> </li>
104 <li class="pattern"> <a href="/donate/">
105 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
106 </svg> <span class="text">Donate</span> </a> </li>
107 <li class="pattern"> <a href="/author/">
108 <svg class="bookmark-icon" aria-hidden="true" focusable="false" viewbox="0 0 40 50"> <use xlink:href="#bookmark"></use>
109 </svg> <span class="text">Author</span> </a> </li>
110 </ul>
111 </nav>
112 <div id="carbon-block"></div>
113 <script>
114
115
116if (window.location.href === "https://notepad-plus-plus.org/downloads/" || window.location.href.indexOf('/downloads/') === -1){
117 var carbonScript = document.createElement("script");
118 carbonScript.src = "//cdn.carbonads.com/carbon.js?serve=CKYIE53I&placement=notepad-plus-plusorg";
119 carbonScript.id = "_carbonads_js";
120 document.getElementById("carbon-block").appendChild(carbonScript);
121}
122else {
123 try {
124 fetch(new Request("https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js", { method: 'HEAD', mode: 'no-cors' })).then(function(response) {
125 return true;
126 }).catch(function(e) {
127 var carbonScript = document.createElement("script");
128 carbonScript.src = "//cdn.carbonads.com/carbon.js?serve=CE7DVKJM&placement=notepad-plus-plusorg";
129 carbonScript.id = "_carbonads_js";
130 document.getElementById("carbon-block").appendChild(carbonScript);
131 });
132 } catch (error) {
133 console.log(error);
134 }
135}
136
137</script>
138 </div>
139 </header>
140 <div class="main-and-footer">
141 <div>
142 <main id="main">
143 <h1>Downloads</h1>
144 <ul class="patterns-list">
145 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.9/">
146 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
147 </svg> Notepad++ 7.9: Stand with Hong Kong </a> </h2> </li>
148 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.9/">
149 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
150 </svg> Notepad++ 7.8.9: Stand with Hong Kong </a> </h2> </li>
151 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.8/">
152 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
153 </svg> Notepad++ 7.8.8 release </a> </h2> </li>
154 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.7/">
155 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
156 </svg> Notepad++ 7.8.7 release </a> </h2> </li>
157 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.6/">
158 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
159 </svg> Notepad++ 7.8.6 release </a> </h2> </li>
160 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.5/">
161 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
162 </svg> Notepad++ 7.8.5 release </a> </h2> </li>
163 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.4/">
164 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
165 </svg> Notepad++ 7.8.4 release </a> </h2> </li>
166 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.3/">
167 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
168 </svg> Notepad++ 7.8.3: Free Uyghur </a> </h2> </li>
169 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.2/">
170 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
171 </svg> Notepad++ 7.8.2: Free Uyghur </a> </h2> </li>
172 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8.1/">
173 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
174 </svg> Notepad++ 7.8.1: Free Uyghur </a> </h2> </li>
175 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.8/">
176 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
177 </svg> Notepad++ 7.8 release </a> </h2> </li>
178 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.7.1/">
179 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
180 </svg> Notepad++ 7.7.1 release </a> </h2> </li>
181 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.7/">
182 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
183 </svg> Notepad++ 7.7 release </a> </h2> </li>
184 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.6.6/">
185 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
186 </svg> Notepad++ 7.6.6 release </a> </h2> </li>
187 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.6.4/">
188 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
189 </svg> Notepad++ 7.6.4 release </a> </h2> </li>
190 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.6.3/">
191 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
192 </svg> Notepad++ 7.6.3 release </a> </h2> </li>
193 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.6.2/">
194 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
195 </svg> Notepad++ 7.6.2 Gilet Jaune Edition </a> </h2> </li>
196 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.5.6/">
197 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
198 </svg> Notepad++ 7.5.6 release </a> </h2> </li>
199 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.5.4/">
200 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
201 </svg> Notepad++ 7.5.4 release </a> </h2> </li>
202 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.3.3/">
203 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
204 </svg> Notepad++ 7.3.3 - CIA Hack fixed </a> </h2> </li>
205 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v7.0/">
206 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
207 </svg> Notepad++ 7 - 64 bits </a> </h2> </li>
208 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v6.9/">
209 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
210 </svg> Notepad++ 6.9 </a> </h2> </li>
211 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v6.8.7/">
212 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
213 </svg> Notepad++ 6.8.7 Black Friday Discount </a> </h2> </li>
214 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v6.7.4/">
215 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
216 </svg> Notepad++ 6.7.4 - Je suis Charlie edition </a> </h2> </li>
217 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v6.6.6/">
218 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
219 </svg> Notepad++ 666 </a> </h2> </li>
220 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v6.6.4/">
221 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
222 </svg> Notepad++ 6.6.4 - Tiananmen June Fourth Incident Edition </a> </h2> </li>
223 <li> <h2> <a href="https://notepad-plus-plus.org/downloads/v6.2.3/">
224 <svg class="bookmark" aria-hidden="true" viewbox="0 0 40 50" focusable="false"> <use xlink:href="#bookmark"></use>
225 </svg> Notepad++ 6.2.3 release </a> </h2> </li>
226 </ul>
227 </main>
228 <footer role="contentinfo">
229 <div> <label for="themer"> dark theme: <input type="checkbox" id="themer" class="vh"> <span aria-hidden="true"></span> </label>
230 </div>
231 </footer>
232 </div>
233 </div>
234</div>
235<script src="https://notepad-plus-plus.org/js/prism.js"></script>
236<script src="https://notepad-plus-plus.org/js/dom-scripts.js"></script>
2.5.2 取得定點資訊
拎到 HTML source 之後,我地可以根據 Current Version
既 DOM 定位,定點咁取得版本 value。
以下係 Current Version
果段既節錄:
<p class="library-desc"> <a href="/downloads/v7.9/"><strong>Current Version 7.9</strong></a> </p>
其實用 Current Version
呢隻字都已經夠。
1public class Main {
2
3 public static void main(String[] args) throws Exception {
4
5 final Document doc = Jsoup.connect("https://notepad-plus-plus.org/downloads/").get();
6// final String html = doc.body().html();
7
8// System.out.println(html);
9
10 final Element currentVersionElement = doc.body().getElementsMatchingOwnText("Current Version").first();
11 final String currentVersion = currentVersionElement.ownText();
12
13 System.out.println(currentVersion);
14 }
15}
結果如下:
Current Version 7.9
之後攞淨個 version number 就 ok,點樣 manipulate string 唔詳細講了。
2.6 DOM node 定位方法
喺例子裡面,我地用左 Element#getElementsMatchingOwnText(regex)
去定位 Current Version
字眼既 DOM node,但其實有好幾個方法。
唯一美中不足既無 jsoup 唔支援 XPath 格式既 selector,which is 一個比 CSS selector 更強大既方法。
2.6.1 根據字眼定位
呢個係最簡單直接既方法,有時可能連 Chrome developer tools 都唔需要用到。
Methods 有:
getElementsContainingText(searchText)
getElementsContainingOwnText(searchText)
getElementsMatchingText(regex)
getElementsMatchingOwnText(regex)
2.6.2 用 id 定位
另一個準確既定位方法係用 Chrome developer tools 裡面見到既 id
或者 name
attribute 黎定位 DOM node。
Method 有:
2.6.3 用 class 定位
我地可以喺 Chrome developer tools 裡面見到個 tag 既 class
attribute 就用佢黎定位 DOM node,如 class="btn"
。
Method 有:
getElementsByClass(className)
2.6.4 用 attribute 定位
我地可以喺 Chrome developer tools 裡面見到 id
、class
以外既 attributes 就用佢地黎定位 DOM node,例如 name
、style
、href
等等。
Methods 有:
getElementsByAttribute(key)
getElementsByAttributeValueMatching(key, regex)
2.6.5 用 tag 定位
我地可以喺 Chrome developer tools 裡面見到個 tag 名就用佢黎定位 DOM node,例如 <div>
、<h3>
、<ul>
。
但一般都係先用左比較準確咁 narrow down 到 HTML DOM node 去好接近既兩三層先再用呢個方法黎定位 child node 或者 sibling node。
Method 有:
getElementsByTag(tagName)
2.6.6 用 CSS selector 定位
有寫過網頁既人都應該會識寫 CSS selector。比起其他方法有時候要我地串連多個 methods,用 CSS selector 係一個相對簡潔既選擇。
Methods 有:
select(cssQuery)
selectFirst(cssQuery)
例子——以下係用 jsoup 既 DSL methods 既寫法:
1final String smallestVersionOnDownloadsPage = Jsoup.connect("https://notepad-plus-plus.org/downloads/").get()
2 .getElementsByClass("patterns-list").first()
3 .getElementsByTag("li").last()
4 .getElementsByTag("h2").first()
5 .getElementsByTag("a").first()
6 .ownText();
而以下就係用 CSS selector 既寫法:
final String smallestVersionOnDownloadsPage = Jsoup.connect("https://notepad-plus-plus.org/downloads/").get()
.selectFirst(".patterns-list li:last-child h2 a")
.ownText();
結果都一樣係:
Notepad++ 6.2.3 release