概述
如果使用golang 写爬虫,这时候需要解析html,那么比较友好的方式就是使用类似jquery 的接口来获取html 里面的元素。
这里介绍使用goquery,即可获得和使用jquery 一样方便的接口。github
以下,以抓取github 首页所有链接作为例子。
当然,本例,只将所有链接打印出来,并不深挖,由使用者自行扩展。
有了链接,就可以做拼接、去重、再抓取等等动作。其中,可以将url存储到数据库中,以便长期抓取、遍历抓取。以及控制频率等等。这些属于爬虫相关的了,不在本篇讨论
package main
import (
"fmt"
"log"
"github.com/PuerkitoBio/goquery"
)
func Scrape(url string) {
doc, err := goquery.NewDocument(url)
if err != nil {
log.Fatal(err)
}
// Find the urls
doc.Find("a").Each(func(i int, s *goquery.Selection) {
val,exists:=s.Attr("href")
if exists && val!="" {
fmt.Printf("href:%s\n",val)
}
})
}
func main() {
Scrape("https://github.com")
}
简单api文档:
package goquery // import “github.com/PuerkitoBio/goquery”
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
It brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go’s net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery’s stateful manipulation functions (like height(), css(), detach()) have been left off.
Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller’s responsibility to ensure that the source document provides UTF-8 encoded HTML. See the repository’s wiki for various options on how to do this.
Syntax-wise, it is as close as possible to jQuery, with the same method names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go’s fmt package), even though some of its methods are less than intuitive (looking at you, index()…).
It is hosted on GitHub, along with additional documentation in the README.md file: https://github.com/puerkitobio/goquery
Please note that because of the net/html dependency, goquery requires Go1.1+.
The various methods are split into files based on the category of behavior. The three dots (…) indicate that various “overloads” are available.
-
array.go : array-like positional manipulation of the selection.
- Eq()
- First()
- Get()
- Index…()
- Last()
- Slice()
-
expand.go : methods that expand or augment the selection’s set.
- Add…()
- AndSelf()
- Union(), which is an alias for AddSelection()
-
filter.go : filtering methods, that reduce the selection’s set.
- End()
- Filter…()
- Has…()
- Intersection(), which is an alias of FilterSelection()
- Not…()
-
iteration.go : methods to loop over the selection’s nodes.
- Each()
- EachWithBreak()
- Map()
-
manipulation.go : methods for modifying the document
- After…()
- Append…()
- Before…()
- Clone()
- Empty()
- Prepend…()
- Remove…()
- ReplaceWith…()
- Unwrap()
- Wrap…()
- WrapAll…()
- WrapInner…()
-
property.go : methods that inspect and get the node’s properties values.
- Attr*(), RemoveAttr(), SetAttr()
- AddClass(), HasClass(), RemoveClass(), ToggleClass()
- Html()
- Length()
- Size(), which is an alias for Length()
- Text()
-
query.go : methods that query, or reflect, a node’s identity.
- Contains()
- Is…()
-
traversal.go : methods to traverse the HTML document tree.
- Children…()
- Contents()
- Find…()
- Next…()
- Parent[s]…()
- Prev…()
- Siblings…()
-
type.go : definition of the types exposed by goquery.
- Document
- Selection
- Matcher
-
utilities.go : definition of helper functions (and not methods on a *Selection) that are not part of jQuery, but are useful to goquery.
- NodeName
- OuterHtml func NodeName(s *Selection) string func OuterHtml(s *Selection) (string, error) func CloneDocument(doc *Document) *Document func NewDocument(url string) (*Document, error) func NewDocumentFromNode(root *html.Node) *Document func NewDocumentFromReader(r io.Reader) (*Document, error) func NewDocumentFromResponse(res *http.Response) (*Document, error) type Document struct { … } type Matcher interface { … } type Selection struct { … }
番外
当然,除了使用goquery ,当然也可以直接使用x/net/html,文档地址: https://godoc.org/golang.org/x/net/html
不过,使用比较麻烦一些。