BeautifulSoup 4

뷰티풀수프는 html와 xml에서 데이터를 뽑아내기 위한 파이썬 라이브러리이다.

아래 BeautifulSoup 객체를 파싱한다.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

BeautifulSoup의 명령어를 사용하여 다음과 같은 결과를 얻는다.

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

# 페이지 내 a태그에 존재하는 모든 url 추출
for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

# 페이지 내 모든 텍스트 추출
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

BeautifulSoup에는 다음의 해석기들이 존재한다.

필자는 html.parser 를 주로 사용한다.

해석기	전형적 사용방법	장점	단점
파이썬의 html.parser	BeautifulSoup(markup, "html.parser")	각종 기능 완비 적절한 속도 관대함	별로 관대하지 않음
lxml의 HTML 해석기	BeautifulSoup(markup, "lxml")	아주 빠름 관대함	외부 C라이브러리 의존
lxml의 XML 해석기	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	아주 빠름 유일하게 XML 해석기를 지원	외부 C라이브러리 의존
html5lib	BeautifulSoup(markup, html5lib)	아주 관대함 웹 브라우저의 방식으로 페이지를 해석함 유효한 HTML5를 생성함	아주 느림 외부 파이썬 라이브러리 의존 파이썬 2 전용

# 참고자료

https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/

저작자표시 비영리 (새창열림)

'개발 > Python' 카테고리의 다른 글

sqlalchemy 2013: Lost connection to MySQL server during query 이슈 해결 (0)	2022.11.24
FastAPI 환경에서 Nginx 와 Uvicorn 을 통한 Deploy (0)	2022.09.07
Selenium Locating Elements (0)	2020.08.04

Sengwoolee’s blog

BeautifulSoup 4

'개발 > Python' 카테고리의 다른 글

티스토리툴바

BeautifulSoup 4

'개발 > Python' 카테고리의 다른 글

'개발/Python' Related Articles

티스토리툴바