Scraping information from webpages is a comparatively superior activity that, till lately, required a level of technical talent. The thought of diving into code or scripts for information extraction appeared overwhelming for a lot of, myself included.
Knowledge scraping can energy many SEO tasks, resembling auditing, competitor evaluation, and inspecting web site and information construction.
Google sheets affords simple solutions to assist.
A kind of options is the IMPORTXML perform that enables customers to scrape webpage information utilizing just some parameters. It makes information extraction accessible to a wider viewers, particularly to those that weren’t well-versed in programming languages.
Whereas this perform is spectacular, the actual breakthrough got here with the adoption and integration of generative AI into the combo.
On this information, we’ll present you the best way to use Google Sheets and AI, significantly ChatGPT, for internet scraping while not having superior coding expertise.
The Instruments: AI And Chatbots
We are actually all acquainted with AI, ChatGPT, and similar chatbots.
In reality, many people use options like ChatGPT to jot down our personal code, scripts, and packages with or without very restricted programming information.
It is so simple as offering detailed directions within the type of prompts and dealing with the chatbot to construct instruments that solely till lately we believed have been manner above us.
However most significantly, these are instruments which can be deeply altering the way in which we strategy our day-to-day work.
For instance, if we ask ChatGPT the next query, “What’s the IMPORTXML perform and the way can I take advantage of it in Google Sheets to scrape the title of an HTML webpage? Present the required code to do this in Google Sheets,” the response is extraordinarily correct. In a matter of seconds, we now have our components prepared to make use of in Google Sheets.
However to be sincere, that was a really primary and easy activity that we might have simply accomplished with out ChatGPT.
The Activity
So, how does this work if we wish to extract information that could be a bit much less normal in comparison with a web page title or description?
For instance, how does this work if we wish to extract the next information from the PPC entrance web page of Search Engine Journal?
Record all featured articles, their authors, the hyperlink URLs, and the article description for the columns listed on https://www.searchenginejournal.com/class/paid-media/pay-per-click/.
Can we do this instantly with ChatGPT?
Executing With ChatGPT
When creating prompts, it took just a few makes an attempt to offer directions that have been detailed sufficient for the chatbot to completely perceive the target of the duty and return good outcomes.
In lots of instances, it felt just like the AI was beneath strain to return fast outcomes regardless of their accuracy.
However let me clarify.
The duty was to research the web page and record all featured articles, their authors, the hyperlink URLs, and the outline for every of the 30 articles listed on the web page. Then compile the information right into a desk and eventually export it right into a CSV file.
Easy proper?
At first, ChatGPT returned only a pattern of seven articles and solely their titles and URLs; after a reworked immediate, it managed to record and export all 30 articles and their hyperlinks.
Now, that was good. So, to finish the duty, we simply wanted so as to add the authors and the article descriptions.
However right here is the place the bot stumbled and was not capable of present an correct description of every article regardless of us offering examples of the web page component it wanted to seek out and duplicate.
ChatGPT saved ignoring the directions and offering its personal article descriptions time and time once more.
ChatGPT even failed once we tried with a distinct strategy and downloaded and uploaded a duplicate of the web page HTML.
-
Screenshot from ChatGPT, February 2024
This time, it was capable of present correct information for seven articles however couldn’t go previous that. The difficulty reported:
“…the construction and content material of the web page current vital challenges for complete information extraction in a single session.
The web page is sort of in depth and complicated, and it’s not possible to extract all 30 articles within the present format of interplay.”
-
Screenshot from ChatGPT, February 2024
ChatGPT + Google Sheets
So, going again to IMPORTXML and Google Sheets.
This time, getting ChatGPT to offer the formulation for every area was like a breeze.
-
Screenshot from ChatGPT, February 2024
Listed here are a few of the formulation, as advised by the chatbot, you could simply strive your self in Google Sheets to extract:
Title
=IMPORTXML("https://www.searchenginejournal.com/class/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/h2/a")
Creator Title
=IMPORTXML("https://www.searchenginejournal.com/class/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/p[1]/a")
URL Hyperlink
=IMPORTXML("https://www.searchenginejournal.com/class/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/h2/a/@href")
Description
=IMPORTXML("https://www.searchenginejournal.com/class/paid-media/pay-per-click/", "//*[@id='archives-wrapper']/article/div/div[2]/p[2]")
Very quickly, we have been capable of extract the information into the spreadsheet.
-
Screenshot from Google Sheets, February 2024
Moreover, by utilizing merely constructed nested formulation, we are able to rapidly pull the information from a number of pages on the identical time.
Within the instance under, I used to be capable of extract the identical information associated to every article (title, creator, URL hyperlink, and outline) for the primary 10 pages of the PPC part.
The result’s a complete of 300 articles scraped in lower than a minute!
-
Screenshot from Google Sheets, February 2024
Evaluating The Two
So, how do ChatGPT vs. ChatGPT + Google Sheets IMPORTXML evaluate?
In my expertise, I couldn’t discover a straightforward and fast manner to make use of ChatGPT to scrape the information I used to be in search of – thoughts you, that doesn’t imply that this isn’t potential, and there is perhaps a number of methods to do that, however I didn’t discover any.
What labored for me was a mix of the completely different instruments, and that served me rather well for my supposed function.
ChatGPT was extraordinarily helpful for writing the IMPORTXML formulation I wanted to make use of in Google Sheets, and people formulation did the remainder.
A further bonus of the ChatGPT + Google Sheets possibility is you could simply use the free 3.5 model of ChatGPT and get the device to construct your IMPORTXML formulation, as a substitute of getting model 4 to scan the web page and extract the information.
Key Takeaway
This highlights a essential side of how AI has remodeled how we expect and work.
The perfect device for the job isn’t merely utilizing AI, Google Sheets, or any particular software program alone however fairly a combination of tools and skills.
It’s on this built-in strategy that we develop workflows which can be environment friendly and efficient, thus enhancing our total productiveness.
Extra sources:
Featured Picture: Visible Era/Shutterstock