
Web scraping has the power to help innovators collect data, which can then be used in a number of ways, from tracking online pricing fluctuations to creating the AI models of the future. As such, access to web data is a priority for many organizations — something that isn’t always easy to maintain.
We spoke to Giedrius Steimantas, engineering manager at Oxylabs, about the demand for data, and the role that web scraping is playing in this landscape.
BN: What’s driving the growing demand for data?
GS: To me, three clear factors are driving this:
- E-commerce and general decision-making software needs real-time, real-world data to successfully perform tasks — for dynamic pricing, for example, up-to-date data can track buying patterns, allowing companies to fluctuate the price in real-time and make it competitive for customers.
- LLMs, alongside other AI tools, require vast quantities of training data to be successful — this demand can be eased with web scraping, making it an essential component in the AI revolution.
- Public data use cases are increasing — this was seen in the UK government’s January plan to turbocharge AI, as it set out to create a new National Data Library to safely and securely unlock the value of public data and support AI development.
These examples are evidence of the growing demand for data, making it clear why many are saying that this information should not sit solely in the hands of Big Tech — it should be equally accessible to the public and the wider market.
It’s worth noting at this stage that the push and pull between Big Tech and the wider market does come with some negative sentiment, especially surrounding the AI boom (which is always expected when big changes shake the technology industry). While this debate rages, it becomes a double-edged sword, causing sites to tighten restrictions, but with knock-on effects. A lack of access means that all web scraping cases that traditionally depended on public web data will be affected. This will have an impact on academic research, investigative journalism, and business cases like price comparison and flight fare aggregators, which used web scraping long before AI drove data demand higher.
BN: What are the trade-offs between openness versus full restriction of web data collection?
GS: Maintaining, or even widening, public data access means that organizations of all sizes have the data needed to both fuel innovation but also continue to operate as they are today. This is where web scraping plays a key role. By giving everyone equal access to the public data needed to fuel future technology in a cost-effective way, we stop a small group of big organizations from having all the power.
It’s clear that web data collection is also a crucial part of building AI tools of the future. Ethics are a key element in the trade-off between openness and full restriction because AI built on a broad spectrum of public data and larger datasets is less likely to replicate human biases. In simple terms, open access provides enterprises with a larger, more balanced input, and therefore, the output is a fairer representation.
It’s important to remember that this is an ever-changing landscape, and one that enterprises need to stay ahead of. More than ever before, customers are prioritizing AI-embedded technologies, and if this market is at risk of bias or data shortages, it could have a huge knock-on effect. Staying ahead of consumer concerns is a high priority.
BN: What are the societal impacts of building AI on limited datasets?
GS: By building AI models on limited datasets, there is a distorted representation between reality and the data used to train AI models. This could lead to increased biases if data isn’t fairly reflective of the whole of society and potentially result in discrimination against underrepresented groups within the outputs.
Additionally, reducing data access for AI training could lead to gaps in knowledge, resulting in AI models providing misleading or distorted information to users. When this is then used in decision-making at a high level, inequalities could be reflected or exaggerated.
BN: How has the rise of AI agents impacted web scraping?
GS: To work efficiently, AI agents need real-time, real-world data from a wide range of accessible data sources, and the main method of acquiring this data is through web scraping practices. Currently, AI agents have an estimated $5.1bn market size, which is expected to grow to $47.1 bn by the end of 2030. This is a huge revenue potential for companies that have control of data sources, which tempts some of them to tighten the grip. If they succeed and monopolize public data access, it will slow down innovation and reduce competitive product offerings, which will hurt the end user the most.
On the other hand, AI agents are increasingly used in public data collection. They can make it more cost-effective and efficient. Additionally, they can help democratize access to public web data for smaller companies that don’t have the resources to hire enough engineers for scraping. AI agents can pick up some of the slack.
BN: Why is it so important that we keep open access to data to fuel competition in the web scraping industry?
GS: For an innovative and competitive AI landscape, we must equally embrace open access to public data for all. This requires a step away from data-gatekeeping practiced by some large companies to ensure that smaller startups are given fair access to the same data. Ultimately, public web data should be in the hands of all players in the free market, not the few. Innovation thrives in a competitive market, where monopolies are rejected and everyone is given an equal opportunity to do great things with public resources.
Image credit: monsit/depositphotos.com
