The Kangaroo LLM project today announced the launch of an extensive web crawling initiative to create Australia's first open-source artificial intelligence model. This ambitious effort will see the project's custom web crawler, "Kangaroo Bot," begin collecting data from 754,000 Australian websites starting September 25th onwards to build the VegeMighty dataset, a comprehensive corpus of Australian English content.
With over 4.2 million registered domains in Australia, this initial phase represents a significant step towards developing an AI model that genuinely understands and represents Australian language and culture.
"This initiative marks a pivotal moment in Australia's AI journey," said Vinod Bijlani, AI Practice Leader at Hewlett Packard Enterprise (HPE) and a key partner in the Kangaroo LLM consortium. "By ethically harvesting data from 754,000 websites in this first phase, we're laying the groundwork for an AI that will not only understand Australian English but will also grasp the nuances of our diverse digital landscape. This is more than just data collection; it's about capturing the essence of Australian online communication and culture."
Key aspects of the web crawling initiative include:
- Extensive Scope: Targeting 754,000 Australian websites in the first phase to create a diverse and comprehensive dataset.
- Ethical Data Collection: Adhering to responsible web crawling practices and respecting website owners' preferences.
- Transparency: Commitment to publishing the full list of websites to be crawled, fostering trust and open dialogue.
- Data Sovereignty: All collected data will be processed and stored within Australia, ensuring compliance with national regulations.
- Immediate Commencement: Web crawling will begin on September 25th, 2024.
The Kangaroo LLM project is committed to responsible data collection. Website owners who wish to opt out of the Kangaroo Bot crawl can do so by adding the following to their robots.txt file
User-agent: Kangaroo Bot Disallow: /
"This extensive data collection effort is not just about creating an AI model; it's about building a foundation for Australia's AI future," Bijlani added. "We're inviting all Australians to be part of this groundbreaking journey, whether by allowing us to include their sites in our dataset or by following our progress."
The Kangaroo LLM consortium, which includes industry leaders such as Katonic, RackCorp, NextDC, Hitachi Vantara, and HPE, views this initiative as a crucial step towards establishing Australia as a leader in ethical AI development.