Lingyuan's blog

Recently, I just created my first web scraper using the Selenium Web Driver. I wanted to share this since finding data and data extraction are both very important parts of making machine learning models. When working with supervised and even unsupervised models, it is impossible without first finding good data to train your model on. Therefore, this is a valuable skill.

Using Selenium allowed me to interact with the webpage as if the program was a real program. For example, I can use methods in the selenium driver to find and interact with elements in the website’s HTML. Using this it is very easy to make a web scraper.

When interacting with any element using selenium, you need a way to specify which exact element it is. This can be in the form of a unique class, id or other attribute assigned to it. Besides that, you can specify the element type like ‘a’ for links or ‘p’ for texts for example. Lastly if these are not enough, you can filter through the elements by specifying the element’s parent elements as well, such as filtering for ‘p’ elements within a ‘div’ element. If this is still not enough, you can specify the specific element by retrieving all elements and specifying the index of the element you want. Thankfully it didn’t come to that in my scraper. The way you can access this information is by going to the website your scrape from and opening Inspect Element using ctrl + shirt + i which lets you see the website’s HTML and CSS, as well as other relevant information. Therefore, when you are encoding any specific interaction, you only need to find a way to identify each element, and you are good to go.

Now for my actual project, I was scraping a freelancer website for example client requests and freelancer profiles for a potential future project. To do this, I simply did what I described above and after navigating to the webpage that contained the relevant information, iterated through each item and copied it to a local document.

 While working on my project, I encountered some problems. For example, one time my code raised an error when I asked it to click a button. It would complain that that element could not be clicked even though I had double and even triple checked that the id I used was accurate. In the end, it was because Selenium could only interact with elements on screen, which I had not realized. The fix was very easy which was to use some JavaScript to scroll the webpage to where the button was located, ensuring that it is onscreen. There were also some other minor problems when logging in that had to do with the overtly complex log in system of the website I was trying to scrape, but could be easily navigated with careful programming.

Overall, I am proud of how my scraper turned out. Despite not being the most efficient or the most elegant scraper, it got the job done and was an interesting challenge compared to my usual projects like USACO or machine learning models.