Web scraping #2

Using Selenium for web scraping jobs, connections, etc.

Simple Web Scraping Tutorial With Node.js | LaptrinhX

Back in Web Scraping #1, I walked through how to automate the login process for LinkedIn. So, now that we are logged in, the fun begins with Selenium. I am going to walk through searching LinkedIn and scraping the search results page.

This process picks up immediately after you have logged into LinkedIn. If you would rather just login manually ::scoff:: then you can simply create the WebDriver instance, then copy the URL from your browser and paste the URL, don’t forget the ‘string’ notation, so that Selenium can take control of your browser.

driver = webdriver.Chrome(PATH)

driver.get(‘https://www.linkedin.com/*’)

You’ll have an instance of the Chrome browser open on your computer, which is indicated by this at the top of your browser:

You should see this search bar at the top of your screen in your browser, which should be controlled by Selenium now.

To understand how the HTML and CSS are functioning on the page, enter CTL/SHIFT/i, or since I know the element I am looking for, I simply right click with my cursor on the search box.

Click “Inspect”. The blue highlighted html text refers to the search bar typehead, which is what we are going to tell our Selenium WebDriver, or ‘driver’, which does not need to be updated if you followed the login process in Web Scraping #1.

Let’s just go ahead and declare a variable for the term for which we are searching.

Next, we will direct the driver to the search bar. Here, I found the element using the id. Then after locating the element which accepts the text input for the search, using the input tag and class name from the HTML on the page, I then add the Action Chains function ‘.click()’.

So, in the first line of code, I am activating my cursor in the search bar, then, I am simply sending the ‘search_item’ variable, which is the string above containing my search term. This ‘send_keys()’ function enters the keys for my search term. Below, is the result of the code up until this point, which you can see in your automated browser window.

Next, we run this line of code, which sends the ‘RETURN’ or ‘ENTER’ key after the search term has been typed out.

Once this code runs, the browser opens the webpage containing the results for your search. This should have sections such as “Jobs”, “People”, and “Posts”. First, you can define whether you are searching for jobs, people, or posts as the ‘target’ variable. Next, I assign the variable ‘button’ to the element’s link text, which is visible in the banner, but can be verified in the HTML.

target = ‘people’

The second line of code, I use a JavaScript command, because the Action Chains and Python commands prove tricky with buttons or clickable areas that prompt new url windows to open. Now, I am going to extract the information for only the first item returned in the search container for now, but this will all be automated later.

In the first line of code in the next block, I find the first element by specifying ‘find_element’, rather than ‘find_elements’, so, only the first instance of the CSS selector is selected in the class ‘entity-result__title’, which is referring to the box that contains the name of the person on the profile. Both items I need are located here, for now.

In the second line, I basically piggy-back off of the first line, so now, the WebDriver is only retrieving information from the first search result, due to specifying ‘result’, where previously ‘driver’ was the instance. Additionally, I split the string, so a list is returned for the result, including the person’s name field as text[0], and as text[1] contains the string “View (name)’s Profile”. These are located within the HTML/CSS for the ‘result’ element, under the attribute “innerText”. This is verified with a simple ‘print’ statement. For the person’s linked in profile url, the attribute which contains this is the pathName attribute, again verified by a ‘print’ statement.

The primary and secondary subtitles tend to be populated by the company where the person works and their geographical location, which I located using the ‘css_selector’ correlating which each. For the primary and secondary details, aka ‘deets’, I am no longer using the ‘result’ element from earlier, due to the fact that it is just the ‘title’ line of the individual search result and it’s corresponding </a> tag.

Here is the output for the above code. This is exactly what I am looking for. Next, I will create a function and incorporate pagination in my automation so that I can get hundreds of results, add them to a Pandas DataFrame, so that I can assess the data, and then filter and decide which profiles I want to fully scrape and connect with in the next installment of “Web Scraping”

For a quick extensive search, a loop can be implemented.

Share this:

Related

Leave a comment Cancel reply