Tutorial | Enrich web logs in the Prepare recipe#

Logs can be difficult and time-consuming to parse. Let’s see how a number of processors in the Prepare recipe can make this job easier!

Get started#

Objectives#

In this tutorial, you will:

  • Parse and enrich web log data using Prepare recipe processors, such as Resolve GeoIP, Classify User-Agent, and Split URL.

Prerequisites#

Create the project#

As a toy example, we provide a randomly generated dataset to work with.

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Web Log Enrichment.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Explore the data#

Even for standardized formats, such as Apache log files, one often needs to code several regular expressions to be able to extract all the information needed. Common fields include:

  • IP address

  • User (not available in this example)

  • Timestamp

  • Requested URL

  • Request status code

  • Size of the object returned

  • Referer (from where the user made the request)

  • User agent (user’s device)

In the starter project, let’s take a look at how Dataiku automatically has detected the format Apache combined log and performed the log parsing on your behalf.

  1. Open the toylogs dataset.

  2. Navigate to the Settings tab near the top right corner.

  3. Click on the Format / Preview subtab.

  4. Click Preview This Dataset.

  5. Note that Dataiku has recognized the file as an Apache combined log.

Dataiku screenshot of the format preview of a dataset.

Create a visual analysis#

We could prepare this log data in a Prepare recipe or a visual analysis. Let’s choose the latter.

  1. From the toylogs dataset, go to the Lab in the Actions sidebar.

  2. Under Visual analyses, click New Analysis.

  3. Click Create Analysis.

Dataiku screenshot of the dialog to create a visual analysis.

Resolve GeoIP addresses#

Let’s first extract geographic information from the IP address.

  1. Open the column header dropdown for the ip column.

  2. Select Resolve GeoIP.

  3. Extract only the country and GeoPoint columns.

Dataiku screenshot of the Resolve GeoIP processor.

Tip

It’s possible to visualize the extracted geographic data on a map. See Tutorial | No-code maps to learn more.

Parse a user-agent information#

In a web log analysis, the user-agent information can help answer questions like:

  • Is the user on a computer, a mobile phone, or a tablet?

  • Which browser is the most used on a website?

  • Which one generates the most error statuses?

  • Is there a correlation between the device used and the probability of a sale?

Let’s extract this information now!

  1. Open the column header dropdown for the user_Agent column.

  2. Select Classify User-Agent.

  3. Using the column header dropdown, delete the columns: user_Agent_category, user_Agent_version, user_Agent_osversion, user_Agent_osflavor — leaving only user_Agent_type, user_Agent_brand, and user_Agent_os.

Dataiku screenshot of the Classify User-Agent processor.

Note

Dataiku suggests this particular step from the processor library because it has inferred the meaning of the this column to be User-Agent.

Clear invalid values based on column meaning#

Now let’s turn our attention to the referer URL, which tells us where visitors are coming from. The data quality bar underneath the column header shows that not all values in the column match the auto-detected meaning for the referer column.

Let’s find and clear these invalid values.

  1. Open the column header dropdown for the referer column.

  2. Select Filter.

  3. Select NOK to show only the rows not matching the auto-detected column meaning of URL.

  4. Click on referer cell value not matching the URL meaning, and select Clear invalid cells for meaning.

Dataiku screenshot of a dialog for clearing invalid meanings.

Parse a referer URL#

Based on the auto-detected meaning of URL, Dataiku suggests a step for splitting the URL into host, port, and many other entities. In particular, let’s assume our interest is analyzing the path for these URLs.

  1. Delete the filter on the referer column.

  2. Once again, open the column header dropdown for the referer column.

  3. Select Split URL into host, port….

  4. On the left, leave only Extract path select. Uncheck the other options.

Dataiku screenshot of the Split URL processor.

Split a column#

Let’s keep working on the extracted URL paths.

  1. Click Add New Step near the bottom left.

  2. Select Split column from the processors library.

  3. Provide referer_path as the column to split.

  4. Provide / as the delimiter.

  5. Check the box Truncate, and keep the default maximum of 1 column.

Dataiku screenshot of a split column step.

What’s next?#

Congratulations! You’ve gained practice using Prepare recipe processors for extracting information from web log data without the need to code any regular expressions.

Before deploying the script as a Prepare recipe, you could continue this analysis in a variety of ways:

  • Use the Analyze tool on the referer_path_0 column to study where most requests come from.

  • Parse and extract date components from the apache_time column.

  • Split the request column as done for the referer column.

  • Visualize results with charts.

Note

You can find more information about all available Prepare recipe processors in the reference documentation.